PDA

View Full Version : SLES 12 SP4 Server crashes with a long BTRFS error list



AAEBHolding
14-May-2019, 09:17
Suddenly I receive an error with the bellow dump when the SLES 12.4 started. Then, the system becomes unavailable.

When I repair the root filesystem [on a guest VM otherwise it isn't possible because the root filesystem is mounted] then the root partition is marked as full even an empty space is shown in df.

What happened, how can I solve it? The very strange thing is that when I restore a 3 weeks old backup [where all was OK ] into a new VM I receive the same error messages.

[ 37.505018] invalid opcode: 0000 [#1] SMP NOPTI
[ 37.505170] CPU: 10 PID: 568 Comm: systemd-journal Not tainted 4.12.14-95.13-default #1 SLE12-SP4
[ 37.505476] task: ffff880003bacc00 task.stack: ffffc90041620000
[ 37.505701] RIP: e030:create_reloc_root+0x295/0x2a0 [btrfs]
[ 37.505853] RSP: e02b:ffffc90041623b98 EFLAGS: 00010282
[ 37.505997] RAX: 00000000ffffffef RBX: ffff8800f731ae00 RCX: 0000000000000001
[ 37.506205] RDX: 0000000000000003 RSI: ffff8800f5323460 RDI: 0000000000000200
[ 37.506486] RBP: ffff8800f965f000 R08: ffff8800ef659cb0 R09: ffffc900416239d8
[ 37.506678] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88000376a2d0
[ 37.506873] R13: ffff8800f58a0000 R14: 0000000000000110 R15: ffff880003bacc00
[ 37.507071] FS: 00007fa2c23a0880(0000) GS:ffff8800faa80000(0000) knlGS:0000000000000000
[ 37.507319] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 37.507532] CR2: 00007fa2bef25550 CR3: 00000000872fa000 CR4: 0000000000040660
[ 37.507796] Call Trace:
[ 37.507923] btrfs_init_reloc_root+0x8e/0xa0 [btrfs]
[ 37.508130] record_root_in_trans+0xa9/0xf0 [btrfs]
[ 37.508333] btrfs_record_root_in_trans+0x4a/0x70 [btrfs]
[ 37.508539] start_transaction+0xab/0x440 [btrfs]
[ 37.508691] btrfs_dirty_inode+0x49/0xe0 [btrfs]
[ 37.508839] file_update_time+0xa6/0xf0
[ 37.508972] btrfs_page_mkwrite+0x129/0x490 [btrfs]
[ 37.509109] ? vsnprintf+0x1e5/0x4b0
[ 37.509212] do_page_mkwrite+0x31/0x70
[ 37.509373] do_wp_page+0x43f/0x570
[ 37.509473] __handle_mm_fault+0x793/0xef0
[ 37.509601] handle_mm_fault+0xc4/0x1d0
[ 37.509719] __do_page_fault+0x1f3/0x4c0
[ 37.509831] do_page_fault+0x2b/0x70
[ 37.509934] ? do_syscall_64+0x9a/0x150
[ 37.510044] ? page_fault+0x2f/0x50
[ 37.510172] page_fault+0x45/0x50
[ 37.510301] RIP: 0510:0x7ffefbe30518
[ 37.510424] RSP: 0024:00005575011ab0a0 EFLAGS: 5575011a46b0
[ 37.510427] Code: 48 83 c6 02 41 83 e8 02 66 89 4f fe e9 37 fe ff ff 8b 0e 48 83 c7 04 48 83 c6 04 41 83 e8 04 89 4f fc e9 2b fe ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 1f 44 00 00 0f 1f 44 00 00 48 89 f9 45 31
[ 37.511135] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache af_packet iscsi_ibft iscsi_boot_sysfs xenfs xen_privcmd intel_rapl sb_edac x86_pkg_temp_thermal coretemp crc32_pclmul ghash_clmulni_intel pcbc xen_netfront aesni_intel aes_x86_64 crypto_simd glue_helper cryptd pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc btrfs xor raid6_pq xen_blkfront crc32c_intel sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
[ 37.512270] Supported: Yes
[ 37.512419] ---[ end trace ab510ab54e7d565e ]---
[ 37.512563] RIP: e030:create_reloc_root+0x295/0x2a0 [btrfs]
[ 37.512721] RSP: e02b:ffffc90041623b98 EFLAGS: 00010282
[ 37.512724] RAX: 00000000ffffffef RBX: ffff8800f731ae00 RCX: 0000000000000001
[ 37.512726] RDX: 0000000000000003 RSI: ffff8800f5323460 RDI: 0000000000000200
[ 37.512728] RBP: ffff8800f965f000 R08: ffff8800ef659cb0 R09: ffffc900416239d8
[ 37.512730] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88000376a2d0
[ 37.512732] R13: ffff8800f58a0000 R14: 0000000000000110 R15: ffff880003bacc00
[ 37.512740] FS: 00007fa2c23a0880(0000) GS:ffff8800faa80000(0000) knlGS:0000000000000000
[ 37.512744] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 37.512750] CR2: 00007fa2bef25550 CR3: 00000000872fa000 CR4: 0000000000040660

AAEBHolding
14-May-2019, 09:39
This is the output from btrfs check --repair /dev/xvdc2

enabling repair mode
Checking filesystem on /dev/xvdc2
UUID: c88dbf5b-3513-4966-b3d6-5bb6c9b7717e
checking extents
Fixed 0 roots.
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
checking csums
checking root refs
found 9380937728 bytes used err is 0
total csum bytes: 8061100
total tree bytes: 230670336
total fs tree bytes: 181600256
total extent tree bytes: 34013184
btree space waste bytes: 41120097
file data blocks allocated: 10417725440
referenced 8217731072

Then I get this:

Filesystem 1MB-blocks Used Available Use% Mounted on
/dev/xvda2 11249MB 9918MB 0MB 100% /

malcolmlewis
14-May-2019, 12:42
Hi
The df tools in SP4 are not btrfs friendly...

See this thread: http://forums.suse.com/showthread.php?t=13627

And also https://www.suse.com/documentation/sles11/stor_admin/data/trbl_btrfs_volfull.html

AAEBHolding
14-May-2019, 15:40
Hi
The df tools in SP4 are not btrfs friendly...

See this thread: http://forums.suse.com/showthread.php?t=13627

And also https://www.suse.com/documentation/sles11/stor_admin/data/trbl_btrfs_volfull.html

@df: You are right, but this is not the problem.
@volume full: This is not the problem and does not solve my issue.

I may have been not accurate enough: The root filesystem is totally damaged. When I start the server the 1st time [from a 3 weeks old backup] so I get after a while the above error messages.
When I try to start the server then the lot of error messages appear and the server becomes unavailable. I found also this post where the same error messages are posted (https://bugzilla.kernel.org/show_bug.cgi?id=203405).

Please note that the system is referring with the message 'kernel BUG at ../fs/btrfs/relocation.c:1449!' to a bug in the kernel in the 1st line of the dump!

Again: The server is idle, this means, apart from the normal services, nothing special is running. Suddenly, the error messages in the 1st post appear in the console and then the VM is broken!

It looks to me like the kernel got a very serious bug in one of the previous updates.

malcolmlewis
15-May-2019, 03:26
@df: You are right, but this is not the problem.
@volume full: This is not the problem and does not solve my issue.

I may have been not accurate enough: The root filesystem is totally damaged. When I start the server the 1st time [from a 3 weeks old backup] so I get after a while the above error messages.
When I try to start the server then the lot of error messages appear and the server becomes unavailable. I found also this post where the same error messages are posted (https://bugzilla.kernel.org/show_bug.cgi?id=203405).

Please note that the system is referring with the message 'kernel BUG at ../fs/btrfs/relocation.c:1449!' to a bug in the kernel in the 1st line of the dump!

Again: The server is idle, this means, apart from the normal services, nothing special is running. Suddenly, the error messages in the 1st post appear in the console and then the VM is broken!

It looks to me like the kernel got a very serious bug in one of the previous updates.
Hi
At GRUB, can you select advanced boot options and boot to an earlier snapshot?

AAEBHolding
15-May-2019, 09:15
Hi
At GRUB, can you select advanced boot options and boot to an earlier snapshot?

It is a VM running under XenServer. I have only snapshots available which are provided by XenServer - the SLES snapshots are not available.

I think I should provide this information because it may be the real reason for this issue; the root partition was running out of space so I did this in order to increase the root partition:

Resized in XenCenter the root partition.
Detached the SLES 12.4 VM.
Attached the root partition to another SLES 12.4 VM.
Resized the root partition in the other VM with yast/Partition Manager.
Detached from the helper VM the root partition.
Attached the resized root partition to the origin VM.



All steps like increasing, detaching and attaching have been performed when the VMs have been down.

Since then even an old backup crashes after few seconds when the VM started.
Is it possible there is a more deeper information stored in the disk and when even I restore to a snapshot where the disk hasn't been resized so it doesn't match and the problems starts?

Does it help?

malcolmlewis
15-May-2019, 12:39
On Wed 15 May 2019 08:24:01 AM CDT, AAEBHolding wrote:

malcolmlewis;57684 Wrote:
> Hi
> At GRUB, can you select advanced boot options and boot to an earlier
> snapshot?

It is a VM running under XenServer. I have only snapshots available
which are provided by XenServer - the SLES snapshots are not available.

I think I should provide this information because it may be the real
reason for this issue; the root partition was running out of space so I
did this in order to increase the root partition:

- Resized in XenCenter the root partition.
- Detached the SLES 12.4 VM.
- Attached the root partition to another SLES 12.4 VM.
- Resized the root partition in the other VM with yast/Partition
Manager.
- Detached from the helper VM the root partition.
- Attached the resized root partition to the origin VM.
-


All steps like increasing, detaching and attaching have been performed
when the VMs have been down.

Since then even an old backup crashes after few seconds when the VM
started.
Is it possible there is a more -deeper- information stored in the disk
and when even I restore to a snapshot where the disk hasn't been resized
so it doesn't match and the problems starts?

Does it help?




Hi
It does make it clearer :) Are you in a position to raise a SR (Support
Request)?

--
Cheers Malcolm °¿° SUSE Knowledge Partner (Linux Counter #276890)
Tumbleweed 20190512 | GNOME Shell 3.32.1 | 5.0.13-1-default
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below... Thanks!

AAEBHolding
15-May-2019, 12:58
Hi
It does make it clearer :) Are you in a position to raise a SR (Support
Request)?


Not really, I never raised a SR. I am alone with in business and I am maintaining my servers and the XenServer on my own. It is now working like this more or less perfectly since 4 years but with this issue I am totally overstrained.

malcolmlewis
17-May-2019, 14:40
Hi
So if you boot the system to runlevel 1 (at grub and 1 to the options), can you mount the / partition and look at the logs to see what's failing? Or boot the system in rescue mode.

AAEBHolding
17-May-2019, 15:58
Hi
So if you boot the system to runlevel 1 (at grub and 1 to the options), can you mount the / partition and look at the logs to see what's failing? Or boot the system in rescue mode.

Which logs should I check? I am in the rescue mode and mounted the root partition as /mnt.
Can I try to fix it somehow? Running btrfs check --repair /dev/xvda2 doesn't solve the problem. As I wrote, when I then boot regularly the / partion is out of space.

malcolmlewis
17-May-2019, 16:27
Which logs should I check? I am in the rescue mode and mounted the root partition as /mnt.
Can I try to fix it somehow? Running btrfs check --repair /dev/xvda2 doesn't solve the problem. As I wrote, when I then boot regularly the / partion is out of space.
Hi
Check down in /var/log for big files, check those for clues (maybe even copy them off to an external drive), especially messages log.

Maybe it's coredumping?

Run;



coredumpctl list


If there are old logs you think can be deleted, remove those and see how it goes getting some disk space.

AAEBHolding
18-May-2019, 21:11
Hi
Check down in /var/log for big files, check those for clues (maybe even copy them off to an external drive), especially messages log.

Maybe it's coredumping?

If there are old logs you think can be deleted, remove those and see how it goes getting some disk space.

Good, I think I could fix it after hours of desperate trying.
These 3 threads did it:

https://unix.stackexchange.com/questions/174446/btrfs-error-error-during-balancing-no-space-left-on-device
http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html


I summarize what I did:

Start the VM in the rescue mode.
mount /dev/xvda2 /mnt
btrfs balance start -v --full-balance /mnt



When I got Done, had to relocate 12 out of 12 chunks then it worked (1).
When I got the message
ERROR: error during balancing '/mnt' - No space left on device
There may be more info in syslog - try dmesg | tail
I had to proceed with (2).

And here, I had two different situations:
1) One VM has been completely recovered and works well even before the VM didn't start and crashed while booting - with other words the VM became void.
2) The other VM was more persistent and really tiresome. I had to find a XenServer snapshot from the VM where the VM did at least start properly even it crashed within few seconds.

Then, to check if it really works I was running some heavy disk access routines where before the VM crashed within seconds. Now, it runs without any problem.
I hope, it will remain like this.

malcolmlewis
18-May-2019, 22:37
Good, I think I could fix it after hours of desperate trying.
These 3 threads did it:

https://unix.stackexchange.com/questions/174446/btrfs-error-error-during-balancing-no-space-left-on-device
http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html


I summarize what I did:

Start the VM in the rescue mode.
mount /dev/xvda2 /mnt
btrfs balance start -v --full-balance /mnt



When I got Done, had to relocate 12 out of 12 chunks then it worked (1).
When I got the message
ERROR: error during balancing '/mnt' - No space left on device
There may be more info in syslog - try dmesg | tail
I had to proceed with (2).

And here, I had two different situations:
1) One VM has been completely recovered and works well even before the VM didn't start and crashed while booting - with other words the VM became void.
2) The other VM was more persistent and really tiresome. I had to find a XenServer snapshot from the VM where the VM did at least start properly even it crashed within few seconds.

Then, to check if it really works I was running some heavy disk access routines where before the VM crashed within seconds. Now, it runs without any problem.
I hope, it will remain like this.
Hi
Thanks for the feedback and good work :)

brainwave64
11-Jun-2019, 21:04
I've experienced this exact same error at relocation.c:1449 running "4.12.14-lp150.12.58-default #1 openSUSE Leap 15.0". The call to btrfs_insert_root() within create_reloc_root() returns an error code which causes the subsequent BUG_ON() assertion to fail.

I'm hopeful that the following commit in the kernel will fix the underlying issue.

https://bugzilla.kernel.org/show_bug.cgi?id=203405
https://lkml.org/lkml/2019/6/7/720
https://github.com/torvalds/linux/commit/30d40577e322b670551ad7e2faa9570b6e23eb2b

AAEBHolding
22-Jun-2019, 02:02
OMG, after the last updates SUSE offered and I ran it started again. But now I cannot fix it. One VM which had the problem wasn't updated and it doesn't have any problem.
Again the same. There is no way I fix it with restoring a snapshot. This VM is apparently broken.