PDA

View Full Version : SLES 12 SP1 Qemu-system-i386 segfaults



Chernishov
06-Dec-2016, 10:01
Hello everyone,

We have 5 HP BL460 Gen9 servers working as XEN hypervisors under SLES 12 SP1, every server hosts 4-5 fullvirt SLES 12 SP1 guests.
About 1 time in 2 months random DomU guest gets unresponsive, in xl list i see system state just as "------", i can ping the guest, but ssh/vnc is not responding. The only option to bring system back is to power off and restart it from virt-manager.

In system logs at that time i can see qemu-system-i386 segfaults:

[16312686.295207] IPv6: udp checksum is 0
[18570901.441606] qemu-system-i38[3619]: segfault at 0 ip 00007fcc3b3e1fae sp 00007ffeed8f5068 error 4 in libc-2.19.so[7fcc3b352000+19e000]
[18570901.527129] br0: port 3(vif1.0-emu) entered disabled state

This happens on all XEN hypervisors.
xl info:

xl info
host : MSK-HVX05
release : 3.12.49-11-xen
version : #1 SMP Wed Nov 11 20:52:43 UTC 2015 (8d714a0)
machine : x86_64
nr_cpus : 40
max_cpu_id : 39
nr_nodes : 2
cores_per_socket : 10
threads_per_core : 2
cpu_mhz : 2297
hw_caps : bfebfbff:2c100800:00000000:00007f00:77fefbff:00000 000:00000021:000037ab
virt_caps : hvm hvm_directio
total_memory : 262015
free_memory : 159893
sharing_freed_memory : 0
sharing_used_memory : 0
outstanding_claims : 0
free_cpus : 0
xen_major : 4
xen_minor : 5
xen_extra : .1_12-2
xen_version : 4.5.1_12-2
xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler : credit
xen_pagesize : 4096
platform_params : virt_start=0xffff800000000000
xen_changeset :
xen_commandline : dom0_mem=5785M,max:5785M
cc_compiler : gcc (SUSE Linux) 4.8.5
cc_compile_by : abuild
cc_compile_domain : suse.de
cc_compile_date : Thu Nov 5 14:42:08 UTC 2015
xend_config_format : 4




I am trying to catch core dump, but i don't know, what do i need to get:
1) Core dump of domU kernel
2) Core dump of crashed qemu-system-i386 process on hypervisor

Please, give me some advice, what do i need to catch.

jmozdzen
06-Dec-2016, 17:13
Hi Cernishov,

I strongly suggest to open a service call and have an engineer look into this. You'll get the according requests for collecting details during the SR process.

Regarding your question, I believe that enabling core dumps for the qemu process would help (see ulimit -c), but would have to be done before starting the VM. Please note that if setting that to "unlimited", you may get a pretty large dump file for the host file system.

Regards,
J

Chernishov
07-Dec-2016, 07:16
Yeah, we do plan to open SR, but i feel like without core dumps it would be impossible to catch the problem, and because it could take up to 3 months just to catch the crash i felt like i could ask in forums first

I guess i should follow this instruction - https://www.novell.com/support/kb/doc.php?id=3054866

But how can i test that dumps really will be created? For example, will it be enough to just send SIGSEGV to qemu process to check if core will dump?
I did some tests on SLES11SP3, configured the ulimit and core_pattern according to the instruction (in sysctl.conf too), restarted the system several times, but dumps were generated only for simple examples, like


top &
kill -6 $pid
fg %1

When i tried to send SIGABRT or SIGSEGV to qemu process, i saw no dump generated

jmozdzen
07-Dec-2016, 13:15
Hi Cernishov,

> restarted the system several times

the invocation of "ulimit -c unlimited" only affects the current shell and children of it and is not persistent across reboots. The same holds true for direct invocation of "sysctl -w"... so if you ran these steps, then rebooted the machine, then *did not rerun* these steps, you'd see no effect. The same is true if you did not start the VM as a child of the session you increased the core size limit for.

By looking at the content of /proc/<pid>/limits you can see if your changes are effective for your process.

Sending a kill -SEGV to a process should be sufficient to simulate the effect you see for qemu. OTOH, the programmers may have decided to catch that signal within the process and react in their own special way, so there's a slight chance (though not awfully likely) that no core is generated by reason.

Regards,
J