PDA

View Full Version : Crash on SLES 11 SP3 with KVM



jg_cnenet
13-Aug-2014, 18:40
Hi to all,

We have 4 Blades UCS-B200-M3 with SLES 11 SP3 installed for KVM supporting SAP.

This morning one of our servers had a crash on 4 of its vNics. There is a total of 13 of them.

The dump on /var/log/messages is:

04:47:10 hv-1 kernel: [3511310.839912] ------------[ cut here ]------------
Aug 13 04:47:10 hv-1 kernel: [3511310.839922] WARNING: at /usr/src/packages/BUILD/kernel-default-3.0.101/linux-3.0/net/sched/sch_generic.c:255 dev_watchdog+0x23e/0x250()
Aug 13 04:47:10 hv-1 kernel: [3511310.839925] Hardware name: UCSB-B200-M3
Aug 13 04:47:10 hv-1 kernel: [3511310.839927] NETDEV WATCHDOG: kvm-cluster-pec (enic): transmit queue 0 timed out
Aug 13 04:47:10 hv-1 kernel: [3511310.839928] Modules linked in: af_packet ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables edd bridge stp ll
c cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop vhost_net macvtap macvlan tun kvm_intel kvm ipv6 ipv6_lib joydev pcspkr iTCO_wdt iTCO_
vendor_support i2c_i801 button acpi_power_meter enic container ac sg wmi rtc_cmos ext3 jbd mbcache usbhid hid sd_mod crc_t10dif ttm drm_kms_helper drm i2c_algo_bit sysimgblt
sysfillrect i2c_core syscopyarea ehci_hcd usbcore usb_common processor thermal_sys hwmon dm_service_time dm_least_pending dm_queue_length dm_round_robin dm_multipath scsi_dh_
hp_sw scsi_dh_emc scsi_dh_alua scsi_dh_rdac scsi_dh dm_snapshot dm_mod fnic libfcoe libfc scsi_transport_fc scsi_tgt megaraid_sas scsi_mod
Aug 13 04:47:10 hv-1 kernel: [3511310.839984] Supported: Yes
Aug 13 04:47:10 hv-1 kernel: [3511310.839987] Pid: 0, comm: swapper Not tainted 3.0.101-0.21-default #1
Aug 13 04:47:10 hv-1 kernel: [3511310.839989] Call Trace:
Aug 13 04:47:10 hv-1 kernel: [3511310.840020] [<ffffffff81004935>] dump_trace+0x75/0x310
Aug 13 04:47:10 hv-1 kernel: [3511310.840032] [<ffffffff8145e063>] dump_stack+0x69/0x6f
Aug 13 04:47:10 hv-1 kernel: [3511310.840041] [<ffffffff8106063b>] warn_slowpath_common+0x7b/0xc0
Aug 13 04:47:10 hv-1 kernel: [3511310.840049] [<ffffffff81060735>] warn_slowpath_fmt+0x45/0x50
Aug 13 04:47:10 hv-1 kernel: [3511310.840057] [<ffffffff813c071e>] dev_watchdog+0x23e/0x250
Aug 13 04:47:10 hv-1 kernel: [3511310.840069] [<ffffffff8106f4db>] call_timer_fn+0x6b/0x120
Aug 13 04:47:10 hv-1 kernel: [3511310.840077] [<ffffffff810708f3>] run_timer_softirq+0x173/0x240
Aug 13 04:47:10 hv-1 kernel: [3511310.840087] [<ffffffff8106770f>] __do_softirq+0x11f/0x260
Aug 13 04:47:10 hv-1 kernel: [3511310.840096] [<ffffffff81469fdc>] call_softirq+0x1c/0x30
Aug 13 04:47:10 hv-1 kernel: [3511310.840107] [<ffffffff81004435>] do_softirq+0x65/0xa0
Aug 13 04:47:10 hv-1 kernel: [3511310.840114] [<ffffffff810674d5>] irq_exit+0xc5/0xe0
Aug 13 04:47:10 hv-1 kernel: [3511310.840122] [<ffffffff81026588>] smp_apic_timer_interrupt+0x68/0xa0
Aug 13 04:47:10 hv-1 kernel: [3511310.840130] [<ffffffff81469773>] apic_timer_interrupt+0x13/0x20
Aug 13 04:47:10 hv-1 kernel: [3511310.840142] [<ffffffff812bd0c1>] intel_idle+0xa1/0x130
Aug 13 04:47:10 hv-1 kernel: [3511310.840152] [<ffffffff8137a9ab>] cpuidle_idle_call+0x11b/0x280
Aug 13 04:47:10 hv-1 kernel: [3511310.840161] [<ffffffff81002126>] cpu_idle+0x66/0xb0
Aug 13 04:47:10 hv-1 kernel: [3511310.840172] [<ffffffff81befeff>] start_kernel+0x376/0x447
Aug 13 04:47:10 hv-1 kernel: [3511310.840180] [<ffffffff81bef3c9>] x86_64_start_kernel+0x123/0x13d
Aug 13 04:47:10 hv-1 kernel: [3511310.840186] ---[ end trace f0165b8680ad586b ]---


I cannot recover from this without shutting down the server.

At this moment, and after several tests and debugging I went nowhere on solving this issue or finding the root cause of it.

On the UCS side there is no Logs or errors regarding the nics.

So, I come here to see anyone has seen this type of errors on this type of systems, and if can provide a more insightful way of solving it.

NOTE - I have not rebooted the server, as we are still trying to figure out what is the root cause for this, since it's not happening on the other 3 servers that have the same configuration. All vms where moved to the other hypervisors.

Thank you for your support.

Jorge Gomes

jmozdzen
14-Aug-2014, 10:33
Hi Jorge,

this is a typical case for a support request, if that option is available to you.

That error can occur if the system is loosing interrupts, there has been at least one severe problem with this, caused by a buggy firmware on Intel 55xx chipsets (http://kb.sp.parallels.com/en/121971)

I've found references to similar problems happening with the chipset your blades seem to have (Intel C600): https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=721316

- Is it only that single blade that is affected? It might be a hardware problem then. Is this the first time this happened, or have you had this before?
- Check if the blade has the latest BIOS/firmware installed.
- Do you see any indicators of similar problems (spurious or missing interrupts, bad throughput, reports in the BMC log,...)?
- Is there a known (BIOS) configuration difference between the system affected and the other blades?

If this is a reproducible situation, try to see if the error persists when you

- move the vNIC to a different hardware interface
- do a factory reset of the blade's BIOS

What I read about that error so far lets me believe it is chipset-related and may be caused by hardware or software (firmware).

Regards,
Jens