Hi to all,

We have 4 Blades UCS-B200-M3 with SLES 11 SP3 installed for KVM supporting SAP.

This morning one of our servers had a crash on 4 of its vNics. There is a total of 13 of them.

The dump on /var/log/messages is:

04:47:10 hv-1 kernel: [3511310.839912] ------------[ cut here ]------------
Aug 13 04:47:10 hv-1 kernel: [3511310.839922] WARNING: at /usr/src/packages/BUILD/kernel-default-3.0.101/linux-3.0/net/sched/sch_generic.c:255 dev_watchdog+0x23e/0x250()
Aug 13 04:47:10 hv-1 kernel: [3511310.839925] Hardware name: UCSB-B200-M3
Aug 13 04:47:10 hv-1 kernel: [3511310.839927] NETDEV WATCHDOG: kvm-cluster-pec (enic): transmit queue 0 timed out
Aug 13 04:47:10 hv-1 kernel: [3511310.839928] Modules linked in: af_packet ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables edd bridge stp ll
c cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop vhost_net macvtap macvlan tun kvm_intel kvm ipv6 ipv6_lib joydev pcspkr iTCO_wdt iTCO_
vendor_support i2c_i801 button acpi_power_meter enic container ac sg wmi rtc_cmos ext3 jbd mbcache usbhid hid sd_mod crc_t10dif ttm drm_kms_helper drm i2c_algo_bit sysimgblt
sysfillrect i2c_core syscopyarea ehci_hcd usbcore usb_common processor thermal_sys hwmon dm_service_time dm_least_pending dm_queue_length dm_round_robin dm_multipath scsi_dh_
hp_sw scsi_dh_emc scsi_dh_alua scsi_dh_rdac scsi_dh dm_snapshot dm_mod fnic libfcoe libfc scsi_transport_fc scsi_tgt megaraid_sas scsi_mod
Aug 13 04:47:10 hv-1 kernel: [3511310.839984] Supported: Yes
Aug 13 04:47:10 hv-1 kernel: [3511310.839987] Pid: 0, comm: swapper Not tainted 3.0.101-0.21-default #1
Aug 13 04:47:10 hv-1 kernel: [3511310.839989] Call Trace:
Aug 13 04:47:10 hv-1 kernel: [3511310.840020] [<ffffffff81004935>] dump_trace+0x75/0x310
Aug 13 04:47:10 hv-1 kernel: [3511310.840032] [<ffffffff8145e063>] dump_stack+0x69/0x6f
Aug 13 04:47:10 hv-1 kernel: [3511310.840041] [<ffffffff8106063b>] warn_slowpath_common+0x7b/0xc0
Aug 13 04:47:10 hv-1 kernel: [3511310.840049] [<ffffffff81060735>] warn_slowpath_fmt+0x45/0x50
Aug 13 04:47:10 hv-1 kernel: [3511310.840057] [<ffffffff813c071e>] dev_watchdog+0x23e/0x250
Aug 13 04:47:10 hv-1 kernel: [3511310.840069] [<ffffffff8106f4db>] call_timer_fn+0x6b/0x120
Aug 13 04:47:10 hv-1 kernel: [3511310.840077] [<ffffffff810708f3>] run_timer_softirq+0x173/0x240
Aug 13 04:47:10 hv-1 kernel: [3511310.840087] [<ffffffff8106770f>] __do_softirq+0x11f/0x260
Aug 13 04:47:10 hv-1 kernel: [3511310.840096] [<ffffffff81469fdc>] call_softirq+0x1c/0x30
Aug 13 04:47:10 hv-1 kernel: [3511310.840107] [<ffffffff81004435>] do_softirq+0x65/0xa0
Aug 13 04:47:10 hv-1 kernel: [3511310.840114] [<ffffffff810674d5>] irq_exit+0xc5/0xe0
Aug 13 04:47:10 hv-1 kernel: [3511310.840122] [<ffffffff81026588>] smp_apic_timer_interrupt+0x68/0xa0
Aug 13 04:47:10 hv-1 kernel: [3511310.840130] [<ffffffff81469773>] apic_timer_interrupt+0x13/0x20
Aug 13 04:47:10 hv-1 kernel: [3511310.840142] [<ffffffff812bd0c1>] intel_idle+0xa1/0x130
Aug 13 04:47:10 hv-1 kernel: [3511310.840152] [<ffffffff8137a9ab>] cpuidle_idle_call+0x11b/0x280
Aug 13 04:47:10 hv-1 kernel: [3511310.840161] [<ffffffff81002126>] cpu_idle+0x66/0xb0
Aug 13 04:47:10 hv-1 kernel: [3511310.840172] [<ffffffff81befeff>] start_kernel+0x376/0x447
Aug 13 04:47:10 hv-1 kernel: [3511310.840180] [<ffffffff81bef3c9>] x86_64_start_kernel+0x123/0x13d
Aug 13 04:47:10 hv-1 kernel: [3511310.840186] ---[ end trace f0165b8680ad586b ]---


I cannot recover from this without shutting down the server.

At this moment, and after several tests and debugging I went nowhere on solving this issue or finding the root cause of it.

On the UCS side there is no Logs or errors regarding the nics.

So, I come here to see anyone has seen this type of errors on this type of systems, and if can provide a more insightful way of solving it.

NOTE - I have not rebooted the server, as we are still trying to figure out what is the root cause for this, since it's not happening on the other 3 servers that have the same configuration. All vms where moved to the other hypervisors.

Thank you for your support.

Jorge Gomes