SUSECON
Results 1 to 3 of 3

Thread: BUG: soft lockup on CPU during heavy load

Threaded View

  1. #1

    BUG: soft lockup on CPU during heavy load

    Hi,

    just upgraded to SLES 11 SP2 a HP DL580 G7 with 40 CPUs (80 HT) and 512 GB memory. Running jobs that cause heavy CPU load is resulting in several errors. Dmesg is giving me the following report.

    Code:
    [124560.864252] BUG: soft lockup - CPU#67 stuck for 22s! [exe:17646]
    [124560.864257] Modules linked in: af_packet st sd_mod crc_t10dif ide_cd_mod ide_core lp parport_pc ppdev parport autofs4 edd xt_tcpudp xt_pkttype ipt_LOG xt_limit nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpu
    freq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_conn
    track nf_defrag_ipv4 ip_tables ip6_tables x_tables fuse loop dm_mod ipv6_lib qlcnic i7core_edac netxen_nic sg edac_core sr_mod cdrom hpilo iTCO_wdt hpwdt iTCO_vendor_support joydev pcspkr serio_raw container rtc_
    cmos acpi_power_meter button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common thermal processor thermal_sys hwmon scsi_dh_hp_sw scsi_dh_alua scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix l
    ibata hpsa cciss scsi_mod [last unloaded: parport_pc]
    [124560.864350] Supported: Yes
    [124560.864353] CPU 67 
    [124560.864355] Modules linked in: af_packet st sd_mod crc_t10dif ide_cd_mod ide_core lp parport_pc ppdev parport autofs4 edd xt_tcpudp xt_pkttype ipt_LOG xt_limit nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpu
    freq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_conn
    track nf_defrag_ipv4 ip_tables ip6_tables x_tables fuse loop dm_mod ipv6_lib qlcnic i7core_edac netxen_nic sg edac_core sr_mod cdrom hpilo iTCO_wdt hpwdt iTCO_vendor_support joydev pcspkr serio_raw container rtc_
    cmos acpi_power_meter button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common thermal processor thermal_sys hwmon scsi_dh_hp_sw scsi_dh_alua scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix l
    ibata hpsa cciss scsi_mod [last unloaded: parport_pc]
    [124560.864416] Supported: Yes
    [124560.864419] 
    [124560.864423] Pid: 17646, comm: exe Not tainted 3.0.13-0.27-default #1 HP ProLiant DL580 G7
    [124560.864428] RIP: 0010:[<ffffffff81441a08>]  [<ffffffff81441a08>] _raw_spin_unlock_irqrestore+0x8/0x10
    [124560.864445] RSP: 0000:ffff8832fa663970  EFLAGS: 00000246
    [124560.864448] RAX: 0000000000000000 RBX: ffffea0008943900 RCX: ffffea00846f300c
    [124560.864451] RDX: 0000000000000002 RSI: 0000000000000246 RDI: 0000000000000246
    [124560.864454] RBP: ffff8832fa663b18 R08: 0000000000000200 R09: ffff88403ffd9e80
    [124560.864458] R10: 00000000025d6c00 R11: ffff88403ffda3b0 R12: ffffffff8144a06e
    [124560.864461] R13: ffffea000896b600 R14: 0000000000000297 R15: 000000000000000c
    [124560.864465] FS:  00007f08256fc700(0000) GS:ffff88603fc40000(0000) knlGS:0000000000000000
    [124560.864469] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [124560.864472] CR2: 00007eff20800000 CR3: 000000183c053000 CR4: 00000000000006e0
    [124560.864476] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [124560.864479] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [124560.864483] Process exe (pid: 17646, threadinfo ffff8832fa662000, task ffff8800941d2540)
    [124560.864486] Stack:
    [124560.864499]  ffffffff81136259 ffff88603fc4e0b0 0000000000000206 ffff88403ffd9e80
    [124560.864516]  0000000000000000 00000000020645b9 0000000000000246 0000000008943af8
    [124560.864527]  ffff88403ffd9ee0 ffffea00846f300c 0000000000000000 0000000000000200
    [124560.864538] Call Trace:
    [124560.864563]  [<ffffffff81136259>] isolate_freepages+0x359/0x3b0
    [124560.864573]  [<ffffffff811362fe>] compaction_alloc+0x4e/0x60
    [124560.864584]  [<ffffffff811404a9>] unmap_and_move+0x49/0x180
    [124560.864593]  [<ffffffff8114067e>] migrate_pages+0x9e/0x1b0
    [124560.864601]  [<ffffffff81136ae3>] compact_zone+0x1f3/0x2f0
    [124560.864609]  [<ffffffff81136e42>] compact_zone_order+0xa2/0xe0
    [124560.864617]  [<ffffffff81136f5f>] try_to_compact_pages+0xdf/0x110
    [124560.864628]  [<ffffffff810f867e>] __alloc_pages_direct_compact+0xee/0x1c0
    [124560.864638]  [<ffffffff810f8ab2>] __alloc_pages_slowpath+0x362/0x7f0
    [124560.864646]  [<ffffffff810f90f1>] __alloc_pages_nodemask+0x1b1/0x1c0
    [124560.864655]  [<ffffffff811354cb>] alloc_pages_vma+0x9b/0x160
    [124560.864666]  [<ffffffff81145170>] do_huge_pmd_anonymous_page+0x160/0x270
    [124560.864677]  [<ffffffff81445327>] do_page_fault+0x207/0x4c0
    [124560.864686]  [<ffffffff81442065>] page_fault+0x25/0x30
    [124560.866515] DWARF2 unwinder stuck at page_fault+0x25/0x30
    [124560.866518] 
    [124560.866520] Leftover inexact backtrace:
    [124560.866521] 
    [124560.866529] Code: 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 eb f5 c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 ff 07 48 89 f7 57 9d 
    [124560.866561]  66 90 66 90 c3 66 90 b8 ff ff ff ff f0 0f c1 07 83 e8 01 ba 
    [124560.866574] Call Trace:
    [124560.866582]  [<ffffffff81136259>] isolate_freepages+0x359/0x3b0
    [124560.866591]  [<ffffffff811362fe>] compaction_alloc+0x4e/0x60
    [124560.866598]  [<ffffffff811404a9>] unmap_and_move+0x49/0x180
    [124560.866605]  [<ffffffff8114067e>] migrate_pages+0x9e/0x1b0
    [124560.866613]  [<ffffffff81136ae3>] compact_zone+0x1f3/0x2f0
    [124560.866621]  [<ffffffff81136e42>] compact_zone_order+0xa2/0xe0
    [124560.866628]  [<ffffffff81136f5f>] try_to_compact_pages+0xdf/0x110
    [124560.866637]  [<ffffffff810f867e>] __alloc_pages_direct_compact+0xee/0x1c0
    [124560.866644]  [<ffffffff810f8ab2>] __alloc_pages_slowpath+0x362/0x7f0
    [124560.866652]  [<ffffffff810f90f1>] __alloc_pages_nodemask+0x1b1/0x1c0
    [124560.866660]  [<ffffffff811354cb>] alloc_pages_vma+0x9b/0x160
    [124560.866668]  [<ffffffff81145170>] do_huge_pmd_anonymous_page+0x160/0x270
    [124560.866677]  [<ffffffff81445327>] do_page_fault+0x207/0x4c0
    [124560.866684]  [<ffffffff81442065>] page_fault+0x25/0x30
    [124560.868451] DWARF2 unwinder stuck at page_fault+0x25/0x30
    [124560.868453] 
    [124560.868455] Leftover inexact backtrace:
    [124560.868456]
    The program that is experiencing the soft lockup should not be the problem since it worked on SP1. I searched for while now and got different ideas where to start looking for the problem:

    1. IIRC System
    2. CPU Damage
    3. APIC System


    But I am really not sure what can cause this kind of problem. During the errors are happening most of the watchdogs scale up to >> 100 % CPU load.

    Any ideas?

    Best regards
    fbemm
    Last edited by fbemm; 13-Mar-2012 at 21:17.

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •