Results 1 to 7 of 7

Thread: mcelog hardware error - is it my memory or CPU failing?

Hybrid View

  1. #1

    mcelog hardware error - is it my memory or CPU failing?

    Hi,
    I've been seeing kernel "[Hardware Error]: Machine check events logged" messages in /var/log/messages. These seem to be from the mcelog daemon, and the corresponding logs (I posted an example below) are in /var/log/mcelog.

    - is a RAM chip on its way out? Or is this the CPU or CPU cache thats having issues?
    - if RAM, how do I determine which chip(s) are having issues?

    /var/log/mcelog:

    Code:
    Hardware event. This is not a software error.
    MCE 0
    CPU 0 4 northbridge
    MISC c0090fff01000000 ADDR 757580490
    TIME 1335182555 Mon Apr 23 08:02:35 2012
      Northbridge RAM Chipkill ECC error
      Chipkill ECC syndrome = 4857
           bit46 = corrected ecc error
           bit59 = misc error valid
           bit62 = error overflow (multiple errors)
      bus error 'local node response, request didn't time out
                 generic read mem transaction
                 memory access, level generic'
    STATUS dc2bc00048080a13 MCGSTATUS 0
    MCGCAP 106 APICID 0 SOCKETID 0
    CPUID Vendor AMD Family 16 Model 4
    (I've never used mcelog before, but since I upgraded from SLES 11 SP1 to SP2, it seems to be configured to start on boot.)

    Thanks,
    J

  2. Re: mcelog hardware error - is it my memory or CPU failing?

    Hi J,

    sounds like a RAM chip giving up... have you had a look at the SEL? Maybe that can give you more details, as the system behind it ought to know about the hardware layout of your machine...

    Regards,
    Jens

  3. #3

    Re: mcelog hardware error - is it my memory or CPU failing?

    Hi Jens,

    Thanks for the reply. In the System event log, I see several of these messages that occur during boot:

    Code:
    ID = 6eb : 04/22/2012 : 00:27:29 : Memory : BIOS : Configuration Error
    Is it possible that there is a strange setting in BIOS that would not play well with mcelog? The machine in question is a Sun Fire x4140.

    Either way, we plan on taking the server down one evening and running memtest86 overnight.
    Thanks,
    J

  4. Re: mcelog hardware error - is it my memory or CPU failing?

    Hi J,

    Quote Originally Posted by ashbyj View Post
    Hi Jens,

    Thanks for the reply. In the System event log, I see several of these messages that occur during boot:

    Code:
    ID = 6eb : 04/22/2012 : 00:27:29 : Memory : BIOS : Configuration Error
    Is it possible that there is a strange setting in BIOS that would not play well with mcelog? The machine in question is a Sun Fire x4140.

    Either way, we plan on taking the server down one evening and running memtest86 overnight.
    Thanks,
    J
    my guess is that it's actually something your machine's BIOS has been complaining about independent of mcelog - mcelog is the mere messenger, don't shoot it for that

    I don't have any experience with Sun hardware so I cannot tell for sure, the folks at Sun (or do we have to call them "Oracles" by now?) ought to be more helpful concerning the actual cause of that message. Probably it's something that simply puts your hardware slightly out of specs and has caused no harm so far...

    With regards,

    Jens

  5. #5

    Re: mcelog hardware error - is it my memory or CPU failing?

    Here is an update. We replaced the entire Sun Fire x4140 with another x4140. Completely different hardware, except the iSCSI HBA card which we kept the same. I'm still seeing errors in /var/log/mcelog, but they seem to correspond to different DIMMs. So by coincidence, this server memory has issues, or the x4140 AMD-based server gives mcelog some issues. We have several x4150s (Intel-based) that are fine.

    New output:

    Code:
    Hardware event. This is not a software error.
    MCE 0
    Hardware event. This is not a software error.
    CPU 4 BANK 4
    STATUS 0 MCGSTATUS 0
    CPU 4 4 northbridge
    MISC c0090fff01000000 ADDR edc79c1c0
    Hardware event. This is not a software error.
    CPU 0 BANK 0
    TIME 1335884912 Tue May  1 11:08:32 2012
    STATUS 0 MCGSTATUS 0
    DDR2 DIMM 333 Mhz Synchronous Width 72 Data Width 64 Size 4 GB
    Device Locator: DIMM14
    Bank Locator: BANK14
    Manufacturer: Qimonda
    Serial Number: FFFFFFFF
    Asset Tag: N/A
    Part Number:
    TIME 1335884912 Tue May  1 11:08:32 2012
      Northbridge RAM Chipkill ECC error
      Chipkill ECC syndrome = 5cac
           bit46 = corrected ecc error
           bit59 = misc error valid
           bit62 = error overflow (multiple errors)
      bus error 'local node response, request didn't time out
                 generic read mem transaction
                 memory access, level generic'
    STATUS dc5640005c080a13 MCGSTATUS 0
    MCGCAP 106 APICID 4 SOCKETID 1

  6. #6

    Re: mcelog hardware error - is it my memory or CPU failing?

    I disabled mcelog on this particular server. The service processor should give me a heads up on any hardware issues.

  7. #7

    Re: mcelog hardware error - is it my memory or CPU failing?

    You could swap the memory modules and see if the mcelog message changes.

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •