PDA

View Full Version : SLES 11 SP3 KVM guest crash



Midata
18-Aug-2015, 08:10
Hello, I'm having problems with my guest servers. I can't find any error messages, but every morning when I come in, I have to shutdown, FORCE OFF, the server and turn it back on. It just seems to stop at some point over night. We have our own software installed and it does updates during the time. After we do a restart in the morning and do the updates per hand, it works fine. My development section also can't find any error messages. The server just seems to hang. Acording to yast, the cpu is under 100% load. At this point I can't login or anything else, just turn off.

The host and guest Server have been runing since middle of May. On the 23rd of July, the Network Section installed new Switches. On the 30th of July, this problem started. So naturally it's my problem. And since ther are error messages it's hard to locate.

Host Server is a Fujitsu TX2540-M1 with SLES 11 SP3, kernel 3.0.101-0.35-default, using KVM 1.4.1-0.11.1.
C600/X79 series chipset, I210 Gigabit Network Connection

Guest Server:
opensuse 11.2, Kernel 2.6.31.12-0.2-default
Memory: 3072
Processors: 2
Graphics Adapter: Cirrus Logic GD5446 VGA
Key Map: de
Sound: none
Source Path: /KVM/STAMMDATEN/disk0.raw
Storage 1: 17 GB
Partition 1: /boot 78 MB
Partition 2: swap 2 GB
Partition 3: / rest(15 GB)
Storage 2: 13 GB
Partition 1: /usr/bbx 13 GB
Storage 3: 20 GB
Partition 1: /home/samba 20 GB
Network Type: Fully Virtualized realtek8139, (was QEMU Virtualized NIC Card, I changed to see if it would help)
Source: br0

Thanks for any help or info.

jmozdzen
18-Aug-2015, 12:18
Hi Midata,

may I ask you to clarify a few points?

> I have to shutdown, FORCE OFF, the server and turn it back on.

That's the KVM guest that has to be shut down?

> It just seems to stop at some point over night.
> The server just seems to hang. Acording to yast, the cpu is under 100% load

Again, this is about the guest? Which YaST is reporting (probably the host) and what in YaST is reporting that, I'd have expected that you use "top" or alike, but not YaST.

> And since ther are error messages it's hard to locate.

Have you had a look at the guest's /var/log/messages after the reboot? Does it simply stop at some point, does it actually continue to log until the reboot? And since this happens every night - open a remote ssh session to the KVM guest in the evening and run "top" there. If the guest will actually hang, then "top" should hang as well... and the last execution time is logged (top row, second value), along with the most recent process list.

Do you have some SNMP monitoring running? Can you see any pattern, i.e. stopped at the same point in time? Do you see any preceding anomalies, i.e. increase in CPU usage?

I'd recommend to increase monitoring inside the KVM - find out via scripts if the machine actually hangs or rather is too busy. If it is hanging on virtual disk access or excessive CPU, enable remote logging (to a central syslog server), if not already enabled anyhow. Run periodic "ps alx" runs, logging to disk, to have a track history of what is active right before the hang. Do SNMP monitoring (and logging of history) for CPU, memory consumption, disk IO, network - this is helpful anyhow and should be the base set to monitor for any production server. You can use freeware tools to do so,. i.e. OpenNMS.

Oh, and what else (besides the earlier network update) changed on July 30? Have you had a look at /var/log/zypp/history both inside the VM and on the host to see if any RPMs were updated?

Regards,
Jens

Midata
18-Aug-2015, 12:55
Hi Midata,

may I ask you to clarify a few points?

> I have to shutdown, FORCE OFF, the server and turn it back on.

That's the KVM guest that has to be shut down? Correct

> It just seems to stop at some point over night.
> The server just seems to hang. Acording to yast, the cpu is under 100% load

Again, this is about the guest? Which YaST is reporting (probably the host) and what in YaST is reporting that, I'd have expected that you use "top" or alike, but not YaST. Correct


> And since ther are error messages it's hard to locate.

Have you had a look at the guest's /var/log/messages after the reboot? Does it simply stop at some point, does it actually continue to log until the reboot? And since this happens every night - open a remote ssh session to the KVM guest in the evening and run "top" there. If the guest will actually hang, then "top" should hang as well... and the last execution time is logged (top row, second value), along with the most recent process list. Correct


Do you have some SNMP monitoring running? Can you see any pattern, i.e. stopped at the same point in time? Do you see any preceding anomalies, i.e. increase in CPU usage? Not until it hangs


I'd recommend to increase monitoring inside the KVM - find out via scripts if the machine actually hangs or rather is too busy. If it is hanging on virtual disk access or excessive CPU, enable remote logging (to a central syslog server), if not already enabled anyhow. Run periodic "ps alx" runs, logging to disk, to have a track history of what is active right before the hang. Do SNMP monitoring (and logging of history) for CPU, memory consumption, disk IO, network - this is helpful anyhow and should be the base set to monitor for any production server. You can use freeware tools to do so,. i.e. OpenNMS. I did increase Memory from 2 GB to 3 GB, no Change. Memory goes over 2 GB only when backup runs. Which always works, although it backups to a completly different Server. This runs at 8pm. The Server hangs some time between 2 am and 5 am.


Oh, and what else (besides the earlier network update) changed on July 30? Have you had a look at /var/log/zypp/history both inside the VM and on the host to see if any RPMs were updated? Per my Boss, Updates are not done here. Only Updates that get done are those that I do. Which means with new Servers they get installed with ServerView and have the same OS and Version.

Regards,
Jens

jmozdzen
18-Aug-2015, 13:33
Hi Midata,

if you "reply with quote", things might be a bit easier to read ;)


Hi Midata,

may I ask you to clarify a few points?

> I have to shutdown, FORCE OFF, the server and turn it back on.

That's the KVM guest that has to be shut down? Correct

> It just seems to stop at some point over night.
> The server just seems to hang. Acording to yast, the cpu is under 100% load

Again, this is about the guest? Which YaST is reporting (probably the host) and what in YaST is reporting that, I'd have expected that you use "top" or alike, but not YaST. Correct
"correct" in the sense that you actually used "top" and not YaST, to diagnose?


> And since ther are error messages it's hard to locate.

Have you had a look at the guest's /var/log/messages after the reboot? Does it simply stop at some point, does it actually continue to log until the reboot? And since this happens every night - open a remote ssh session to the KVM guest in the evening and run "top" there. If the guest will actually hang, then "top" should hang as well... and the last execution time is logged (top row, second value), along with the most recent process list. Correct

So you have done this already? How did the system behave then, i.e. judging from syslog? Did the "top" run hang, or did it continue? If it hung, what was the last system state (CPU, memory, IO load)?


Do you have some SNMP monitoring running? Can you see any pattern, i.e. stopped at the same point in time? Do you see any preceding anomalies, i.e. increase in CPU usage? Not until it hangs


I'd recommend to increase monitoring inside the KVM - find out via scripts if the machine actually hangs or rather is too busy. If it is hanging on virtual disk access or excessive CPU, enable remote logging (to a central syslog server), if not already enabled anyhow. Run periodic "ps alx" runs, logging to disk, to have a track history of what is active right before the hang. Do SNMP monitoring (and logging of history) for CPU, memory consumption, disk IO, network - this is helpful anyhow and should be the base set to monitor for any production server. You can use freeware tools to do so,. i.e. OpenNMS. I did increase Memory from 2 GB to 3 GB, no Change. Memory goes over 2 GB only when backup runs. Which always works, although it backups to a completly different Server. This runs at 8pm. The Server hangs some time between 2 am and 5 am.

If something eats up all memory, it'll eat up more memory as well :D Did you trace the system state via some script, and if so, what's the outcome? What happens right before the "hang"?


Oh, and what else (besides the earlier network update) changed on July 30? Have you had a look at /var/log/zypp/history both inside the VM and on the host to see if any RPMs were updated? Per my Boss, Updates are not done here. Only Updates that get done are those that I do. Which means with new Servers they get installed with ServerView and have the same OS and Version.

That it's only a single set of hands on the machine, will make debugging this easier, indeed. But it's still interesting that no evident change happened on July 30.

Have you checked the hardware monitoring, especially for disk faults? Additionally, you might want to run "rpm -V all" to spot erroneous file modifications.

Regards,
Jens

Midata
18-Aug-2015, 14:45
Hi Jens,

if you "reply with quote", things might be a bit easier to read ;)
All Problems occure with guest Server only. Host Server runs fine.


"correct" in the sense that you actually used "top" and not YaST, to diagnose?
yes



So you have done this already? How did the system behave then, i.e. judging from syslog? Did the "top" run hang, or did it continue? If it hung, what was the last system state (CPU, memory, IO load)?
Everything just stops. Top Shows no load on CPU, Memory or IO. In Messages, I see just the Server was running and then it's booting. Processes shown on top at time of freeze were normal.



If something eats up all memory, it'll eat up more memory as well :D Did you trace the system state via some script, and if so, what's the outcome? What happens right before the "hang"?
I increased Memory because it was running most of the time around 2 GB, but that's only because they have Java set use 1 GB. It's not actually using it. Most of it is cached or buffered. I was just tring something.



That it's only a single set of hands on the machine, will make debugging this easier, indeed. But it's still interesting that no evident change happened on July 30.
Here is where my Problem Begins. I'm not an expert and learn, since 1996, Linux. Before I had some experince AT&T Unix. I know about that time are development section updated there Software, but we have about 400 customers using this and they were updated too. But they have other Network configurations then we do. And that's were I think the Problem is. But I can't get them to tell me what was going on in there Software at the time.

Have you checked the hardware monitoring, especially for disk faults? Additionally, you might want to run "rpm -V all" to spot erroneous file modifications.
yes, everything looked fine.


I hope I'm doing this right, if not sorry. Info: There are 3 other guest Servers on this Host, with roughly same configuration and conditions. Mostly just Hard drives are bigger or smaller. They too, have done this, but only once or twice a week and not everyday. One is a Test Server, so I just turned it off.


Thanksī

Sorry did it backwards

jmozdzen
18-Aug-2015, 16:32
Hi Midata,

Everything just stops. Top Shows no load on CPU, Memory or IO. In Messages, I see just the Server was running and then it's booting. Processes shown on top at time of freeze were normal.

what puzzles me is your earlier reference that the VM, when in this condition, uses 100% CPU... so it's doing *something*, and as everything in userland comes to a halt, it's likely that it's the VM's kernel that is munching away the time.

I guess you already had a look at the KVM logs for that VM, on the host, without finding any indications of what's going on. As you're going to kill the VM anyhow, there are two things you may want to try on the next hang:

- run "strace" on the KVM process. This will show if KVM is spinning around some resource on the host machine - although I doubt it's the case.
- after detaching strace, attach to the process via "gdb" and run "info thread" and "where" for all (or just the interesting-looking) threads. This may give you an idea on where KVM is working internally.


There are 3 other guest Servers on this Host, with roughly same configuration and conditions. Mostly just Hard drives are bigger or smaller. They too, have done this, but only once or twice a week and not everyday. One is a Test Server, so I just turned it off.

This makes for an interesting option: If you can get the test server to crash reliably, you could then start isolating the causing factor by turning off one piece of software after the other. Starting with the programs that were updated last.

You wrote the VM runs OpenSUSE 11.2 - that's horridly old software, with a kernel full of bugs. Do you have the option to at least upgrade the kernel?

Regards,
Jens

Midata
19-Aug-2015, 08:32
Hi Jens,



what puzzles me is your earlier reference that the VM, when in this condition, uses 100% CPU... so it's doing *something*, and as everything in userland comes to a halt, it's likely that it's the VM's kernel that is munching away the time.
This what yast from host server shows. All other Information and attachments are from guest server.


I guess you already had a look at the KVM logs for that VM, on the host, without finding any indications of what's going on. As you're going to kill the VM anyhow, there are two things you may want to try on the next hang:

- run "strace" on the KVM process. This will show if KVM is spinning around some resource on the host machine - although I doubt it's the case.
- after detaching strace, attach to the process via "gdb" and run "info thread" and "where" for all (or just the interesting-looking) threads. This may give you an idea on where KVM is working internally.
I'm assuming I do this from the host Server?
Can you give me the extract commands? I'm not expert. I started "strace qemu-kvm" and got alot of information, but not sure what it all meant. Also a window from qemu opened that I didn't understand. See attachments. I tried to save strace to a "strace.log" and "strace-log", but they both showed up empty.



This makes for an interesting option: If you can get the test server to crash reliably, you could then start isolating the causing factor by turning off one piece of software after the other. Starting with the programs that were updated last.
I can't get it to crash like this one. The other VM Servers crash 1 maybe 2 time a week. I don't understand that. And I can't find the difference.


You wrote the VM runs OpenSUSE 11.2 - that's horridly old software, with a kernel full of bugs. Do you have the option to at least upgrade the kernel?
Yes I know that, but I also can't change it.[/QUOTE]

Regards,
Jens

jmozdzen
19-Aug-2015, 14:01
Hi Midata,


Regards,
Jens

(are you a "Jens", too? Then I apologize for addressing you by your nickname only :o )


Hi Jens,
[...]

[... regarding strace and gdb ...]
I'm assuming I do this from the host Server?
Can you give me the extract commands? I'm not expert. I started "strace qemu-kvm" and got alot of information, but not sure what it all meant. Also a window from qemu opened that I didn't understand. See attachments. I tried to save strace to a "strace.log" and "strace-log", but they both showed up empty.

You'd have to attach to the running KVM instance, by specifying its pid (see "man strace"). But that probably was a bad suggestion from my side, anyhow: As you didn't know strace yet, you'll probably get no helpful hints from the vast amount of information it will produce. But if you're curious, I encourage you to have a look anyhow - I use it quite a lot for blackbox analysis, i.e. to see which files a program actually tries to open, when it just reports "file myconfig.cfg not found" :)

Your invocation of strace created a new instance of KVM, and is missing required KVM parameters etc. Hence the complaints by KVM that you show in your screen shot. (You'll likely want to attach to the already running instance, especially since you'll only want to catch the details from after the crash. Else you'd get flooded by details.)


[...test machine ...]
I can't get it to crash like this one. The other VM Servers crash 1 maybe 2 time a week. I don't understand that. And I can't find the difference.

From your description, I guess that the difference is the usage pattern: Assuming that some "coincidence" (timing-related issues concerning certain calls) needs to happen for the system to crash, the more it's used, the more likely it is to happen.



[... kernel upgrade for old system ...]
Yes I know that, but I also can't change it.

Do you have an idea *what* the applications are doing / requiring? If those aren't system-level applications, changing the kernel should not be that problematic... and if, in the end, you have a system that is again running stable, this should be more beneficial to the users. (Yes, I know about application dependencies and about developers, we're doing both system-level and application-level software development. I'm just checking options.)

OTOH, crashing the kernel via typical non-system applications is pretty rare. Not that we haven't seen that yet: Running certain (old) variants of Linux kernels did randomly, but often, crash when run under specific hypervisors. These VMs were only running compile jobs (including NFS access) at the time of crash, nothing fancy. Running these under a different hypervisor sometimes helped - if only to reduce the likeliness of a crash from once a day to once a month.

You might want to check if you're eligible to open a service request with SUSE - if running OpenSUSE 11.2 inside KVM on SLES11SP3 is a supported case.

Regards,
Jens

Midata
21-Aug-2015, 07:16
Hello Jens,

I think the problem has been solved. It Looks like it was the switches. The Network section switched the cable from slave switch to master switch and at least for the last 2 days it worked fine.
It still doesn't make sense to me, but it's working now. Thanks for all the help. At least I learned some new stuff.

Do I have to Close this? If so HOW?

Have a nice day.

jmozdzen
21-Aug-2015, 11:05
Hi Midata,

while I really appreciate that you're back to a reliable system, and that it was an external cause (rather than some user-land software crashing the VM/kernel), it leaves me puzzling what side effects hung your VM: Just killing the network connection should leave your VM completely hung.

Might it be that your (virtual console) login hung because of authentication / post-login waiting for network-based services, i.e. LDAP or NFS? (No need to answer, it's just that I prefer to understand *why* something fails.)

Thank you for reporting back!

Regards,
Jens

Midata
21-Aug-2015, 11:14
Hello,

Yes I know what you mean. We have Software from Basis and uses Java and Tomcat. Since always seem to happen when they were starting.

Midata
02-Aug-2016, 08:13
Hello, I finally found the Problem. It has to do with the date and time. See link.

https://blog.jensschanz.de/2012/03/09/kvm-non-smp-guests-become-unresponsive-and-use-100-cpu/

jmozdzen
02-Aug-2016, 16:52
Hi Midata,

great find! And thank you for reporting back.

Regards,
J