PDA

View Full Version : SLES-Other SLES 10 SP4 guest slowly in SLES 11 SP4 host - high si



berndgsflinux
27-Feb-2017, 14:24
Hi,

i have a SLES 10 SP4 guest running slowly in a SLES 11 SP4 host. The guest is running a small web application with a MySQL DB, an apache webserver and some perl scripts. The DB is not really busy, maybe some hundreds requests per day.
But the system performs slowly, especially on a console. Connecting with ssh, e.g. top need between 2 and 3 seconds to refresh its output. It's really not funny to work on the console.
The system has 4 virtual cpus and 8gb of ram. System was before a physical system which was migrated to a vm. The physical system just has 2 cpu's but is running fine. The host is very performanent, 8 cores and 96gb of ram. The host is running fine, no performance problem. And the guest is not a heavy load for it. It's KVM. Vmx flag is set on the host cores.
What i see is that the guest has constantly high si in top.

Here a typical example:

top - 14:12:31 up 38 min, 9 users, load average: 0.81, 0.69, 0.60
Tasks: 111 total, 2 running, 109 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 1.6%sy, 0.0%ni, 95.6%id, 0.0%wa, 0.0%hi, 2.8%si, 0.0%st
Cpu1 : 1.1%us, 1.1%sy, 0.0%ni, 86.8%id, 3.9%wa, 0.0%hi, 7.1%si, 0.0%st
Cpu2 : 0.6%us, 0.6%sy, 0.0%ni, 58.7%id, 0.0%wa, 5.2%hi, 34.8%si, 0.0%st
Cpu3 : 0.3%us, 0.3%sy, 0.0%ni, 85.8%id, 0.0%wa, 0.3%hi, 13.2%si, 0.0%st
Mem: 7995120k total, 1122420k used, 6872700k free, 75368k buffers
Swap: 2104472k total, 0k used, 2104472k free, 790208k cached

You see, the system isn't doing much. Very little IO, just a little bit load in user and system context, but much si.
How can i find out from where this si come from ? IIRC correctly, i had this problem already. I updated the system completely to SLES 11 SP4,
and the high si were history. But this was just for test, the system needs to be a SLES 10 SP4.
Maybe the kernel or some modules are the culprit ? I tried to update only the kernel, but it's not possible.
Tons of dependencies.

Thanks for any help.


Bernd

ab
27-Feb-2017, 15:09
I have not seen this, but I also do not know that I have ever had SLES 10
as a KVM guest, where I have had SLES 11 and 12 as such, both with at
least the 3.x kernel.

One thread I found online said another user was helped with a similar
kernel (version) by changing the clock settings on the guest:

https://www.centos.org/forums/viewtopic.php?t=17663

I have not had to hack kernel clock settings for a while, but "a while"
means back to the SLES 9 or 10 days, so maybe this is valid for you after
all; if nothing else it's a simple test.

I found another thread that indicated a qemu patch had been submitted to
help with si being high due to network traffic, but your box doesn't sound
busy enough to warrant that.... still who knows. If it is "just" a simple
LAMP box, have you considered upgrading? In my experience these types of
upgrades are really easy and low risk, with the hardest part being keeping
downtime to a minimum as you move over the last (if applicable) database
changes going form the old to new systems, but even that can be really
simple depending on the application's use of the DB.

--
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below...

berndgsflinux
27-Feb-2017, 16:14
Hi ab,

unfortunately that didn't help. Still high si. Is there a way to find the reason for that ? It's just a "simple" Lamp but ...
Although it's not heavily utilized, it's really absolut buisness-critical. Downtime over the day must not happen.
And the developer who cared about it has left us and we will get no substitute. My collegue and i really don't want to upgrade it,
because we can't be absolut sure that new versions of mysql, perl and apache don't interfere the application.
And we are both no perl developers. And the app is very mighty, it has a lot of functionality. Testing every function is like hell,
and we surely will oversee some. That's the problem. Downtime in the evening or at the weekend could last for hours,
that's not the point.

Bernd

ab
27-Feb-2017, 17:21
Bummer; I have never, as far as I know, seen an issue with si being high,
so I do not have a lot of experience there, thus my results all being from
Google last time.

I understand the simple-yet-critical argument; having the original
developer gone is painful but that is where you are. If you can come up
with a couple reliable ways to monitor the system, and test the critical
functions fairly quickly, I think your best bet is probably still to
upgrade. The nice thing about this type of setup, usually, is that you
can copy it to a new system, pound it with tests for an hour/day/week, and
then when you are ready to swap over just refresh the data in the database
and update DNS pointers (hopefully clients are using DNS) or change IP
addresses over and that's it. Falling back to the original system is just
as easy, in case you find something was missed after some time.

--
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below...

ab
27-Feb-2017, 17:25
What do you get from the following command:



cat /proc/interrupts


I'm trying to find ways to find causes of interrupts, and the types of
interrupts, and perhaps this virtual file will help us.

--
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below...

berndgsflinux
27-Feb-2017, 18:00
I would like to upgrade. But as said the system is buisness-critical. Really. We are a research institute, and the DB is for our mice breeding. We have about 5.000 mice, and if the DB is inconsistent or buggy, we are completely lost.
And the source code is not really fine and well written, we saw already some lines which are horrible. And the DB, which was not developped by ourself, has no relations !!! Just one table has a primary key. So now you know what we are talking about.
The perl-scripts have in sum about 60.000 lines of code. That's about 2000 pages. That's not really big, surely there is a mass of bigger applications. But it's big enough that we hesitate.
If someone could promise me that an upgrade will run smoothly at 100 %, ok. Do you know someone who will promise that ? If you see any way, tell me.

Bernd

berndgsflinux
27-Feb-2017, 18:02
I did it three times that you get a better insight:

vm58820-4:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 439757 0 0 0 IO-APIC-edge timer
1: 53 0 285 0 IO-APIC-edge i8042
8: 1 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 IO-APIC-level acpi
10: 1 0 0 0 IO-APIC-level virtio2, uhci_hcd:usb1
12: 104 0 0 0 IO-APIC-edge i8042
15: 122 0 0 10637 IO-APIC-edge ide1
177: 0 0 0 0 PCI-MSI-X virtio1-config
185: 6906 26624 0 0 PCI-MSI-X virtio1-requests
193: 0 0 0 0 PCI-MSI-X virtio0-config
201: 146 0 38029 0 PCI-MSI-X virtio0-input
209: 86 0 0 1551 PCI-MSI-X virtio0-output
NMI: 0 0 0 0
LOC: 881734 905777 481458 882053
ERR: 0
MIS: 0
vm58820-4:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 440141 0 0 0 IO-APIC-edge timer
1: 53 0 285 0 IO-APIC-edge i8042
8: 1 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 IO-APIC-level acpi
10: 1 0 0 0 IO-APIC-level virtio2, uhci_hcd:usb1
12: 104 0 0 0 IO-APIC-edge i8042
15: 122 0 0 10643 IO-APIC-edge ide1
177: 0 0 0 0 PCI-MSI-X virtio1-config
185: 6906 26636 0 0 PCI-MSI-X virtio1-requests
193: 0 0 0 0 PCI-MSI-X virtio0-config
201: 146 0 38057 0 PCI-MSI-X virtio0-input
209: 86 0 0 1556 PCI-MSI-X virtio0-output
NMI: 0 0 0 0
LOC: 882538 906581 481864 882757
ERR: 0
MIS: 0
vm58820-4:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 440338 0 0 0 IO-APIC-edge timer
1: 53 0 285 0 IO-APIC-edge i8042
8: 1 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 IO-APIC-level acpi
10: 1 0 0 0 IO-APIC-level virtio2, uhci_hcd:usb1
12: 104 0 0 0 IO-APIC-edge i8042
15: 122 0 0 10649 IO-APIC-edge ide1
177: 0 0 0 0 PCI-MSI-X virtio1-config
185: 6906 26636 0 0 PCI-MSI-X virtio1-requests
193: 0 0 0 0 PCI-MSI-X virtio0-config
201: 146 0 38075 0 PCI-MSI-X virtio0-input
209: 86 0 0 1563 PCI-MSI-X virtio0-output
NMI: 0 0 0 0
LOC: 882988 907053 482069 883108
ERR: 0
MIS: 0
vm58820-4:~ #

Bernd

KBOYLE
27-Feb-2017, 20:02
I would like to upgrade. But as said the system is buisness-critical. Really. We are a research institute, and the DB is for our mice breeding. We have about 5.000 mice, and if the DB is inconsistent or buggy, we are completely lost.
And the source code is not really fine and well written, we saw already some lines which are horrible. And the DB, which was not developped by ourself, has no relations !!! Just one table has a primary key. So now you know what we are talking about.
The perl-scripts have in sum about 60.000 lines of code. That's about 2000 pages. That's not really big, surely there is a mass of bigger applications. But it's big enough that we hesitate.
If someone could promise me that an upgrade will run smoothly at 100 %, ok. Do you know someone who will promise that ? If you see any way, tell me.

Bernd
Hi Bernd,

I can understand why you don't want to make changes to an old, unsupported, poorly developed, business critical system but this system has an issue that needs to be resolved and will likely require some changes to the system.

In a previous post, ab (Aaron) said:

The nice thing about this type of setup, usually, is that you
can copy it to a new system, pound it with tests for an hour/day/week, and
then when you are ready to swap over just refresh the data in the database
and update DNS pointers (hopefully clients are using DNS) or change IP
addresses over and that's it. Falling back to the original system is just
as easy, in case you find something was missed after some time.

If you follow his suggestion, you will not be making any changes to your business critical system!

Make a clone of your VM.
In the evening, shut down your production VM and start the clone.
Make sure there are no IP address conflicts so the production VM can be started and left running.
Upgrade the clone to SLES11 SP4.

This will help determine whether your issue can be resolved by upgrading SLES to a newer version. The newer version has many patches which may resolve your issue. You have not changed your production system and can decide at a later time whether or not you should update your production VM.

A note about MySQL:
While a single table is probably not the best database design, it means that this is a very simple database that is likely exploiting very few MySQL features. While performance may suffer, there is a much better chance it will be compatible with a newer version of MySQL.

We are fortunate to have Aaron helping in these forums. His Linux skills are extensive. If anyone can help you with your issue, it will be he.

I do have one suggestion for you:
When posting the output from commands, please use code tags (click on the "#" icon at the top of the reply box) and paste the results between the and tags. This will make your information much easier to read.

ab
28-Feb-2017, 17:06
LOC stands for 'Local timer interrupts' according to my laptop's version
of the file (openSUSE, newer version).

Some reading online says this has to do with multi-CPU process handling,
and that it's not a bad thing generally. Did you run those commands one
second apart? One minute? One hour? Other?

I suppose you could try something crazy and decrease the number of
assigned cores. They're all virtual anyway, so maybe your idle box has an
older bug that causes a lot of unnecessary overhead due to number of
processors which is just causing your problem instead of helping
performance. That would be ironic.

None of your other numbers seem to be that big, or growing very quickly,
but again I do not know the time intervals involved.

--
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below...

ab
28-Feb-2017, 17:15
On 02/27/2017 10:04 AM, berndgsflinux wrote:
>
> I would like to upgrade. But as said the system is buisness-critical.
> Really. We are a research institute, and the DB is for our mice
> breeding. We have about 5.000 mice, and if the DB is inconsistent or
> buggy, we are completely lost.

Yes, this is not abnormal though. Severs are usually somewhere from
important to mission-critical, and yet upgrades happen.

As Kevin mentioned, the option I'm proposing is to build a new box (copy
existing or build new, whatever) either as the new code, or to be upgraded
to the current code without being the currently-running prod box.

> And the source code is not really fine and well written, we saw already
> some lines which are horrible. And the DB, which was not developped by
> ourself, has no relations !!! Just one table has a primary key. So now
> you know what we are talking about.

That's not abnormal either, which is why I proposed building a new box,
testing it there without having it be the box that servers the
mission-critical work until you are sure the upgrade is a success.

> The perl-scripts have in sum about 60.000 lines of code. That's about
> 2000 pages. That's not really big, surely there is a mass of bigger

I presume you mean 200 pages; 30 lines per page seems a bit short, not
that lines per page mean anything since pages only count if you print,and
then it's influenced by nonsense like font size, and nobody prints source
code.

> applications. But it's big enough that we hesitate.
> If someone could promise me that an upgrade will run smoothly at 100 %,
> ok. Do you know someone who will promise that ? If you see any way, tell
> me.

You should hesitate; that's why you have the job you do. A guarantee of
success is probably impossible, but I would recommend the other approach
for that very reason, especially when combined with the need for this
service to persist reliably.

1. Build a new SLES 11 or SLES 12 box.
2. Be sure you have Perl/MySQL/Apache2(httpd) (or MariaDB instead of
MySQL) as needed.
3. Copy over the data; Perl scripts, configuration files, the DB itself, etc.
4. Test it, while your existing system keeps happily-ish performing
badly, but otherwise doing its job.
5. When ready, update the DB on the new box, point clients there, and
turn off the old box.

--
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below...

berndgsflinux
25-Aug-2017, 15:47
Finally, half a year later, i found the culprit. I used a virtio network adapter in the vm. Although you read everywhere that paravirtualized hardware is faster than ones which have to be completely emulated. But it's not true in my case. I used e1000 as network adapter in the VMM, and now the system is running fine, with nearly zero si. Very interesting.

Bernd

smflood
25-Aug-2017, 17:29
On 25/08/17 15:54, berndgsflinux wrote:

> Finally, half a year later, i found the culprit. I used a virtio network
> adapter in the vm. Although you read everywhere that paravirtualized
> hardware is faster than ones which have to be completely emulated. But
> it's not true in my case. I used e1000 as network adapter in the VMM,
> and now the system is running fine, with nearly zero si. Very
> interesting.

I'm glad you found a solution and thanks for taking the time to report back.
--
Simon
SUSE Knowledge Partner

------------------------------------------------------------------------
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below. Thanks.
------------------------------------------------------------------------

berndgsflinux
25-Aug-2017, 17:50
Yes, i'm happy. I had this problems already several times. Strange that the paravirtualized drivers slow down a vm more than the emulated devices. It should be the other way round, from all you read about virtualization.
I think it's important to always post the solution if you find one, surely someone else have the same or a similiar problem.
I know that i ask more in mailing lists or forums than i help, so this is my small tribute to the community :-))
Bernd

KBOYLE
25-Aug-2017, 23:19
Finally, half a year later, i found the culprit. I used a virtio network adapter in the vm. Although you read everywhere that paravirtualized hardware is faster than ones which have to be completely emulated. But it's not true in my case. I used e1000 as network adapter in the VMM, and now the system is running fine, with nearly zero si. Very interesting.

Bernd
Normally a paravirtual nic should be faster and more efficient but that is not to say that there might be a bug in the driver. SLES 10 is very old and bug fixes may not have been ported back to that release.

I'm glad you found a workaround with the E1000 adapter. If you ever decide to upgrade this server, be sure to try the paravirtual nic and see if the issue has been resolved.