PDA

View Full Version : Odd hangs after writing to SAN



lpphiggp
30-Jun-2014, 18:49
We just started setting up our first chassis/blade based server (Dell m1000e) which will host xen vms.

The physical OS (SLES11sp3) and xen virtual machines reside on the internal blade server drives (mirrored), but their data and mail (Novell NSS volumes) will live on our SAN, on a CLARiiON CX4.

This is still the testing stage, we haven't got any OES vms in the tree yet or NSS volumes to speak of, so I'm just working with straight linux and ext3 FS.

Environment:
I just created a couple of LUNs on the CX4 for testing.
So, I've got the physical server, with one LUN assigned/mounted directly to it (/dev/emcpowerb), and, one VM: I thought about using NPIV to get the vm to directly register to the CX4 but that seemed overly complicated. So, the physical box alone is zoned and registered to the CX-4, and all luns are seen by it, but any LUNs meant for a vm are simply added as add'l storage to the VM via xen. (and not mounted or used by the physical host). Make sense?

Situation:
I ran some tests writing a 3.2 GB iso file to the SAN luns on both phys and vm to compare performance; on the physical machine, it copies in about 5 seconds. Nice.
But on the vm, it either takes nearly a minute (!) or, it seems done in about 8 seconds but then any following command makes the machine hang at the prompt, even a simple command like "ls" or "hostname".

I have no idea what is causing this hanging, but it's not a good sign. The phys host seems okay, it's the vm that does this.

I don't really see anything in /var/log/messages that is useful. I know from the physical test that the SAN is good, it screams. But something seems amiss with xen.
Any ideas where to start to look?

cjcox
01-Jul-2014, 04:02
Make sure you don't have a frame mismatch. Check your jumbo frame settings..
makes sure everything is consistent. Frame errors will definitely slow things
down to a crawl.

On 06/30/2014 12:54 PM, lpphiggp wrote:
>
> We just started setting up our first chassis/blade based server (Dell
> m1000e) which will host xen vms.
>
> The physical OS (SLES11sp3) and xen virtual machines reside on the
> internal blade server drives (mirrored), but their data and mail (Novell
> NSS volumes) will live on our SAN, on a CLARiiON CX4.
>
> This is still the testing stage, we haven't got any OES vms in the tree
> yet or NSS volumes to speak of, so I'm just working with straight linux
> and ext3 FS.
>
> Environment:
> I just created a couple of LUNs on the CX4 for testing.
> So, I've got the physical server, with one LUN assigned/mounted directly
> to it (/dev/emcpowerb), and, one VM: I thought about using NPIV to get
> the vm to directly register to the CX4 but that seemed overly
> complicated. So, the physical box alone is zoned and registered to the
> CX-4, and all luns are seen by it, but any LUNs meant for a vm are
> simply added as add'l storage to the VM via xen. (and not mounted or
> used by the physical host). Make sense?
>
> Situation:
> I ran some tests writing a 3.2 GB iso file to the SAN luns on both phys
> and vm to compare performance; on the physical machine, it copies in
> about 5 seconds. Nice.
> But on the vm, it either takes nearly a minute (!) or, it seems done in
> about 8 seconds but then any following command makes the machine hang at
> the prompt, even a simple command like "ls" or "hostname".
>
> I have no idea what is causing this hanging, but it's not a good sign.
> The phys host seems okay, it's the vm that does this.
>
> I don't really see anything in /var/log/messages that is useful. I
> know from the physical test that the SAN is good, it screams. But
> something seems amiss with xen.
> Any ideas where to start to look?
>
>

Magic31
01-Jul-2014, 09:16
... So, the physical box alone is zoned and registered to the CX-4, and all luns are seen by it, but any LUNs meant for a vm are simply added as add'l storage to the VM via xen. (and not mounted or used by the physical host). Make sense?..

Sure, that makes sense. And that should work well... just watch out for possible corruption that can happen if you end up running two same guest configurations on different blades (meaning one virtual server is actually running on two or more different Xen hosts simultaneously)

PowerPath is running on the host I presume? I am curious to the phy: path you are using, as a first though is that an incorrect path might affect throughput as it's not (fully) being handled by the multipath daemon.

Cheers,
Willem

jmozdzen
01-Jul-2014, 14:34
Hi lpphiggp,

seems you're using a FC-based SAN, else NPIV would make no sense.

> I thought about using NPIV to get the vm to directly register to the CX4 but that seemed overly complicated.

I'd say, it's one of the best inventions since "sliced bread" ;) Combine it with Xen DomU locking (where the Xen servers will run a shared file system with lock files per VM, so that any VM will only be started once in your cluster) and you gain a lot of flexibility and security (even if NPIV will only run at the Dom0 level, I haven't seen vHBA via NPIV in action yet).

> But on the vm, it either takes nearly a minute (!) or, it seems done in about 8 seconds but then any following command makes the machine hang at the prompt, even a simple command like "ls" or "hostname".

Sounds like the VM is busy flushing its caches. Could you please be more specific on the VM setup - are those HVMs or PVMs, how's the DomU storage defined, what do you see in terms of iowaits, buffer usage, etc inside the DomU and the Dom0 when you hit that situation, what i/o schedulers are you using? Any performance tuning in the I/O layer yet?

The theoretical data path (it may be more complicated, depending on your setup) is

- VM: file is written to file system (and/or the Linux cache)
- VM: the dirty cache pages are written to the (virtual) device
- Xen: The writes to the virtual device are handed to Dom0
- Dom0: The writes go to the HBA device handler, to be written in SAN direction
- (FC network, SAN server)

My bet goes for the virtual device handling, where data is passed to Dom0. But it now is time for bottle neck analysis to know for sure and how to cure ;)

With regards,
Jens

lpphiggp
01-Jul-2014, 15:31
Hi Jens,

Yes, it's FC.
The xen vm is default - paravirtualized. I haven't done any kind of tuning, I'm not even sure how to do any. I/O scheduling? I confess I'm at a loss there.
Where do I start?

I added the SAN LUN to the VM by editing the VM's file under /etc/xen/vm. I added the line 'phy:/dev/emcpowera,xvdc,w' to the disk=[ ... ] line, and started it up with xm create <server>.

Not sure I need to lock the DomU, this isn't a cluster. Our Unit will only use the SAN for the ability to add and expand space, not sharing. At least, for the foreseeable future.

lpphiggp
01-Jul-2014, 16:55
Thanks, unlikely to happen though.. we're not clustering. Yes, I'm running Powerpath 5.7.SP4. (for SLES11SP3).
I used the pseudo device /dev/emcpowera for the path.

lpphiggp
01-Jul-2014, 17:12
That last reply was to Magic31, I forgot to quote

jmozdzen
03-Jul-2014, 14:13
Hi lpphiggp,

Hi Jens,

Yes, it's FC.
The xen vm is default - paravirtualized. I haven't done any kind of tuning, I'm not even sure how to do any. I/O scheduling? I confess I'm at a loss there.
Where do I start?

by running "vmstat 1" and "iostat -N 1" both on the DomU (VM) and the Dom0 (host). "iostat" is from the sysstat package, highly recommend at least on the Dom0 ;) Pay special attention to the i/o waits in conjunction with actual disk writes and try to get an idea of what's happening (when is what data going where, and why is it so slow?).

"I/O scheduling" refers to the algorithms used to decide when to write what data from the OS block buffer to the physical media, see "more /sys/block/*/queue/scheduler". You'll probably see "cfq" selected for the disks, which is ok if you have direct access to physical disks. But both within virtual systems, as well as when connected to a SAN, that scheduler can have no idea how to optimize write operations, only the SAN server will. So you could set the scheduler to "deadline" (which gives you a fair amount of tuning options, the defaults may be ok) or even to "noop" (by echoing the according string to the file: "echo deadline > /sys/block/yourDiskDevice/queue/scheduler"). Fine-tuning requires knowledge of your actual usage, I wouldn't want to recommend this to you at this stage. I would not expect a facto 12 improvement by changing the scheduler, it's just something to keep in mind for later.

An interesting question would be if you're just not seeing the problems on Dom0 - it may be writing to the buffers similarily ( 5 vs. 8 seconds), but on Dom0, the disk accesses for system disks and that data LUN go via different controllers. Thus your system is still responsive, while in the background it still is struggling to write that data to the SAN server. On the DomU, *everything* goes through that single virtual controller to Dom0, thus flushing the buffers impacts the system disks, too. How long does "sync" hang on the Dom0, after the test?


I added the SAN LUN to the VM by editing the VM's file under /etc/xen/vm. I added the line 'phy:/dev/emcpowera,xvdc,w' to the disk=[ ... ] line, and started it up with xm create <server>.

Sounds pretty straight-forward :) Did I understand that correctly, you tested throughput to "/dev/emcpowera" from within Dom0 with sufficient results? Then how's your CPU load (Dom0) while the VM flushes its buffers?


Not sure I need to lock the DomU, this isn't a cluster. Our Unit will only use the SAN for the ability to add and expand space, not sharing. At least, for the foreseeable future.

Oh, I got that wrong, sorry - reading "blade" I implied you will be running multiple Xen servers (multiple blades). Of course no inter-Xend locking is required with only a single xend :D

With regards,
Jens

lpphiggp
03-Jul-2014, 15:41
Hi lpphiggp,

by running "vmstat 1" and "iostat -N 1" both on the DomU (VM) and the Dom0 (host). "iostat" is from the sysstat package, highly recommend at least on the Dom0 ;) Pay special attention to the i/o waits in conjunction with actual disk writes and try to get an idea of what's happening (when is what data going where, and why is it so slow?).

"I/O scheduling" refers to the algorithms used to decide when to write what data from the OS block buffer to the physical media, see "more /sys/block/*/queue/scheduler". You'll probably see "cfq" selected for the disks, which is ok if you have direct access to physical disks. But both within virtual systems, as well as when connected to a SAN, that scheduler can have no idea how to optimize write operations, only the SAN server will. So you could set the scheduler to "deadline" (which gives you a fair amount of tuning options, the defaults may be ok) or even to "noop" (by echoing the according string to the file: "echo deadline > /sys/block/yourDiskDevice/queue/scheduler"). Fine-tuning requires knowledge of your actual usage, I wouldn't want to recommend this to you at this stage. I would not expect a facto 12 improvement by changing the scheduler, it's just something to keep in mind for later.

An interesting question would be if you're just not seeing the problems on Dom0 - it may be writing to the buffers similarily ( 5 vs. 8 seconds), but on Dom0, the disk accesses for system disks and that data LUN go via different controllers. Thus your system is still responsive, while in the background it still is struggling to write that data to the SAN server. On the DomU, *everything* goes through that single virtual controller to Dom0, thus flushing the buffers impacts the system disks, too. How long does "sync" hang on the Dom0, after the test?

Sounds pretty straight-forward :) Did I understand that correctly, you tested throughput to "/dev/emcpowera" from within Dom0 with sufficient results? Then how's your CPU load (Dom0) while the VM flushes its buffers?

Oh, I got that wrong, sorry - reading "blade" I implied you will be running multiple Xen servers (multiple blades). Of course no inter-Xend locking is required with only a single xend :D

With regards,
Jens

Hi Jens,

I just installed the sysstat tools on both host and guest. It will take me a bit to get familiar with what I'm looking at here and how to properly interpret the results.

Regards Dom0, yes, throughput with the physical machine (aka host aka Dom0) to the SAN space is excellent (I use /dev/emcpowerb for the physical; a different lun, /dev/emcpowera for the vm), it's only when writing (copying) from the vm guest that the throughput takes a dive. Note that all luns regardless here though come from the same array, same storage group/zone etc..

During my troubleshooting yesterday, I found something interesting though. I started using the dd command for tests instead of cp, and got very different results.

I ran " dd if=/dev/zero of=/SAN/testfile bs=2M oflag=direct " on the phys host, and it copied the 1 GB in about 2 seconds.
When I ran the equivalent on the vm (dd if=/dev/zero of=/san_lun/testfile bs=2M oflag=direct)I expected the same delay, but it also completed in about 2 seconds, this time, no difference. I really don't know why.
I ran tests multiple times, in every instance, the results using dd were identical between host and guest, unlike when I use cp. I'm not sure what that means. It seems like the file meta data written using cp is what takes so long to write.

Incidentally, here's what I have regarding the scheduler, for both host and guest: I assume [cfg] means that's the current setting?

noop deadline [cfg]
Since this server is not a production server but was created exclusively for testing, I'll try the deadline setting and it's defaults for now, and see if that helps.

I'll get back next week probably with some results from iostat and vmstat.

Thanks again for all the information, I'm in your debt.

Paul

lpphiggp
03-Jul-2014, 15:48
Jens,

I just had to add a quick update: since changing the scheduler on the guest vm to deadline, the copy time (using cp to copy a DVD iso) has improved dramatically!! I went from 40 seconds to about 12 seconds!
If subsequent tests are consistent, I'd say we're in really good shape.

jmozdzen
03-Jul-2014, 16:13
Hi Paul,

Hi Jens,

I just installed the sysstat tools on both host and guest. It will take me a bit to get familiar with what I'm looking at here and how to properly interpret the results.

yes, but if you're to handle such systems in the future, that time spent now will pay off later for sure. When you start bringing your hardware to the edge by running multiple full systems on top, performance difficulties always are just a glimpse away, so knowing how to identify them is rather important.



Regards Dom0, yes, throughput with the physical machine (aka host aka Dom0) to the SAN space is excellent (I use /dev/emcpowerb for the physical; a different lun, /dev/emcpowera for the vm), it's only when writing (copying) from the vm guest that the throughput takes a dive. Note that all luns regardless here though come from the same array, same storage group/zone etc..

Please keep in mind my other comment - some bottlenecks go by unnoticed, if you don't look close enough. You came here since your virtual machine took only seconds more than the physical machine for the copy job, but was unresponsive for a minute after that. I tried to explain a possible cause for such different experience.


During my troubleshooting yesterday, I found something interesting though. I started using the dd command for tests instead of cp, and got very different results.

I ran " dd if=/dev/zero of=/SAN/testfile bs=2M oflag=direct " on the phys host, and it copied the 1 GB in about 2 seconds.
When I ran the equivalent on the vm (dd if=/dev/zero of=/san_lun/testfile bs=2M oflag=direct)I expected the same delay, but it also completed in about 2 seconds, this time, no difference. I really don't know why.
I ran tests multiple times, in every instance, the results using dd were identical between host and guest, unlike when I use cp. I'm not sure what that means. It seems like the file meta data written using cp is what takes so long to write.

This points to caching as the cause for the differences seen earlier.



Incidentally, here's what I have regarding the scheduler, for both host and guest: I assume [cfg] means that's the current setting?
noop deadline [cfg]
Since this server is not a production server but was created exclusively for testing, I'll try the deadline setting and it's defaults for now, and see if that helps.

You're right concerning the "current setting" and as your other, later message showed, you not only were able to change the scheduling algorithm, but it made a difference (to the better :)), too.

I'm sure you'll have more questions to ask once you took the plunge into the world of Linux I/O, please don't hesitate to get back here for assistance and/or exchange of opinions!

With regards,
Jens