PDA

View Full Version : copy operation hanged



sharfuddin
19-Sep-2012, 09:52
SLES 11 SP1 x864

My Customer is using a NAS device for backup.

192.168.0.y:/mnt/HD_a2 on /external_disk type nfs (rw,addr=192.168.0.y)

also the the directory they backed up is on a local file system "/backup"
/dev/cciss/c0d0p8 on /backup type ext3 (rw,acl,user_xattr)
this /backup file system contains some very large file like 129 GB.

Problem is that when we try to copy a very large file "/backup/19sep/large-file" which is about 129 GB in size to the NAS we found that
1 - blocks in(bi) and blocks out(bo) remains very low


procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 1 0 11230940 91200 19138368 0 0 102 7 18 156 0 0 98 2 0
0 1 0 11231312 91200 19139952 0 0 0 0 1116 1321 0 0 96 4 0
0 1 0 11231296 91200 19140744 0 0 0 0 350 475 0 0 96 4 0
2 1 0 11231296 91200 19141592 0 0 0 0 767 876 0 0 95 5 0
0 1 0 11231544 91208 19143008 0 0 0 12 801 961 0 0 97 3 0
0 1 0 11231544 91208 19144184 0 0 0 268 727 990 0 0 95 5 0
1 1 0 11232108 91208 19145680 0 0 0 16 1020 1498 0 0 96 3 0
0 1 0 11231744 91208 19148480 0 0 0 0 1125 1311 1 0 93 5 0
0 1 0 11231496 91208 19150044 0 0 0 0 863 1034 0 0 96 4 0
0 1 0 11231744 91216 19152192 0 0 32 0 1318 1965 0 0 94 5 0
1 1 0 11231480 91216 19154592 0 0 0 944 1385 1563 0 0 96 4 0
0 1 0 11231736 91216 19155932 0 0 0 0 1050 1091 0 0 96 4 0
0 1 0 11231116 91308 19157936 0 0 516 136 1176 2495 0 0 93 6 0
2 1 0 11231116 91308 19158368 0 0 0 0 217 343 0 0 95 5 0
0 1 0 11227396 91308 19164352 0 0 0 0 1504 1717 1 0 95 4 0
0 2 0 11227396 91324 19166280 0 0 80 372 1373 1706 0 0 95 5 0
0 1 0 11227520 91324 19167488 0 0 8 0 885 1121 0 0 96 4 0
1 1 0 11227528 91324 19169596 0 0 0 0 1254 1473 0 0 95 4 0
0 1 0 11226784 91324 19173440 0 0 0 0 1479 2022 1 0 96 4 0
0 1 0 11225372 91324 19176268 0 0 0 0 1265 1545 0 0 94 6 0
1 1 0 11225372 91336 19178540 0 0 12 468 1067 1220 0 0 96 4 0


2 - then after few hours like 2 hours we found that


procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 4 0 11230940 91200 19138368 0 0 108 12 40 131 0 0 95 5 0
1 3 0 11231312 91200 19139952 0 0 576 288 1571 2068 0 0 85 14 0
1 3 0 11231296 91200 19140744 0 0 216 344 1585 3430 1 0 86 13 0

i.e block jobs(b), in, cs, and wa are all high, while bi and bo remains low. Also the copy operation becomes uninterruptable (D+)
"ps aux" shows


root 8866 1.0 0.0 12576 804 pts/0 D+ 11:13 0:08 cp largefile /external_disk/BMP/19sep12/

and we have to reboot the server to recover.

My questions
1 - why it start copying very slow('bi' and 'bo' values of vmstat, remains too low when we copy the 129 Gig file from local disk to the NAS)
2 - then why the copy operation becomes hang/free or D+

and where is the problem.. is there something wrong with the NAS device or with our local disk/file system(/backup)

please help

Bob-O-Rama
20-Sep-2012, 04:00
No idea.... but a couple theories:

As a test copy the same data to /dev/null If you have issues / slowness. If so, something bad is happening with your smartarray. check dmesg output and see if the cciss ( or whatever it is now ) is screaming. Make sure you have the latest HP firmware updates. I have had issues where a disk was going bad, but not quite, and it acted this way - struggling when it hit the bad disk.

On the NAS side of things, you can dd count=129000 ibs=1M obs=1M < /dev/zero > some_file_on_the_NAS or whatever to exercise the NAS

This tests each storage device independant of the other. If they seem to be handling the IO properly, then we have to look at what is in between them ( you backup script ) With certain filers, you need to disable file locking which may be an issue depending on the I/O patterns used. Disabling opportunistic locking, for example.

Like I said, total guesses. But if you cut the problem in half, you might be able to determine which side is the issue.

-- Bob

sharfuddin
24-Sep-2012, 10:06
very nice recommendations/tips, thanks a lot.

I have forwarded the link of this thread to my customer, awaiting for their response.

I will update if and when I get anything from my customer.


No idea.... but a couple theories:

As a test copy the same data to /dev/null If you have issues / slowness. If so, something bad is happening with your smartarray. check dmesg output and see if the cciss ( or whatever it is now ) is screaming. Make sure you have the latest HP firmware updates. I have had issues where a disk was going bad, but not quite, and it acted this way - struggling when it hit the bad disk.

On the NAS side of things, you can dd count=129000 ibs=1M obs=1M < /dev/zero > some_file_on_the_NAS or whatever to exercise the NAS

This tests each storage device independant of the other. If they seem to be handling the IO properly, then we have to look at what is in between them ( you backup script ) With certain filers, you need to disable file locking which may be an issue depending on the I/O patterns used. Disabling opportunistic locking, for example.

Like I said, total guesses. But if you cut the problem in half, you might be able to determine which side is the issue.

-- Bob