PDA

View Full Version : qla2xxx: Ramping down queue depth



fpitinfra
24-Oct-2014, 07:36
Hi,

we have 2 HP BL460 G6 blades running SLES11 SP3 and CommVault Simpana 10 SP8 Media Agent. They are connected to a 3PAR storage through a QLogic 4GBit/s HBA. The driver is from SLES11 SP3.
Now I have the problem that the CommVault Simpana DeDup process gets stuck regularly (every few days). Each time this happens I see a lot of these "qla2xxx: Ramping down queue depth" messages. Could this be related?
To backup the DeDup database CommVault Simpana creates snapshots of the LVM volumes. If the backup takes too long, the COW file becomes full. Could this have an impact on the running DeDup process?

Regards
Bernhard

jmozdzen
28-Oct-2014, 12:46
Hi Bernhard,

Hi,

we have 2 HP BL460 G6 blades running SLES11 SP3 and CommVault Simpana 10 SP8 Media Agent. They are connected to a 3PAR storage through a QLogic 4GBit/s HBA. The driver is from SLES11 SP3.

IIRC there was a recent update for just that driver, are you already using that? Must have been few days ago in the SLES patch announcement mail...


Now I have the problem that the CommVault Simpana DeDup process gets stuck regularly (every few days). Each time this happens I see a lot of these "qla2xxx: Ramping down queue depth" messages. Could this be related?
To backup the DeDup database CommVault Simpana creates snapshots of the LVM volumes. If the backup takes too long, the COW file becomes full. Could this have an impact on the running DeDup process?

Regards
Bernhard

ramping down seems to be the result of a resource shortage in the HBA, and that feature was reverted about a year ago. Maybe the recently distributed driver in SLES now also uses the static queue length again?

Anyhow, I believe that the ramp-down is just a result, not the source of the problem. The dedup run might overload your storage back-end, hence the fill-up of the HBA and the ramp-down messages. And yes, this of course would impact the dedup process...

I've had a similar case (impact-wise) with a 2Gbps FC HBA connected via 4 Gbps fabric to a 4 Gbps storage back-end. The FC throughput slowed down to a crawl during backup runs, it appeared to be some hickup in the HBA code... rebooting the machine with the 2 Gbps HBA brought everything back to normal. (Unfortunately, somehow the HBA did impact the HBA / disks on the storage back-end, too, so other systems using the same back-end were affected, too).

How do you currently resolve those situations, via an HBA reset (i.e. during reboot) or does it get back to normal all by itself (i.e. after killing the dedup process)?

Regards,
Jens

fpitinfra
28-Oct-2014, 14:11
Hi Jens,
thanks for the info.

I patched the system about a week ago. So, apparently I don't have the new driver. I'll install it ASAP

When it happens, I can't kill the DeDup process. So, it must be stuck somewhere very deep in the system. And then of course a clean reboot is not possible either. But when the system comes up again, all is back to normal. Most of the time the CommVault continues with the pending jobs. Sometimes the DDB is corrupt and needs recovery, but this is an automated process.

Regards
Bernhard

jmozdzen
28-Oct-2014, 14:17
Hi Bernhard,

it's this kernel update "Security update for Linux kernel 9750" ( https://download.suse.com/Download?buildid=Nig38l4JlpM~ ) - the announcement email arrived here on Oct 25.

Regards,
Jens

jmozdzen
28-Oct-2014, 14:28
Hi Bernhard,

When it happens, I can't kill the DeDup process. So, it must be stuck somewhere very deep in the system. And then of course a clean reboot is not possible either. But when the system comes up again, all is back to normal.
"feels" like the situation we had with our server, too.

Something else you might want to check is that the proper firmware is installed on the HBA. For older cards (you didn't say which one you use, only 4Gbps, some QLx246x it then will be) the firmware update via OS-supplied firmware files is officially supported, so you could check the boot log for details on the version used. While I generally would stick with the firmware files provided by SUSE (because they match the driver code), in such cases a test with a more current version from QLogic might be worth the effort.

And if everything else fails - open a service request with SUSE and have *them* wade through the debug log of your qla2xxx driver (but be warned, setting the driver to "debug" will produce tons of syslog output. I recommend to route those syslog messages to some dedicated or at least remote syslog server, and to turn it on as closely to the actual tracing time window as possible ;) ).

Regards,
Jens