SBD fails to fence node if 1 of 2 sbd devices unreachable
Inherited a 2 node cluster running SLES11 SP1 and HA extension. Servers are attached to 2xHP P2000 arrays. There is an SBD device served by each array, and these devices are visible through multipath. My problem is if I test a scenario of poweroff one server AND one of the disk arrays simultaneously, I can see that SBD attempts to fence the powered off node, but instead of acknowledging the fence as successful after the SBD msgwait timeout, it receives a return code of 1, and the cluster remains in an unclean state. There is also a fair amount of failed I/O being logged during this test. My initial investigations seem to suggest a multipath issue, in that I/O appears to be queueing on the SBD paths, even though I had already disabled the queue_if_no_path feature for the SBD LUNs.
Here are my SBD timeouts:-
Cluster Stonith Timeout 300
Multipath polling interval 5
I guess what I'm looking for initially is any tips or tricks to deal with SBD on multipath, as whatever config I add to multipath.conf regarding queue_if_no_path or no_path_retry seems to be ignored in favour of the controller settings.