I have 2 SLES 11 SP1 clusters (10 nodes in one blade center) with a strange issue. The nodes are loosing randomly the communications with sbd partition and rebooting. The warning messages are the following in the /var/log/warn:
sbd: : WARN: Latency: No liveness for 59 s exceeds threshold of 3 s (healthy servants: 0)
sbd: : WARN: Latency: No liveness for 60 s exceeds threshold of 3 s (healthy servants: 0)
The timeout is 60 second, so the sbd does its job right. But why is the communication failing ? Why can't the sbd watchdog access the sbd partition ? That is the question. There is nothing in the logs, that a path from the 4 pathes went down or anything from multipathd, just the sbd messages. The polling interval in multiapath.conf is set to 1 second. If I am right, if a path fails, there should be something in the logs after 4 seconds, that multipathd tries to recover, arn't I?
Or can a RAID group on the SAN stall for more than 60 seconds, with 4 pathes up and running ???