Hi there,

We created an OCSFS2 cluster by using SUSE Linux High Availability Extensions on SLES SP3. The cluster's nodes are two Apache servers which share one disk. We have stonith enabled and SBD daemon. It works fine, but...

When one of the nodes is disconnected from the network (Network Card disconnected in VirtualBox) and therefore both nodes fail to communicate in cluster, both servers are rebooted brutally 30 seconds later.

Once the nodes are started again, one of them keeps rebooting the other, so the service availability is lost completely. To recover, the first failed node is reconnected to network (Network Card connected again in VBox) and the problem is fixed.

Questions are:

  1. Why does this happen?
  2. How can I avoid this behaviour?

The expected result from us is to ensure service level availability, so that if a node disconnects temporarily from network, the another one could continue serving and using the disk, event if network connection between them is lost.

If I either kill the corosync daemon (killall -9 corosync) on one node, or the node is shutdown normally, the remaining node keeps working fine. Why doesn't this work when the network card is disconnected? :-/

I'm providing the Cluster Configuration (crm configure show) here: