On 06/09/2015 06:54 PM, jonvargas wrote:
>
> Hi there,
>
> We created an OCSFS2 cluster by using SUSE Linux High Availability
> Extensions on SLES SP3. The cluster's nodes are two Apache servers which
> share one disk. We have stonith enabled and SBD daemon. It works fine,
> but...
>
> When one of the nodes is disconnected from the network (Network Card
> disconnected in VirtualBox) and therefore both nodes fail to communicate
> in cluster, both servers are rebooted brutally 30 seconds later.
>
> Once the nodes are started again, one of them keeps rebooting the other,
> so the service availability is lost completely. To recover, the first
> failed node is reconnected to network (Network Card connected again in
> VBox) and the problem is fixed.
>
> QUESTIONS ARE:
>
>
> - Why does this happen?
> - How can I avoid this behaviour?


Not much help, but sounds like STONITH isn't working. A failed node should
*not* come back up after any sort of reboot once it has been STONITH'd.

That's the whole idea of STONITH.

STONITH is not some kind of "death until reboot" sort of thing. On physical
servers this is why STONITH usually requires integration with the local IPMI to
keep the node from coming back to life even during some kind of hard recycle.

With that said, it's always possible to confuse a cluster by unplugging and
plugging network cables.. but ideally speaking even that should ensure at least
STONITH on one node (in most cases).

So... no help here, just observation...