PDA

View Full Version : sudden split brain occurred. No reason foud



bcunix
12-Jan-2016, 18:09
A processor failed, forming new configuration.
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] CLM CONFIGURATION CHANGE
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] New Configuration:
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] r(0) ip()
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] Members Left:
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] r(0) ip()
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] Members Joined:
Jan 8 02:47:36 node1 corosync[12261]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 712: memb=1, new=0, lost=1
Jan 8 02:47:36 node1 corosync[12261]: [pcmk ] info: pcmk_peer_update: memb: node1 1
Jan 8 02:47:36 node1 corosync[12261]: [pcmk ] info: pcmk_peer_update: lost: node2 2
Jan 8 02:47:37 node1 corosync[12261]: [pcmk ] info: update_member: Node 2/node2 is now: lost
Jan 8 02:47:37 node1 crmd[12271]: notice: crm_update_peer_state: plugin_handle_membership: Node node2[2] - state is now lost (was member)
Jan 8 02:47:37 node1 sbd: [12249]: WARN: CIB: We do NOT have quorum!
Jan 8 02:47:37 node1 sbd: [12245]: WARN: Pacemaker health check: UNHEALTHY
Jan 8 02:47:37 node1 pengine[12270]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jan 8 02:47:37 node1 pengine[12270]: warning: pe_fence_node: Node node2 will be fenced because the node is no longer part of the cluster.
Jan 8 02:47:36 node2 corosync[26232]: [pcmk ] info: pcmk_peer_update: memb: node2 2
Jan 8 02:47:36 node2 corosync[26232]: [pcmk ] info: pcmk_peer_update: lost: node1 1
Jan 8 02:47:37 node2 pengine[26241]: warning: handle_startup_fencing: Blind faith: not fencing unseen nodes

Suddenly my cluster went to split brain situation.
We did not observed any abnormalities at vmware level.
No latency or network disconnection observed at the vm level at the time when split brain happened.
What can cause this sudden situation

jmozdzen
13-Jan-2016, 12:20
Hi bcunix,


A processor failed, forming new configuration.
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] CLM CONFIGURATION CHANGE
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] New Configuration:
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] r(0) ip()
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] Members Left:
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] r(0) ip()
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] Members Joined:
Jan 8 02:47:36 node1 corosync[12261]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 712: memb=1, new=0, lost=1
Jan 8 02:47:36 node1 corosync[12261]: [pcmk ] info: pcmk_peer_update: memb: node1 1
Jan 8 02:47:36 node1 corosync[12261]: [pcmk ] info: pcmk_peer_update: lost: node2 2
Jan 8 02:47:37 node1 corosync[12261]: [pcmk ] info: update_member: Node 2/node2 is now: lost
Jan 8 02:47:37 node1 crmd[12271]: notice: crm_update_peer_state: plugin_handle_membership: Node node2[2] - state is now lost (was member)
Jan 8 02:47:37 node1 sbd: [12249]: WARN: CIB: We do NOT have quorum!
Jan 8 02:47:37 node1 sbd: [12245]: WARN: Pacemaker health check: UNHEALTHY
Jan 8 02:47:37 node1 pengine[12270]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jan 8 02:47:37 node1 pengine[12270]: warning: pe_fence_node: Node node2 will be fenced because the node is no longer part of the cluster.
Jan 8 02:47:36 node2 corosync[26232]: [pcmk ] info: pcmk_peer_update: memb: node2 2
Jan 8 02:47:36 node2 corosync[26232]: [pcmk ] info: pcmk_peer_update: lost: node1 1
Jan 8 02:47:37 node2 pengine[26241]: warning: handle_startup_fencing: Blind faith: not fencing unseen nodes

Suddenly my cluster went to split brain situation.
We did not observed any abnormalities at vmware level.
No latency or network disconnection observed at the vm level at the time when split brain happened.
What can cause this sudden situation

have you seen/caused any "high LRM workload" situation on node2, i.e. by running "crm_resource --cleanup" (without specifying a specific resource to clean up)?

OTOH, both nodes declared the respective other node as lost, so it does look like a ring interruption to me.

Is above all you have in syslog, especially on node2? If so, you might want to increase the debug level to better diagnose future events. Else it might be helpful to have some context on what happened prior to the "split brain" situation.

Regards,
Jens