PDA

View Full Version : Cluster node migration



klemenvi
07-Jun-2012, 15:12
Hello,

I have build two node (SLES for VMWARE 11) HA cluster, when both nodes live on same ESX host, everything works perfect.

When I migrate (Vmotion) one node to other ESX they lose connection. We have other clusters in same environment without problems.
They are in same subnet without firewall between, I also try in different ESX cluster so I'm pretty sure it's not network related problem.

What can be wrong that they lose connection when they are on different ESX servers?

Thanks

Magic31
07-Jun-2012, 22:19
Hello,

I have build two node (SLES for VMWARE 11) HA cluster, when both nodes live on same ESX host, everything works perfect.

When I migrate (Vmotion) one node to other ESX they lose connection. We have other clusters in same environment without problems.
They are in same subnet without firewall between, I also try in different ESX cluster so I'm pretty sure it's not network related problem.

What can be wrong that they lose connection when they are on different ESX servers?

Thanks

My first thought is that it could be something that is happening at switch level in combination with something like an overlapping MAC address or IP (vm's sharing the same MAC, which could be possible if you have multiple separately managed VMware hosts (not all in one vCenter environment for example). Getting a duplicate MAC could happen when manually cloning between systems for example.

If you boot both SLES nodes on different clusters.... are you able to get communication going between the nodes (simple pings) and others nodes in the network?

Possibly looking at the switch logs could point you to the issue.

-Willem

klemenvi
08-Jun-2012, 08:02
My first thought is that it could be something that is happening at switch level in combination with something like an overlapping MAC address or IP (vm's sharing the same MAC, which could be possible if you have multiple separately managed VMware hosts (not all in one vCenter environment for example). Getting a duplicate MAC could happen when manually cloning between systems for example.

If you boot both SLES nodes on different clusters.... are you able to get communication going between the nodes (simple pings) and others nodes in the network?

Possibly looking at the switch logs could point you to the issue.

-Willem

VMs have different mac addresses:

xxxx2:~ # ifconfig
eth0 Link encap:Ethernet HWaddr 00:50:56:B7:00:64

xxxx2:~ # ifconfig
eth1 Link encap:Ethernet HWaddr 00:50:56:B7:00:2B

I cannot place one VM on other cluster, because different network segments. Is it possible to be related with with process of creating VMs?
I installed one VM then I clone first one for second. HA software was installed after cloning.

If helped I can post HA configurations... just tell me which files can help.

I don't understand which logs you mean?

klemenvi
08-Jun-2012, 09:07
Let's picture situation:


1. both nodes are online on same ESX server (it have to be on same to give em online)
2. migration on other ESX servers in cluster everything works both nodes are online
3. nodes are still online on different ESX
4. I give on node in standby
5. standby host become offline (unclean), logs on host (/var/messages) when host become unclean:

Jun 8 09:53:49 Ha-test1 crm_attribute: [4942]: info: Invoked: crm_attribute -t nodes -n standby -v on -N Ha-test1
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - <cib admin_epoch="0" epoch="27" num_updates="5" >
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - <configuration >
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - <nodes >
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - <node id="Ha-test1" >
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - <instance_attributes id="nodes-Ha-test1" >
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - <nvpair value="off" id="nodes-Ha-test1-standby" />
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - </instance_attributes>
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - </node>
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - </nodes>
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - </configuration>
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - </cib>
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: abort_transition_graph: need_abort:59 - Triggered transition abort (complete=1) : Non-status change
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + <cib admin_epoch="0" epoch="28" num_updates="1" >
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: need_abort: Aborting on change to admin_epoch
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + <configuration >
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + <nodes >
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + <node id="Ha-test1" >
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: do_pe_invoke: Query 187: Requesting the current CIB: S_POLICY_ENGINE
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + <instance_attributes id="nodes-Ha-test1" >
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + <nvpair value="on" id="nodes-Ha-test1-standby" />
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + </instance_attributes>
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + </node>
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + </nodes>
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + </configuration>
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + </cib>
Jun 8 09:53:49 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crm_attribute/4, version=0.28.1): ok (rc=0)
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: do_pe_invoke_callback: Invoking the PE: query=187, ref=pe_calc-dc-1339142029-119, seq=40, quorate=1
Jun 8 09:53:49 Ha-test1 pengine: [3007]: info: unpack_config: Startup probes: enabled
Jun 8 09:53:49 Ha-test1 pengine: [3007]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 8 09:53:49 Ha-test1 pengine: [3007]: info: unpack_domains: Unpacking domains
Jun 8 09:53:49 Ha-test1 pengine: [3007]: ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Jun 8 09:53:49 Ha-test1 pengine: [3007]: ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Jun 8 09:53:49 Ha-test1 pengine: [3007]: ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Jun 8 09:53:49 Ha-test1 pengine: [3007]: info: unpack_status: Node Ha-test1 is in standby-mode
Jun 8 09:53:49 Ha-test1 pengine: [3007]: info: determine_online_status: Node Ha-test1 is standby
Jun 8 09:53:49 Ha-test1 pengine: [3007]: info: determine_online_status: Node Ha-test2 is online
Jun 8 09:53:49 Ha-test1 pengine: [3007]: info: stage6: Delaying fencing operations until there are resources to manage
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 8 09:53:49 Ha-test1 cib: [4943]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-30.raw
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: unpack_graph: Unpacked transition 84: 0 actions in 0 synapses
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: do_te_invoke: Processing graph 84 (ref=pe_calc-dc-1339142029-119) derived from /var/lib/pengine/pe-input-123.bz2
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: run_graph: ================================================== ==
Jun 8 09:53:49 Ha-test1 crmd: [3008]: notice: run_graph: Transition 84 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-123.bz2): Complete
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: te_graph_trigger: Transition 84 is now complete
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: notify_crmd: Transition 84 status: done - <null>
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 8 09:53:49 Ha-test1 crmd: [3008]: info: do_state_transition: Starting PEngine Recheck Timer
Jun 8 09:53:49 Ha-test1 cib: [4943]: info: write_cib_contents: Wrote version 0.28.0 of the CIB to disk (digest: 6eb8aa46cbec39938c66a1a1804773e2)
Jun 8 09:53:49 Ha-test1 pengine: [3007]: info: process_pe_message: Transition 84: PEngine Input stored in: /var/lib/pengine/pe-input-123.bz2
Jun 8 09:53:49 Ha-test1 cib: [4943]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.ZSRx8X (digest: /var/lib/heartbeat/crm/cib.ZZa3ZC)
Jun 8 09:53:49 Ha-test1 pengine: [3007]: info: process_pe_message: Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues.
Jun 8 09:53:50 Ha-test1 mgmtd: [3009]: info: CIB query: cib
Jun 8 09:53:50 Ha-test1 mgmtd: [3009]: ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Jun 8 09:53:50 Ha-test1 mgmtd: [3009]: ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Jun 8 09:53:50 Ha-test1 mgmtd: [3009]: ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Jun 8 09:53:53 Ha-test1 corosync[2995]: [TOTEM ] FAILED TO RECEIVE
Jun 8 09:53:53 Ha-test1 corosync[2995]: [TOTEM ] FAILED TO RECEIVE
Jun 8 09:53:54 Ha-test1 corosync[2995]: [TOTEM ] FAILED TO RECEIVE
Jun 8 09:53:54 Ha-test1 corosync[2995]: [TOTEM ] FAILED TO RECEIVE
Jun 8 09:53:54 Ha-test1 corosync[2995]: [TOTEM ] FAILED TO RECEIVE
Jun 8 09:53:55 Ha-test1 corosync[2995]: [TOTEM ] FAILED TO RECEIVE
Jun 8 09:53:55 Ha-test1 corosync[2995]: [TOTEM ] FAILED TO RECEIVE
Jun 8 09:53:55 Ha-test1 corosync[2995]: [TOTEM ] FAILED TO RECEIVE
Jun 8 09:53:55 Ha-test1 corosync[2995]: [TOTEM ] FAILED TO RECEIVE
Jun 8 09:53:56 Ha-test1 corosync[2995]: [TOTEM ] FAILED TO RECEIVE
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] CLM CONFIGURATION CHANGE
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] New Configuration:
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.50)
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] Members Left:
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.51)
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] Members Joined:
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 44: memb=1, new=0, lost=1
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] info: pcmk_peer_update: memb: Ha-test1 852151050
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] info: pcmk_peer_update: lost: Ha-test2 868928266
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] CLM CONFIGURATION CHANGE
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] New Configuration:
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.50)
Jun 8 09:54:00 Ha-test1 cib: [3004]: notice: ais_dispatch: Membership 44: quorum lost
Jun 8 09:54:00 Ha-test1 crmd: [3008]: notice: ais_dispatch: Membership 44: quorum lost
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] Members Left:
Jun 8 09:54:00 Ha-test1 cib: [3004]: info: crm_update_peer: Node Ha-test2: id=868928266 state=lost (new) addr=r(0) ip(10.203.202.51) votes=1 born=24 seen=40 proc=00000000000000000000000000151312
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: ais_status_callback: status: Ha-test2 is now lost (was member)
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] Members Joined:
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: crm_update_peer: Node Ha-test2: id=868928266 state=lost (new) addr=r(0) ip(10.203.202.51) votes=1 born=24 seen=40 proc=00000000000000000000000000151312
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 44: memb=1, new=0, lost=0
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: erase_node_from_join: Removed node Ha-test2 from join calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] info: pcmk_peer_update: MEMB: Ha-test1 852151050
Jun 8 09:54:00 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/188, version=0.28.1): ok (rc=0)
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: crm_update_quorum: Updating quorum status to false (call=190)
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] info: ais_mark_unseen_peer_dead: Node Ha-test2 was not seen in the previous transition
Jun 8 09:54:00 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - <cib have-quorum="1" admin_epoch="0" epoch="28" num_updates="2" />
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] info: update_member: Node 868928266/Ha-test2 is now: lost
Jun 8 09:54:00 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + <cib have-quorum="0" admin_epoch="0" epoch="29" num_updates="1" />
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] info: send_member_notification: Sending membership update 44 to 2 children
Jun 8 09:54:00 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/190, version=0.29.1): ok (rc=0)
Jun 8 09:54:00 Ha-test1 corosync[2995]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: crm_ais_dispatch: Setting expected votes to 2
Jun 8 09:54:00 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/192, version=0.29.1): ok (rc=0)
Jun 8 09:54:00 Ha-test1 corosync[2995]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 8 09:54:00 Ha-test1 crmd: [3008]: WARN: match_down_event: No match for shutdown action on Ha-test2
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: te_update_diff: Stonith/shutdown of Ha-test2 not matched
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: abort_transition_graph: te_update_diff:194 - Triggered transition abort (complete=1, tag=node_state, id=Ha-test2, magic=NA, cib=0.28.2) : Node failure
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: abort_transition_graph: need_abort:59 - Triggered transition abort (complete=1) : Non-status change
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: need_abort: Aborting on change to have-quorum
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: do_pe_invoke: Query 193: Requesting the current CIB: S_POLICY_ENGINE
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: do_pe_invoke: Query 194: Requesting the current CIB: S_POLICY_ENGINE
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: do_pe_invoke_callback: Invoking the PE: query=194, ref=pe_calc-dc-1339142040-121, seq=44, quorate=0
Jun 8 09:54:00 Ha-test1 pengine: [3007]: info: unpack_config: Startup probes: enabled
Jun 8 09:54:00 Ha-test1 pengine: [3007]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 8 09:54:00 Ha-test1 pengine: [3007]: WARN: cluster_status: We do not have quorum - fencing and resource management disabled
Jun 8 09:54:00 Ha-test1 pengine: [3007]: info: unpack_domains: Unpacking domains
Jun 8 09:54:00 Ha-test1 pengine: [3007]: ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Jun 8 09:54:00 Ha-test1 pengine: [3007]: ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Jun 8 09:54:00 Ha-test1 pengine: [3007]: ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Jun 8 09:54:00 Ha-test1 pengine: [3007]: info: unpack_status: Node Ha-test1 is in standby-mode
Jun 8 09:54:00 Ha-test1 cib: [4944]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-31.raw
Jun 8 09:54:00 Ha-test1 pengine: [3007]: info: determine_online_status: Node Ha-test1 is standby
Jun 8 09:54:00 Ha-test1 pengine: [3007]: WARN: pe_fence_node: Node Ha-test2 will be fenced because it is un-expectedly down
Jun 8 09:54:00 Ha-test1 pengine: [3007]: info: determine_online_status_fencing: ha_state=active, ccm_state=false, crm_state=online, join_state=member, expected=member
Jun 8 09:54:00 Ha-test1 pengine: [3007]: WARN: determine_online_status: Node Ha-test2 is unclean
Jun 8 09:54:00 Ha-test1 pengine: [3007]: WARN: stage6: Node Ha-test2 is unclean!
Jun 8 09:54:00 Ha-test1 pengine: [3007]: notice: stage6: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore)
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: unpack_graph: Unpacked transition 85: 0 actions in 0 synapses
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: do_te_invoke: Processing graph 85 (ref=pe_calc-dc-1339142040-121) derived from /var/lib/pengine/pe-warn-5.bz2
Jun 8 09:54:00 Ha-test1 pengine: [3007]: WARN: process_pe_message: Transition 85: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-5.bz2
Jun 8 09:54:00 Ha-test1 cib: [4944]: info: write_cib_contents: Wrote version 0.29.0 of the CIB to disk (digest: e018449c81fe810793f22facfe1c7d13)
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: run_graph: ================================================== ==
Jun 8 09:54:00 Ha-test1 pengine: [3007]: info: process_pe_message: Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues.
Jun 8 09:54:00 Ha-test1 crmd: [3008]: notice: run_graph: Transition 85 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-warn-5.bz2): Complete
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: te_graph_trigger: Transition 85 is now complete
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: notify_crmd: Transition 85 status: done - <null>
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 8 09:54:00 Ha-test1 crmd: [3008]: info: do_state_transition: Starting PEngine Recheck Timer
Jun 8 09:54:00 Ha-test1 cib: [4944]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.ql9DEr (digest: /var/lib/heartbeat/crm/cib.T5Nf2z)
Jun 8 09:54:00 Ha-test1 mgmtd: [3009]: info: CIB query: cib
Jun 8 09:54:00 Ha-test1 mgmtd: [3009]: ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Jun 8 09:54:00 Ha-test1 mgmtd: [3009]: ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Jun 8 09:54:00 Ha-test1 mgmtd: [3009]: ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity

5. node cannot be bring up if they live on different ESX hosts
6. migrating hosts on same ESX, node become active. Logs from node:

Jun 8 09:59:08 Ha-test1 corosync[2995]: [CLM ] CLM CONFIGURATION CHANGE
Jun 8 09:59:08 Ha-test1 corosync[2995]: [CLM ] New Configuration:
Jun 8 09:59:08 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.50)
Jun 8 09:59:08 Ha-test1 corosync[2995]: [CLM ] Members Left:
Jun 8 09:59:08 Ha-test1 corosync[2995]: [CLM ] Members Joined:
Jun 8 09:59:08 Ha-test1 corosync[2995]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 48: memb=1, new=0, lost=0
Jun 8 09:59:08 Ha-test1 corosync[2995]: [pcmk ] info: pcmk_peer_update: memb: Ha-test1 852151050
Jun 8 09:59:08 Ha-test1 corosync[2995]: [CLM ] CLM CONFIGURATION CHANGE
Jun 8 09:59:08 Ha-test1 corosync[2995]: [CLM ] New Configuration:
Jun 8 09:59:08 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.50)
Jun 8 09:59:08 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.51)
Jun 8 09:59:08 Ha-test1 corosync[2995]: [CLM ] Members Left:
Jun 8 09:59:08 Ha-test1 cib: [3004]: notice: ais_dispatch: Membership 48: quorum acquired
Jun 8 09:59:08 Ha-test1 crmd: [3008]: notice: ais_dispatch: Membership 48: quorum acquired
Jun 8 09:59:08 Ha-test1 corosync[2995]: [CLM ] Members Joined:
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: crm_update_peer: Node Ha-test2: id=868928266 state=member (new) addr=r(0) ip(10.203.202.51) votes=1 born=24 seen=48 proc=00000000000000000000000000151312
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: ais_status_callback: status: Ha-test2 is now member (was lost)
Jun 8 09:59:08 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.51)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: crm_update_peer: Node Ha-test2: id=868928266 state=member (new) addr=r(0) ip(10.203.202.51) votes=1 born=24 seen=48 proc=00000000000000000000000000151312
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='Ha-test2']/lrm (origin=local/crmd/195, version=0.29.2): ok (rc=0)
Jun 8 09:59:08 Ha-test1 corosync[2995]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 48: memb=2, new=1, lost=0
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: crm_update_quorum: Updating quorum status to true (call=199)
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='Ha-test2']/transient_attributes (origin=local/crmd/196, version=0.29.3): ok (rc=0)
Jun 8 09:59:08 Ha-test1 corosync[2995]: [pcmk ] info: update_member: Node 868928266/Ha-test2 is now: member
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/197, version=0.29.3): ok (rc=0)
Jun 8 09:59:08 Ha-test1 corosync[2995]: [pcmk ] info: pcmk_peer_update: NEW: Ha-test2 868928266
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: - <cib have-quorum="0" admin_epoch="0" epoch="29" num_updates="4" />
Jun 8 09:59:08 Ha-test1 corosync[2995]: [pcmk ] info: pcmk_peer_update: MEMB: Ha-test1 852151050
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: log_data_element: cib:diff: + <cib have-quorum="1" admin_epoch="0" epoch="30" num_updates="1" />
Jun 8 09:59:08 Ha-test1 corosync[2995]: [pcmk ] info: pcmk_peer_update: MEMB: Ha-test2 868928266
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/199, version=0.30.1): ok (rc=0)
Jun 8 09:59:08 Ha-test1 corosync[2995]: [pcmk ] info: send_member_notification: Sending membership update 48 to 2 children
Jun 8 09:59:08 Ha-test1 cib: [3004]: WARN: cib_process_diff: Diff 0.28.1 -> 0.28.2 not applied to 0.30.1: current "epoch" is greater than required
Jun 8 09:59:08 Ha-test1 corosync[2995]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 8 09:59:08 Ha-test1 cib: [3004]: WARN: cib_process_diff: Diff 0.28.2 -> 0.28.3 not applied to 0.30.1: current "epoch" is greater than required
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: crm_ais_dispatch: Setting expected votes to 2
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/201, version=0.30.1): ok (rc=0)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: erase_xpath_callback: Deletion of "//node_state[@uname='Ha-test2']/lrm": ok (rc=0)
Jun 8 09:59:08 Ha-test1 cib: [3004]: WARN: cib_process_diff: Diff 0.28.3 -> 0.28.4 not applied to 0.30.1: current "epoch" is greater than required
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: abort_transition_graph: te_update_diff:157 - Triggered transition abort (complete=1, tag=transient_attributes, id=Ha-test2, magic=NA, cib=0.29.3) : Transient attribute: removal
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: erase_xpath_callback: Deletion of "//node_state[@uname='Ha-test2']/transient_attributes": ok (rc=0)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: abort_transition_graph: need_abort:59 - Triggered transition abort (complete=1) : Non-status change
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: need_abort: Aborting on change to have-quorum
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_state_transition: Membership changed: 40 -> 48 - join restart
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_pe_invoke: Query 202: Requesting the current CIB: S_POLICY_ENGINE
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=do_state_transition ]
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: update_dc: Unset DC Ha-test1
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: join_make_offer: Making join offers based on membership 48
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_dc_join_offer_all: join-5: Waiting on 2 outstanding join acks
Jun 8 09:59:08 Ha-test1 crmd: [3008]: ERROR: crmd_ha_msg_filter: Another DC detected: Ha-test2 (op=noop)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_INTEGRATION -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Jun 8 09:59:08 Ha-test1 cib: [4950]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-32.raw
Jun 8 09:59:08 Ha-test1 cib: [3004]: WARN: cib_process_diff: Diff 0.28.4 -> 0.29.1 not applied to 0.30.1: current "epoch" is greater than required
Jun 8 09:59:08 Ha-test1 crmd: [3008]: WARN: do_log: FSA: Input I_JOIN_OFFER from route_message() received in state S_ELECTION
Jun 8 09:59:08 Ha-test1 cib: [4950]: info: write_cib_contents: Wrote version 0.30.0 of the CIB to disk (digest: 03fc798ce3a58a1753981a98097c0112)
Jun 8 09:59:08 Ha-test1 corosync[2995]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 8 09:59:08 Ha-test1 cib: [4950]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.S1OOss (digest: /var/lib/heartbeat/crm/cib.749AEB)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_election_count_vote: Election 8 (owner: Ha-test2) pass: vote from Ha-test2 (Host name)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_dc_takeover: Taking over DC status for this partition
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_readwrite: We are now in R/O mode
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_slave_all for section 'all' (origin=local/crmd/203, version=0.30.1): ok (rc=0)
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_readwrite: We are now in R/W mode
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_master for section 'all' (origin=local/crmd/204, version=0.30.1): ok (rc=0)
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/205, version=0.30.1): ok (rc=0)
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/207, version=0.30.1): ok (rc=0)
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/209, version=0.30.1): ok (rc=0)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_dc_join_offer_all: join-6: Waiting on 2 outstanding join acks
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: ais_dispatch: Membership 48: quorum retained
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: crm_ais_dispatch: Setting expected votes to 2
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/212, version=0.30.1): ok (rc=0)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: config_query_callback: Checking for expired actions every 900000ms
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: config_query_callback: Sending expected-votes=2 to corosync
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: update_dc: Set DC to Ha-test1 (3.0.2)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: ais_dispatch: Membership 48: quorum retained
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: crm_ais_dispatch: Setting expected votes to 2
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/215, version=0.30.1): ok (rc=0)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_state_transition: All 2 cluster nodes responded to the join offer.
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_dc_join_finalize: join-6: Syncing the CIB from Ha-test1 to the rest of the cluster
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/216, version=0.30.1): ok (rc=0)
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/217, version=0.30.1): ok (rc=0)
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/218, version=0.30.1): ok (rc=0)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_dc_join_ack: join-6: Updating node state to member for Ha-test1
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='Ha-test1']/lrm (origin=local/crmd/219, version=0.30.2): ok (rc=0)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: erase_xpath_callback: Deletion of "//node_state[@uname='Ha-test1']/lrm": ok (rc=0)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_dc_join_ack: join-6: Updating node state to member for Ha-test2
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='Ha-test2']/lrm (origin=local/crmd/221, version=0.30.3): ok (rc=0)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: erase_xpath_callback: Deletion of "//node_state[@uname='Ha-test2']/lrm": ok (rc=0)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_dc_join_final: Ensuring DC, quorum and node attributes are up-to-date
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: crm_update_quorum: Updating quorum status to true (call=225)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: abort_transition_graph: do_te_invoke:176 - Triggered transition abort (complete=1) : Peer Cancelled
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_pe_invoke: Query 226: Requesting the current CIB: S_POLICY_ENGINE
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/223, version=0.30.4): ok (rc=0)
Jun 8 09:59:08 Ha-test1 cib: [3004]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/225, version=0.30.4): ok (rc=0)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=Ha-test2, magic=NA, cib=0.30.5) : Transient attribute: update
Jun 8 09:59:08 Ha-test1 attrd: [3006]: info: attrd_local_callback: Sending full refresh (origin=crmd)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_pe_invoke_callback: Invoking the PE: query=226, ref=pe_calc-dc-1339142348-134, seq=48, quorate=1
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: unpack_config: Startup probes: enabled
Jun 8 09:59:08 Ha-test1 attrd: [3006]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_pe_invoke: Query 227: Requesting the current CIB: S_POLICY_ENGINE
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 8 09:59:08 Ha-test1 attrd: [3006]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_pe_invoke_callback: Invoking the PE: query=227, ref=pe_calc-dc-1339142348-135, seq=48, quorate=1
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: unpack_domains: Unpacking domains
Jun 8 09:59:08 Ha-test1 attrd: [3006]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jun 8 09:59:08 Ha-test1 pengine: [3007]: ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Jun 8 09:59:08 Ha-test1 pengine: [3007]: ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Jun 8 09:59:08 Ha-test1 pengine: [3007]: ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: unpack_status: Node Ha-test1 is in standby-mode
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: determine_online_status: Node Ha-test1 is standby
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: determine_online_status: Node Ha-test2 is online
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: stage6: Delaying fencing operations until there are resources to manage
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: handle_response: pe_calc calculation pe_calc-dc-1339142348-134 is obsolete
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: process_pe_message: Transition 86: PEngine Input stored in: /var/lib/pengine/pe-input-124.bz2
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: process_pe_message: Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues.
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: unpack_config: Startup probes: enabled
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: unpack_domains: Unpacking domains
Jun 8 09:59:08 Ha-test1 pengine: [3007]: ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Jun 8 09:59:08 Ha-test1 pengine: [3007]: ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Jun 8 09:59:08 Ha-test1 pengine: [3007]: ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: unpack_status: Node Ha-test1 is in standby-mode
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: determine_online_status: Node Ha-test1 is standby
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: determine_online_status: Node Ha-test2 is online
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: stage6: Delaying fencing operations until there are resources to manage
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: unpack_graph: Unpacked transition 87: 0 actions in 0 synapses
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_te_invoke: Processing graph 87 (ref=pe_calc-dc-1339142348-135) derived from /var/lib/pengine/pe-input-125.bz2
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: run_graph: ================================================== ==
Jun 8 09:59:08 Ha-test1 crmd: [3008]: notice: run_graph: Transition 87 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-125.bz2): Complete
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: te_graph_trigger: Transition 87 is now complete
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: notify_crmd: Transition 87 status: done - <null>
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 8 09:59:08 Ha-test1 crmd: [3008]: info: do_state_transition: Starting PEngine Recheck Timer
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: process_pe_message: Transition 87: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2
Jun 8 09:59:08 Ha-test1 pengine: [3007]: info: process_pe_message: Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues.
Jun 8 09:59:08 Ha-test1 mgmtd: [3009]: info: CIB query: cib
Jun 8 09:59:08 Ha-test1 mgmtd: [3009]: ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Jun 8 09:59:08 Ha-test1 mgmtd: [3009]: ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Jun 8 09:59:08 Ha-test1 mgmtd: [3009]: ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity

Conclusion:

- hosts works on different ESX servers in cluster. Problem is when one ESX failed down or I give on node in standby in cant be back on line if it's not in same ESX.

Magic31
08-Jun-2012, 09:17
VMs have different mac addresses:

I meant duplicate MAC's looking at the total environment (also taking the other VM MAC's into consideration) :)



I cannot place one VM on other cluster, because different network segments.

At ESX level, by doing the vMotion is what I meant...


Is it possible to be related with with process of creating VMs? I installed one VM then I clone first one for second. HA software was installed after cloning.

Yes, that is possible in certain circumstances... but should not be when running the VM's on the same ESX host or when centrally managed through vCenter (which you seem to be using)- in this situation virtual nic MAC's are checked and adjusted where needed.


I don't understand which logs you mean?

Logging on the network device(s) (switch on which the ESX hosts are connected), that might show some errors/warning that explain why VM communication fails.

Another thing is if clients on the network ever show issues connecting to services/resources on the two SLES VM nodes you are having issues with.

Maybe bring it back a step to checking that when the SLES HA cluster is up and running (both nodes)... as services and ip's reachable from all other devices in your network? Also, I could well be misunderstanding what the issue is...

-Willem

klemenvi
08-Jun-2012, 09:46
Maybe you can see anything useful from logs in my previews post.

klemenvi
08-Jun-2012, 10:15
I try to disable stonith (crm configure property stonith-enabled=false) (because error in previews log) and status of nodes change from:

============
Last updated: Fri Jun 8 11:11:13 2012
Stack: openais
Current DC: Ha-test1 - partition WITHOUT quorum
Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Node Ha-test2: UNCLEAN (offline)
Online: [ Ha-test1 ]

to



============
Last updated: Fri Jun 8 11:07:56 2012
Stack: openais
Current DC: Ha-test1 - partition WITHOUT quorum
Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ Ha-test1 ]
OFFLINE: [ Ha-test2 ]

Both nodes see each other as offline.

jmozdzen
08-Jun-2012, 11:09
Hi klemenvi,

there are a few things you have either not looked at or not included the resulting information in your posts...

- you have two VMs, ha-test1 (IP: 10.203.202.50)and ha-test2 (IP: 10.203.202.51)
- you bring up both VMs on the same ESX node, everything works as expected
- you live migrate ha-test2 to a different ESX node of the cluster, still everything works as expected
- you say "I give on node in standby": What are you talking about, the ESX node or the pacemaker (SLES11 HAE) node? I guess you mean you set pacemaker on ha-test1 to "standby". Where did you initiate this, ha-test1?
- what does ha-test2 see after you brought ha-test1 into standby?
- you say "node cannot be bring up if they live on different ESX hosts": From ha-test1's log I see you put it to standby successfully, but that IP communications between the pacemaker cluster nodes are down:
Jun 8 09:53:56 Ha-test1 corosync[2995]: [TOTEM ] FAILED TO RECEIVE
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] CLM CONFIGURATION CHANGE
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] New Configuration:
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.50)
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] Members Left:
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.51)
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] Members Joined:
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 44: memb=1, new=0, lost=1
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] info: pcmk_peer_update: memb: Ha-test1 852151050
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] info: pcmk_peer_update: lost: Ha-test2 868928266
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] CLM CONFIGURATION CHANGE
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] New Configuration:
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.50)
Jun 8 09:54:00 Ha-test1 cib: [3004]: notice: ais_dispatch: Membership 44: quorum lost

- what happened to ha-test2?
- if both VMs are still up, how's the IP connectivity between these nodes? Can they reach each other via ping? Can they only reach each other when on a common ESX node?
- if the VMs cannot see each other via IP when on different ESX nodes - how's your test running if you wait for some time so that the nodes "detect" they're solo?

Seems you're running into a split brain situation and as always you'd be better off with a proper stonithing solution.

Regards,
Jens

klemenvi
08-Jun-2012, 11:30
Hi klemenvi,

there are a few things you have either not looked at or not included the resulting information in your posts...

- you have two VMs, ha-test1 (IP: 10.203.202.50)and ha-test2 (IP: 10.203.202.51) yes
- you bring up both VMs on the same ESX node, everything works as expected yes
- you live migrate ha-test2 to a different ESX node of the cluster, still everything works as expected yes
- you say "I give on node in standby": What are you talking about, the ESX node or the pacemaker (SLES11 HAE) node? I guess you mean you set pacemaker on ha-test1 to "standby". Where did you initiate this, ha-test1? Yes i give node ha-test1 to standby.
- what does ha-test2 see after you brought ha-test1 into standby? He see himseft alive and ha-test1 as offline. The same is on ha-test1 he see himself alive (but standby) and ha-test2 as offline.
- you say "node cannot be bring up if they live on different ESX hosts": From ha-test1's log I see you put it to standby successfully, but that IP communications between the pacemaker cluster nodes are down:
Jun 8 09:53:56 Ha-test1 corosync[2995]: [TOTEM ] FAILED TO RECEIVE
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] CLM CONFIGURATION CHANGE
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] New Configuration:
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.50)
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] Members Left:
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.51)
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] Members Joined:
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 44: memb=1, new=0, lost=1
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] info: pcmk_peer_update: memb: Ha-test1 852151050
Jun 8 09:54:00 Ha-test1 corosync[2995]: [pcmk ] info: pcmk_peer_update: lost: Ha-test2 868928266
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] CLM CONFIGURATION CHANGE
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] New Configuration:
Jun 8 09:54:00 Ha-test1 corosync[2995]: [CLM ] r(0) ip(10.203.202.50)
Jun 8 09:54:00 Ha-test1 cib: [3004]: notice: ais_dispatch: Membership 44: quorum lost

- what happened to ha-test2? Logs from ha-test2 when ha-test1 become offline:


Jun 8 12:24:59 Ha-test2 corosync[2997]: [TOTEM ] A processor failed, forming new configuration.
Jun 8 12:25:03 Ha-test2 corosync[2997]: [CLM ] CLM CONFIGURATION CHANGE
Jun 8 12:25:03 Ha-test2 corosync[2997]: [CLM ] New Configuration:
Jun 8 12:25:03 Ha-test2 corosync[2997]: [CLM ] r(0) ip(10.203.202.51)
Jun 8 12:25:03 Ha-test2 corosync[2997]: [CLM ] Members Left:
Jun 8 12:25:03 Ha-test2 corosync[2997]: [CLM ] r(0) ip(10.203.202.50)
Jun 8 12:25:03 Ha-test2 corosync[2997]: [CLM ] Members Joined:
Jun 8 12:25:03 Ha-test2 corosync[2997]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 68: memb=1, new=0, lost=1
Jun 8 12:25:03 Ha-test2 corosync[2997]: [pcmk ] info: pcmk_peer_update: memb: Ha-test2 868928266
Jun 8 12:25:03 Ha-test2 corosync[2997]: [pcmk ] info: pcmk_peer_update: lost: Ha-test1 852151050
Jun 8 12:25:03 Ha-test2 corosync[2997]: [CLM ] CLM CONFIGURATION CHANGE
Jun 8 12:25:03 Ha-test2 corosync[2997]: [CLM ] New Configuration:
Jun 8 12:25:03 Ha-test2 corosync[2997]: [CLM ] r(0) ip(10.203.202.51)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: notice: ais_dispatch: Membership 68: quorum lost
Jun 8 12:25:03 Ha-test2 cib: [3005]: notice: ais_dispatch: Membership 68: quorum lost
Jun 8 12:25:03 Ha-test2 corosync[2997]: [CLM ] Members Left:
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: crm_update_peer: Node Ha-test1: id=852151050 state=lost (new) addr=r(0) ip(10.203.202.50) votes=1 born=24 seen=64 proc=00000000000000000000000000151312
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: crm_update_peer: Node Ha-test1: id=852151050 state=lost (new) addr=r(0) ip(10.203.202.50) votes=1 born=24 seen=64 proc=00000000000000000000000000151312
Jun 8 12:25:03 Ha-test2 corosync[2997]: [CLM ] Members Joined:
Jun 8 12:25:03 Ha-test2 crmd: [3009]: WARN: check_dead_member: Our DC node (Ha-test1) left the cluster
Jun 8 12:25:03 Ha-test2 corosync[2997]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 68: memb=1, new=0, lost=0
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ]
Jun 8 12:25:03 Ha-test2 corosync[2997]: [pcmk ] info: pcmk_peer_update: MEMB: Ha-test2 868928266
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: update_dc: Unset DC Ha-test1
Jun 8 12:25:03 Ha-test2 corosync[2997]: [pcmk ] info: ais_mark_unseen_peer_dead: Node Ha-test1 was not seen in the previous transition
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
Jun 8 12:25:03 Ha-test2 corosync[2997]: [pcmk ] info: update_member: Node 852151050/Ha-test1 is now: lost
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_te_control: Registering TE UUID: 443104e2-ceea-42d0-89c3-1b716d922cc6
Jun 8 12:25:03 Ha-test2 corosync[2997]: [pcmk ] info: send_member_notification: Sending membership update 68 to 2 children
Jun 8 12:25:03 Ha-test2 crmd: [3009]: WARN: cib_client_add_notify_callback: Callback already present
Jun 8 12:25:03 Ha-test2 corosync[2997]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: set_graph_functions: Setting custom graph functions
Jun 8 12:25:03 Ha-test2 corosync[2997]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: unpack_graph: Unpacked transition -1: 0 actions in 0 synapses
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_dc_takeover: Taking over DC status for this partition
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: cib_process_readwrite: We are now in R/W mode
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: cib_process_request: Operation complete: op cib_master for section 'all' (origin=local/crmd/193, version=0.48.5): ok (rc=0)
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/194, version=0.48.5): ok (rc=0)
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/196, version=0.48.5): ok (rc=0)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: join_make_offer: Making join offers based on membership 68
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/198, version=0.48.5): ok (rc=0)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_dc_join_offer_all: join-7: Waiting on 1 outstanding join acks
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: ais_dispatch: Membership 68: quorum still lost
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: crm_ais_dispatch: Setting expected votes to 2
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/201, version=0.48.5): ok (rc=0)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: config_query_callback: Checking for expired actions every 900000ms
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: config_query_callback: Sending expected-votes=2 to corosync
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: update_dc: Set DC to Ha-test2 (3.0.2)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: ais_dispatch: Membership 68: quorum still lost
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: crm_ais_dispatch: Setting expected votes to 2
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/204, version=0.48.5): ok (rc=0)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_state_transition: All 1 cluster nodes responded to the join offer.
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_dc_join_finalize: join-7: Syncing the CIB from Ha-test2 to the rest of the cluster
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/205, version=0.48.5): ok (rc=0)
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/206, version=0.48.5): ok (rc=0)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_dc_join_ack: join-7: Updating node state to member for Ha-test2
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='Ha-test2']/lrm (origin=local/crmd/207, version=0.48.6): ok (rc=0)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: erase_xpath_callback: Deletion of "//node_state[@uname='Ha-test2']/lrm": ok (rc=0)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/209, version=0.48.7): ok (rc=0)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_dc_join_final: Ensuring DC, quorum and node attributes are up-to-date
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: crm_update_quorum: Updating quorum status to false (call=211)
Jun 8 12:25:03 Ha-test2 attrd: [3007]: info: attrd_local_callback: Sending full refresh (origin=crmd)
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: log_data_element: cib:diff: - <cib have-quorum="1" dc-uuid="Ha-test1" admin_epoch="0" epoch="48" num_updates="8" />
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: abort_transition_graph: do_te_invoke:176 - Triggered transition abort (complete=1) : Peer Cancelled
Jun 8 12:25:03 Ha-test2 attrd: [3007]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: log_data_element: cib:diff: + <cib have-quorum="0" dc-uuid="Ha-test2" admin_epoch="0" epoch="49" num_updates="1" />
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_pe_invoke: Query 212: Requesting the current CIB: S_POLICY_ENGINE
Jun 8 12:25:03 Ha-test2 cib: [3005]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/211, version=0.49.1): ok (rc=0)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: WARN: match_down_event: No match for shutdown action on Ha-test1
Jun 8 12:25:03 Ha-test2 attrd: [3007]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: te_update_diff: Stonith/shutdown of Ha-test1 not matched
Jun 8 12:25:03 Ha-test2 attrd: [3007]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: abort_transition_graph: te_update_diff:194 - Triggered transition abort (complete=1, tag=node_state, id=Ha-test1, magic=NA, cib=0.48.8) : Node failure
Jun 8 12:25:03 Ha-test2 cib: [5136]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-43.raw
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: abort_transition_graph: need_abort:59 - Triggered transition abort (complete=1) : Non-status change
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: need_abort: Aborting on change to have-quorum
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_pe_invoke_callback: Invoking the PE: query=212, ref=pe_calc-dc-1339151103-98, seq=68, quorate=0
Jun 8 12:25:03 Ha-test2 pengine: [3008]: info: unpack_config: Startup probes: enabled
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_pe_invoke: Query 213: Requesting the current CIB: S_POLICY_ENGINE
Jun 8 12:25:03 Ha-test2 pengine: [3008]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_pe_invoke: Query 214: Requesting the current CIB: S_POLICY_ENGINE
Jun 8 12:25:03 Ha-test2 pengine: [3008]: WARN: cluster_status: We do not have quorum - fencing and resource management disabled
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_pe_invoke_callback: Invoking the PE: query=214, ref=pe_calc-dc-1339151103-99, seq=68, quorate=0
Jun 8 12:25:03 Ha-test2 pengine: [3008]: info: unpack_domains: Unpacking domains
Jun 8 12:25:03 Ha-test2 pengine: [3008]: info: determine_online_status: Node Ha-test2 is online
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: handle_response: pe_calc calculation pe_calc-dc-1339151103-98 is obsolete
Jun 8 12:25:03 Ha-test2 cib: [5136]: info: write_cib_contents: Wrote version 0.49.0 of the CIB to disk (digest: 421aa13bdf8ae39c9788ae1423d5ec78)
Jun 8 12:25:03 Ha-test2 pengine: [3008]: info: process_pe_message: Transition 14: PEngine Input stored in: /var/lib/pengine/pe-input-44.bz2
Jun 8 12:25:03 Ha-test2 pengine: [3008]: info: unpack_config: Startup probes: enabled
Jun 8 12:25:03 Ha-test2 pengine: [3008]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 8 12:25:03 Ha-test2 pengine: [3008]: WARN: cluster_status: We do not have quorum - fencing and resource management disabled
Jun 8 12:25:03 Ha-test2 pengine: [3008]: info: unpack_domains: Unpacking domains
Jun 8 12:25:03 Ha-test2 pengine: [3008]: info: determine_online_status: Node Ha-test2 is online
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: unpack_graph: Unpacked transition 15: 0 actions in 0 synapses
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_te_invoke: Processing graph 15 (ref=pe_calc-dc-1339151103-99) derived from /var/lib/pengine/pe-input-45.bz2
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: run_graph: ================================================== ==
Jun 8 12:25:03 Ha-test2 crmd: [3009]: notice: run_graph: Transition 15 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-45.bz2): Complete
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: te_graph_trigger: Transition 15 is now complete
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: notify_crmd: Transition 15 status: done - <null>
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 8 12:25:03 Ha-test2 crmd: [3009]: info: do_state_transition: Starting PEngine Recheck Timer
Jun 8 12:25:03 Ha-test2 cib: [5136]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.TvR0yI (digest: /var/lib/heartbeat/crm/cib.0MB8PY)
Jun 8 12:25:03 Ha-test2 pengine: [3008]: info: process_pe_message: Transition 15: PEngine Input stored in: /var/lib/pengine/pe-input-45.bz2



- if both VMs are still up, how's the IP connectivity between these nodes? Can they reach each other via ping? Can they only reach each other when on a common ESX node? They can reach over ping also on different ESX hosts.
- if the VMs cannot see each other via IP when on different ESX nodes - how's your test running if you wait for some time so that the nodes "detect" they're solo?

Seems you're running into a split brain situation and as always you'd be better off with a proper stonithing solution.

Regards,
Jens

jmozdzen
08-Jun-2012, 11:47
Hi klemenvi,

judging from the logs you have an IP connectivity problem, despite your test result "They can reach over ping also on different ESX hosts.". How long have you waited after migrating the VM to the other ESX node before conducting your tests? Can you see successful CIB updates or alike after the migration (as a proof of successful connectivity between the pacemaker nodes)?

Regards,
Jens

klemenvi
08-Jun-2012, 11:55
Nodes are on different ESX, node ha-test1 is offline since my last post. Ping works:

Ha-test1:/var/log # ping 10.203.202.51
PING 10.203.202.51 (10.203.202.51) 56(84) bytes of data.
64 bytes from 10.203.202.51: icmp_seq=1 ttl=64 time=3.30 ms
64 bytes from 10.203.202.51: icmp_seq=2 ttl=64 time=0.282 ms
64 bytes from 10.203.202.51: icmp_seq=3 ttl=64 time=0.450 ms
64 bytes from 10.203.202.51: icmp_seq=4 ttl=64 time=0.284 ms
64 bytes from 10.203.202.51: icmp_seq=5 ttl=64 time=0.276 ms
64 bytes from 10.203.202.51: icmp_seq=6 ttl=64 time=0.246 ms
64 bytes from 10.203.202.51: icmp_seq=7 ttl=64 time=0.253 ms
^C
--- 10.203.202.51 ping statistics ---
7 packets transmitted, 7 received, 0% packet loss, time 5998ms
rtt min/avg/max/mdev = 0.246/0.727/3.303/1.053 ms

jmozdzen
08-Jun-2012, 12:26
> Nodes are on different ESX, node ha-test1 is offline since my last post.

I don't agree: Both pacemaker nodes are either online or standby, but report the other node as offline. This is something completely different. Judging from your logs, the pacemaker nodes don't see each other on the IP level (and there's much more to that than a simple "ping"), which is called "split-brain situation".

Since this communications problem doesn't happen when both nodes are on the same ESX server, you should have a closer look at the network setup. (I agree that the working "ping" is confusing - but the cluster traffic in question is not ICMP but multicast, maybe that is blocked on your switch(es) or something along that line).

Regards,
Jens

klemenvi
08-Jun-2012, 12:42
I talk with network team about multicast traffic. They say multicast is enabled, I try on Cisco and Extreme switches but no difference. They don't see any requests from nodes.
Is it possible to use broadcast traffic for test?

Thanks for help.

jmozdzen
08-Jun-2012, 13:07
Hi klemenvi,

> I talk with network team [...]They don't see any requests from nodes.

"any requests" like in "no multicast traffic" or like "no traffic at all" - because, in the latter case you'd have to wonder who's responding to the ICMP requests.

I'm not too deep into ESX - any known problems with multicast traffic there? How's your (related) networking set up within ESX in the first place?

> Is it possible to use broadcast traffic for test?

Not that I know of.

With regards
Jens

klemenvi
08-Jun-2012, 13:19
No multicast traffic on these ESX interfaces.
No no multicast traffic problems related with ESX. What do you mean about ESX network configuration? It's nothing special, we have 4 ESX servers in cluster with DRS for live migration.

jmozdzen
08-Jun-2012, 13:33
No multicast traffic on these ESX interfaces.
No no multicast traffic problems related with ESX. What do you mean about ESX network configuration? It's nothing special, we have 4 ESX servers in cluster with DRS for live migration.

AFAIK there are several ways to configure the network connections, including tuning of the IGMP parameters.

Out of curiosity I did a quick search on the Internet and voila: You're not alone and I seem to have been wrong concerning other protocals than multicast: http://www.gossamer-threads.com/lists/linuxha/pacemaker/70390

Regards,
Jens