PDA

View Full Version : DRBD MySQL Cluster configuration issue or bug



mahmoodhassan
01-Aug-2014, 10:32
Hi There,

I have configured HA cluster for MYSQL with DRBD, pacemaker and corosync. I was able to bring up cluster for a few time.
Even when cluster is on and Node1 is master and node2 is slave, if I shutdown node1 , node2 never comes up as master until I restart openais service several times and vice versa.

Now before shutdowm cluster was working I rebooted both nodes and cluster is down and not ready to come up.
I have tried same configurations in ubuntu and its working absolutely fine. I am not sure this is bug in SLES HA solution or I am missing something. We have SLES 11 SP3 and thinking to buy HA extension but still I could not demonstrate that this solution gonna work or not.

Can anyone please help to fix this problem

Thanks

Mahmood

Versions:
SLES11 SP3 full licensed
SLE-HA-11 evaluation
Kernel: 3.0.101-0.35-default
drbd-8.4.4-0.22.9
corosync-1.4.6-0.7.1
pacemaker-1.1.10-0.15.25
lsb-4.0-2.4.21


node sles1
node sles2
primitive p_drbd_mysql ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="30s" role="Slave" \
op monitor interval="15s" role="Master"
primitive p_fs_mysql ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory=" /var/lib/mysql_drbd" fstype="ext3"
primitive p_ip_mysql ocf:heartbeat:IPaddr2 \
params ip="172.16.100.20" cidr_netmask="24" nic="eth1"
primitive p_mysql lsb:mysql
group g_mysql p_fs_mysql p_ip_mysql p_mysql
ms ms_drbd_mysql p_drbd_mysql \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
colocation c_mysql_on_drbd inf: g_mysql ms_drbd_mysql:Master
order o_drbd_before_mysql inf: ms_drbd_mysql: promote g_mysql:start
property $id="cib-bootstrap-options" \
dc-version="1.1.10-f3eeaf4" \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes="2" \
no-quorum-policy="ignore" \
stonith-enabled="false"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"





Jul 30 11:53:01 sles1 crmd[4494]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
Jul 30 11:53:01 sles1 crmd[4494]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
Jul 30 11:53:01 sles1 crmd[4494]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
Jul 30 11:53:01 sles1 crmd[4486]: notice: plugin_handle_membership: Membership 128: quorum acquired
Jul 30 11:53:01 sles1 crmd[4486]: notice: crm_update_peer_state: plugin_handle_membership: Node sles1[739271701] - state is now member (wa
Jul 30 11:53:01 sles1 crmd[4486]: notice: crm_update_peer_state: plugin_handle_membership: Node sles2[739271702] - state is now member (wa
Jul 30 11:53:01 sles1 crmd[4494]: notice: plugin_handle_membership: Membership 128: quorum acquired
Jul 30 11:53:01 sles1 crmd[4486]: notice: do_started: The local CRM is operational
Jul 30 11:53:01 sles1 crmd[4486]: notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTE
Jul 30 11:53:22 sles1 crmd[4486]: warning: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Jul 30 11:55:22 sles1 crmd[4486]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ]
Jul 30 11:58:22 sles1 crmd[4486]: error: crm_timer_popped: Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms)
Jul 30 11:58:22 sles1 crmd[4486]: warning: do_state_transition: Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
Jul 30 11:58:22 sles1 crmd[4486]: warning: do_state_transition: 1 cluster nodes failed to respond to the join offer.
Jul 30 11:58:22 sles1 crmd[4486]: notice: crmd_join_phase_log: join-1: sles2=welcomed
Jul 30 11:58:22 sles1 crmd[4486]: notice: crmd_join_phase_log: join-1: sles1=integrated
^C
sles1:~ # crm status
Last updated: Wed Jul 30 11:59:59 2014
Last change: Wed Jul 30 11:18:38 2014 by root via cibadmin on sles2
Stack: classic openais (with plugin)
Current DC: NONE
2 Nodes configured, 2 expected votes
5 Resources configured


OFFLINE: [ sles1 sles2 ]



Jul 30 11:53:17 sles2 corosync[9865]: [pcmk ] info: send_member_notification: Sending membership update 128 to 1 children
Jul 30 11:53:17 sles2 corosync[9865]: [pcmk ] info: update_member: Node sles1 now has process list: 00000000000000000000000000101302 (1053442)
Jul 30 11:53:17 sles2 corosync[9865]: [pcmk ] info: send_member_notification: Sending membership update 128 to 1 children
Jul 30 11:53:17 sles2 corosync[9865]: [pcmk ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)
Jul 30 11:53:17 sles2 corosync[9865]: [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
Jul 30 11:53:18 sles2 corosync[9865]: [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
Jul 30 11:53:38 sles2 corosync[9865]: [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
Jul 30 11:55:38 sles2 corosync[9865]: [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
Jul 30 11:55:47 sles2 corosync[9865]: [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)


sles2:~ # crm status
Last updated: Wed Jul 30 12:00:32 2014
Last change: Wed Jul 30 11:01:12 2014 by root via cibadmin on sles2
Stack: classic openais (with plugin)
Current DC: NONE
2 Nodes configured, 2 expected votes
5 Resources configured


Online: [ sles2 ]
OFFLINE: [ sles1 ]

jmozdzen
01-Aug-2014, 17:10
Hi Mahmood,

Hi There,

I have configured HA cluster for MYSQL with DRBD, pacemaker and corosync. I was able to bring up cluster for a few time.
Even when cluster is on and Node1 is master and node2 is slave, if I shutdown node1 , node2 never comes up as master until I restart openais service several times and vice versa.
[...]
sles1:~ # crm status
Last updated: Wed Jul 30 11:59:59 2014
Last change: Wed Jul 30 11:18:38 2014 by root via cibadmin on sles2
Stack: classic openais (with plugin)
Current DC: NONE
2 Nodes configured, 2 expected votes
5 Resources configured


OFFLINE: [ sles1 sles2 ]


while at first I'd have blamed the quorum, I see you have set that to "ignore" (which is reasonable for testing). What catches the eye is that your cluster communication is down - hence *nothing* will work in the cluster. Do you have a local firewall running? Apparmor? You'll need to fix the communications issue, we then can continue to check why shutting down a single node doesn't make the other node take over.

Regards,
Jens

mahmoodhassan
04-Aug-2014, 09:51
Hi Jens

Thanks for your reply. I have disabled firewall on both nodes when I installed them so there is no firewall running. I have stopped apparmor service and on boot as well.
I am little concerned about "no-quorum-policy=ignore" as in all documents this value is suggested for two node cluster. Can you please advise what other value can be used.
I have rebooted both nodes but problem is still there.
If I reboot any node then after reboot its always show offline status and other node does not come up as master.

Below are some logs from Node2 which was master before reboot

ug 4 10:39:59 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:40:28 sles2 attrd[3725]: warning: attrd_cib_callback: Update 44 for master-p_drbd_mysql=10000 failed: Timer expired
Aug 4 10:40:29 sles2 attrd[3725]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_drbd_mysql (10000)
Aug 4 10:40:29 sles2 attrd[3725]: notice: attrd_perform_update: Sent update 52: master-p_drbd_mysql=10000
Aug 4 10:40:29 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:40:29 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:40:29 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.attrd failed: ipc delivery failed (rc=-2)
Aug 4 10:40:58 sles2 attrd[3725]: warning: attrd_cib_callback: Update 46 for master-p_drbd_mysql=10000 failed: Timer expired
Aug 4 10:40:59 sles2 attrd[3725]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_drbd_mysql (10000)
Aug 4 10:40:59 sles2 attrd[3725]: notice: attrd_perform_update: Sent update 54: master-p_drbd_mysql=10000
Aug 4 10:40:59 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:40:59 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:40:59 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.attrd failed: ipc delivery failed (rc=-2)
Aug 4 10:41:29 sles2 attrd[3725]: warning: attrd_cib_callback: Update 48 for master-p_drbd_mysql=10000 failed: Timer expired
Aug 4 10:41:29 sles2 attrd[3725]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_drbd_mysql (10000)
Aug 4 10:41:29 sles2 attrd[3725]: notice: attrd_perform_update: Sent update 56: master-p_drbd_mysql=10000
Aug 4 10:41:29 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.attrd failed: ipc delivery failed (rc=-2)
Aug 4 10:41:29 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:41:29 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:41:59 sles2 attrd[3725]: warning: attrd_cib_callback: Update 50 for master-p_drbd_mysql=10000 failed: Timer expired
Aug 4 10:41:59 sles2 attrd[3725]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_drbd_mysql (10000)
Aug 4 10:41:59 sles2 attrd[3725]: notice: attrd_perform_update: Sent update 58: master-p_drbd_mysql=10000
Aug 4 10:41:59 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.attrd failed: ipc delivery failed (rc=-2)
Aug 4 10:41:59 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:41:59 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:42:29 sles2 attrd[3725]: warning: attrd_cib_callback: Update 52 for master-p_drbd_mysql=10000 failed: Timer expired
Aug 4 10:42:29 sles2 attrd[3725]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_drbd_mysql (10000)
Aug 4 10:42:29 sles2 attrd[3725]: notice: attrd_perform_update: Sent update 60: master-p_drbd_mysql=10000
Aug 4 10:42:29 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.attrd failed: ipc delivery failed (rc=-2)
Aug 4 10:42:29 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:42:29 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)

sles2:~ # crm status
Last updated: Mon Aug 4 10:43:54 2014
Last change: Wed Jul 30 12:03:57 2014 by root via cibadmin on sles1
Stack: classic openais (with plugin)
Current DC: NONE
2 Nodes configured, 2 expected votes
5 Resources configured


OFFLINE: [ sles1 sles2 ]


------------------------------------------------------------------------------
Node1 one was master
when I rebooted node 1 then I am getting following errors on both nodes

NODE1:
sles1:~ # crm status
Last updated: Wed Jul 30 12:28:32 2014
Last change: Wed Jul 30 12:08:26 2014 by root via cibadmin on sles1
Stack: classic openais (with plugin)
Current DC: NONE
2 Nodes configured, 2 expected votes
5 Resources configured


OFFLINE: [ sles1 sles2 ]
sles1:~ # tail -f /var/log/messages
Jul 30 12:27:50 sles1 corosync[3776]: [pcmk ] info: send_member_notification: Sending membership update 172 to 3 children
Jul 30 12:27:50 sles1 corosync[3776]: [pcmk ] info: update_member: Node sles2 now has process list: 00000000000000000000000000110202 (1114626)
Jul 30 12:27:50 sles1 corosync[3776]: [pcmk ] info: send_member_notification: Sending membership update 172 to 3 children
Jul 30 12:27:50 sles1 corosync[3776]: [pcmk ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)
Jul 30 12:27:50 sles1 corosync[3776]: [pcmk ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)
Jul 30 12:27:50 sles1 crmd[3787]: notice: peer_update_callback: Our peer on the DC (sles2) is dead
Jul 30 12:27:50 sles1 corosync[3776]: [pcmk ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)
Jul 30 12:27:50 sles1 crmd[3787]: warning: reap_dead_nodes: Our DC node (sles2) left the cluster
Jul 30 12:28:11 sles1 crmd[3787]: notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Jul 30 12:28:11 sles1 crmd[3787]: warning: join_query_callback: No DC for join-1
Jul 30 12:30:01 sles1 crmd[3787]: warning: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Jul 30 12:30:01 sles1 crmd[3787]: notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]

and getting following errors on NODE2

Aug 4 10:49:08 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:49:28 sles2 crmd[5657]: error: finalize_sync_callback: Sync from sles1 failed: Timer expired
Aug 4 10:49:28 sles2 crmd[5657]: warning: do_log: FSA: Input I_ELECTION_DC from finalize_sync_callback() received in state S_FINALIZE_JOIN
Aug 4 10:49:28 sles2 crmd[5657]: notice: do_dc_join_finalize: join-9: Syncing the CIB from sles1 to the rest of the cluster
Aug 4 10:49:28 sles2 crmd[5657]: notice: do_dc_join_finalize: Requested version <generation_tuple epoch="16" num_updates="0" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Wed Jul 30 12:08:26 2014" update-origin="sles1" update-client="cibadmin" update-user="root" crm_feature_set="3.0.8" have-quorum="0"/>
Aug 4 10:49:28 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:49:28 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:49:38 sles2 crmd[5657]: error: finalize_sync_callback: Sync from sles1 failed: Timer expired
Aug 4 10:49:38 sles2 crmd[5657]: warning: do_log: FSA: Input I_ELECTION_DC from finalize_sync_callback() received in state S_FINALIZE_JOIN
Aug 4 10:49:38 sles2 crmd[5657]: notice: do_dc_join_finalize: join-10: Syncing the CIB from sles1 to the rest of the cluster
Aug 4 10:49:38 sles2 crmd[5657]: notice: do_dc_join_finalize: Requested version <generation_tuple epoch="16" num_updates="0" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Wed Jul 30 12:08:26 2014" update-origin="sles1" update-client="cibadmin" update-user="root" crm_feature_set="3.0.8" have-quorum="0"/>
Aug 4 10:49:38 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:49:38 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:49:58 sles2 crmd[5657]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ]
Aug 4 10:49:58 sles2 crmd[5657]: warning: do_log: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
Aug 4 10:49:58 sles2 crmd[5657]: error: finalize_sync_callback: Sync from sles1 failed: Timer expired
Aug 4 10:49:58 sles2 crmd[5657]: warning: do_log: FSA: Input I_ELECTION_DC from finalize_sync_callback() received in state S_INTEGRATION
Aug 4 10:49:58 sles2 crmd[5657]: warning: join_query_callback: No DC for join-12
Aug 4 10:49:58 sles2 crmd[5657]: notice: do_dc_join_finalize: join-13: Syncing the CIB from sles1 to the rest of the cluster
Aug 4 10:49:58 sles2 crmd[5657]: notice: do_dc_join_finalize: Requested version <generation_tuple epoch="16" num_updates="0" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Wed Jul 30 12:08:26 2014" update-origin="sles1" update-client="cibadmin" update-user="root" crm_feature_set="3.0.8" have-quorum="0"/>
Aug 4 10:49:58 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:49:58 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:50:08 sles2 crmd[5657]: error: finalize_sync_callback: Sync from sles1 failed: Timer expired
Aug 4 10:50:08 sles2 crmd[5657]: warning: do_log: FSA: Input I_ELECTION_DC from finalize_sync_callback() received in state S_FINALIZE_JOIN
Aug 4 10:50:08 sles2 crmd[5657]: notice: do_dc_join_finalize: join-14: Syncing the CIB from sles1 to the rest of the cluster
Aug 4 10:50:08 sles2 crmd[5657]: notice: do_dc_join_finalize: Requested version <generation_tuple epoch="16" num_updates="0" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Wed Jul 30 12:08:26 2014" update-origin="sles1" update-client="cibadmin" update-user="root" crm_feature_set="3.0.8" have-quorum="0"/>
Aug 4 10:50:08 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 10:50:08 sles2 corosync[3717]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
^C
sles2:~ # crm status
Last updated: Mon Aug 4 10:50:22 2014
Last change: Wed Jul 30 12:03:57 2014 by root via cibadmin on sles1
Stack: classic openais (with plugin)
Current DC: sles2 - partition with quorum
Version: 1.1.10-f3eeaf4
2 Nodes configured, 2 expected votes
5 Resources configured


Online: [ sles2 ]
OFFLINE: [ sles1 ]
sles2:~ #


Thanks

Mahmood

jmozdzen
04-Aug-2014, 10:33
Hi Mahmood,

the logs / outputs you show are from various points in time, this is not helpful:


Hi Jens
[...]
Aug 4 10:39:59 sles2
...
Aug 4 10:42:29 sles2
[...]
sles2:~ # crm status
Last updated: Mon Aug 4 10:43:54 2014
Last change: Wed Jul 30 12:03:57 2014 by root via cibadmin on sles1
[...]
sles1:~ # crm status
Last updated: Wed Jul 30 12:28:32 2014
Last change: Wed Jul 30 12:08:26 2014 by root via cibadmin on sles1
[...]
Jul 30 12:27:50 sles1 corosync[3776]: [pcmk ] info: send_member_notification: Sending membership update 172 to 3 children
...
Jul 30 12:30:01 sles1 crmd[3787]: notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]

and getting following errors on NODE2

Aug 4 10:49:08 sles2
...
Aug 4 10:50:08 sles2
[...]
sles2:~ # crm status
Last updated: Mon Aug 4 10:50:22 2014
Last change: Wed Jul 30 12:03:57 2014 by root via cibadmin on sles1

To diagnose this issue, please run some coordinated tests/traces. I.e.

- stop / shutdown both nodes
- wait a moment (1 minute or two, to get distinct time stamps)
- start sles1
- wait a moment after things have settled (2 or the minutes)
- start sles2

and then read the logs side by side - first have a look what happens when sles1 comes up - does it report node-internal errors? Then check from both nodes' point of view what happens during start of sles2 - does sles1 see sles2, does it receive corosync messages and react? Does Pacemaker synchronize the CIB?

Once that works and you understand what is going on, shut down sles2 and again wait a moment. sles1 should report sles2 missing after a few moments... and then restart sles2 and compare to how things worked in the first test above.

As it looks to me like a cluster communications issue, how have you set up Corosync? Please show the communications section from both sles1 and sles2. If you feel it is set up according to your needs, verify communications on sles1/sles2 with i.e. wireshark/tcpdump - it'd not be the first time that some "intelligent switch" would block/misroute multicast traffic between the nodes.

Concerning "no-quorum-policy=ignore": Yes, this is a typical configuration for two-node clusters. OTOH, this is a good way to run into "split brain" situations, where, after an inter-node communications break, both nodes believe to be the remaining only node and to start all resources - giving you resource clashes all over. Personally, I much prefer to have a "tie-breaker", either a third node (which may be set to plain "standby", hence never running any resources, no big server needed for that) or some pseudo-resource to take into consideration when judging the node status (SAN, IP connectivity,...).

But IMO your current concern is to get both nodes to see each other reliably - once that's running, you can tune your cluster to avoid "split brain" :)

Regards,
Jens

mahmoodhassan
04-Aug-2014, 14:38
Hi Jen

I have stopped and started service as per your advise and copied logs below. most of the time I get the error "Sending message to local.cib failed: ipc delivery failed (rc=-2"
and error: crmd_fast_exit: Could not recover from internal error

I have setup corosync and please have a look to corosync configuration. Both nodes are virtual hosts having two Ip addresses
eth0:10.40.12.0/22
eth1: 172.16.100.0/24
But I have configured corosync and DRBD with 172.16.100.0/24 network
Both nodes are virtual hosts at VMWARE workstation.
I have two NICS in my desktop one is on 10.40.12.0/22 subnet, other is on 172.16.100.0/24 subnet
My desktop and all vms have full access on all internal networks
# modify this option.
bindnetaddr: 172.16.100.0
# the network administrator). Do not reuse multicast
mcastaddr: 239.255.1.1
mcastport: 5405

Can you please have a look
Thanks

Mahmoood
Stopped service on Both Nodes
Started service on SLES1 only:
Aug 4 14:08:22 sles1 crmd[7529]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
Aug 4 14:08:22 sles1 crmd[7529]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
Aug 4 14:08:22 sles1 stonith-ng[7533]: notice: setup_cib: Watching for stonith topology changes
Aug 4 14:08:22 sles1 stonith-ng[7525]: notice: setup_cib: Watching for stonith topology changes
Aug 4 14:08:22 sles1 stonith-ng[7525]: error: qb_ipcs_us_publish: Could not bind AF_UNIX (): Address already in use (98)
Aug 4 14:08:22 sles1 stonith-ng[7525]: error: mainloop_add_ipc_server: Could not start stonith-ng IPC server: Address already in use (-98)
Aug 4 14:08:22 sles1 stonith-ng[7525]: error: stonith_ipc_server_init: Failed to create stonith-ng servers: exiting and inhibiting respawn.
Aug 4 14:08:22 sles1 stonith-ng[7525]: warning: stonith_ipc_server_init: Verify pacemaker and pacemaker_remote are not both enabled.
Aug 4 14:08:22 sles1 stonith-ng[7533]: notice: crm_update_peer_state: plugin_handle_membership: Node sles1[739271701] - state is now member (was (null))
Aug 4 14:08:22 sles1 stonith-ng[7533]: notice: unpack_config: On loss of CCM Quorum: Ignore
Aug 4 14:08:43 sles1 crmd[7537]: warning: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Aug 4 14:08:43 sles1 crmd[7537]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ]
Aug 4 14:08:43 sles1 crmd[7537]: warning: do_log: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
Aug 4 14:08:43 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:08:43 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:08:43 sles1 attrd[7535]: notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
Aug 4 14:08:43 sles1 attrd[7535]: notice: attrd_perform_update: Sent update 3: shutdown=0
Aug 4 14:08:43 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:09:03 sles1 crmd[7537]: error: join_update_complete_callback: Join update 23 failed
Aug 4 14:09:03 sles1 crmd[7537]: error: do_log: FSA: Input I_ERROR from join_update_complete_callback() received in state S_FINALIZE_JOIN
Aug 4 14:09:03 sles1 crmd[7537]: warning: do_state_transition: State transition S_FINALIZE_JOIN -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=join_update_complete_callback ]
Aug 4 14:09:03 sles1 crmd[7537]: warning: do_recover: Fast-tracking shutdown in response to errors
Aug 4 14:09:03 sles1 crmd[7537]: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY
Aug 4 14:09:03 sles1 crmd[7537]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Aug 4 14:09:03 sles1 crmd[7537]: notice: lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown (0 ops remaining)
Aug 4 14:09:03 sles1 crmd[7537]: notice: do_lrm_control: Disconnected from the LRM
Aug 4 14:09:03 sles1 crmd[7537]: notice: terminate_cs_connection: Disconnecting from Corosync
Aug 4 14:09:03 sles1 corosync[7519]: [pcmk ] info: pcmk_ipc_exit: Client crmd (conn=0x6a7800, async-conn=0x6a7800) left
Aug 4 14:09:03 sles1 crmd[7537]: error: crmd_fast_exit: Could not recover from internal error
Aug 4 14:09:04 sles1 corosync[7519]: [pcmk ] ERROR: pcmk_wait_dispatch: Child process crmd exited (pid=7537, rc=201)
Aug 4 14:09:04 sles1 corosync[7519]: [pcmk ] info: update_member: 0x6a34a0 Node 739271701 ((null)) born on: 220
Aug 4 14:09:04 sles1 corosync[7519]: [pcmk ] info: update_member: Node sles1 now has process list: 00000000000000000000000000101002 (1052674)
Aug 4 14:09:04 sles1 corosync[7519]: [pcmk ] notice: pcmk_wait_dispatch: Respawning failed child process: crmd
Aug 4 14:09:04 sles1 corosync[7519]: [pcmk ] info: spawn_child: Using uid=90 and group=90 for process crmd
Aug 4 14:09:04 sles1 corosync[7519]: [pcmk ] info: spawn_child: Forked child 7562 for process crmd
Aug 4 14:09:04 sles1 corosync[7519]: [pcmk ] info: update_member: Node sles1 now has process list: 00000000000000000000000000101202 (1053186)
Aug 4 14:09:04 sles1 corosync[7519]: [pcmk ] info: update_member: Node sles1 now has process list: 00000000000000000000000000101002 (1052674)
Aug 4 14:09:04 sles1 corosync[7519]: [pcmk ] info: update_member: Node sles1 now has process list: 00000000000000000000000000101202 (1053186)
Aug 4 14:09:04 sles1 corosync[7519]: [pcmk ] info: send_member_notification: Sending membership update 220 to 2 children
Aug 4 14:09:04 sles1 crmd[7562]: notice: main: CRM Git Version: f3eeaf4
Aug 4 14:09:04 sles1 crmd[7562]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
Aug 4 14:09:04 sles1 crmd[7562]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
Aug 4 14:09:04 sles1 corosync[7519]: [pcmk ] info: pcmk_ipc: Recorded connection 0x6a7800 for crmd/7562
Aug 4 14:09:04 sles1 corosync[7519]: [pcmk ] info: pcmk_ipc: Sending membership update 220 to crmd
Aug 4 14:09:04 sles1 crmd[7562]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
Aug 4 14:09:04 sles1 crmd[7562]: notice: crm_update_peer_state: plugin_handle_membership: Node sles1[739271701] - state is now member (was (null))
Aug 4 14:09:04 sles1 crmd[7562]: notice: do_started: The local CRM is operational
Aug 4 14:09:04 sles1 crmd[7562]: notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Aug 4 14:09:25 sles1 crmd[7562]: warning: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Aug 4 14:09:25 sles1 crmd[7562]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ]
Aug 4 14:09:25 sles1 crmd[7562]: warning: do_log: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
Aug 4 14:09:25 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:09:25 sles1 attrd[7535]: notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
Aug 4 14:09:25 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:09:25 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to

Logs at SLES1 after starting SLES2
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] r(0) ip(172.16.100.21)
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] Members Left:
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] Members Joined:
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 228: memb=1, new=0, lost=0
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] info: pcmk_peer_update: memb: sles1 739271701
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] CLM CONFIGURATION CHANGE
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] New Configuration:
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] r(0) ip(172.16.100.21)
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] r(0) ip(172.16.100.22)
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] Members Left:
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] Members Joined:
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] r(0) ip(172.16.100.22)
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 228: memb=2, new=1, lost=0
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] info: update_member: Creating entry for node 739271702 born on 228
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] info: update_member: Node 739271702/unknown is now: member
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] info: pcmk_peer_update: NEW: .pending. 739271702
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] info: pcmk_peer_update: MEMB: sles1 739271701
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] info: pcmk_peer_update: MEMB: .pending. 739271702
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] info: send_member_notification: Sending membership update 228 to 3 children
Aug 4 14:11:37 sles1 corosync[7519]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] info: update_member: 0x6a5270 Node 739271702 (sles2) born on: 228
Aug 4 14:11:37 sles1 cib[7524]: notice: get_node_name: Could not obtain a node name for classic openais (with plugin) nodeid 739271702
Aug 4 14:11:37 sles1 stonith-ng[7533]: notice: get_node_name: Could not obtain a node name for classic openais (with plugin) nodeid 739271702
Aug 4 14:11:37 sles1 stonith-ng[7533]: notice: crm_update_peer_state: plugin_handle_membership: Node (null)[739271702] - state is now member (was (null))
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] info: update_member: 0x6a5270 Node 739271702 now known as sles2 (was: (null))
Aug 4 14:11:37 sles1 stonith-ng[7533]: notice: plugin_handle_membership: Membership 228: quorum acquired
Aug 4 14:11:37 sles1 crmd[7572]: notice: get_node_name: Could not obtain a node name for classic openais (with plugin) nodeid 739271702
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] info: update_member: Node sles2 now has process list: 00000000000000000000000000151312 (1381138)
Aug 4 14:11:37 sles1 cib[7524]: notice: crm_update_peer_state: plugin_handle_membership: Node (null)[739271702] - state is now member (was (null))
Aug 4 14:11:37 sles1 crmd[7572]: notice: crm_update_peer_state: plugin_handle_membership: Node (null)[739271702] - state is now member (was (null))
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] info: update_member: Node sles2 now has 1 quorum votes (was 0)
Aug 4 14:11:37 sles1 cib[7524]: notice: plugin_handle_membership: Membership 228: quorum acquired
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] info: send_member_notification: Sending membership update 228 to 3 children
Aug 4 14:11:37 sles1 crmd[7572]: notice: plugin_handle_membership: Membership 228: quorum acquired
Aug 4 14:11:37 sles1 crmd[7572]: notice: crm_update_quorum: Updating quorum status to true (call=31)
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:11:37 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:11:37 sles1 corosync[7519]: [CPG ] chosen downlist: sender r(0) ip(172.16.100.21) ; members(old:1 left:0)
Aug 4 14:11:37 sles1 corosync[7519]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 4 14:11:38 sles1 corosync[7519]: [pcmk ] info: update_member: Node sles2 now has process list: 00000000000000000000000000151302 (1381122)
Aug 4 14:11:38 sles1 corosync[7519]: [pcmk ] info: send_member_notification: Sending membership update 228 to 3 children
Aug 4 14:11:38 sles1 corosync[7519]: [pcmk ] info: update_member: Node sles2 now has process list: 00000000000000000000000000150302 (1377026)
Aug 4 14:11:38 sles1 corosync[7519]: [pcmk ] info: send_member_notification: Sending membership update 228 to 3 children
local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:11:41 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:11:41 sles1 attrd[7535]: notice: attrd_perform_update: Sent update 15: shutdown=0
Aug 4 14:11:43 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:11:51 sles1 crmd[7572]: notice: erase_xpath_callback: Deletion of "//node_state[@uname='sles1']/lrm": Timer expired (rc=-62)
Aug 4 14:11:51 sles1 crmd[7572]: error: join_update_complete_callback: Join update 23 failed
Aug 4 14:11:51 sles1 crmd[7572]: error: do_log: FSA: Input I_ERROR from join_update_complete_callback() received in state S_FINALIZE_JOIN
Aug 4 14:11:51 sles1 crmd[7572]: warning: do_state_transition: State transition S_FINALIZE_JOIN -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=join_update_complete_callback ]
Aug 4 14:11:51 sles1 crmd[7572]: warning: do_recover: Fast-tracking shutdown in response to errors
Aug 4 14:11:51 sles1 crmd[7572]: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY
Aug 4 14:11:51 sles1 crmd[7572]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Aug 4 14:11:51 sles1 crmd[7572]: notice: lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown (0 ops remaining)
Aug 4 14:11:52 sles1 crmd[7579]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
Aug 4 14:11:52 sles1 corosync[7519]: [pcmk ] info: pcmk_ipc: Recorded connection 0x6a7800 for crmd/7579
Aug 4 14:11:52 sles1 corosync[7519]: [pcmk ] info: pcmk_ipc: Sending membership update 228 to crmd
Aug 4 14:11:52 sles1 crmd[7579]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name
Aug 4 14:11:52 sles1 crmd[7579]: notice: plugin_handle_membership: Membership 228: quorum acquired
Aug 4 14:11:52 sles1 crmd[7579]: notice: crm_update_peer_state: plugin_handle_membership: Node sles1[739271701] - state is now member (was (null))
Aug 4 14:11:52 sles1 crmd[7579]: notice: crm_update_peer_state: plugin_handle_membership: Node sles2[739271702] - state is now member (was (null))
Aug 4 14:11:52 sles1 crmd[7579]: notice: do_started: The local CRM is operational
Aug 4 14:11:52 sles1 crmd[7579]: notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Aug 4 14:12:07 sles1 attrd[7535]: warning: attrd_cib_callback: Update 7 for shutdown=0 failed: Timer expired
Aug 4 14:12:13 sles1 crmd[7579]: warning: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Aug 4 14:12:13 sles1 crmd[7579]: notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Aug 4 14:12:13 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:13 sles1 crmd[7579]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Aug 4 14:12:13 sles1 attrd[7535]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Aug 4 14:12:13 sles1 attrd[7535]: notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
Aug 4 14:12:13 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:13 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:13 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:14 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:14 sles1 crmd[7579]: notice: process_lrm_event: LRM operation p_ip_mysql_monitor_0 (call=64, rc=0, cib-update=11, confirmed=true) ok
Aug 4 14:12:14 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:14 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:14 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:14 sles1 crmd[7529]: notice: process_lrm_event: LRM operation p_ip_mysql_monitor_0 (call=64, rc=0, cib-update=6, confirmed=false) ok
Aug 4 14:12:14 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:14 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:14 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:14 sles1 crmd[7579]: notice: process_lrm_event: LRM operation p_mysql_monitor_0 (call=68, rc=0, cib-update=12, confirmed=true) ok
Aug 4 14:12:14 sles1 crmd[7529]: notice: process_lrm_event: LRM operation p_mysql_monitor_0 (call=68, rc=0, cib-update=7, confirmed=false) ok
Aug 4 14:12:14 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:14 sles1 attrd[7535]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_drbd_mysql (10000)
Aug 4 14:12:16 sles1 kernel: [ 1676.421567] block drbd0: self B9B336D291522399:09E1DB7BE31AE0D7:B9361CC225AB9E29 :B9351CC225AB9E29 bits:0 flags:0
Aug 4 14:12:16 sles1 kernel: [ 1676.421568] block drbd0: peer 09E1DB7BE31AE0D6:0000000000000000:B9361CC225AB9E28 :B9351CC225AB9E29 bits:0 flags:0
Aug 4 14:12:16 sles1 kernel: [ 1676.421570] block drbd0: uuid_compare()=1 by rule 70
Aug 4 14:12:16 sles1 kernel: [ 1676.421575] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent )
Aug 4 14:12:16 sles1 kernel: [ 1676.421803] block drbd0: send bitmap stats : plain 0(0), RLE 23(1), total 23; compression: 100.0%
Aug 4 14:12:16 sles1 kernel: [ 1676.425209] block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Aug 4 14:12:16 sles1 kernel: [ 1676.425214] block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
Aug 4 14:12:16 sles1 kernel: [ 1676.426054] block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
Aug 4 14:12:16 sles1 kernel: [ 1676.426062] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
Aug 4 14:12:16 sles1 kernel: [ 1676.426069] block drbd0: Began resync as SyncSource (will sync 0 KB [0 bits set]).
Aug 4 14:12:16 sles1 kernel: [ 1676.426191] block drbd0: updated sync UUID B9B336D291522399:09E2DB7BE31AE0D7:09E1DB7BE31AE0D7 :B9361CC225AB9E29
Aug 4 14:12:16 sles1 kernel: [ 1676.487680] block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
Aug 4 14:12:16 sles1 kernel: [ 1676.487687] block drbd0: updated UUIDs B9B336D291522399:0000000000000000:09E2DB7BE31AE0D7 :09E1DB7BE31AE0D7
Aug 4 14:12:16 sles1 kernel: [ 1676.487694] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Aug 4 14:12:24 sles1 crmd[7529]: warning: cib_rsc_callback: Resource update 5 failed: (rc=-62) Timer expired
Timer expired
Aug 4 14:12:30 sles1 attrd[7535]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_drbd_mysql (10000)
Aug 4 14:12:30 sles1 attrd[7535]: notice: attrd_perform_update: Sent update 30: master-p_drbd_mysql=10000
Aug 4 14:12:30 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:30 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:43 sles1 crmd[7579]: notice: erase_xpath_callback: Deletion of "//node_state[@uname='sles1']/transient_attributes": Timer expired (rc=-62)
Aug 4 14:12:44 sles1 crmd[7579]: warning: cib_rsc_callback: Resource update 10 failed: (rc=-62) Timer expired
Timer expired
Aug 4 14:12:45 sles1 attrd[7535]: notice: attrd_perform_update: Sent update 32: master-p_drbd_mysql=10000
Aug 4 14:12:46 sles1 attrd[7535]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_drbd_mysql (10000)
Aug 4 14:12:46 sles1 attrd[7535]: notice: attrd_perform_update: Sent update 34: master-p_drbd_mysql=10000
Aug 4 14:12:46 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:12:46 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:13:01 sles1 attrd[7535]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_drbd_mysql (10000)
p_drbd_mysql=10000
Aug 4 14:13:16 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:13:16 sles1 corosync[7519]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
[B]Logs at SLES2 : started 3 minutes after SLES1
sles2:~ # tail -f /var/log/messages
Aug 4 12:28:51 sles2 stonith-ng[5866]: error: qb_ipcs_us_publish: Could not bind AF_UNIX (): Address already in use (98)
Aug 4 12:28:51 sles2 stonith-ng[5866]: error: mainloop_add_ipc_server: Could not start stonith-ng IPC server: Address already in use (-98)
Aug 4 12:28:51 sles2 stonith-ng[5866]: error: stonith_ipc_server_init: Failed to create stonith-ng servers: exiting and inhibiting respawn.
Aug 4 12:28:51 sles2 stonith-ng[5866]: warning: stonith_ipc_server_init: Verify pacemaker and pacemaker_remote are not both enabled.
Aug 4 12:28:51 sles2 stonith-ng[5874]: notice: plugin_handle_membership: Membership 184: quorum acquired
Aug 4 12:28:51 sles2 stonith-ng[5874]: notice: crm_update_peer_state: plugin_handle_membership: Node sles1[739271701] - state is now member (was (null))
Aug 4 12:28:51 sles2 stonith-ng[5874]: notice: crm_update_peer_state: plugin_handle_membership: Node sles2[739271702] - state is now member (was (null))
Aug 4 12:28:51 sles2 stonith-ng[5874]: notice: unpack_config: On loss of CCM Quorum: Ignore
Aug 4 12:29:12 sles2 crmd[5870]: warning: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Aug 4 12:29:31 sles2 syslog-ng[1420]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=6463', processed='center(received)=3592', processed='destination(messages)=3586', processed='destination(mailinfo)=2', processed='destination(mailwarn)=0', processed='destination(localmessages)=11', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=1924', processed='destination(console)=467', processed='destination(null)=2', processed='destination(mail)=2', processed='destination(xconsole)=467', processed='destination(firewall)=0', processed='destination(acpid)=2', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=3592'
Aug 4 12:31:12 sles2 crmd[5870]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ]
^C

Logs at SLES2 after starting service
Sles2:~ # crm status
les2:~ # tail -f /var/log/messages
Aug 4 14:23:31 sles2 crmd[7584]: notice: do_started: The local CRM is operational
Aug 4 14:23:31 sles2 crmd[7584]: notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_ PENDING cause=C_FSA_INTERNAL origin=do_started ]
Aug 4 14:23:31 sles2 crmd[7592]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic opena is (with plugin)
Aug 4 14:23:31 sles2 crmd[7592]: notice: get_node_name: Defaulting to uname -n for the local classic openais (wit h plugin) node name
Aug 4 14:23:31 sles2 crmd[7592]: notice: get_node_name: Defaulting to uname -n for the local classic openais (wit h plugin) node name
Aug 4 14:23:31 sles2 crmd[7592]: notice: plugin_handle_membership: Membership 236: quorum acquired
Aug 4 14:23:39 sles2 crmd[7584]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ inpu t=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ]
Aug 4 14:23:39 sles2 crmd[7584]: warning: do_log: FSA: Input I_ELECTION_DC from do_election_check() received in st ate S_INTEGRATION
Aug 4 14:23:39 sles2 crmd[7584]: notice: do_dc_join_finalize: join-2: Syncing the CIB from sles1 to the rest of t he cluster
Aug 4 14:23:39 sles2 crmd[7584]: notice: do_dc_join_finalize: Requested version <generation_tuple epoch="17" nu m_updates="41" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Mon Aug 4 13:57:23 2014" update-orig in="sles2" update-client="crmd" update-user="hacluster" crm_feature_set="3.0.8" have-quorum="0" dc-uuid="sles1"/>
^[[A^C
sles2:~ # tail -f /var/log/messages
Aug 4 14:24:09 sles2 attrd[7590]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Aug 4 14:24:09 sles2 attrd[7590]: notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
Aug 4 14:24:09 sles2 attrd[7590]: notice: attrd_perform_update: Sent update 5: shutdown=0
Aug 4 14:24:09 sles2 corosync[7574]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:24:09 sles2 corosync[7574]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:24:09 sles2 corosync[7574]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:24:09 sles2 corosync[7574]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:24:09 sles2 corosync[7574]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:24:09 sles2 corosync[7574]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
Aug 4 14:24:09 sles2 corosync[7574]: [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)

jmozdzen
04-Aug-2014, 15:30
Hi Mahmood,

Hi Jen

I have stopped and started service as per your advise and copied logs below. most of the time I get the error "Sending message to local.cib failed: ipc delivery failed (rc=-2"


do you see a process "cib" running?

:~ # ps ax|grep pacemaker
14273 ? S< 7:46 /usr/lib64/pacemaker/cib
14274 ? S< 7:13 /usr/lib64/pacemaker/stonithd
14275 ? S< 23:56 /usr/lib64/pacemaker/lrmd
14276 ? S< 2:18 /usr/lib64/pacemaker/attrd
14277 ? S< 3:33 /usr/lib64/pacemaker/pengine
14278 ? S< 4:41 /usr/lib64/pacemaker/crmd
28617 pts/36 S+ 0:00 grep pacemaker
Those many messages about the cib not being reachable are not good. I've seen none so far on any of our cluster installations.

I can see from the messages from sles1 that the ring is extended once sles2 comes online:

Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] CLM CONFIGURATION CHANGE
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] New Configuration:
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] r(0) ip(172.16.100.21)
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] r(0) ip(172.16.100.22)
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] Members Left:
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] Members Joined:
Aug 4 14:11:37 sles1 corosync[7519]: [CLM ] r(0) ip(172.16.100.22)
so network-wise communications seem to be ok.

Are there *any* syslog messages for "cib"? Does "cibadmin --query --local" report the current content of the CIB on each node? Do you see content in "/var/lib/pacemaker/cib"? Do you have any out-of-diskspace condition that might interfere with the cib operation?

Regards,
Jens

PS: Part of your traces seems to be... ancient:

Logs at SLES2 : started 3 minutes after SLES1
sles2:~ # tail -f /var/log/messages
Aug 4 12:28:51 sles2 stonith-ng[5866]: error: qb_ipcs_us_publish: Could not bind AF_UNIX (): Address already in use (98)
Aug 4 12:28:51 sles2 stonith-ng[5866]: error: mainloop_add_ipc_server: Could not start stonith-ng IPC server: Address already in use (-98)
all the other traces are from after 14:00 o'clock - does sles2 you have a clock issue or is it just some mis-quote? ;)

mahmoodhassan
04-Aug-2014, 16:35
Hi Jen;

Please find cib configurations below. All the disk space is free so no space issue. Both systems have same time sorry for old logs mistakenly.
I have stated syslog service and will send you syslog of I found some

Thanks Mahmood


sles1:~ # cibadmin --query --local
<cib epoch="17" num_updates="46" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Mon Aug 4 13:57:23 2014" update-origin="sles2" update-client="crmd" update-user="hacluster" crm_feature_set="3.0.8" have-quorum="0" dc-uuid="sles1">
<configuration>
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.10-f3eeaf4"/>
<nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="classic openais (with plugin)"/>
<nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="2"/>
<nvpair name="no-quorum-policy" value="ignore" id="cib-bootstrap-options-no-quorum-policy"/>
<nvpair name="stonith-enabled" value="false" id="cib-bootstrap-options-stonith-enabled"/>
</cluster_property_set>
</crm_config>
<nodes>
<node id="sles2" uname="sles2"/>
<node id="sles1" uname="sles1"/>
</nodes>
<resources>
<group id="g_mysql">
<primitive id="p_fs_mysql" class="ocf" provider="heartbeat" type="Filesystem">
<instance_attributes id="p_fs_mysql-instance_attributes">
<nvpair name="device" value="/dev/drbd0" id="p_fs_mysql-instance_attributes-device"/>
<nvpair name="directory" value=" /var/lib/mysql_drbd" id="p_fs_mysql-instance_attributes-directory"/>
<nvpair name="fstype" value="ext3" id="p_fs_mysql-instance_attributes-fstype"/>
</instance_attributes>
</primitive>
<primitive id="p_ip_mysql" class="ocf" provider="heartbeat" type="IPaddr2">
<instance_attributes id="p_ip_mysql-instance_attributes">
<nvpair name="ip" value="172.16.100.20" id="p_ip_mysql-instance_attributes-ip"/>
<nvpair name="cidr_netmask" value="24" id="p_ip_mysql-instance_attributes-cidr_netmask"/>
<nvpair name="nic" value="eth1" id="p_ip_mysql-instance_attributes-nic"/>
</instance_attributes>
</primitive>
<primitive id="p_mysql" class="lsb" type="mysql"/>
</group>
<master id="ms_drbd_mysql">
<meta_attributes id="ms_drbd_mysql-meta_attributes">
<nvpair name="master-max" value="1" id="ms_drbd_mysql-meta_attributes-master-max"/>
<nvpair name="master-node-max" value="1" id="ms_drbd_mysql-meta_attributes-master-node-max"/>
<nvpair name="clone-max" value="2" id="ms_drbd_mysql-meta_attributes-clone-max"/>
<nvpair name="clone-node-max" value="1" id="ms_drbd_mysql-meta_attributes-clone-node-max"/>
<nvpair name="notify" value="true" id="ms_drbd_mysql-meta_attributes-notify"/>
</meta_attributes>
<primitive id="p_drbd_mysql" class="ocf" provider="linbit" type="drbd">
<instance_attributes id="p_drbd_mysql-instance_attributes">
<nvpair name="drbd_resource" value="r0" id="p_drbd_mysql-instance_attributes-drbd_resource"/>
</instance_attributes>
<operations>
<op name="monitor" interval="30s" role="Slave" id="p_drbd_mysql-monitor-30s"/>
<op name="monitor" interval="15s" role="Master" id="p_drbd_mysql-monitor-15s"/>
</operations>
</primitive>
</master>
</resources>
<constraints>
<rsc_colocation id="c_mysql_on_drbd" score="INFINITY" rsc="g_mysql" with-rsc="ms_drbd_mysql" with-rsc-role="Master"/>
<rsc_order id="o_drbd_before_mysql" score="INFINITY" first="ms_drbd_mysql" first-action="promote" then="g_mysql" then-action="start"/>
</constraints>
<rsc_defaults>
<meta_attributes id="rsc-options">
<nvpair name="resource-stickiness" value="100" id="rsc-options-resource-stickiness"/>
</meta_attributes>
</rsc_defaults>
</configuration>
<status>
<node_state id="sles1" uname="sles1" in_ccm="true" crmd="online" crm-debug-origin="do_lrm_query_internal" join="member" expected="member">
<lrm id="sles1">
<lrm_resources/>
</lrm>
<transient_attributes id="sles1">
<instance_attributes id="status-sles1">
<nvpair id="status-sles1-shutdown" name="shutdown" value="0"/>
</instance_attributes>
</transient_attributes>
</node_state>
<node_state id="sles2" uname="sles2" crmd="online" crm-debug-origin="do_lrm_query_internal" in_ccm="true" join="member" expected="member">
<lrm id="sles2">
<lrm_resources/>
</lrm>
</node_state>
</status>
</cib>
sles1:~ # cd /var/lib/pa
pacemaker/ pam_devperm/
sles1:~ # cd /var/lib/pacemaker/
sles1:/var/lib/pacemaker # ls
blackbox cib pengine
sles1:/var/lib/pacemaker # cd cib/
sles1:/var/lib/pacemaker/cib # ls
cib-0.raw cib-10.raw.sig cib-13.raw cib-15.raw.sig cib-18.raw cib-2.raw.sig cib-22.raw cib-24.raw.sig cib-27.raw cib-29.raw.sig cib-31.raw cib-4.raw.sig cib-7.raw cib-9.raw.sig
cib-0.raw.sig cib-11.raw cib-13.raw.sig cib-16.raw cib-18.raw.sig cib-20.raw cib-22.raw.sig cib-25.raw cib-27.raw.sig cib-3.raw cib-31.raw.sig cib-5.raw cib-7.raw.sig cib.last
cib-1.raw cib-11.raw.sig cib-14.raw cib-16.raw.sig cib-19.raw cib-20.raw.sig cib-23.raw cib-25.raw.sig cib-28.raw cib-3.raw.sig cib-32.raw cib-5.raw.sig cib-8.raw cib.xml
cib-1.raw.sig cib-12.raw cib-14.raw.sig cib-17.raw cib-19.raw.sig cib-21.raw cib-23.raw.sig cib-26.raw cib-28.raw.sig cib-30.raw cib-32.raw.sig cib-6.raw cib-8.raw.sig cib.xml.sig
cib-10.raw cib-12.raw.sig cib-15.raw cib-17.raw.sig cib-2.raw cib-21.raw.sig cib-24.raw cib-26.raw.sig cib-29.raw cib-30.raw.sig cib-4.raw cib-6.raw.sig cib-9.raw
s

jmozdzen
04-Aug-2014, 18:44
Hi Mahmood,

it's somewhat strange: One one hand, CIB is reported as unreachable, but on the other hand you can query it. One one hand, "crm_mon -1" reports both nodes as "offline", but your CIB reports <node_state ... crmd="online" /> for both nodes. Did you fetch this cibadmin output when the node were in one of the failed states you reported earlier?

What about the expected cib messages in syslog, or the question whether that process is alive (when you see the failure) - did you get a chance to verify that? It somehow looks lib the cib process is failing, rendering pacemaker useless. If this can be confirmed, the question is "why".

This whole situation appears rather complicated and hard to be handled off-site via forums. If you have a support contract with SUSE, I'd recommend to open a service request, stating that you're evaluating HAE prior to purchase and have run into this problem - and if no support contract is available, see your SUSE / vendor pre-sales support for assistance.

I'll still try to help as best as I can, of course.

Regards,
Jens