PDA

View Full Version : Proper method for updating a cluster?



allenb_1121
25-Sep-2012, 21:52
What is the "proper" method for updating a HAE cluster?
I have encountered a problem several times now when I have done the following:
1. Put a node in standby mode
2. Run zypper update on that node
3. Rebooted.

It seems that it comes up in some sort of split brain mess. The node that was updated sees the one that is still online as offline, and that one sees the one that was updated as pending or offline.
I can't put it back online to take the other one offline, so I have to disrupt services by having them BOTH down and applying the update to the other one.

The updates that were applied this time were:
v | SLE11-HAE-SP2-Updates | cluster-glue | 1.0.9.1-0.36.1 | 1.0.9.1-0.38.2 | x86_64
v | SLE11-HAE-SP2-Updates | cluster-network-kmp-default | 1.4_3.0.34_0.7-2.10.30 | 1.4_3.0.38_0.5-2.16.1 | x86_64
v | SLE11-HAE-SP2-Updates | corosync | 1.4.1-0.13.1 | 1.4.3-0.5.1 | x86_64
v | SLE11-HAE-SP2-Updates | crmsh | 1.1.0-0.17.3 | 1.1.0-0.19.16 | x86_64
v | SLE11-HAE-SP2-Updates | gfs2-kmp-default | 2_3.0.34_0.7-0.7.30 | 2_3.0.38_0.5-0.7.37 | x86_64
v | SLE11-HAE-SP2-Updates | ldirectord | 3.9.2-0.25.5 | 3.9.3-0.7.1 | x86_64
v | SLE11-HAE-SP2-Updates | libcorosync4 | 1.4.1-0.13.1 | 1.4.3-0.5.1 | x86_64
v | SLE11-HAE-SP2-Updates | libglue2 | 1.0.9.1-0.36.1 | 1.0.9.1-0.38.2 | x86_64
v | SLE11-HAE-SP2-Updates | libpacemaker3 | 1.1.6-1.29.1 | 1.1.7-0.9.1 | x86_64
v | SLE11-HAE-SP2-Updates for x86_64 | ocfs2-kmp-default | 1.6_3.0.34_0.7-0.7.30 | 1.6_3.0.38_0.5-0.7.37 | x86_64
v | SLE11-HAE-SP2-Updates | pacemaker | 1.1.6-1.29.1 | 1.1.7-0.9.1 | x86_64
v | SLE11-HAE-SP2-Updates | pacemaker-mgmt | 2.1.0-0.8.74 | 2.1.0-0.10.2 | x86_64
v | SLE11-HAE-SP2-Updates | pacemaker-mgmt-client | 2.1.0-0.8.74 | 2.1.0-0.10.2 | x86_64
v | SLE11-HAE-SP2-Updates | resource-agents | 3.9.2-0.25.5 | 3.9.3-0.7.1 | x86_64
v | SLE11-HAE-SP2-Updates | yast2-cluster | 2.15.0-8.35.4 | 2.15.0-8.39.1 | noarch


The error spewing into /var/log/messages is:

Sep 25 15:50:08 uaweb02 crmd: [4461]: info: update_dc: Set DC to uaweb01 (3.0.5)
Sep 25 15:50:08 uaweb02 cib: [4456]: WARN: cib_process_replace: Replacement 0.86.74 not applied to 0.88.27: current epoch is greater than the replacement
Sep 25 15:50:08 uaweb02 cib: [4456]: WARN: cib_diff_notify: Update (client: uaweb02, call:392016): -1.-1.-1 -> 0.86.74 (Update was older than existing configuration)
Sep 25 15:50:08 uaweb02 cib: [4456]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=uaweb01/crmd/392018, version=0.88.27): ok (rc=0)
Sep 25 15:50:08 uaweb02 crmd: [4461]: info: do_election_count_vote: Election 175156 (owner: uaweb01) lost: vote from uaweb01 (Version)

Any ideas of why this occurs, and if there is any way I can get this node back online long enough to update the other one?

Thanks.
Allen Beddingfield
Systems Engineer
The University of Alabama

allenb_1121
25-Sep-2012, 22:20
Additionally there are these errors on the node that has not been updated:

Sep 25 16:15:42 uaweb01 cib: [3944]: ERROR: cib_perform_op: Discarding update with feature set '3.0.6' greater than our own '3.0.5'
Sep 25 16:15:42 uaweb01 cib: [3944]: WARN: cib_diff_notify: Update (client: crmd, call:690821): -1.-1.-1 -> 0.88.0 (The action/feature is not supported)
Sep 25 16:15:42 uaweb01 cib: [3944]: ERROR: cib_process_request: Operation complete: op cib_replace for section 'all' (origin=uaweb02/crmd/690821, version=0.86.85): The action/feature is not supported (rc=-29)
Sep 25 16:15:42 uaweb01 cib: [3944]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=uaweb02/uaweb02/690821, version=0.86.85): ok (rc=0)
Sep 25 16:15:42 uaweb01 crmd: [3948]: ERROR: finalize_sync_callback: Sync from uaweb02 resulted in an error: The action/feature is not supported
Sep 25 16:15:42 uaweb01 crmd: [3948]: WARN: do_log: FSA: Input I_ELECTION_DC from finalize_sync_callback() received in state S_FINALIZE_JOIN
Sep 25 16:15:42 uaweb01 crmd: [3948]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=finalize_sync_callback ]
Sep 25 16:15:42 uaweb01 crmd: [3948]: info: update_dc: Unset DC uaweb01
Sep 25 16:15:42 uaweb01 crmd: [3948]: info: do_dc_join_offer_all: join-324550: Waiting on 2 outstanding join acks
Sep 25 16:15:42 uaweb01 crmd: [3948]: info: update_dc: Set DC to uaweb01 (3.0.5)
Sep 25 16:15:42 uaweb01 crmd: [3948]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Sep 25 16:15:42 uaweb01 crmd: [3948]: info: do_state_transition: All 2 cluster nodes responded to the join offer.
Sep 25 16:15:42 uaweb01 crmd: [3948]: info: do_dc_join_finalize: join-324550: Syncing the CIB from uaweb02 to the rest of the cluster

Is it possible to get these talking long enough to update the other node, and how can I avoid this situation when applying updates in the future? Fortunately, this is a pre-production setup.

Allen Beddingfield
Systems Engineer
The University of Alabama

LarsMB
26-Sep-2012, 10:27
Is it possible to get these talking long enough to update the other node, and how can I avoid this situation when applying updates in the future? Fortunately, this is a pre-production setup.


Hi Allen,

upgrading the nodes in a rolling fashion is indeed supposed to work. However, putting the node into standby possibly isn't enough, since that still swaps the binaries and libraries of a running node; I'd recommend to just run "rcopenais stop ; zypper dup ; reboot" (so that the update happens while the node is completely down from a cluster perspective). After the reboot, it is supposed to be able to rejoin the cluster just fine.

(And once the second node has been updated with the same process, upgrade the feature set in the CIB and make the possibly new features available.)

I hope this helps!

If that still doesn't work (it did in our testing) and results in a split cluster that doesn't join, please do file a service request; we'll likely ask for the hb_report covering the period from just before the first node's upgrade to the state where the nodes refuse to join. Thanks!