Seeing TOTEM msgs(Totem is unable to form a cluster because of an operating system or network fault)

Hello,
We have a use case where we have configured both SLES HA 15 as well as strongswan service. While initially setting up these services in particular order we do not see any issues in HA services coming up.
But in a particular use case when we reboot any of our cluster nodes (say Node-1) , we see that crm status shows FILE status as UNCLEAN and we notice following messages repeatedly:

2021-05-19T07:47:02.908610+00:00 FILE-1 corosync[3610]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
2021-05-19T07:47:53.913841+00:00 FILE-1 corosync[3610]: message repeated 34 times: [ [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.]
2021-05-19T07:47:54.018432+00:00 FILE-1 systemd[1]: Started Service to post platform alerts.
.
.
.
This status remains so unless there is any ping or ssh to another node (in a 2 node setup ). As soon as there is a ping/ssh.. we notice following logs:

2021-05-19T07:47:54.020228+00:00 FILE-1 charon-systemd[2864]: creating acquire job for policy 147.178.40.8/32[udp/34030] === 147.178.40.7/32[udp/blackjack] with reqid {145}
2021-05-19T07:47:54.020488+00:00 FILE-1 charon-systemd[2864]: initiating IKE_SA local-FILE-147178408-147178407[5] to 147.178.40.7
2021-05-19T07:47:54.021248+00:00 FILE-1 charon-systemd[2864]: generating IKE_SA_INIT request 0 [ SA KE No N(NATD_S_IP) N(NATD_D_IP) N(FRAG_SUP) N(HASH_ALG) N(REDIR_SUP) ]
2021-05-19T07:47:54.021470+00:00 FILE-1 charon-systemd[2864]: sending packet: from 147.178.40.8[500] to 147.178.40.7[500] (332 bytes)
2021-05-19T07:47:54.024443+00:00 FILE-1 charon-systemd[2864]: received packet: from 147.178.40.7[500] to 147.178.40.8[500] (340 bytes)

Can you please help in understanding what goes wrong as soon as the node rebooted and it is unable to join the cluster on its own?
TIA!

Comments

  • I am struggling to understand your situation, so I doubt anyone will be able to help.
    Can you provide more details like cluster setup, stonith availability, corosync options, links per host ,network topology.

    Are you trying to tunnel corosync traffic over VPN ? I doubt it's supported, but to stay on track , you need your VPN service to start before corosync. Also, you will need the "two_node" option, so quorum won't be lost when the other cluster peer is dead.

Sign In or Register to comment.