Seeing TOTEM msgs(Totem is unable to form a cluster because of an operating system or network fault)

Hello,
We have a use case where we have configured both SLES HA 15 as well as strongswan service. While initially setting up these services in particular order we do not see any issues in HA services coming up.
But in a particular use case when we reboot any of our cluster nodes (say Node-1) , we see that crm status shows FILE status as UNCLEAN and we notice following messages repeatedly:

2021-05-19T07:47:02.908610+00:00 FILE-1 corosync[3610]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
2021-05-19T07:47:53.913841+00:00 FILE-1 corosync[3610]: message repeated 34 times: [ [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.]
2021-05-19T07:47:54.018432+00:00 FILE-1 systemd[1]: Started Service to post platform alerts.
.
.
.
This status remains so unless there is any ping or ssh to another node (in a 2 node setup ). As soon as there is a ping/ssh.. we notice following logs:

2021-05-19T07:47:54.020228+00:00 FILE-1 charon-systemd[2864]: creating acquire job for policy 147.178.40.8/32[udp/34030] === 147.178.40.7/32[udp/blackjack] with reqid {145}
2021-05-19T07:47:54.020488+00:00 FILE-1 charon-systemd[2864]: initiating IKE_SA local-FILE-147178408-147178407[5] to 147.178.40.7
2021-05-19T07:47:54.021248+00:00 FILE-1 charon-systemd[2864]: generating IKE_SA_INIT request 0 [ SA KE No N(NATD_S_IP) N(NATD_D_IP) N(FRAG_SUP) N(HASH_ALG) N(REDIR_SUP) ]
2021-05-19T07:47:54.021470+00:00 FILE-1 charon-systemd[2864]: sending packet: from 147.178.40.8[500] to 147.178.40.7[500] (332 bytes)
2021-05-19T07:47:54.024443+00:00 FILE-1 charon-systemd[2864]: received packet: from 147.178.40.7[500] to 147.178.40.8[500] (340 bytes)

Can you please help in understanding what goes wrong as soon as the node rebooted and it is unable to join the cluster on its own?
TIA!

Sign In or Register to comment.