Csysnc2 not responding properly on new host

Setting up a new SLES 15 SP1 HA host (actually reinstalling)

Got it ready to join the existing cluster

run ha-cluster-join, it asks for an IP address or name of an existing cluster member - fill it in & continue
also ha-cluster-join with the -c option
-c HOST, --cluster-node HOST
IP address or hostname of existing cluster node

---
nss-fs6:~ # ha-cluster-join -c nss-fs8
Retrieving SSH keys - This may prompt for root@nss-fs8:
/root/.ssh/id_rsa already exists - overwrite (y/n)? y
One new SSH key installed
Configuring csync2...
WARNING: csync2 run failed - some files may not be sync'd
done
Merging known_hosts
Probing for new partitions...done
Hawk cluster interface is now running. To see cluster status, open:
https://192.168.12.246:7630/
Log in with username 'hacluster'
Waiting for cluster..............
ERROR: cluster.join: Cannot see peer node "nss-fs8", please check the communication IP
---

So errors are 'WARNING: csync2 run failed - some files may not be sync'd' & 'ERROR: cluster.join: Cannot see peer node "nss-fs8", please check the communication IP'

After checking my hosts file and the DNS server * ping - all is correct & well
I can passwordless ssh to any host in the cluster and cluster hosts can passwordless ssh to this new host by IP and ny name

So why can't sync with csync2??

from the /var/log/ha-cluster-bootstrap.log on the new host
---
Connect to 192.168.12.248:30865 (nss-fs8).
SSL: failed to use key file /etc/csync2/csync2_ssl_key.pem and/or certificate file /etc/csync2/csync2_ssl_cert.pem: Error while reading file. (GNUTLS_E_FILE_ERROR)
+ systemctl stop corosync
ERROR: Cannot see peer node "nss-fs8", please check the communication IP
---

I see the csync2 can send files to existing cluster hosts, but not his new host.


Any comments, suggestions, wild guesses, concrete fixes??

Comments

  • strahil-nikolov-dxcstrahil-nikolov-dxc Established Member
    Have you tried to cleanup the node and restart the join ?
    Check what 'crm_mon' reports on the other nodes.

    As you said reinstall, did you restore any files after the installation?
  • johngoutbeckjohngoutbeck Established Member
    Have you tried to cleanup the node and restart the join ?
    Check what 'crm_mon' reports on the other nodes.

    As you said reinstall, did you restore any files after the installation?

    ---
    How to exactly do a 'cleanup' on the new host?

    crm_mon reports two existing nodes - not the new host

    Fresh install of SLES 15 SP1 + HA on the new host
    No restore of any files.

    About possible cleanup on the new host

    if the /etc/csync2 is emptied and 'ha-cluster-join -c nss-fs8' is run on the new host

    several files are added to this dir.
    --
    nss-fs6:~ # l /etc/csync2/
    total 8
    drwxr-xr-x 1 root root 42 Mar 28 21:03 ./
    drwxr-xr-x 1 root root 4690 Mar 27 21:00 ../
    -rw
    1 root root 537 Mar 30 07:37 csync2.cfg
    -rw
    1 root root 65 Mar 30 07:37 key_hagroup
    --

    So it seems some files are synced
    missing the *.pem files from the existing cluster host
    --
    nss-fs8:/etc/csync2 # l /etc/csync2/
    total 16
    drwxr-xr-x 1 root root 116 Mar 30 07:37 ./
    drwxr-xr-x 1 root root 5122 Mar 30 07:37 ../
    -rw
    1 root root 537 Mar 28 07:52 csync2.cfg
    -rw
    1 root root 1021 Feb 20 16:51 csync2_ssl_cert.pem
    -rw
    1 root root 359 Feb 20 16:51 csync2_ssl_key.pem
    -rw
    1 root root 65 Apr 6 2015 key_hagroup
    --

    checked the cert.pem file on the existing cluster host - all OK
    --
    nss-fs8:/etc/csync2 # openssl x509 -enddate -noout -in csync2_ssl_cert.pem
    notAfter=May 8 23:51:42 2028 GMT
    --

    Any other suggestions?
  • johngoutbeckjohngoutbeck Established Member
    Hello All;

    I opened up and SR with SUSE

    Turns out to be a bug they found last Friday.
    SUSE made a pacemaker test patch, it was installed on all my HA hosts including the new node that was joining the existing cluster.
    This patch also fixes an issue where the a node may not start properly.

    Following https://www.suse.com/support/kb/doc/?id=000019604
    Then 'rm /var/lib/corosync/ringid_*' on all nodes. This erases the ringid that pacemaker/corosync creates.

    After the patch, a corosync update and erase the ringid on all the nodes, the new node could finally join the cluster.

    SUSE will include this patch/update in the very near future since it affects all current clusters.


    Hope this helps others
    Have a good day
Sign In or Register to comment.