PDA

View Full Version : slesha-join Error Messesages



epretorious
18-Dec-2012, 04:26
Attempting to setup corosync on san1.example.com [192.168.1.1] fails:

san1:~ # sleha-init
WARNING: Could not detect IP address for eth0
WARNING: Could not detect network address for eth0
Enabling sshd service
/root/.ssh/id_rsa already exists - overwrite? [y/N] y
Generating ssh key
Configuring csync2
Generating csync2 shared key (this may take a while)...ERROR: Can't create csync2 key
So, instead, I setup corosync on san2.example.com [192.168.1.2] instead:

san2:~ # sleha-init
WARNING: Could not detect IP address for eth0
WARNING: Could not detect network address for eth0
Enabling sshd service
Generating ssh key
Configuring csync2
Generating csync2 shared key (this may take a while)...done
Enabling csync2 service
Enabling xinetd service
csync2 checking files

Configure Corosync:
This will configure the cluster messaging layer. You will need
to specify a network address over which to communicate (default
is eth0's network, but you can use the network address of any
active interface), a multicast address and multicast port.

Network address to bind to (e.g.: 192.168.1.0) [] 192.168.1.2
Multicast address (e.g.: 239.x.x.x) [239.129.63.45] 239.1.1.2
Multicast port [5405]

Configure SBD:
If you have shared storage, for example a SAN or iSCSI target,
you can use it avoid split-brain scenarios by configuring SBD.
This requires a 1 MB partition, accessible to all nodes in the
cluster. The device path must be persistent and consistent
across all nodes in the cluster, so /dev/disk/by-id/* devices
are a good choice. Note that all data on the partition you
specify here will be destroyed.

Do you wish to use SBD? [y/N]
WARNING: Not configuring SBD - STONITH will be disabled.
Enabling hawk service
HA Web Konsole is now running, to see cluster status go to:
https://SERVER:7630/
Log in with username 'hacluster', password 'linux'
WARNING: You should change the hacluster password to something more secure!
Enabling openais service
Waiting for cluster........done
Loading initial configuration
Done (log saved to /var/log/sleha-bootstrap.log)
But then san1.example.com [192.168.1.1] complains when joining the cluster:

san1:~ # sleha-join
WARNING: Could not detect IP address for eth0
WARNING: Could not detect network address for eth0

Join This Node to Cluster:
You will be asked for the IP address of an existing node, from which
configuration will be copied. If you have not already configured
passwordless ssh between nodes, you will be prompted for the root
password of the existing node.

IP address or hostname of existing node (e.g.: 192.168.1.1) [] 192.168.1.2
Enabling sshd service
/root/.ssh/id_rsa already exists - overwrite? [y/N] y
Retrieving SSH keys from 192.168.1.2
Password:
Configuring csync2
Enabling csync2 service
Enabling xinetd service
WARNING: csync2 of /etc/csync2/csync2.cfg failed - file may not be in sync on all nodes
WARNING: csync2 run failed - some files may not be sync'd
Merging known_hosts
WARNING: known_hosts collection may be incomplete
WARNING: known_hosts merge may be incomplete
Probing for new partitions......ERROR: Failed to probe new partitions
I've verified that the firewall rules of both hosts have been flushed:

san1:~ # iptables -L -v
Chain INPUT (policy ACCEPT 24M packets, 66G bytes)
pkts bytes target prot opt in out source destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination

Chain OUTPUT (policy ACCEPT 25M packets, 37G bytes)
pkts bytes target prot opt in out source destination


san2:~ # iptables -L -v
Chain INPUT (policy ACCEPT 22M packets, 39G bytes)
pkts bytes target prot opt in out source destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination

Chain OUTPUT (policy ACCEPT 17M packets, 43G bytes)
pkts bytes target prot opt in out source destination
There are some entries in /var/log/messages of san2.example.com that seem related to this:

Dec 17 19:04:51 san2 cib: [3615]: info: cib_stats: Processed 83 operations (1445.00us average, 0% utilization) in the last 10min
Dec 17 19:10:18 san2 crmd: [3620]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
Dec 17 19:10:18 san2 crmd: [3620]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Dec 17 19:10:18 san2 crmd: [3620]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
Dec 17 19:10:18 san2 pengine: [3619]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec 17 19:10:18 san2 crmd: [3620]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_res
ponse ]
Dec 17 19:10:18 san2 crmd: [3620]: info: do_te_invoke: Processing graph 2 (ref=pe_calc-dc-1355800218-15) derived from /var/lib/pengine/pe-input-2.bz2
Dec 17 19:10:18 san2 crmd: [3620]: notice: run_graph: ==== Transition 2 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-2.bz2): Complet
e
Dec 17 19:10:18 san2 crmd: [3620]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Dec 17 19:10:18 san2 pengine: [3619]: notice: process_pe_message: Transition 2: PEngine Input stored in: /var/lib/pengine/pe-input-2.bz2
Dec 17 19:14:51 san2 cib: [3615]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min
What's going on here? Why can't san1.example.com setup/join the cluster?

Eric Pretorious
Truckee, CA

epretorious
18-Dec-2012, 05:05
I noticed that Step #5 of Chapter 3.4 (https://www.suse.com/documentation/sle_ha/book_sleha/?page=/documentation/sle_ha/book_sleha/data/sec_ha_installation_setup_manual.html) in the "SUSE Linux Enterprise High Availability Extension Guide (https://www.suse.com/documentation/sle_ha/book_sleha/data/book_sleha.html)" mentions the /var/log/sleha-bootstrap.log file so I took a look in there and found this:

================================================== ==============
2012-12-17 19:29:01-08:00 /usr/sbin/sleha-join
----------------------------------------------------------------
WARNING: Could not detect IP address for eth0
WARNING: Could not detect network address for eth0
+ chkconfig sshd on
+ mkdir -m 700 -p /root/.ssh
+ scp -oStrictHostKeyChecking=no root@192.168.1.2:/root/.ssh/id_rsa\* /root/.ssh/
+ rm -f /var/lib/csync2/san1.db
+ ssh root@192.168.1.2 sleha-init csync2_remote san1
WARNING: Could not detect IP address for eth0
WARNING: Could not detect network address for eth0
+ scp root@192.168.1.2:/etc/csync2/\{csync2.cfg\,key_hagroup\} /etc/csync2
+ chkconfig csync2 on
+ chkconfig xinetd on
+ ssh root@192.168.1.2 csync2\ -m\ /etc/csync2/csync2.cfg\ \;\ csync2\ -f\ /etc/csync2/csync2.cfg\ \;\ csync2\ -xv\ /etc/csync2/csync2.cfg
Connecting to host san1 (SSL) ...
Can't connect to remote host.
ERROR: Connection to remote host failed.
Host stays in dirty state. Try again later...
Finished with 1 errors.
WARNING: csync2 of /etc/csync2/csync2.cfg failed - file may not be in sync on all nodes
+ ssh root@192.168.1.2 csync2\ -mr\ /\ \;\ csync2\ -fr\ /\ \;\ csync2\ -xv\ -P\ san1
Connecting to host san1 (SSL) ...
Can't connect to remote host.
ERROR: Connection to remote host failed.
Host stays in dirty state. Try again later...
Finished with 1 errors.
WARNING: csync2 run failed - some files may not be sync'd
+ mkdir -p /tmp/sleha-pssh.5727
+ rm -f /tmp/sleha-pssh.5727/\*
+ pssh -H san1 -H san2 -O StrictHostKeyChecking=no -o /tmp/sleha-pssh.5727 cat /root/.ssh/known_hosts
[1] 19:35:31 [SUCCESS] san1
[2] 19:36:31 [FAILURE] san2 Timed out, Killed by signal 9
WARNING: known_hosts collection may be incomplete
+ pscp -H san1 -H san2 -O StrictHostKeyChecking=no /root/.ssh/known_hosts.new /root/.ssh/known_hosts
[1] 19:36:31 [SUCCESS] san1
[2] 19:42:50 [FAILURE] san2 Exited with error code 1
WARNING: known_hosts merge may be incomplete
+ rm /root/.ssh/known_hosts.new
+ rm -r /tmp/sleha-pssh.5727
+ partprobe
Error: Error informing the kernel about modifications to partition /dev/drbd0_part1 -- Invalid argument. This means Linux won't know about any changes you made to /dev/drbd0_part1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (Invalid argument)
Error: Error informing the kernel about modifications to partition /dev/drbd1_part1 -- Invalid argument. This means Linux won't know about any changes you made to /dev/drbd1_part1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (Invalid argument)
Error: Error informing the kernel about modifications to partition /dev/drbd2_part1 -- Invalid argument. This means Linux won't know about any changes you made to /dev/drbd2_part1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (Invalid argument)
Error: Error informing the kernel about modifications to partition /dev/drbd3_part1 -- Invalid argument. This means Linux won't know about any changes you made to /dev/drbd3_part1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (Invalid argument)
ERROR: Failed to probe new partitions
----------------------------------------------------------------
2012-12-17 19:42:55-08:00 exited (rc=1)
================================================== ==============
Does this help?

Eric Pretorious
Truckee, CA

jmozdzen
18-Dec-2012, 18:08
Hi Eric,

> + ssh root@192.168.1.2 csync2\ -m\ /etc/csync2/csync2.cfg\ \;\ csync2\ -f\ /etc/csync2/csync2.cfg\ \;\ csync2\ -xv\ /etc/csync2/csync2.cfg > Connecting to host san1 (SSL) ... > Can't connect to remote host. > ERROR: Connection to remote host failed. > Host stays in dirty state. Try again later... > Finished with 1 errors. > WARNING: csync2 of /etc/csync2/csync2.cfg failed - file may not be in sync on all nodes
> Does this help?

That log seems to be from san1, where you can see that it connects to san2 via ssh and there invokes a backward SSL connection to san1 - which in turn fails.

I cannot tell why it exactly fails - maybe a SSL certificate problem or DNS? Does "san2" resolve "san1"? This seems to be a csync2-based problem, so I'd focus on debugging that.
It's interesting too that the initial attempt on san1 failed with a key-related error, too: "Generating csync2 shared key (this may take a while)...ERROR: Can't create csync2 key".

Regards
Jens

epretorious
20-Dec-2012, 04:01
That log seems to be from san1, where you can see that it connects to san2 via ssh and there invokes a backward SSL connection to san1 - which in turn fails.

I cannot tell why it exactly fails - maybe a SSL certificate problem or DNS? Does "san2" resolve "san1"? This seems to be a csync2-based problem, so I'd focus on debugging that.
It's interesting too that the initial attempt on san1 failed with a key-related error, too: "Generating csync2 shared key (this may take a while)...ERROR: Can't create csync2 key".
I corrected the errors in /etc/hosts and san1.example.com & san2.example.com both resolve correctly from both hosts now:

san1:~ # ping -c3 san2.example.com
PING san2.example.com (192.168.1.2) 56(84) bytes of data.
64 bytes from san2.example.com (192.168.1.2): icmp_seq=1 ttl=64 time=0.162 ms
64 bytes from san2.example.com (192.168.1.2): icmp_seq=2 ttl=64 time=0.115 ms
64 bytes from san2.example.com (192.168.1.2): icmp_seq=3 ttl=64 time=0.128 ms

--- san2.example.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.115/0.135/0.162/0.019 ms

san2:~ # ping -c3 san1.example.com
PING san1.example.com (192.168.1.1) 56(84) bytes of data.
64 bytes from san1.example.com (192.168.1.1): icmp_seq=1 ttl=64 time=0.188 ms
64 bytes from san1.example.com (192.168.1.1): icmp_seq=2 ttl=64 time=0.115 ms
64 bytes from san1.example.com (192.168.1.1): icmp_seq=3 ttl=64 time=0.117 ms

--- san1.example.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.115/0.140/0.188/0.033 ms
...and then re-ran sleha-join from san1.example.com (192.168.1.1):

san1:~ # sleha-join -c 192.168.1.2 -i bond0
Restarting firewall (TCP 30865 5560 7630 21064 and UDP none open)
Enabling sshd service
/root/.ssh/id_rsa already exists - overwrite? [y/N] y
Retrieving SSH keys from 192.168.1.2
Configuring csync2
Enabling csync2 service
Enabling xinetd service
WARNING: csync2 of /etc/csync2/csync2.cfg failed - file may not be in sync on all nodes
WARNING: csync2 run failed - some files may not be sync'd
Merging known_hosts
Probing for new partitions......ERROR: Failed to probe new partitions
This is the latest entry in /var/log/sleha-bootstrap.log:

san1:~ # grep -A 50 -B 1 18:48:00 /var/log/sleha-bootstrap.log
================================================== ==============
2012-12-19 18:48:00-08:00 /usr/sbin/sleha-join -c 192.168.1.2 -i bond0
----------------------------------------------------------------
+ chkconfig sshd on
+ mkdir -m 700 -p /root/.ssh
+ scp -oStrictHostKeyChecking=no root@192.168.1.2:/root/.ssh/id_rsa\* /root/.ssh/
+ rm -f /var/lib/csync2/san1.db
+ ssh root@192.168.1.2 sleha-init csync2_remote san1
WARNING: Could not detect IP address for eth0
WARNING: Could not detect network address for eth0
+ scp root@192.168.1.2:/etc/csync2/\{csync2.cfg\,key_hagroup\} /etc/csync2
+ chkconfig csync2 on
+ chkconfig xinetd on
+ ssh root@192.168.1.2 csync2\ -m\ /etc/csync2/csync2.cfg\ \;\ csync2\ -f\ /etc/csync2/csync2.cfg\ \;\ csync2\ -xv\ /etc/csync2/csync2.cfg
Connecting to host san1 (SSL) ...
ERROR from peer san1: csync2: relocation error: /usr/lib64/libgnutls-openssl.so.26: symbol _gnutls_log_level, version GNUTLS_1_4 not defined in file libgnutls.so.26 with link time reference
SSL command failed.
ERROR: Connection to remote host failed.
Host stays in dirty state. Try again later...
Finished with 2 errors.
WARNING: csync2 of /etc/csync2/csync2.cfg failed - file may not be in sync on all nodes
+ ssh root@192.168.1.2 csync2\ -mr\ /\ \;\ csync2\ -fr\ /\ \;\ csync2\ -xv\ -P\ san1
Connecting to host san1 (SSL) ...
ERROR from peer san1: csync2: relocation error: /usr/lib64/libgnutls-openssl.so.26: symbol _gnutls_log_level, version GNUTLS_1_4 not defined in file libgnutls.so.26 with link time reference
SSL command failed.
ERROR: Connection to remote host failed.
Host stays in dirty state. Try again later...
Finished with 2 errors.
WARNING: csync2 run failed - some files may not be sync'd
+ mkdir -p /tmp/sleha-pssh.7684
+ rm -f /tmp/sleha-pssh.7684/\*
+ pssh -H san1 -H san2 -O StrictHostKeyChecking=no -o /tmp/sleha-pssh.7684 cat /root/.ssh/known_hosts
[1] 18:48:03 [SUCCESS] san2
[2] 18:48:03 [SUCCESS] san1
+ pscp -H san1 -H san2 -O StrictHostKeyChecking=no /root/.ssh/known_hosts.new /root/.ssh/known_hosts
[1] 18:48:04 [SUCCESS] san2
[2] 18:48:04 [SUCCESS] san1
+ rm /root/.ssh/known_hosts.new
+ rm -r /tmp/sleha-pssh.7684
+ partprobe
Error: Error informing the kernel about modifications to partition /dev/drbd0_part1 -- Invalid argument. This means Linux won't know about any changes you made to /dev/drbd0_part1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (Invalid argument)
Error: Error informing the kernel about modifications to partition /dev/drbd1_part1 -- Invalid argument. This means Linux won't know about any changes you made to /dev/drbd1_part1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (Invalid argument)
Error: Error informing the kernel about modifications to partition /dev/drbd2_part1 -- Invalid argument. This means Linux won't know about any changes you made to /dev/drbd2_part1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (Invalid argument)
Error: Error informing the kernel about modifications to partition /dev/drbd3_part1 -- Invalid argument. This means Linux won't know about any changes you made to /dev/drbd3_part1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (Invalid argument)
ERROR: Failed to probe new partitions
----------------------------------------------------------------
2012-12-19 18:48:09-08:00 exited (rc=1)
================================================== ==============
When I execute crm_mon it complains:

san1:~ # crm_mon
Attempting connection to the cluster...crm_mon: symbol lookup error: /usr/lib64/libplumb.so.2: undefined symbol: g_malloc_n

san1:~ # ls -al /usr/lib64/libplumb.so.2
lrwxrwxrwx 1 root root 17 Nov 21 17:51 /usr/lib64/libplumb.so.2 -> libplumb.so.2.1.0

san1:~ # ls -al /usr/lib64/libplumb.so.2.1.0
-rwxr-xr-x 1 root root 230112 Jun 22 03:24 /usr/lib64/libplumb.so.2.1.0

Ideas? Suggestions?

Eric Pretorious
Truckee, CA

jmozdzen
20-Dec-2012, 11:01
Hi Eric,


I corrected the errors in /etc/hosts and san1.example.com & san2.example.com both resolve correctly from both hosts now:

yes, I already guessed that fixing that other problem might get you somewhere with this problem. But obviously, there's more to it:


This is the latest entry in /var/log/sleha-bootstrap.log:

san1:~ # grep -A 50 -B 1 18:48:00 /var/log/sleha-bootstrap.log
================================================== ==============
[...]
+ ssh root@192.168.1.2 csync2\ -m\ /etc/csync2/csync2.cfg\ \;\ csync2\ -f\ /etc/csync2/csync2.cfg\ \;\ csync2\ -xv\ /etc/csync2/csync2.cfg
Connecting to host san1 (SSL) ...
ERROR from peer san1: csync2: relocation error: /usr/lib64/libgnutls-openssl.so.26: symbol _gnutls_log_level, version GNUTLS_1_4 not defined in file libgnutls.so.26 with link time reference
[..]
================================================== ==============
When I execute crm_mon it complains:

san1:~ # crm_mon
Attempting connection to the cluster...crm_mon: symbol lookup error: /usr/lib64/libplumb.so.2: undefined symbol: g_malloc_n

san1:~ # ls -al /usr/lib64/libplumb.so.2
lrwxrwxrwx 1 root root 17 Nov 21 17:51 /usr/lib64/libplumb.so.2 -> libplumb.so.2.1.0

san1:~ # ls -al /usr/lib64/libplumb.so.2.1.0
-rwxr-xr-x 1 root root 230112 Jun 22 03:24 /usr/lib64/libplumb.so.2.1.0

Ideas? Suggestions?

Eric Pretorious
Truckee, CA

Your SLES installation looks inconsistent. It seems like csync2 (or a library package used by it) is linked against GNU tools lib v 1.4, but 2.6 is installed. libplumb requires some other library to contain "g_malloc_n", but there's no such symbol in any library linked to crm_mon or libplumb.

Check the packages containing the affected files (/usr/lib64/libplumb.so.2, /usr/lib64/libgnutls-openssl.so.26, csync2, crm_mon), i.e. via

san1:~ # rpm -qf /usr/lib64/libplumb.so.2
libglue2-1.0.8-0.4.4.1
san1:~ # rpm -V libglue2-1.0.8-0.4.4.1
san1:~ # which crm_mon
/usr/sbin/crm_mon
san1:~ # rpm -qf /usr/sbin/crm_mon
pacemaker-1.1.5-5.9.11.1
san1:~ # rpm -V pacemaker-1.1.5-5.9.11.1
san1:~ #
If "rpm -V" reports nothing, then the packages are as they should be. Via "rpm -qi <packagename>" you can display the installation details of the package, to verify that they are all indeed from SLES11 and not from some third-party source.

Another possible (but currently, less probable) source of trouble could be a misconfigured loader - having an ordering in LD_LIBRARY_PATH (or the system-wide configuration in /etc/ld.so.conf, /etc/ld.so.conf.d) that brings in third-party library versions too early.

Regards,
Jens

epretorious
26-Dec-2012, 06:24
Check the packages containing the affected files (/usr/lib64/libplumb.so.2, /usr/lib64/libgnutls-openssl.so.26, csync2, crm_mon), i.e. via

san1:~ # rpm -qf /usr/lib64/libgnutls-openssl.so.26
libgnutls-extra26-2.4.1-24.39.39.1
san1:~ # rpm -qf /usr/lib64/libplumb.so.2
libglue2-1.0.8-0.4.4.1
san1:~ # rpm -V libglue2-1.0.8-0.4.4.1
san1:~ # which crm_mon
/usr/sbin/crm_mon
san1:~ # rpm -qf /usr/sbin/crm_mon
pacemaker-1.1.5-5.9.11.1
san1:~ # rpm -V pacemaker-1.1.5-5.9.11.1
san1:~ #
If "rpm -V" reports nothing, then the packages are as they should be. Via "rpm -qi <packagename>" you can display the installation details of the package, to verify that they are all indeed from SLES11 and not from some third-party source.

Thanks, Jens:

I haven't had much time to work on this lately, but here are the initial results:

san1:~ # rpm -qf /usr/lib64/libgnutls-openssl.so.26
libgnutls-extra26-2.4.1-24.39.39.1

san1:~ # rpm -qf /usr/lib64/libgnutls-openssl.so.26 | xargs rpm -V
....L... /usr/lib64/libgnutls-extra.so.26

san1:~ # rpm -qf /usr/lib64/libplumb.so.2
cluster-glue-libs-1.0.5-6.el6

san1:~ # rpm -qf /usr/lib64/libplumb.so.2 | xargs rpm -V
Unsatisfied dependencies for cluster-glue-libs-1.0.5-6.el6.x86_64: rpmlib(FileDigests) <= 4.6.0-1

My initial observation is that, on your system, /usr/lib64/libplumb.so.2 belongs to a different package than on my system. :confused:

I'm using SLES-11-SP2. What version of SLES are you using?

Eric Pretorious
Truckee, CA

jmozdzen
26-Dec-2012, 18:05
Hi Eric,

> I'm using SLES-11-SP2. What version of SLES are you using?

SLES11 SP1 - if everything goes right, I'll be testing SP2 HAE first week of January

> san1:~ # rpm -qf /usr/lib64/libplumb.so.2
> cluster-glue-libs-1.0.5-6.el6
> san1:~ # rpm -qf /usr/lib64/libplumb.so.2 | xargs rpm -V
> Unsatisfied dependencies for cluster-glue-libs-1.0.5-6.el6.x86_64: rpmlib(FileDigests) <= 4.6.0-1
>
> My initial observation is that, on your system, /usr/lib64/libplumb.so.2 belongs to a different package than on my system

My initial observation is that you seem to be using RHEL packages (*el6.x86_64) on a SLES system... does "rpm -qi cluster-glue-libs" confirm this? Taking the other problems reported by you into account ("Can't start DomU" thread) with the library problems noticable there, my suspicion grows with each revealed detail... I believe you have an inconsistently installed system... and probably not because of a system crash, but because of a mix of packets for different target systems (RHEL, SLES,...).

Does a simple "rpm -Vall" report more of such problems ("Unsatisfied dependencies")? What other packets from outside the "SLES universe" have you installed? You probably got yourself a lesser nice Xmas present by injecting incompatible libraries into the system...

Regards,
Jens

epretorious
27-Dec-2012, 04:20
My initial observation is that you seem to be using RHEL packages (*el6.x86_64) on a SLES system... does "rpm -qi cluster-glue-libs" confirm this? Taking the other problems reported by you into account ("Can't start DomU" thread) with the library problems noticable there, my suspicion grows with each revealed detail... I believe you have an inconsistently installed system... and probably not because of a system crash, but because of a mix of packets for different target systems (RHEL, SLES,...).

san1:~ # rpm -qi cluster-glue-libs-1.0.5-6.el6.x86_64:
package cluster-glue-libs-1.0.5-6.el6.x86_64: is not installed

Does a simple "rpm -Vall" report more of such problems ("Unsatisfied dependencies")? What other packets from outside the "SLES universe" have you installed? You probably got yourself a lesser nice Xmas present by injecting incompatible libraries into the system...

I'm not using any other repositories than the installation CD's (transferred to an NFS server). However, verifying the packages using RPM revealed several problems:

san1:~ # rpm -Vall > rpm_verify.txt wc -l rpm_verify.txt
san1:~ # wc -l rpm_verify.txt
108 rpm_verify.txt
san1:~ # grep Unsat rpm_verify.txt
Unsatisfied dependencies for libnetfilter_conntrack-0.0.100-2.el6.x86_64: rpmlib(FileDigests) <= 4.6.0-1
Unsatisfied dependencies for obex-data-server-0.4.3-4.el6.x86_64: rpmlib(FileDigests) <= 4.6.0-1
Unsatisfied dependencies for cluster-glue-libs-1.0.5-6.el6.x86_64: rpmlib(FileDigests) <= 4.6.0-1
Unsatisfied dependencies for openssl098e-0.9.8e-17.el6.centos.2.x86_64: ca-certificates >= 2008-5, rpmlib(FileDigests) <= 4.6.0-1
Unsatisfied dependencies for gnutls-2.8.5-4.el6_2.2.x86_64: rpmlib(FileDigests) <= 4.6.0-1
Unsatisfied dependencies for libnfnetlink-1.0.0-1.el6.x86_64: rpmlib(FileDigests) <= 4.6.0-1
Unsatisfied dependencies for openobex-1.4-7.el6.x86_64: rpmlib(FileDigests) <= 4.6.0-1
Unsatisfied dependencies for clusterlib-3.0.12.1-32.el6_3.2.x86_64: rpmlib(FileDigests) <= 4.6.0-1
So, to summarize: The system is highly suspect so I'm going to reinstall from scratch and start over.

Thanks, Jens!

Eric Pretorious
Truckee, CA

jmozdzen
28-Dec-2012, 16:38
Hi Eric,

I hope reinstallation will work out fine - it might be a good idea to check your source for el6 files...

> san1:~ # rpm -qi cluster-glue-libs-1.0.5-6.el6.x86_64:
> package cluster-glue-libs-1.0.5-6.el6.x86_64: is not installed

May I recommend to be "a bit more precise" in typing/c&p? the colon is no part of the package name for sure, but simply a visual aid with the output of the program used to generate the original listing you c&p'ed from ;) And I'm sure that the architecture doesn't help, either.

"rpm -qi cluster-glue-libs" might be more helpful. And if you haven't reinstalled yet, I actually do recommend to check when/where these packets came from. Just so that they won't cause trouble with the reinstall.

Can anyone else out there confirm that Eric should not see .el6 RPMs on his install? Just to make sure I'm not barking up the wrong tree... I still have no HAE at SP2 at hand :[

With regards,
Jens

ChrisMalarkyIBM
25-Feb-2013, 18:41
I noticed that Step #5 of Chapter 3.4 (https://www.suse.com/documentation/sle_ha/book_sleha/?page=/documentation/sle_ha/book_sleha/data/sec_ha_installation_setup_manual.html) in the "SUSE Linux Enterprise High Availability Extension Guide (https://www.suse.com/documentation/sle_ha/book_sleha/data/book_sleha.html)" mentions the /var/log/sleha-bootstrap.log file so I took a look in there and found this:

================================================== ==============
2012-12-17 19:29:01-08:00 /usr/sbin/sleha-join
----------------------------------------------------------------
WARNING: Could not detect IP address for eth0
WARNING: Could not detect network address for eth0
+ chkconfig sshd on
+ mkdir -m 700 -p /root/.ssh
+ scp -oStrictHostKeyChecking=no root@192.168.1.2:/root/.ssh/id_rsa\* /root/.ssh/
+ rm -f /var/lib/csync2/san1.db
+ ssh root@192.168.1.2 sleha-init csync2_remote san1
WARNING: Could not detect IP address for eth0
WARNING: Could not detect network address for eth0
+ scp root@192.168.1.2:/etc/csync2/\{csync2.cfg\,key_hagroup\} /etc/csync2
+ chkconfig csync2 on
+ chkconfig xinetd on
+ ssh root@192.168.1.2 csync2\ -m\ /etc/csync2/csync2.cfg\ \;\ csync2\ -f\ /etc/csync2/csync2.cfg\ \;\ csync2\ -xv\ /etc/csync2/csync2.cfg
Connecting to host san1 (SSL) ...
Can't connect to remote host.
ERROR: Connection to remote host failed.
Host stays in dirty state. Try again later...
Finished with 1 errors.
WARNING: csync2 of /etc/csync2/csync2.cfg failed - file may not be in sync on all nodes
+ ssh root@192.168.1.2 csync2\ -mr\ /\ \;\ csync2\ -fr\ /\ \;\ csync2\ -xv\ -P\ san1
Connecting to host san1 (SSL) ...
Can't connect to remote host.
ERROR: Connection to remote host failed.
Host stays in dirty state. Try again later...
Finished with 1 errors.
WARNING: csync2 run failed - some files may not be sync'd
+ mkdir -p /tmp/sleha-pssh.5727
+ rm -f /tmp/sleha-pssh.5727/\*
+ pssh -H san1 -H san2 -O StrictHostKeyChecking=no -o /tmp/sleha-pssh.5727 cat /root/.ssh/known_hosts
[1] 19:35:31 [SUCCESS] san1
[2] 19:36:31 [FAILURE] san2 Timed out, Killed by signal 9
WARNING: known_hosts collection may be incomplete
+ pscp -H san1 -H san2 -O StrictHostKeyChecking=no /root/.ssh/known_hosts.new /root/.ssh/known_hosts
[1] 19:36:31 [SUCCESS] san1
[2] 19:42:50 [FAILURE] san2 Exited with error code 1
WARNING: known_hosts merge may be incomplete
+ rm /root/.ssh/known_hosts.new
+ rm -r /tmp/sleha-pssh.5727
+ partprobe
Error: Error informing the kernel about modifications to partition /dev/drbd0_part1 -- Invalid argument. This means Linux won't know about any changes you made to /dev/drbd0_part1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (Invalid argument)
Error: Error informing the kernel about modifications to partition /dev/drbd1_part1 -- Invalid argument. This means Linux won't know about any changes you made to /dev/drbd1_part1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (Invalid argument)
Error: Error informing the kernel about modifications to partition /dev/drbd2_part1 -- Invalid argument. This means Linux won't know about any changes you made to /dev/drbd2_part1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (Invalid argument)
Error: Error informing the kernel about modifications to partition /dev/drbd3_part1 -- Invalid argument. This means Linux won't know about any changes you made to /dev/drbd3_part1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (Invalid argument)
ERROR: Failed to probe new partitions
----------------------------------------------------------------
2012-12-17 19:42:55-08:00 exited (rc=1)
================================================== ==============
Does this help?

Eric Pretorious
Truckee, CA

Hi Eric,

I just found the same problem - when I first try and run sleha-join to connect a new node to an existing cluster, I also see the "csync2 run failed - some files may not be sync'd" warning. I tracked it down to the "Enabling xinetd service" section in sleha-join. As far as I can tell, as xinetd is already running, no action is performed, xinetd is not listening on the csync2 port (so the other node cant connect back in to the new node). I restarted xinetd and tried again, this time it worked as expected.

Hope this helps,
Chris

jmozdzen
25-Feb-2013, 19:10
Hi Eric,
[...]
> san1:~ # rpm -qi cluster-glue-libs-1.0.5-6.el6.x86_64:
> package cluster-glue-libs-1.0.5-6.el6.x86_64: is not installed
[...]
Can anyone else out there confirm that Eric should not see .el6 RPMs on his install? Just to make sure I'm not barking up the wrong tree... I still have no HAE at SP2 at hand :[

With regards,
Jens

Just for the records: I have checked on a running SLES11HAESP2 system that there are no .el6 packages installed... i.e. cluster-glue is "cluster-glue-1.0.9.1-0.38.2".

Regards,
Jens