Results 1 to 10 of 10

Thread: Unable to bring node 2 in a 2 node cluster online

Hybrid View

  1. #1

    Unable to bring node 2 in a 2 node cluster online

    I'm setting up a 2 node cluster for the first time ;o)
    Tried in a test environment - had a few difficulties, but managed.
    Now in production environment, I have more or less the same difficulties, but am not able to solve the problem...
    To my best knowledge, I have configured the 2 nodes correct (and the same).
    After having configured stonith and a few other resources on node 1, I ran sleha-join on node 2 - apparently with no errors.
    But node 2 never shows up in the crm_gui - if I manually add node 2 in crm_gui, it seems not to connect, and the node can't be brought online.
    The /var/log/sleha-bootstrap.log on node 2 shows no error.
    It mostly looks like if I had 2 "1 node clusters"...
    I haven't found any troubleshooting guides leading me to the source of my problem...

    Any ideas?

  2. Re: Unable to bring node 2 in a 2 node cluster online

    Hi gertsogaart,

    > Any ideas

    Plenty

    My first guess would be networking. I'm the "if in doubt, do it manually"-type of guy, so I have no experience with "sleha-join" - are both nodes set up with the correct interface definitions? Same multicast port & address, same network address? Is firewalling currently disabled to rule out errors in that department? Could you give us a copy of the interface section(s) from both /etc/corosync/corosync.conf files? Can both nodes see each other through the configured network connection?

    > The /var/log/sleha-bootstrap.log on node 2 shows no error.

    How's logging set up in /etc/corosync/corosync.conf, just using files or is syslog enabled(, too)?

    Oh, should have asked that first: What OS are you using and are you up to latest patch levels?

    Regards,
    Jens

  3. #3

    Re: Unable to bring node 2 in a 2 node cluster online

    First, thanks for your response ;o)

    OS version, oh yes I forgot, sorry.
    SLES11SP2 and the corresponding HA software.
    I think, I'm kind of "do it manually"-type kind of guy too, but when I couldn't make it work, I googled for solutions, and the sleha-join way turned up...

    are both nodes set up with the correct interface definitions? Yes
    Same multicast port & address, same network address? Yes
    Is firewalling currently disabled to rule out errors in that department? Yes

    /etc/corosync/corosync.conf for node 1:
    interface {
    #Network Address to be bind for this interface setting

    bindnetaddr: 192.168.22.0

    #The multicast address to be used

    mcastaddr: 226.94.1.1

    #The multicast port to be used

    mcastport: 5405

    #The ringnumber assigned to this interface setting

    ringnumber: 0

    }

    /etc/corosync/corosync.conf for node 2:
    interface {
    #Network Address to be bind for this interface setting

    bindnetaddr: 192.168.22.0

    #The multicast address to be used

    mcastaddr: 226.94.1.1

    #The multicast port to be used

    mcastport: 5405

    #The ringnumber assigned to this interface setting

    ringnumber: 0

    }

    About logging - I didn't touch any settings - they look like this in corosync.conf on node 2:
    logging {
    #Log to a specified file

    to_logfile: no

    #Log to syslog

    to_syslog: yes

    #Whether or not turning on the debug information in the log

    debug: off

    #Log timestamp as well

    timestamp: off

    #Log to the standard error output

    to_stderr: no

    #Logging file line in the source code as well

    fileline: off

    #Facility in syslog

    syslog_facility: daemon

    }

    I'm not really familiar with multicast - how can I make sure, the hosts are seeing each other via multicast?

    Hope this clarifies the situation?

    Regards
    Gert

  4. Re: Unable to bring node 2 in a 2 node cluster online

    Hi Gert,

    Quote Originally Posted by gertsogaard View Post
    First, thanks for your response ;o)

    OS version, oh yes I forgot, sorry.
    SLES11SP2 and the corresponding HA software.
    I have to admit that even my test cluster is still at SP1 :[

    Quote Originally Posted by gertsogaard View Post
    are both nodes set up with the correct interface definitions? Yes
    Same multicast port & address, same network address? Yes
    Is firewalling currently disabled to rule out errors in that department? Yes

    /etc/corosync/corosync.conf for node 1: [...]
    Yes, that definitely looks good so far.

    The logging statement is set up for syslog, so you might find some useful information there... but I expect that to be hidden beneath lots of other messages, HA tends to be quite verbose, once you're trying to get some useful information out of it
    Quote Originally Posted by gertsogaard View Post
    I'm not really familiar with multicast - how can I make sure, the hosts are seeing each other via multicast?
    I'd run a "tcpdump -i <your cluster node's ethernet interface name goes here> -nvv port 5405 and host 226.94.1.1", which should give you quite some output like
    Code:
    cluster02:~ # tcpdump -i eth0 -nvv port 5405 and host 239.103.103.0
    tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
    16:17:12.149368 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 147) 10.0.99.1.5404 > 239.103.103.0.5405: UDP, length 119
    16:17:12.542538 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 147) 10.0.99.1.5404 > 239.103.103.0.5405: UDP, length 119
    16:17:12.936567 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 147) 10.0.99.1.5404 > 239.103.103.0.5405: UDP, length 119
    16:17:13.332483 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 147) 10.0.99.1.5404 > 239.103.103.0.5405: UDP, length 119
    16:17:13.724467 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 147) 10.0.99.1.5404 > 239.103.103.0.5405: UDP, length 119
    16:17:14.124646 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 147) 10.0.99.1.5404 > 239.103.103.0.5405: UDP, length 119
    16:17:14.514524 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 147) 10.0.99.1.5404 > 239.103.103.0.5405: UDP, length 119
    16:17:14.909659 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 147) 10.0.99.1.5404 > 239.103.103.0.5405: UDP, length 119
    16:17:15.301609 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 147) 10.0.99.1.5404 > 239.103.103.0.5405: UDP, length 119
    16:17:15.698635 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 147) 10.0.99.1.5404 > 239.103.103.0.5405: UDP, length 119
    ^C
    10 packets captured
    10 packets received by filter
    0 packets dropped by kernel
    of course with your mcast target address. (You can stop the trace via Ctrl-C).

    Quote Originally Posted by gertsogaard View Post
    Hope this clarifies the situation?

    Regards
    Gert
    If you see mcast traffic on each node (originating from the respective other node, of course), you can run a quick check on the "ring status" via "corosync-cfgtool -s". If this reports "status = ring 0 active with no faults", then it's time to move up in the layers - if not, maybe syslog can help identifying the root cause.

    If you don't see the required multicast traffic, then something is bogus at the network layer.

    Oh, and I assume that your test cluster is either in a totally different IP network / multicast domain or is using a different mcast address/port combination.

    Regards,
    Jens

  5. #5

    Re: Unable to bring node 2 in a 2 node cluster online

    Hi Jens

    My test environment and my production environment can't see each other.

    My tcpdumps looks pretty much as yours, but on node 1, I seem to see its own traffic only - the same applies to node 2.
    Node 1 IP addresses: 10.229.13.46 and 192.168.22.1
    Node 2 IP addrssses: 10.229.13.47 and 192.168.22.2
    tcpdump on node 1:
    17:26:56.068681 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 147) 192.168.22.1.5404 > 226.94.1.1.5405: UDP, length 119
    tcpdump on node 2:
    17:28:39.514092 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 147) 192.168.22.2.5404 > 226.94.1.1.5405: UDP, length 119

    So it seems to be a network/communication problem - I would say something about multicast, as I can ping between 192.168.22.1 og 192.168.22.2 both ways...

    Regards

    Gert

  6. Re: Unable to bring node 2 in a 2 node cluster online

    Hi Gert,

    > My tcpdumps looks pretty much as yours, but on node 1, I seem to see its own traffic only - the same applies to node 2.

    that's pretty strange, indeed. My dumps only show the the corresponding other node. Out of curiosity: Are those IP addresses both on the same interface per node, or do they have their own interface (and if yes, are those "physical" interfaces, i.e. eth0/eth1, or VLAN interfaces)?

    May I assume that you have (a) switch(es) between those two nodes? Running multicast across those may require special configuration at the switch level. If possible, you might want to try a crossover connection between the servers just for the cluster protocol, to rule out limitations of the production network.

    As a side note: It is recommended to have a second network path between production cluster nodes - as a matter of fact, I see ring failures between my nodes more often on the regular network path (redundant physical interfaces, bonded, going to the same hardware-redundant switch for all cluster servers), the most reliable path is the "backup path" via a non-managed switch. Of course, that backup switch does not need any reboots for maintenance reasons

    Regards,
    Jens

  7. #7

    Re: Unable to bring node 2 in a 2 node cluster online

    Hi Jens,

    On each node, I have 2 network interfaces eth0 and eth1 - they are both attached to the same VLAN, eth0 with a "real IP address" and eth1 with an address just used for the "cluster management" traffic.
    The nodes are virtual servers running on VMware ESX hosts - node 1 and node 2 are running on 2 different hosts located on 2 different locations - so the crossover cable will be difficult, but I could probably for the test have node 2 migrated to the same host as node 1?
    I guess you are right, the swithes need to be configured for multicasting.
    I will try the 2 "solutions" tomorrow...

    Thanks for your help

    Gert

  8. Re: Unable to bring node 2 in a 2 node cluster online

    Hi Gert,

    > node 1 and node 2 are running on 2 different hosts located on 2 different locations

    Corosync is said to be rather picky about network latency. And you might be better off trying to use unicast (see "transport" parameter) rather than multicast.

    Depending on the WAN link stability, you must be prepared for casual split-brain situations. I don't know what services your cluster will run, so it's up to you to decide what measures need to be taken when this happens...

    Regards,
    Jens

  9. #9

    Re: Unable to bring node 2 in a 2 node cluster online

    Hi Jens,

    I have now reconfigured the Cluster Communication Channel to use unicast - and now my nodes are seeing each other and I can migrate my resources etc.
    So now I'm happy

    Thanks again

    Gert

  10. Re: Unable to bring node 2 in a 2 node cluster online

    Hi Gert,

    it's nice to know you got it working in the end - thanks for reporting back!

    With regards,
    Jens

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •