PDA

View Full Version : Talking about cluster



zabidin2
08-Jul-2013, 05:00
Hi,

I have 2 server (server A and server B). Both connect to san storage using ocfs2. The issue is when server A down, why server B cannot access cluster?

This is not meet purpose we create cluster. Can someone who really expert with suse tell me detail.

Thanks.

jmozdzen
08-Jul-2013, 10:46
Hi zabidin2,

> I have 2 server (server A and server B). Both connect to san storage using ocfs2. The issue is when server A down, why server B cannot access cluster?
> [...] Can someone who really expert with suse tell me detail.

I'm sorry, but it's you who'll have to provide some more details first ;)

Which SLES are you talking about? SLES11SP2 with latest patches? SLES11SP3? Do you have HAE installed and if not, how did you put OCFS2 on your servers? How's OCFS2 set up, specifically concerning the heartbeat? Are you running OCFS2 "stand-alone" or have you integrated it with Pacemaker?

Your description seems a big ambiquous to me, I apologize for that. In my terms, the "cluster" would be the combination of server A and B, but I sense that you're asking about accessing the SAN-LUN-based OCFS2 filesystem, is that correct?

Regards,
Jens

zabidin2
08-Jul-2013, 13:48
Sorry, i'm not giving detail. Last person who configure it as stand alone. I install drdb, drbd-heartbeat and drdb-pacemaker. Does suse give some tutorial about how to configure it? I'm using SLES VERSION = 11, PATCHLEVEL = 1.

Current we running custom web application so we don't want to upgrade. If we upgrade we need to tuning again and it hardwork. Please assist me to make my cluster work as it should.

Thanks.

MoserHans
08-Jul-2013, 15:54
Sorry, i'm not giving detail. Last person who configure it as stand alone. I install drdb, drbd-heartbeat and drdb-pacemaker. Does suse give some tutorial about how to configure it? I'm using SLES VERSION = 11, PATCHLEVEL = 1.
Try https://www.suse.com/documentation/sle_ha/

jmozdzen
08-Jul-2013, 16:04
Hi zabidin2,

ok, since you only mention SLES11SP1, I assume you have no HAE installed, but OCFS2 etc. from other sources (opensuse.org?).

> Does suse give some tutorial about how to configure it? I'm using SLES VERSION = 11, PATCHLEVEL = 1.
Last time I looked there were detailed docs covering this, but based on SLES 11 HAE - the "high-availability extensions" add-on containing the supported versions of Pacemaker, OCFS2 etc.

How detailed is your experience in this sector (just to "calibrate" our discussion level)? Are you using heartbeat or pacemaker to control your ressource (other than OCFS2, which you say is stand-alone)?

> The issue is when server A down, why server B cannot access cluster?
"server A down" can be an uncontrolled failure (power failure, system crash,...) or an operator-initiated shutdown. Which of both (or both?) are you referring to? What exactly are the symptoms you're experiencing? What's in the logs, i.e. syslog and dmesg, when you lose access?

Loosing "access to a cluster" has many faces and can have many causes. So we first need to dissect the actual situation to get to the cause, that's why I have so many questions...

Regards,
Jens

zabidin2
10-Jul-2013, 05:42
I'm using heartbeat to control my resource. Last time OVFS2 down because of iptables not adding port for heartbeat. I do not know how to configure pacemaker.

When i search, the document show only how to configure on SLES 11 SP3, i didn't found for SP1. When i install cluster using yast, it's totaly different from documentation. So i stuck with configuration for SP1.

jmozdzen
10-Jul-2013, 16:01
Hi zabidin2,

>> Are you running OCFS2 "stand-alone" or have you integrated it with Pacemaker?
> Last person who configure it as stand alone.

but now
> I'm using heartbeat to control my resource.

is a bit confusing to me. Typically you'd integrate OCFS2 with the cluster stack you're running, unless you have no cluster stack...

> When i install cluster using yast

So are you using HAE then?

So far, you have provided only tiny bits of information, leaving a lot guesswork for us - we won't be able to provide much help that way. Clustering is a rather complex subject.

- What is your software starting point? SLES11SP1 as a base system you have already mentioned, but that doesn't contain OCFS2 nor Heartbeat/Pacemaker. Where do those come from?
- How is your (OCFS2 and most probably heartbeat/Pacemaker) software set up? Please share relevant sections of configuration and/or live status information, i.e. from /sys and the CIB.
- Try to describe the problematic situation as precise as possible: What did work, what happened, what were the problems arising then
- Please include *details* of the symptoms you were experiencing - "stopped working" would not be helpful, "tried to open a file for reading but the open call never returned" lets us understand much better
- Please try to fetch relevant information from the log files, i.e. syslog, dmesg and other probably relevant logs

To provide helpful responses, we need to know what is happening - and you're the only one that can describe it to us.

Regards,
Jens

zabidin2
11-Jul-2013, 03:39
This the configuration:

svr-web1:/etc/ocfs2 # cat cluster.conf
cluster:
name = MyIPO
node_count = 2
node:
name = svr-web1
cluster = MyIPO
number = 0
ip_address = 192.168.1.26
ip_port = 7777
node:
name = svr-web2
cluster = MyIPO
number = 1
ip_address = 192.168.1.27
ip_port = 7777
svr-web1:/etc/ocfs2 #

svr-web1:/etc # cat issue

Welcome to SUSE Linux Enterprise Server 11 SP1 (x86_64) - Kernel \r (\l).

zabidin2
11-Jul-2013, 03:46
svr-web2:~ # /etc/init.d/o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster MyIPO: Online
Heartbeat dead threshold = 7
Network idle timeout: 5000
Network keepalive delay: 1000
Network reconnect delay: 2000
Checking O2CB heartbeat: Active

zabidin2
11-Jul-2013, 03:49
============
Last updated: Thu Jul 11 10:48:26 2013
Stack: openais
Current DC: svr-web1 - partition WITHOUT quorum
Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
1 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ svr-web1 ]

jmozdzen
11-Jul-2013, 08:53
Hi zabidin2,

those look fine to me. A single-node cluster does indeed look strange, but that's probably just an interim status until everything is up & configured - it doesn't even run resources, so there should be no harm done.

Regards,
Jens

zabidin2
11-Jul-2013, 09:07
What is resources for? I'm not expert. Just junior person.

jmozdzen
11-Jul-2013, 11:34
Hi zabidin2,

> What is resources for? I'm not expert. Just junior person

ah, ok - then a different approach is required :) You're in for a steep learning curve.

This also explains why most answers are still missing, so I'll be more precise in my questions.

First of all, please check in YaST (Software - Add-on products) if the "SUSE Linux Enterprise High Availability Extension 11 SP1" is installed, so that we know if you have an installation based on official packages (OCFS2, clustering) or if those parts are from some other source.

Then some statements concerning clustering:

In an earlier post you wrote
> I'm using heartbeat to control my resource

but your crm_mon output shows that your "cluster" consists of only one node and has no resources configured:


Last updated: Thu Jul 11 10:48:26 2013
Stack: openais
Current DC: svr-web1 - partition WITHOUT quorum
Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
1 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ svr-web1 ]
So currently, there is no resource - and the main purpose of the clustering software (moving "resources" between nodes) is voided, as there is only a single node.

> What is resources for?

"Resources" are the "entities controlled by the cluster management". It can be IP addresses (that need to be moved from serverA to serverB in case serverA fails), file system mounts (which i.e. can be active on all cluster nodes in parallel, in case of OCFS2), processes and subsystems (like a MySQL DMBS that may only be active on one node at a time, or httpd running in parallel on more than one node, for load sharing).

Back to your situation: You in fact currently seem to be running two clusters with separate cluster management stacks:
1. OCFS2
2. Pacemaker/openais

The usual recommendation is to use only a single cluster stack, that's why OCFS2 can use either it's own cluster stack or plug in to Pacemaker:

# cat /sys/fs/ocfs2/cluster_stack
pcmk
(that is taken from one of our installations)

Your OCFS2 (o2cb service and file system(s) ) then is configured as resources of your Pacemaker cluster. This is covered quite well in the SLES documentation (see i.e. https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha/book_sleha.html#cha.ha.ocfs2), I recommend reading the HAE guide as a starter.

> When i search, the document show only how to configure on SLES 11 SP3, i didn't found for SP1. When i install cluster using yast, it's totaly different from documentation.

Might that be because most of the cluster is usually configured outside of YaST? The tools described in the HAE guide, especially "crm", are used from the command line, which to many administrators is the preferred way to interact with Linux systems anyhow. If you don't have HAE installed (see the initial question), the "YaST parts" will be different - but the basic configuration tasks (setting up OCFS2 & Pacemaker) are extremely similar.

Please take some time to read the guide, try the steps mentioned there to set up Pacemaker to control your OCFS2 filesystem(s), and feel free to ask here whenever you do not understand something written there or if you feel that SP1 is totally different from what's described in the manual. I had a quick browse through the text any nothing SP3-specific caught my eye, but SP1 is a bit older, so I may have overlooked something.

Important: Get yourself a test environment - do not experiment with clustering on production servers. Clusters tend to behave differently from the way that was expected - and that includes taking down the whole server.

Something else caught my eye when reading through your messages:

> I have 2 server (server A and server B). Both connect to san storage using ocfs2
> I install drdb, drbd-heartbeat and drdb-pacemaker.

Since you classify yourself as "junior", I have to question those statements: They seem not to go together well: From how I see it, you have either
- SAN storage, accessible to all cluster nodes via some storage protocol (Fiber channel, iSCSI, shared SCSI)
- *or* local storage on (two) servers, that is sychronized and presented to upper layers as a single storage, via DRBD

So if you already have SAN storage available ("accessing the same block device from multiple servers"), there's no need for DRBD. For the sake of clearness in future discussion, could you please clarify if DRDB is a required part of your setup or if it was only installed because typical cluster documentations mention this? (the typical "small cluster" has two servers and no SAN... then DRBD can be part of the picture)

Regards,
Jens