We have a two node DRBD cluster, both nodes were previously running SLES10-SP3 and OES2-SP2, we did an inplace upgrade of one of the nodes to SLES11-SP1 and OES11.
Since the upgrade the node running SLES11-SP1 refuses to maintain its node in a primary role

Our configuration :
Node - Blade-2201-P
SLES10-SP3
OES2-SP2
DRBD 8.2.7

Node - Blade-2202-P
SLES11-SP1
OES11
DRBD 8.3.7 from SLES11-SP1 HAE pack.

DRBD is configured for both-primary

The DRBD on node 2201 comes up and runs fine
when we start DRBD on node 2202, the resource comes up as a primary, and then after about 1 minute, the resource changes role to secondary, and becomes standalone

to test we have performed the following.
Invalidated node 2202, when it is started it starts to synchronises but after 1 minute changes to secondary and becomes standalone, we keep restarting until the node is fully synchronised and then the behaviour mentioned above continues.

we have disconnected node 2202, issued the drbdadm primary r0 command, the resource role changes to standalone primary for 1 minute then reverts back to secondary.

We have fully tested the network and have no network connectivity issues, and node 2202 was working perfectly until the upgrade to SLES11-SP1.

Does anyone have any ideas as to why the node 2202 keeps changing role to secondary and disconnecting from the other node ?

Regards.


Config files and log outputs :-
Node 2201
global {
usage-count yes;
}

resource r0 {
protocol C;

handlers {
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
}

startup {
degr-wfc-timeout 120;
become-primary-on both;
}

net {
allow-two-primaries;
after-sb-0pri discard-least-changes;
}

disk {
on-io-error pass_on;
}

syncer {
rate 1000M;
al-extents 257;
}


on Blade-2202-P {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 10.14.0.42:7788;
flexible-meta-disk internal;
}

on Blade-2201-P {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 10.14.0.41:7788;
flexible-meta-disk internal;
}
}

Jul 1 18:50:57 Blade-2201-P kernel: drbd0: Handshake successful: Agreed network protocol version 88
Jul 1 18:50:57 Blade-2201-P kernel: drbd0: conn( WFConnection -> WFReportParams )
Jul 1 18:50:57 Blade-2201-P kernel: drbd0: Starting asender thread (from drbd0_receiver [27662])
Jul 1 18:50:57 Blade-2201-P kernel: drbd0: data-integrity-alg: <not-used>
Jul 1 18:50:58 Blade-2201-P kernel: drbd0: drbd_sync_handshake:
Jul 1 18:50:58 Blade-2201-P kernel: drbd0: self 2C224DC42E0E03A9:47D1F9D901831687:A61220B09F104B5B :E8A65252749F334D
Jul 1 18:50:58 Blade-2201-P kernel: drbd0: peer 47D1F9D901831686:0000000000000000:A61220B09F104B5A :E8A65252749F334D
Jul 1 18:50:58 Blade-2201-P kernel: drbd0: uuid_compare()=1 by rule 7
Jul 1 18:50:58 Blade-2201-P kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Jul 1 18:50:58 Blade-2201-P kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
Jul 1 18:50:58 Blade-2201-P kernel: drbd0: Began resync as SyncSource (will sync 7684 KB [1921 bits set]).
Jul 1 18:50:59 Blade-2201-P kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 7684 K/sec)
Jul 1 18:50:59 Blade-2201-P kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Jul 1 18:51:01 Blade-2201-P /usr/sbin/cron[26889]: (root) CMD (/root/gwcheck)
Jul 1 18:51:52 Blade-2201-P kernel: drbd0: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Jul 1 18:51:52 Blade-2201-P kernel: drbd0: Creating new current UUID
Jul 1 18:51:52 Blade-2201-P kernel: drbd0: meta connection shut down by peer.
Jul 1 18:51:52 Blade-2201-P kernel: drbd0: asender terminated
Jul 1 18:51:52 Blade-2201-P kernel: drbd0: Terminating asender thread
Jul 1 18:51:52 Blade-2201-P kernel: drbd0: Connection closed
Jul 1 18:51:52 Blade-2201-P kernel: drbd0: conn( TearDown -> Unconnected )
Jul 1 18:51:52 Blade-2201-P kernel: drbd0: receiver terminated
Jul 1 18:51:52 Blade-2201-P kernel: drbd0: Restarting receiver thread
Jul 1 18:51:52 Blade-2201-P kernel: drbd0: receiver (re)started
Jul 1 18:51:52 Blade-2201-P kernel: drbd0: conn( Unconnected -> WFConnection )

node 2202
global {
usage-count yes;
}

resource r0 {
protocol C;

handlers {
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
}

startup {
degr-wfc-timeout 120;
become-primary-on both;
}

net {
allow-two-primaries;
after-sb-0pri discard-least-changes;
}

disk {
on-io-error detach;
}

syncer {
rate 1000M;
al-extents 257;
}


on Blade-2202-P {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 10.14.0.42:7788;
flexible-meta-disk internal;
}

on Blade-2201-P {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 10.14.0.41:7788;
flexible-meta-disk internal;
}
}

Jul 1 18:50:57 Blade-2202-P kernel: [248478.687385] drbd: initialized. Version: 8.3.7 (api:88/proto:86-91)
Jul 1 18:50:57 Blade-2202-P kernel: [248478.687388] drbd: GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by phil@fat-tyre, 2010-01-13 17:17:27
Jul 1 18:50:57 Blade-2202-P kernel: [248478.687390] drbd: registered as block device major 147
Jul 1 18:50:57 Blade-2202-P kernel: [248478.687393] drbd: minor_table @ 0xffff88022cfd1ec0
Jul 1 18:50:57 Blade-2202-P kernel: [248478.800714] block drbd0: Starting worker thread (from cqueue [13875])
Jul 1 18:50:57 Blade-2202-P kernel: [248478.800789] block drbd0: disk( Diskless -> Attaching )
Jul 1 18:50:57 Blade-2202-P kernel: klogd 1.4.1, ---------- state change ----------
Jul 1 18:50:57 Blade-2202-P kernel: [248478.821584] block drbd0: No usable activity log found.
Jul 1 18:50:57 Blade-2202-P kernel: [248478.821588] block drbd0: Method to ensure write ordering: barrier
Jul 1 18:50:57 Blade-2202-P kernel: [248478.821593] block drbd0: max_segment_size ( = BIO size ) = 32768
Jul 1 18:50:57 Blade-2202-P kernel: [248478.821597] block drbd0: drbd_bm_resize called with capacity == 64338280
Jul 1 18:50:57 Blade-2202-P kernel: [248478.821946] block drbd0: resync bitmap: bits=8042285 words=125661
Jul 1 18:50:57 Blade-2202-P kernel: [248478.821950] block drbd0: size = 31 GB (32169140 KB)
Jul 1 18:50:57 Blade-2202-P kernel: [248478.848324] block drbd0: recounting of set bits took additional 0 jiffies
Jul 1 18:50:57 Blade-2202-P kernel: [248478.848327] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jul 1 18:50:57 Blade-2202-P kernel: [248478.848336] block drbd0: disk( Attaching -> UpToDate )
Jul 1 18:50:57 Blade-2202-P kernel: [248478.848346] block drbd0: Barriers not supported on meta data device - disabling
Jul 1 18:50:57 Blade-2202-P kernel: [248478.873276] block drbd0: conn( StandAlone -> Unconnected )
Jul 1 18:50:57 Blade-2202-P kernel: [248478.873292] block drbd0: Starting receiver thread (from drbd0_worker [31013])
Jul 1 18:50:57 Blade-2202-P kernel: [248478.873380] block drbd0: receiver (re)started
Jul 1 18:50:57 Blade-2202-P kernel: [248478.873384] block drbd0: conn( Unconnected -> WFConnection )
Jul 1 18:50:57 Blade-2202-P kernel: [248478.972043] block drbd0: Handshake successful: Agreed network protocol version 88
Jul 1 18:50:57 Blade-2202-P kernel: [248478.972052] block drbd0: conn( WFConnection -> WFReportParams )
Jul 1 18:50:57 Blade-2202-P kernel: [248478.972073] block drbd0: Starting asender thread (from drbd0_receiver [31030])
Jul 1 18:50:57 Blade-2202-P kernel: [248478.972135] block drbd0: data-integrity-alg: <not-used>
Jul 1 18:50:57 Blade-2202-P kernel: [248478.972505] block drbd0: drbd_sync_handshake:
Jul 1 18:50:57 Blade-2202-P kernel: [248478.972509] block drbd0: self 47D1F9D901831686:0000000000000000:A61220B09F104B5A :E8A65252749F334D bits:0 flags:0
Jul 1 18:50:57 Blade-2202-P kernel: [248478.972512] block drbd0: peer 2C224DC42E0E03A9:47D1F9D901831687:A61220B09F104B5B :E8A65252749F334D bits:1921 flags:0
Jul 1 18:50:57 Blade-2202-P kernel: [248478.972514] block drbd0: uuid_compare()=-1 by rule 50
Jul 1 18:50:57 Blade-2202-P kernel: [248478.972520] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Jul 1 18:50:58 Blade-2202-P kernel: [248479.389507] block drbd0: conn( WFBitMapT -> WFSyncUUID )
Jul 1 18:50:58 Blade-2202-P kernel: [248479.391950] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Jul 1 18:50:58 Blade-2202-P kernel: [248479.394007] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Jul 1 18:50:58 Blade-2202-P kernel: [248479.394015] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
Jul 1 18:50:58 Blade-2202-P kernel: [248479.394023] block drbd0: Began resync as SyncTarget (will sync 7684 KB [1921 bits set]).
Jul 1 18:50:58 Blade-2202-P kernel: [248479.397233] block drbd0: write: error=-95 s=7443824s
Jul 1 18:50:58 Blade-2202-P kernel: [248479.397236] block drbd0: Method to ensure write ordering: flush
Jul 1 18:50:58 Blade-2202-P kernel: [248479.700314] block drbd0: local disk flush failed with status -95
Jul 1 18:50:58 Blade-2202-P kernel: [248479.700316] block drbd0: Method to ensure write ordering: drain
Jul 1 18:50:59 Blade-2202-P kernel: [248480.548041] block drbd0: Retrying drbd_rs_del_all() later. refcnt=30
Jul 1 18:50:59 Blade-2202-P kernel: [248480.656719] block drbd0: Resync done (total 1 sec; paused 0 sec; 7684 K/sec)
Jul 1 18:50:59 Blade-2202-P kernel: [248480.656727] block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Jul 1 18:50:59 Blade-2202-P kernel: [248480.656733] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
Jul 1 18:50:59 Blade-2202-P kernel: [248480.658518] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
Jul 1 18:51:52 Blade-2202-P kernel: [248533.871495] block drbd0: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Jul 1 18:51:52 Blade-2202-P kernel: [248533.871511] block drbd0: short read expecting header on sock: r=-512
Jul 1 18:51:52 Blade-2202-P kernel: [248533.871521] block drbd0: asender terminated
Jul 1 18:51:52 Blade-2202-P kernel: [248533.871526] block drbd0: Terminating asender thread
Jul 1 18:51:52 Blade-2202-P kernel: [248533.887555] block drbd0: Connection closed
Jul 1 18:51:52 Blade-2202-P kernel: [248533.887564] block drbd0: conn( Disconnecting -> StandAlone )
Jul 1 18:51:52 Blade-2202-P kernel: [248533.887580] block drbd0: receiver terminated
Jul 1 18:51:52 Blade-2202-P kernel: [248533.887582] block drbd0: Terminating receiver thread