PDA

View Full Version : libnetsnmp.so.30.0.3 segfault and crm_mon coredump



berndgsflinux
16-Jan-2019, 18:13
Hi,

i tried to create a ClusterMon resource:
ha-idg-1:~ # crm configure show SNMP
primitive SNMP ocf:pacemaker:ClusterMon \
params user=root \
params update=5000 \
params extra_options="-S vm49093-4.scidom.de -C idg-ha" \
params htmlfile="/srv/www/hawk/public/crm_mon.html" \
op start timeout=20 interval=0 \
op stop timeout=20 interval=0 \
op monitor interval=30 timeout=20

ClusterMon uses crm_mon. But the resource always fail,
first /usr/lib64/libnetsnmp.so.30.0.3 creates a segfault und immediately afterwards crm_mon creates a coredump.

This is the typical procedure:


2019-01-16T14:12:35.921439+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:0 on ha-idg-1: not running
2019-01-16T14:12:35.924387+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:1 on ha-idg-2: not running
2019-01-16T14:12:35.925833+01:00 ha-idg-1 pengine[5690]: notice: * Recover SNMP:0 ( ha-idg-1 )
2019-01-16T14:12:35.926191+01:00 ha-idg-1 pengine[5690]: notice: Calculated transition 191, saving inputs in /var/lib/pacemaker/pengine/pe-input-2406.bz2
2019-01-16T14:12:35.944837+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:0 on ha-idg-1: not running
2019-01-16T14:12:35.945743+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:1 on ha-idg-2: not running
2019-01-16T14:12:35.949082+01:00 ha-idg-1 pengine[5690]: notice: * Recover SNMP:0 ( ha-idg-1 )
2019-01-16T14:12:35.949952+01:00 ha-idg-1 pengine[5690]: notice: Calculated transition 192, saving inputs in /var/lib/pacemaker/pengine/pe-input-2407.bz2
2019-01-16T14:12:35.950240+01:00 ha-idg-1 crmd[5691]: notice: Processing graph 192 (ref=pe_calc-dc-1547644355-512) derived from /var/lib/pacemaker/pengine/pe-input-2407.bz2
2019-01-16T14:12:35.950463+01:00 ha-idg-1 crmd[5691]: notice: Initiating stop operation SNMP_stop_0 locally on ha-idg-1
2019-01-16T14:12:35.951522+01:00 ha-idg-1 lrmd[5687]: notice: executing - rsc:SNMP action:stop call_id:242
2019-01-16T14:12:35.967848+01:00 ha-idg-1 lrmd[5687]: notice: SNMP_stop_0:29153:stderr [ /usr/lib/ocf/resource.d/pacemaker/ClusterMon: line 147: kill: (28105) - No such process ]
2019-01-16T14:12:35.968248+01:00 ha-idg-1 lrmd[5687]: notice: finished - rsc:SNMP action:stop call_id:242 pid:29153 exit-code:0 exec-time:17ms queue-time:0ms
2019-01-16T14:12:35.968657+01:00 ha-idg-1 crmd[5691]: notice: Result of stop operation for SNMP on ha-idg-1: 0 (ok)
2019-01-16T14:12:35.971903+01:00 ha-idg-1 crmd[5691]: notice: Initiating start operation SNMP_start_0 locally on ha-idg-1
2019-01-16T14:12:35.972624+01:00 ha-idg-1 lrmd[5687]: notice: executing - rsc:SNMP action:start call_id:243
2019-01-16T14:12:35.989012+01:00 ha-idg-1 su: pam_unix(su-l:session): session opened for user root by (uid=0)
2019-01-16T14:12:35.991876+01:00 ha-idg-1 systemd[1]: Started Session c4 of user root.
2019-01-16T14:12:36.046399+01:00 ha-idg-1 su: pam_unix(su-l:session): session closed for user root
2019-01-16T14:12:36.049003+01:00 ha-idg-1 lrmd[5687]: notice: finished - rsc:SNMP action:start call_id:243 pid:29158 exit-code:0 exec-time:76ms queue-time:1ms
2019-01-16T14:12:36.049729+01:00 ha-idg-1 crmd[5691]: notice: Result of start operation for SNMP on ha-idg-1: 0 (ok)
2019-01-16T14:12:36.055968+01:00 ha-idg-1 crmd[5691]: notice: Initiating monitor operation SNMP_monitor_30000 locally on ha-idg-1
2019-01-16T14:12:36.062611+01:00 ha-idg-1 crmd[5691]: notice: Transition 192 aborted by operation SNMP_monitor_30000 'modify' on ha-idg-2: Old event
2019-01-16T14:12:36.098341+01:00 ha-idg-1 crmd[5691]: notice: Transition 192 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2407.bz2): Complete
2019-01-16T14:12:36.107834+01:00 ha-idg-1 kernel: [157578.314958] crm_mon[29187]: segfault at 6c ip 00007fc2ff4d928d sp 00007fff6e231800 error 4 in libnetsnmp.so.30.0.3[7fc2ff49e000+c8000]
2019-01-16T14:12:36.123148+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:0 on ha-idg-1: not running
2019-01-16T14:12:36.124655+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:1 on ha-idg-2: not running
2019-01-16T14:12:36.127945+01:00 ha-idg-1 pengine[5690]: notice: * Recover SNMP:1 ( ha-idg-2 )
2019-01-16T14:12:36.129056+01:00 ha-idg-1 pengine[5690]: notice: Calculated transition 193, saving inputs in /var/lib/pacemaker/pengine/pe-input-2408.bz2
2019-01-16T14:12:36.129795+01:00 ha-idg-1 crmd[5691]: notice: Processing graph 193 (ref=pe_calc-dc-1547644356-516) derived from /var/lib/pacemaker/pengine/pe-input-2408.bz2
2019-01-16T14:12:36.130047+01:00 ha-idg-1 crmd[5691]: notice: Initiating stop operation SNMP_stop_0 on ha-idg-2
2019-01-16T14:12:36.153502+01:00 ha-idg-1 crmd[5691]: notice: Initiating start operation SNMP_start_0 on ha-idg-2
2019-01-16T14:12:36.244619+01:00 ha-idg-1 crmd[5691]: notice: Initiating monitor operation SNMP_monitor_30000 on ha-idg-2
2019-01-16T14:12:36.288010+01:00 ha-idg-1 crmd[5691]: notice: Transition 193 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2408.bz2): Complete
2019-01-16T14:12:36.288350+01:00 ha-idg-1 crmd[5691]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
2019-01-16T14:12:37.371629+01:00 ha-idg-1 systemd-coredump[29197]: Process 29187 (crm_mon) of user 0 dumped core.


It's always the same:
the cluster recognizes that resource SNMP isn't running, stops it and starts it again.
crm_mon creates a segfault while accessing the library, crm_mon terminates with a core dump. The next monitor operation for SNMP recognizes that it isn't running and the procedure starts again. Every 30 seconds the same procedure.

Any ideas ?

Bernd

smflood
19-Jan-2019, 11:57
berndgsflinux Wrote in message:

> i tried to create a ClusterMon resource:
> ha-idg-1:~ # crm configure show SNMP
> primitive SNMP ocf:pacemaker:ClusterMon \
> params user=root \
> params update=5000 \
> params extra_options="-S vm49093-4.scidom.de -C idg-ha" \
> params htmlfile="/srv/www/hawk/public/crm_mon.html" \
> op start timeout=20 interval=0 \
> op stop timeout=20 interval=0 \
> op monitor interval=30 timeout=20
>
> ClusterMon uses crm_mon. But the resource always fail,
> first /usr/lib64/libnetsnmp.so.30.0.3 creates a segfault und immediately
> afterwards crm_mon creates a coredump.
>
> This is the typical procedure:
>
>
> Code:
> --------------------
> 2019-01-16T14:12:35.921439+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:0 on ha-idg-1: not running
> 2019-01-16T14:12:35.924387+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:1 on ha-idg-2: not running
> 2019-01-16T14:12:35.925833+01:00 ha-idg-1 pengine[5690]: notice: * Recover SNMP:0 ( ha-idg-1 )
> 2019-01-16T14:12:35.926191+01:00 ha-idg-1 pengine[5690]: notice: Calculated transition 191, saving inputs in /var/lib/pacemaker/pengine/pe-input-2406.bz2
> 2019-01-16T14:12:35.944837+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:0 on ha-idg-1: not running
> 2019-01-16T14:12:35.945743+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:1 on ha-idg-2: not running
> 2019-01-16T14:12:35.949082+01:00 ha-idg-1 pengine[5690]: notice: * Recover SNMP:0 ( ha-idg-1 )
> 2019-01-16T14:12:35.949952+01:00 ha-idg-1 pengine[5690]: notice: Calculated transition 192, saving inputs in /var/lib/pacemaker/pengine/pe-input-2407.bz2
> 2019-01-16T14:12:35.950240+01:00 ha-idg-1 crmd[5691]: notice: Processing graph 192 (ref=pe_calc-dc-1547644355-512) derived from /var/lib/pacemaker/pengine/pe-input-2407.bz2
> 2019-01-16T14:12:35.950463+01:00 ha-idg-1 crmd[5691]: notice: Initiating stop operation SNMP_stop_0 locally on ha-idg-1
> 2019-01-16T14:12:35.951522+01:00 ha-idg-1 lrmd[5687]: notice: executing - rsc:SNMP action:stop call_id:242
> 2019-01-16T14:12:35.967848+01:00 ha-idg-1 lrmd[5687]: notice: SNMP_stop_0:29153:stderr [ /usr/lib/ocf/resource.d/pacemaker/ClusterMon: line 147: kill: (28105) - No such process ]
> 2019-01-16T14:12:35.968248+01:00 ha-idg-1 lrmd[5687]: notice: finished - rsc:SNMP action:stop call_id:242 pid:29153 exit-code:0 exec-time:17ms queue-time:0ms
> 2019-01-16T14:12:35.968657+01:00 ha-idg-1 crmd[5691]: notice: Result of stop operation for SNMP on ha-idg-1: 0 (ok)
> 2019-01-16T14:12:35.971903+01:00 ha-idg-1 crmd[5691]: notice: Initiating start operation SNMP_start_0 locally on ha-idg-1
> 2019-01-16T14:12:35.972624+01:00 ha-idg-1 lrmd[5687]: notice: executing - rsc:SNMP action:start call_id:243
> 2019-01-16T14:12:35.989012+01:00 ha-idg-1 su: pam_unix(su-l:session): session opened for user root by (uid=0)
> 2019-01-16T14:12:35.991876+01:00 ha-idg-1 systemd[1]: Started Session c4 of user root.
> 2019-01-16T14:12:36.046399+01:00 ha-idg-1 su: pam_unix(su-l:session): session closed for user root
> 2019-01-16T14:12:36.049003+01:00 ha-idg-1 lrmd[5687]: notice: finished - rsc:SNMP action:start call_id:243 pid:29158 exit-code:0 exec-time:76ms queue-time:1ms
> 2019-01-16T14:12:36.049729+01:00 ha-idg-1 crmd[5691]: notice: Result of start operation for SNMP on ha-idg-1: 0 (ok)
> 2019-01-16T14:12:36.055968+01:00 ha-idg-1 crmd[5691]: notice: Initiating monitor operation SNMP_monitor_30000 locally on ha-idg-1
> 2019-01-16T14:12:36.062611+01:00 ha-idg-1 crmd[5691]: notice: Transition 192 aborted by operation SNMP_monitor_30000 'modify' on ha-idg-2: Old event
> 2019-01-16T14:12:36.098341+01:00 ha-idg-1 crmd[5691]: notice: Transition 192 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2407.bz2): Complete
> 2019-01-16T14:12:36.107834+01:00 ha-idg-1 kernel: [157578.314958] crm_mon[29187]: segfault at 6c ip 00007fc2ff4d928d sp 00007fff6e231800 error 4 in libnetsnmp.so.30.0.3[7fc2ff49e000+c8000]
> 2019-01-16T14:12:36.123148+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:0 on ha-idg-1: not running
> 2019-01-16T14:12:36.124655+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:1 on ha-idg-2: not running
> 2019-01-16T14:12:36.127945+01:00 ha-idg-1 pengine[5690]: notice: * Recover SNMP:1 ( ha-idg-2 )
> 2019-01-16T14:12:36.129056+01:00 ha-idg-1 pengine[5690]: notice: Calculated transition 193, saving inputs in /var/lib/pacemaker/pengine/pe-input-2408.bz2
> 2019-01-16T14:12:36.129795+01:00 ha-idg-1 crmd[5691]: notice: Processing graph 193 (ref=pe_calc-dc-1547644356-516) derived from /var/lib/pacemaker/pengine/pe-input-2408.bz2
> 2019-01-16T14:12:36.130047+01:00 ha-idg-1 crmd[5691]: notice: Initiating stop operation SNMP_stop_0 on ha-idg-2
> 2019-01-16T14:12:36.153502+01:00 ha-idg-1 crmd[5691]: notice: Initiating start operation SNMP_start_0 on ha-idg-2
> 2019-01-16T14:12:36.244619+01:00 ha-idg-1 crmd[5691]: notice: Initiating monitor operation SNMP_monitor_30000 on ha-idg-2
> 2019-01-16T14:12:36.288010+01:00 ha-idg-1 crmd[5691]: notice: Transition 193 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2408.bz2): Complete
> 2019-01-16T14:12:36.288350+01:00 ha-idg-1 crmd[5691]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
> 2019-01-16T14:12:37.371629+01:00 ha-idg-1 systemd-coredump[29197]: Process 29187 (crm_mon) of user 0 dumped core.
>
> --------------------
>
>
> It's always the same:
> the cluster recognizes that resource SNMP isn't running, stops it and
> starts it again.
> crm_mon creates a segfault while accessing the library, crm_mon
> terminates with a core dump. The next monitor operation for SNMP
> recognizes that it isn't running and the procedure starts again. Every
> 30 seconds the same procedure.
>
> Any ideas ?

Which version of SUSE Linux Enterprise Server (SLES) and High
Availability Extension are you using?

HTH.
--
Simon Flood
SUSE Knowledge Partner


----Android NewsGroup Reader----
http://usenet.sinaapp.com/

berndgsflinux
21-Jan-2019, 11:40
Which version of SUSE Linux Enterprise Server (SLES) and High
Availability Extension are you using?

HTH.
--
Simon Flood
SUSE Knowledge Partner

It's SP4.

Bernd

jmozdzen
22-Jan-2019, 16:40
Hi Bernd,

to make debugging easier, you should be able to reproduce the symptoms by calling


root@ha-idg-1# crm_mon -p /tmp/ClusterMon_testing.pid -i 5000 -S vm49093-4.scidom.de -C idg-ha -h /srv/www/hawk/public/crm_mon.html

This should fail with a segfault, like the invocation done by the cluster resource. The only difference to the resource script is that I omitted the "-d" option to keep the process in the foreground.

Do you see any additional output on stdout/stderr that might hint at the root cause?

Regards,
J

PS: I'm not at a matching system right now - what are the -S and -C options about?

smflood
22-Jan-2019, 17:41
It's SP4.

But SP4 of which version? SLES9, SLES10, SLES11, or SLES12?

HTH.

smflood
22-Jan-2019, 17:44
PS: I'm not at a matching system right now - what are the -S and -C options about?

From https://manpages.debian.org/testing/pacemaker-cli-utils/crm_mon.8.en.html :



Modes (mutually exclusive):

..snip..


-S, --snmp-traps=value
Send SNMP traps to this station

-C, --snmp-community=value
Specify community for SNMP traps(default is NULL)


HTH.

berndgsflinux
23-Jan-2019, 11:07
But SP4 of which version? SLES9, SLES10, SLES11, or SLES12?

HTH.

It's 12.


Bernd

berndgsflinux
24-Jan-2019, 13:41
Hi,

i found in the syslog from one off the nodes:


2019-01-24T13:35:43.145664+01:00 ha-idg-1 crmd[20846]: warning: Compile-time support for crm_mon SNMP options is deprecated and will be removed in a future release (configure alerts instead)


I think that's a clear statement. We shouldn't waste our time with something deprecated. I will switch to alerts.

Bernd

berndgsflinux
12-Feb-2019, 17:33
Hi,

I think that's a clear statement. We shouldn't waste our time with something deprecated. I will switch to alerts.

Bernd

For the sake of completeness: the hostname in my RA was wrong, fixing it solved the problem. Nevertheless i switched to alerts.
Bernd