PDA

View Full Version : Mystery RX packet drops on SLES11 SP2 every 30 sec



kevins_5
24-Jul-2012, 18:20
We have deployed several SP2 servers for testing, and are finding this annoying issue.
Approx every 30 seconds, the RX dropped counter ticks up by 1. This is only happening on SLES11 SP2 systems.
Usually when this counter ticks, it lines up with rx_fw_discards in ethtool statistics. I cannot find a matching stat for this drop.

eth0 Link encap:Ethernet HWaddr 00:21:9B:A0:07:7E
inet addr:192.168.69.247 Bcast:192.168.69.255 Mask:255.255.255.0
inet6 addr: fe80::221:9bff:fea0:77e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:61591 errors:0 dropped:276 overruns:0 frame:0
TX packets:38438099 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:39053090 (37.2 Mb) TX bytes:49909260932 (47597.1 Mb)
Interrupt:36 Memory:d2000000-d2012800

NIC statistics:
rx_bytes: 42672588
rx_error_bytes: 0
tx_bytes: 55266311872
tx_error_bytes: 0
rx_ucast_packets: 32153
rx_mcast_packets: 31988
rx_bcast_packets: 4239
tx_ucast_packets: 37456
tx_mcast_packets: 42720893
tx_bcast_packets: 4
tx_mac_errors: 0
tx_carrier_errors: 0
rx_crc_errors: 0
rx_align_errors: 0
tx_single_collisions: 0
tx_multi_collisions: 0
tx_deferred: 0
tx_excess_collisions: 0
tx_late_collisions: 0
tx_total_collisions: 0
rx_fragments: 0
rx_jabbers: 0
rx_undersize_packets: 0
rx_oversize_packets: 0
rx_64_byte_packets: 3667
rx_65_to_127_byte_packets: 22572
rx_128_to_255_byte_packets: 16761
rx_256_to_511_byte_packets: 353
rx_512_to_1023_byte_packets: 70
rx_1024_to_1522_byte_packets: 24957
rx_1523_to_9022_byte_packets: 0
tx_64_byte_packets: 56728
tx_65_to_127_byte_packets: 34705
tx_128_to_255_byte_packets: 1478329
tx_256_to_511_byte_packets: 2166944
tx_512_to_1023_byte_packets: 3321491
tx_1024_to_1522_byte_packets: 35700156
tx_1523_to_9022_byte_packets: 0
rx_xon_frames: 0
rx_xoff_frames: 0
tx_xon_frames: 0
tx_xoff_frames: 0
rx_mac_ctrl_frames: 0
rx_filtered_packets: 45678
rx_ftq_discards: 0
rx_discards: 0
rx_fw_discards: 0

driver: bnx2
version: 2.1.11
firmware-version: 6.4.5 bc 5.2.3 NCSI 2.0.11
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes

I have tried the latest driver 2.2.11j from broadcom with same results. Any ideas what might tick this counter?

enovaklbank
26-Jul-2012, 12:05
I can confirm this behaviour... also within ESXi5 VMs, so it does not depend on the NIC (we use vmxnet3 NICs).
Anyway, one dropped packet every ~30 seconds seems not to bother anything.

kevins7189_5
31-Jul-2012, 17:22
I can confirm this behaviour... also within ESXi5 VMs, so it does not depend on the NIC (we use vmxnet3 NICs).
Anyway, one dropped packet every ~30 seconds seems not to bother anything.

Well it is a pain because we monitor these counts for legitimate drops. There has to be a reason why this is happening.

Magic31
01-Aug-2012, 21:30
Well it is a pain because we monitor these counts for legitimate drops. There has to be a reason why this is happening.

I've been seeing this too with a couple of SLES 11 SP2 Xen setups. I was planning on setting up a small (network) isolated test server to see if it happens there too.

Trying to orientate how to find out what the server is dropping, I came across tool called dropwatch (http://linux.die.net/man/1/dropwatch). Not sure if it will run on SLES 11, but there are some packages for it here : http://pkgs.org/download/dropwatch

How are you guys planning to go about finding whats getting dropped?

Cheers,
Willem

Magic31
01-Aug-2012, 21:35
Trying to orientate how to find out what the server is dropping, I came across tool called dropwatch (http://linux.die.net/man/1/dropwatch). Not sure if it will run on SLES 11, but there are some packages for it here : http://pkgs.org/download/dropwatch



Ah, even better - an OBS build for SLES 11 SP2 : https://build.opensuse.org/package/revisions?package=dropwatch&project=home%3Abenjamin_poirier%3Adropwatch (as I also don't know how trustworthy pkgs.org downloads are)

kevins7189_5
08-Aug-2012, 21:48
Ah, even better - an OBS build for SLES 11 SP2 : https://build.opensuse.org/package/revisions?package=dropwatch&project=home%3Abenjamin_poirier%3Adropwatch (as I also don't know how trustworthy pkgs.org downloads are)

Interesting utility.
I was able to get this every time I saw the counter move up about every 30 sec. No idea what it means

1 drops at __netif_receive_skb+1fe (0xffffffff8138388e)

Magic31
09-Aug-2012, 14:10
Interesting utility.
I was able to get this every time I saw the counter move up about every 30 sec. No idea what it means

1 drops at __netif_receive_skb+1fe (0xffffffff8138388e)
Interesting indeed :)

I haven't been able to put more time into this that trying to simulate it in a (very small) test environment. Funny enough, I did not witness any dropped packets there.

My next move is to do this at the two sites where I am seeing this, but I have not had a window to do so yet (time and other priorities).


If you can open an SR, that would be best. It could well be certain type packets are intentionally for some security reason or other. I don't know, by far, enough about the workings of the kernel and modules... but this snip talken from a Google search did catch my interest:

* Add a packet_type handler and see if we can prevent
* other packet_type's from handling an skb
* Specifically, we will register our packet_type to be
* the first handler invoked by netif_receive_skb()
* If the packet received meets certain conditions, then,
* drop it, i.e, prevent subsequent ptype_all and ptype_base
* handlers in netif_receive_skb() from processing the packet


Sorry I can't be of more help here. I will pass on this thread to my Novell contact to see if this might be something Novell is aware of.

-Willem

Magic31
09-Aug-2012, 16:44
...I will pass on this thread to my Novell contact to see if this might be something Novell is aware of.

I should have asked earlier! : http://www.novell.com/support/kb/doc.php?id=7007165

There you go (and me too).

Cheers,
Willem

kevins7189_5
13-Aug-2012, 17:18
I should have asked earlier! : http://www.novell.com/support/kb/doc.php?id=7007165

There you go (and me too).

Cheers,
Willem

that reply seems kinda cop-out ish. Seems like this would be an easy to find problem if this has been happening since 2.32.37, but can't find any so easy.

Have no drops here, so that one is out
cat /proc/net/softnet_stat
03619a59 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
035d90a8 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
03609a2e 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
035dce37 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

kevins7189_5
13-Aug-2012, 17:52
the other issue is this goes away if I run tcpdump on the machine and try to catch the packets. I can go hours an hours with tcpdump running and the rx counter won't move, but soon after shutting it down the rx counters start incrementing again.

kevins7189_5
13-Aug-2012, 18:04
does anyone know if oui Unknown messages from arp would cause this counter to increment?

Magic31
14-Aug-2012, 10:47
does anyone know if oui Unknown messages from arp would cause this counter to increment?

Could well be.... opening an SR with SUSE would seem the way to get the best answers.

Not a prio for me at the moment, as this is not effecting production, but I will be on a lookout to see if the drop can be ignored within the counter.

-Willem

Bob-O-Rama
15-Aug-2012, 02:43
This may be a known issue with the bnx2 driver's default buffer coalescing settings. Does the rx_filtered_packets value match ( approximately ) the rx discards value shown in ifconfig? If so, then show us output of

ethtool -c eth0

and post here....

Then try the following:

ethtool -C eth0 rx-usecs 6 rx-usecs-irq 6 rx-frames 0 rx-frames-irq 0

and see if this reduces / eliminates the counters increasing.

Also this can be ( cough ) "perfectly normal" because when the server is unable to find a sink for the packats, it drops them. This can be things like BPDU packets or other stuff which is not a layer 3 protocol for which the server listens for. use of tcpdump or wireshart may well stop the counter from increasing as it will be a sink for all packats, it also changes the ring buffer to capture packets.

If the above incantation fixes the issue, you can leave it that way. The likely cause is a periodic burst of packets which overruns the driver buffer pool. "Stuff happens."

-- Bob

kevins7189_5
15-Aug-2012, 16:32
Thanks for the reply!

rx_fw_discards = 117
rx dropped in ifconfig (62 days uptime) is 237501

I've done some reading on the coalesce thing with bnx2 but I thought it was more of a troubleshooting step than a "bug", but using default now.
Coalesce parameters for eth0:
Adaptive RX: off TX: off
stats-block-usecs: 999936
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 18
rx-frames: 12
rx-usecs-irq: 18
rx-frames-irq: 2

I tried what you listed

Coalesce parameters for eth0:
Adaptive RX: off TX: off
stats-block-usecs: 999936
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 6
rx-frames: 0
rx-usecs-irq: 6
rx-frames-irq: 0


still get rx packet drops in ifconfig (now 237523). I'm well aware of the rx_fw_discards issue, because i definitely need to monitor that. I usually monitor it through ifconfig drops, however. But i CAN'T because these other "drops", whatever they are, are also counting to the same counter. I guess I could monitor rw_fw_discards directly, but I don't think I should have to. There should be a separate counter for the kernel to report dropping unknown packets, and not to the main ifconfig rx counter.
I'm trying to explain to my network guys about these unknown packets, and they are completely confused and uninterested (imagine that).
From what I can ascertain, I believe the OUI UNKNOWN packets are the cause, but I haven't found a good explanation of what oui Unknown means. But since it says "unknown", it seems a good candidate for these drops.

Bob-O-Rama
16-Aug-2012, 03:43
Thanks for the reply!


Since the drops go away when you use packet capturing, that is significant. It either means the capture drive is acting as a sink for packets that have not owning protocol ( things like BPDU ) or it messes with the buffering to allow the driver to correctly accept the packets. I would do about 2-5 minutes of packet captures and eliminate everything which is IP / TCP / UDP and look for something with the frequency you are seeing.

If you are good friends with the network guy, have them disable BPDU, CDP, and other digital detritus on the switch port(s) feeding the server.

But that the counters do not increment when you have captures is a significant finding supporting my theory that the drops counter is pretty meaningless.

-- Bob

kevins7189_5
16-Aug-2012, 17:18
Since the drops go away when you use packet capturing, that is significant. It either means the capture drive is acting as a sink for packets that have not owning protocol ( things like BPDU ) or it messes with the buffering to allow the driver to correctly accept the packets. I would do about 2-5 minutes of packet captures and eliminate everything which is IP / TCP / UDP and look for something with the frequency you are seeing.

If you are good friends with the network guy, have them disable BPDU, CDP, and other digital detritus on the switch port(s) feeding the server.

But that the counters do not increment when you have captures is a significant finding supporting my theory that the drops counter is pretty meaningless.

-- Bob

I'll keep going with packet captures. The network guys have actually been helpful, but there seems to be a giant lack of information on why this "error" was implemented, and what parameters would make it do so. If the kernel devs thought this necessary, they should document the conditions so people could prepare.
Right now I'm focusing on the only "unknowns" I see in tcpdump which are "OUI Unknown" from arp. I don't even know if these are the correct ones, but all I have to go on. After further investigation, the errors are not quite as timely as every 30 seconds, sometimes no drops occur for several minutes, then 3 or 4 show up.
This is very annoying.

kevins7189_5
16-Aug-2012, 18:53
Are any of these issues related?
http://hardforum.com/showthread.php?t=1472177
http://serverfault.com/questions/77510/unknown-tcpdump-packets

If broadcom nics are supported by SLES, then shouldn't their loopback packets be "supported" and therefore not error?

kevins7189_5
22-Aug-2012, 13:56
has anybody found any new information that they can share? I can't believe this isn't a big problem for a lot of people...

ab
22-Aug-2012, 14:17
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Did you read Magic31's reply on 2012-08-09 at 09:54 (MDT)? He posted a
link to a TID which does not seem, as you stated, "cop-out ish", since
it describes what is happening, why it happens, and I think even
provides a link back to when this was checked-in to the mainline kernel;
it also describes using tcpdump as a test, as you found, to stop the
counter from incrementing to show that it is indeed because of the
kernel change.

What would you have b e different at this point? It seems to me that
this change gives you a better picture of reality on your network, even
though that now means you can see things which are being dropped where
before you did not ("Pay no attention to the main behind the curtain.").
The reality of networking is that data are stopped, but networking is
designed to handle that too from collision detection (which matters less
today it seems) to TCP checksums and acknowledgments.

Good luck.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJQNNvoAAoJEF+XTK08PnB5Zz8QAIklMYp0gn 88ornoszGCN01A
AYNK/5A0CACAaLXJ6kAlrghtW1NxmXdB4Lw6UNlbR3AlbkFFkQrhx2z VueCSiImN
Y5EXfVaBjVx0UxHktGXvu1tZ2I51eFdRHtx4KemN0ySFwJ2VWs kI3hG2ZCf/GxjD
hVVUnpy1Zb/xZRqTHsG8h54pHmhhwNoA5voicM53RDt3iYQfwr8WyL4a6hV1r vmY
i+3VIUPTzvy6dfRaLFN5bFn87R01AE3nfcU7ddfIoCYfJKxjf6 BtRJkKbQYC9rdz
vg1UeIlVSVYMDusInyhE7sNxkRcq3JRHYTswv1uitYUGQNCY+m sWdRcGikW7tA2f
FPlG5HjibZDfEO3fgmQ2sh+5lCn44IPPMyg8spbBdiHVbtD7bU PbacD5G+Lq3c4Q
VRMqxdVvC59USDF6mLGYuYHPEz5VodU+h2tMQCPZaCK5wlUj/1OSQqBW938FomBv
82MXsLUPvCbkeaZ1bb2cofGQt5Nqnyh95FF8v93VTUYscpRgg5 KjQLiqrmvoobXB
1AzCNvy+nfXcEp8Mey1z6+Zh5cTgGhK8LPiunRUqxMspuKurQJ wYaAXREkW+4arX
SC5ujKCRpZ8zm5wpVC55h34KuyfonaPlzIlkN3de30RJ3Fjvw/Lz3155OXWczVBx
RD1mhnawjfL/XmkSMqby
=pePV
-----END PGP SIGNATURE-----

kevins7189_5
27-Aug-2012, 17:05
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Did you read Magic31's reply on 2012-08-09 at 09:54 (MDT)? He posted a
link to a TID which does not seem, as you stated, "cop-out ish", since
it describes what is happening, why it happens, and I think even
provides a link back to when this was checked-in to the mainline kernel;
it also describes using tcpdump as a test, as you found, to stop the
counter from incrementing to show that it is indeed because of the
kernel change.

What would you have b e different at this point? It seems to me that
this change gives you a better picture of reality on your network, even
though that now means you can see things which are being dropped where
before you did not ("Pay no attention to the main behind the curtain.").
The reality of networking is that data are stopped, but networking is
designed to handle that too from collision detection (which matters less
today it seems) to TCP checksums and acknowledgments.

Good luck.


The issue is that the kernel devs could have made their own "counter" for their "unknown" drops, which, are in reality, the kernel devs not keeping up with network protocols, not the other way around. We've found the SLES11 kernel doesn't understand several cisco protocols (really??), and bonding protocols even. If we need to "see" unknown packets, put it in a different counter. There are already LEGITIMATE drops in the DROP counter, don't need to see the unknown ones here too.
There is no where, where I can find, that says, to be a Kernel 3.0 up user, you need to clean up your unknown tiny packets they may be lurking on your network, because we think its important to count them...

rant over.

ab
27-Aug-2012, 20:23
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> The issue is that the kernel devs could have made their own
> "counter" for their "unknown" drops, which, are in reality, the
> kernel devs not keeping up with network protocols, not the other way
> around.

What protocol, specifically, is on the wire and should be supported by
the Linux kernel on your SLES machine? Have you searched for relevant
kernel modules to support that protocol? Have you submitted this to the
kernel developers? For some reason, perusing this thread, I thought you
were looking for an explanation on dropped packets, not for missing
functionality because of a lack of support for third-party protocols
which do not affect your server (other than by incrementing a counter of
packets dropped).

> We've found the SLES11 kernel doesn't understand several cisco
> protocols (really??), and bonding protocols even.

Again, ones that you need? Which ones specifically? If support for
something useful is missing then I'm sure engineering will consider
adding it if possible. For many of these cases, though, protocols are
not supported out of the box because they do not matter to the purpose
of the box. Cisco protocols.... if you're talking about proprietary
protocols then this doesn't seem very important at all. If you're
talking about protocols that Cisco happens to implement along with the
rest of the world (BGP, RIP, etc.) then I'd expect Linux can handle
those, but those modules may be omitted out of the box to save
time/space. If that's the case you're welcome to change that by adding
more modules.

> If we need to "see" unknown packets, put it in a different counter.
> There are already LEGITIMATE drops in the DROP counter, don't need to
> see the unknown ones here too.

Sounds like a valid enhancement request.

> There is no where, where I can find, that says, to be a Kernel 3.0
> up user, you need to clean up your unknown tiny packets they may be
> lurking on your network, because we think its important to count
> them...

Indeed, there is not; nor should there be, though. Your system works
just fine dropping unknown packets as it always has. I've heard of two
customers now who cared (you were the second, and the first prompted the
TID) so perhaps this will become a bigger issue as more move to SLES 11
SP2 (or as other non-SLES customers move to the late 2.x and 3+ kernels)
so if that's the case your enhancement request, or even a bug, may be
the right course of actions to provide yet another counter for packets.
I doubt the developer implementing this did so to cause problems, but
rather to get a better picture of what really happens on the wire. That
they did not account for people watching for other types of dropped
packets as some kind of network health check seems like an honest
mistake, if that's really what happened.

Good luck.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iQIcBAEBAgAGBQJQO8lHAAoJEF+XTK08PnB5MoMP/10LlrawKU1erdqs1abfI1EY
A6z09WEDWSrNZw4u5lVbaIqVQ9TXT0FBGKbwrfOrZQlm6yaC2z cV49AAi9r43KQZ
TC6fV84TstFHgUsYmoLLbads0z1Wbke1AOnlS/mGC2VC/lSmNW0USZBQ7S5mvv5A
GAMK/abWSMvU/6UfC91qdyI/qa4GvKwU0Su/0NIjm3xn3N2NIbeiEtTOMd1Xp/ng
2b5BT4Ac64gHjQTKGzKvROskG6yta5VbBwGMuoL875uS7dhwut hjFHb/MUK5VX6A
FsgHpd2ZafGQDG8UpI1t+Ki4V1bztUe1BZ+sNperHmO9WlJV1Z N1Zj4Eci1hulU0
kYOv5t2rwMgoVAwBDp/VK7MhzIkPjict7PkivAoR1fy7oX9CtXffPaInWrMWAgjZ
EsBttipPq5x37RAMMihjQEtP0hTQ4F4oiFp5jL4xG2WAUjcL0r B3iVkI2etmQbyE
KbQf+VF42gAHVHzHGkCey+w8rgnl+/6YT2txVbrE+ROOp78FO6G/U5ZqYFpF8lQ+
jLt3XIY8HpEE+AadjAHtATbebopGryIziid/Bc5twGr0ILdDD/sI0JjeykmCpFRz
8JeD7k0EmtPa+5Ic+CVi2AdeCNOTjX1AbNfofBTBUfh+7Xo756 okgucR1j35LBF3
8e/kBpGTbsvQrhw2K0Qu
=Sx7F
-----END PGP SIGNATURE-----

kevins7189_5
28-Aug-2012, 14:11
The "OUI Unknown" shown in tcp dump seems to be the ones that trigger the drops. We've eliminated a few that we have found, but it seems things like spanning tree, bonding, and the bnx2 loopback packets all make these happen (there was one protocol that was from a Microsoft terminal server that was causing one but can't recall what the name was). I have a fedora box that has the same issues with bonding and dropping "unknown" packets (3.x kernel again).
It's really an issue of monitoring. I use the drop counter (and others) to look for drops of important packets that show issues with data we care about. I don't need to see a drop of a small arp packet that the kernel doesn't understand. It needs its own counter and leave useful counters alone.

kevins7189_5
18-Oct-2012, 14:35
has any resolution been found for this? This is still the absolute dumbest non-error I've ever seen in a OS.

jmozdzen
18-Oct-2012, 14:49
Have you asked the developers for a fix or change?

(that's a change in the upstream Linux kernel - not SLES- nor SuSE-specific, so lkml-net might be a better place to ask your question)