PDA

View Full Version : SLES 11 SP2 for VMware lose connectivity



tkarhu
15-Oct-2012, 15:14
Hello,

I am new to SLES, so please bear with me.
Simply put, scenario is following:
- in my environment of network, 4 VMware (vSphere 5) hosts, and IBM Lotus Mobile Connect VPN
- I have 4 SLES 11 SP2 for VMware servers which lose connectivity for certain clients which are connected via LMC VPN. Servers have 1 interface.
- when connectivity is lost, the servers are still reachable via LAN. This appears only happen to remote clients
- Looking at tcpdump/wireshark, I see packets coming in but not going out of SuSe

16:23:37.964832 IP 10.31.64.3 > vsv04: ICMP echo request, id 1, seq 27841, length 40
16:23:42.946887 IP 10.31.64.3 > vsv04: ICMP echo request, id 1, seq 27847, length 40
16:23:47.956295 IP 10.31.64.3 > vsv04: ICMP echo request, id 1, seq 27853, length 40
16:23:52.945814 IP 10.31.64.3 > vsv04: ICMP echo request, id 1, seq 27859, length 40
16:23:57.950353 IP 10.31.64.3 > vsv04: ICMP echo request, id 1, seq 27865, length 40
16:24:03.041327 IP 10.31.64.3 > vsv04: ICMP echo request, id 1, seq 27871, length 40
^C12428 packets captured
12428 packets received by filter
0 packets dropped by kernel
(this is Win 7 client -> LMC VPN -> SLES 11 SP2(vsv04))
- for some reason the sequence number jumps off 5

I'm clueless. Based on above information, should I suspect something SLES 11 SP2 internal or the VPN connection (which works perfectly to any other OS I have here)?
Restarting interface resolves problem but only temporarily.

tkarhu
16-Oct-2012, 11:19
Tested same scenario with SLES11 SP2 x86_64 (not 'for VMware') on physical server and same dropping happened, except now the vanished Echo Request amount is 6 so that every 7th request is seen by tcpdump and no Echo Replies from kernel to interface.

Next I need to test with a GA or SP1 system.

jmozdzen
16-Oct-2012, 13:06
Hi tkarhu,

how's the routing on your SLES servers set up? Is there a default route set or do you have explicit rules for the client's network?

I believe the "no ICMP echo reponses" is easier than "consistently 5 to 6 echo requests missing" part, so let's handle that first ;)

Regards,
Jens

tkarhu
16-Oct-2012, 14:39
Thanks for replying!

All servers use default route, like this:

itds:~ # route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default router2.domain.fi 0.0.0.0 UG 0 0 0 eth0
10.31.0.0 * 255.255.192.0 U 0 0 0 eth0
loopback * 255.0.0.0 U 0 0 0 lo
link-local * 255.255.0.0 U 0 0 0 eth0

I just finished installing SLES 11 x86_64 SP1 on address 10.31.10.250, and from same client where I miss echos from the SP2 system(s) I have been getting reply now over 2h in row. For the SP2 systems this has also been working for a couple of hours and then stopping all of a sudden.
But this SP1 system has really now pinged the longest.. I'm close saying this is a SP2 feature/problem but I won't, yet.

Wrap up so far:
- SLES 11 for VMware x86_64 SP2 running on vSphere ESXi5: no reply to client behind VPN
- SLES 11 x86_64 SP2 running on hardware: no reply to client behind VPN
- SLES 11 x86_64 SP1 running on vSphere ESXi5: has been replying to client behind VPN over 2h now

tkarhu
17-Oct-2012, 10:08
This morning I restarted my LMC VPN as our visitor wlan had kicked me out during the night. Same laptop, same LMC, same source IP.
For my surprise the two SP2 servers continued pinging over 2h, other one over 3h. But now they died again. Test servers I compare to are still pinging without a hitch.

SLES 11 x86_64 SP2 on IBM x366: pinged good for ~2h, no more replies
SLES 11 for VMware x86_64 SP2 on vShere ESXi5: pinged good for ~3h, no more replies
SLES 9 x86 GA on vSphere ESXi5: pings good and going strong
SLES 11 x86_64 SP1 on vSphere ESXi5: pings good and going strong


I've tried the route of "net.ipv4.conf.default.rp_filter = 0" in sysctl.conf but to no avail. Does anyone know why would this guy say it doesn't work for SLES 11 SP2? (http://www.softpanorama.info/Net/Internet_layer/Routing/martian_source.shtml) (bottom of the article)
Somehow that return path check would fit this picture as from LAN also the SP2 servers are all smile.

I would have done a ticket but we don't have subscription as this is a customer demo of several products, and SuSE was chosen 'on top of head' to be the platform for the demo. It took too long to realise this problem, so we don't have a choice to change OS or patchlevel at this time.

jmozdzen
17-Oct-2012, 11:13
Hi again,

I'm not familiar with LMC and its installation - does it act as a network interface on the target ("gatekeeper"?) machine? In your case, where does the VPN tunnel terminate, on the server(s) in question?

If the SLES11SP2 servers are *not* terminating the tunnels, then I don't see how martians are to be expected:

> I have 4 SLES 11 SP2 for VMware servers which lose connectivity for certain clients which are connected via LMC VPN. Servers have 1 interface

If the servers have (in routing terms) only a single interface, all traffic is going through that interface. Always.

If, on the other hand, those servers run the gatekeeper (tunnel end point) component of the LMC VPN software, then I'd expect them to have more than a single interface: the physical interface to the LAN plus some (virtual) interface provided by LMC. Or it's functioning via policy based routing.

Seems I'd need more details of the setup to be of help: How's the actual network setup, where are the packets traveling though? Client with LMC, LMC gatekeeper, probably some local network, SLES11SP2 server?

Could it be the gatekeeper that is causing the trouble? What happens if you re-establish the LMC connection after the ICMP packets are dropped? What happens if you restart the LMC components? If the gatekeeper is on the SLES11SP2 server: Does LMC somehow interact with the policy routing engine?

Too many questions, no answers... sorry :/

Regards,
Jens

tkarhu
17-Oct-2012, 11:57
Thanks Jens,

I appreciate your effort. I could draw a simplified picture of the setup, but to answer your fundamental questions I'll keep in writing for time being.
This case's particular flow is

LMC Client on Windows 7 (10.31.64.3) --internet--> LMC gateways (one AIX6, one RHEL4, same behavior via either gw) --LAN--> router2 --> x3850(ESXi5) --> SLES 11 SP2 (vsv04)

VPN pool is 10.31.64.0/24 while LAN is 10.31.0.0/18. LMC gatekeepers terminate VPN and they sure do have 2 interfaces (and many virtual interfaces as well). SLES 11 SP2 Virtual Machines or physical server has one interface connected to LAN.
You are right about martian packets, however, I actually got carried away by this article http://www.novell.com/support/kb/doc.php?id=7007649, and then found that martian packet article which was suggesting that this setting cannot be used with SP2. My point is rather making sure this "Return Path Check" would be set to off for sure. Currently trying to understand what is FIB.

All the time I've been on the edge whether to go suspect-LMC-Gatekeeper way or suspect-SLES11-SP2 way. LMC gatekeeper trace shows packets going out towards SLES 11 SP2, but none coming back. As tcpdump on target SLES 11 SP2 also suggests that no replies are sent back to the network, and SLES 11 SP1 or any other OS is replying back to this same LMC client, I'm willing to suspect that SLES 11 SP2 has now something which is triggered when packets flow via VPN, or even LMC in particular.

About restart, so far I have tested:

restart of LMC Connection Manager (gatekeeper): no help, no reply
restart of VPN tunnel fron client end: no help, no reply
no internet connection for several hours on client end, re-establish internet connection and VPN tunnel: replies received for ~3h then stopped again
on SLES 11 SP2 VMs, removed/reinstalled NIC via VMware (new MAC): only temporary help, eventually no replies
ifdown/ifup on SLES 11 SP2: only temporary help, eventually no replies
reboot of SLES 11 SP2: only temporary help but many times worked for couple of hours after restart, eventually no replies

jmozdzen
17-Oct-2012, 13:08
hi tkarhu,

no need for a picture - the textual description makes things clear enough, at least to me.

> [...] I'm willing to suspect that SLES 11 SP2 has now something

since vsv04 (SLES11SP2) has incoming ICMPs, but is not responding to them, I agree that it's unlikely to be a problem with the LMC gateway.

> VPN pool is 10.31.64.0/24 while LAN is 10.31.0.0/18.

Looks good to me.

> As tcpdump on target SLES 11 SP2 also suggests that no replies are sent back to the network

An interesting question is what probable causes might there be to the "no response" situation. The actual packet flow and associated information, simplified, ought to look like:

- the ICMP echo request from 10.31.64.3 comes through eth0 and is accepted (that seems to be correct, verified by tcpdump on the SLES11SP2 machine)
- the IP stack decides to answer
- the response packet is constructed, destination IP would be 10.31.64.3
- the routing part of the IP stack determines that it's no IP address reachable locally and tries to determine the corresponding router (should be default router in your case)
- the ARP cache is checked for an entry for the default router's IP address
- the ICMP packet is sent to the router's MAC address

If i.e. there's no entry for the default router and it cannot be determined via ARP requests, you'd not see an outgoing ICMP echo response in tcpdump.

An idea just crossed my mind: When I tested openSuSE 12.1 on a netbook, I had to face situations where the IP stack seemed to have missed the "is not a local address" part of the decision making - leading to an ARP entry for the (remote) destination address. Therefore, I couldn't ping that address from the netbook anymore, I had to restart the netbook (or maybe "rcnetwork restart" was sufficient - I'm not 100% sure on that). Any other IP address was pingable without difficulties, even in the same target network. The situation (or rather that specific cause) was easily detectable: "arp -an" gave an (incomplete) entry for the remote IP address (that'd be 10.31.64.3 in your case).

You could also check the interface error stats (/sys/class/net/eth0/statistics/) for anything that catches the eye. May I assume you have had a look at dmesg and syslog for anything out of the ordinary happening around the time the replies stop?

Have you compared the tcpdump packet details of the requests from when responses work with those when no responses are sent (I wouldn't expect non-standard changes though) and have you checked what other traffic (non-ICM echo request/response) occurs on the SLES11SP2 interface when no responses are sent? (I'm thinking about non-answered ARP requests for the router IP or alike).

Regards,
Jens

tkarhu
17-Oct-2012, 15:10
An idea just crossed my mind: When I tested openSuSE 12.1 on a netbook, I had to face situations where the IP stack seemed to have missed the "is not a local address" part of the decision making - leading to an ARP entry for the (remote) destination address. Therefore, I couldn't ping that address from the netbook anymore, I had to restart the netbook (or maybe "rcnetwork restart" was sufficient - I'm not 100% sure on that). Any other IP address was pingable without difficulties, even in the same target network. The situation (or rather that specific cause) was easily detectable: "arp -an" gave an (incomplete) entry for the remote IP address (that'd be 10.31.64.3 in your case).
Now we exit my comfort zone, but that's the mother of all learning I guess!


vsv04:~ # arp -an
? (10.31.0.10) at 00:09:6b:71:57:c6 [ether] on eth0
? (10.31.10.1) at 00:50:56:9d:53:a5 [ether] on eth0
? (10.31.21.176) at 00:27:13:6b:c4:13 [ether] on eth0
? (10.31.0.1) at 00:0b:60:da:c4:38 [ether] on eth0
? (10.31.10.2) at 00:0d:60:0f:29:50 [ether] on eth0

in above order, they are
DNS
file server
my laptop in lan
gw
lmc gw who is connected to 10.31.64.3



You could also check the interface error stats (/sys/class/net/eth0/statistics/) for anything that catches the eye. May I assume you have had a look at dmesg and syslog for anything out of the ordinary happening around the time the replies stop?



vsv04:~ # ifconfig
eth0 Link encap:Ethernet HWaddr 00:50:56:9D:53:C4
inet addr:10.31.10.4 Bcast:10.31.63.255 Mask:255.255.192.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2051413 errors:0 dropped:8674 overruns:0 frame:0
TX packets:115363 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:152714275 (145.6 Mb) TX bytes:29334029 (27.9 Mb)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:60358 errors:0 dropped:0 overruns:0 frame:0
TX packets:60358 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:35685182 (34.0 Mb) TX bytes:35685182 (34.0 Mb)

SP2 intruduced drop counter to grow, but article http://www.novell.com/support/kb/doc.php?id=7007165 suggested it's normal. This drop counter does not grow in the rate of my pings, so I dropped that route totally.
audit.log, messages, and dmesg really looked normal, as far as I can judge that.



Have you compared the tcpdump packet details of the requests from when responses work with those when no responses are sent (I wouldn't expect non-standard changes though) and have you checked what other traffic (non-ICM echo request/response) occurs on the SLES11SP2 interface when no responses are sent? (I'm thinking about non-answered ARP requests for the router IP or alike).


I found something new from TCP dump (I accidentally forgot tcpdump running longer than earlier <30sec)!

16:56:52.142751 IP vsv04 > 10.31.11.11: ICMP vsv04 udp port 42535 unreachable, length 196
10.31.11.11 is the other LMC gateway (aix one) and during this dump I am sure my Win7 LMC client is connected to the other one. I need to read tcpdump longer now, I must have overlooked something. LMC uses only one gateway at a time.
I monitor tcpdump longer and come back.

jmozdzen
17-Oct-2012, 17:48
Now we exit my comfort zone, but that's the mother of all learning I guess!


vsv04:~ # arp -an
? (10.31.0.10) at 00:09:6b:71:57:c6 [ether] on eth0
? (10.31.10.1) at 00:50:56:9d:53:a5 [ether] on eth0
? (10.31.21.176) at 00:27:13:6b:c4:13 [ether] on eth0
? (10.31.0.1) at 00:0b:60:da:c4:38 [ether] on eth0
? (10.31.10.2) at 00:0d:60:0f:29:50 [ether] on eth0

These entries all look fine.

A quick explanation: When an IP host tries to send an IP data packet, it will always have to send it to a target host on a network that our sending host is directly attached to. If that network is capable of addressing different hosts (like an Ethernet LAN, where our sending host can directly reach any machine on the same LAN), then some sort of underlying addressing has to take place. In case of LANs that's what the MAC address is used for, and to determine the correct MAC address, the IP ARP protocol is used (a mapping service MAC to IP, so to say). The important part is that in a hosts ARP table, there must be only entries for the network(s) our sending host is directly attached to.
In your list, this is the case: vsv04 is attached to 10.31.0.0/18 and all reported entries are in that subnet. Had there been an entry for 10.31.64.3, this would have been an error (and the entry would probably have been reported as <incomplete>, as no MAC address on the local LAN could have been determined for that host).

[...]I found something new from TCP dump (I accidentally forgot tcpdump running longer than earlier <30sec)!

16:56:52.142751 IP vsv04 > 10.31.11.11: ICMP vsv04 udp port 42535 unreachable, length 196
10.31.11.11 is the other LMC gateway (aix one) and during this dump I am sure my Win7 LMC client is connected to the other one.
That recorded ICMP packet is a report that some packet was received from 10.31.11.11 for the local port 42535 - but there's no-one listening in vsv04:42535. This does seem unrelated... but one never knows ;)

I need to read tcpdump longer now, I must have overlooked something. LMC uses only one gateway at a time.
I monitor tcpdump longer and come back.

Hm, something else caught my eye:
> LMC Client on Windows 7 (10.31.64.3) --internet--> LMC gateways (one AIX6, one RHEL4, same behavior via either gw) --LAN--> router2 --> x3850(ESXi5) --> SLES 11 SP2 (vsv04)
> LAN is 10.31.0.0/18
> (10.31.10.2) at 00:0d:60:0f:29:50 [ether] on eth0 => lmc gw who is connected to 10.31.64.3
In msg #4 of this thread
> itds:~ # route
> Kernel IP routing table
> Destination Gateway Genmask Flags Metric Ref Use Iface
> default router2.domain.fi 0.0.0.0 UG 0 0 0 eth0
> 10.31.0.0 * 255.255.192.0 U 0 0 0 eth0
> loopback * 255.0.0.0 U 0 0 0 lo
> link-local * 255.255.0.0 U 0 0 0 eth0

1. How does router2 fit into this picture? Is it simply the Dom0 (and all DomUs have direct interfaces to the LAN), or is it an actual router between the LAN and the Domo/DomUs?
2. If the LMC gateway is at 10.31.10.2 (which is on the LAN - where vsv04 is said to be directly connected), how's the vsv04 machine to know where to send the response packets? It looks as if the default router is (network-wise) not between vsv04 and the LMC gateway and no route to 10.31.64.0 points to the LMC gateway.

If router2 is a separate router between the LAN subnet and vsv04, then the IP/routing setup of vsv04 seems strange. If not, then who's routing the ARP responses from vsv04 to the LMC gw on the same subnet?

Regards,
Jens

tkarhu
18-Oct-2012, 12:38
1. How does router2 fit into this picture? Is it simply the Dom0 (and all DomUs have direct interfaces to the LAN), or is it an actual router between the LAN and the Domo/DomUs?
2. If the LMC gateway is at 10.31.10.2 (which is on the LAN - where vsv04 is said to be directly connected), how's the vsv04 machine to know where to send the response packets? It looks as if the default router is (network-wise) not between vsv04 and the LMC gateway and no route to 10.31.64.0 points to the LMC gateway.

If router2 is a separate router between the LAN subnet and vsv04, then the IP/routing setup of vsv04 seems strange. If not, then who's routing the ARP responses from vsv04 to the LMC gw on the same subnet?

Regards,
Jens

1. router2 is a separate router, yes. Dom0 says nothing to me but according to Wikipedia it means the 'root domain' or similar, so yes, it's a Dom0. Return packets will flow - on all other systems than SLES11 SP2 - virtual machine (vsv**) to router2 to LMC gw to LMC client, if I understand this, and all work as expected.
2. def gw for vsv04 is router2, so return packets should first go to router2, which then tells that 10.31.64.0/24 is to be found behind 10.31.10.2 (lmc gateway) and from there to the LMC client

I see nothing too odd in this, and it does work for the SP1 and other OS'es. But I am more than willing to believe that we have such a setup that SLES 11 SP2 triggers something to make this problem visible, while others route packets just fine.

tkarhu
18-Oct-2012, 13:32
One more thing I forgot to mention. This morning a project member told he cannot copy files worth of 3,5GB to the server. I promised to take a look. This copy involved copy over samba from a fileserver (SLES 9). I started copy and it ran at 8MB/s for around 3mins, then stopped. Network utilization went near to zero. After some timeout, nautilus said file copy error "unknown argument". Meanwhile Windows clients, or my Ubuntu laptop was able to connect to the fileserver without problem.
SLES 11 SP2 did this 3 times before I got all 3,5GB copied. I copied again the three files that got partially copied when the connectivity dropped.

I'll try reproduce this on another SLES 11 SP2 and get tcpdump of it.

What wonders me are:

8MB/s, this is a gigabit LAN and expected copy speed are in range of 25-35MB/s
although copy progress indicator would suggest grinding to a halt, network throughput actually stalled immediately


Jens, you being more knowledgeable, would you mind take a look at my tcpdumps when I get the situation captured?

jmozdzen
18-Oct-2012, 14:14
Re,

I was asking these questions because for "traditional" routing, some puzzle pieces don't seem to fit. At least from my point of view.

"Dom0" is a term from i.e. the Xen world (sorry, I missed the "VMware" info in your message): Dom0 is the "host system", like physical server, on which the DomUs (aka "VMs" - you said you had virtual SLES11SP2 servers) are running. Depending on the networking setup inside the virtual environment, Dom0 can act as a router rather than attaching the interfaces of the DomUs directly to a LAN.

Here's what I gathered so far:

LMC client (IP 10.31.64.3/24, default gateway is the LNC VPN gateway) -> tunnel via Internet -> (10.31.64.1/24) LNC gw (10.31.10.2/18) -> LAN (10.31.0.0/18) -> (10.31.0.1/18) router2 (some IP address) -> someNet -> (10.31.10.4/18) vsv04

Now, IP-wise, this picture doesn't fit.

Routing basics are:
- A packet is delivered directly to the target, if the target is on the same subnet as the sending host
- Otherwise, the packet is delivered to the router with the closed matching route
- Lastly, the sending host will try to deliver the packet to the "default router".

Under all these circumstances, either the target or the "next hop"-router needs to be reachable via the physical network connecting sender and receiver.

In the above picture, I see three subnets:
- the network between LMC client and LMC gateway, said to be "10.31.64.0/24"
- the network between LMC gateway and router2, labeled "LAN" in msg #7, "10.31.0.0/18" according to that msg
- the network between router2 and vsv04, currently no information available

Alternatively it could be:
- the network between LMC client and LMC gateway, said to be "10.31.64.0/24"
- the network between LMC gateway and *both router2 and vsv04*, labeled "LAN" in msg #7, "10.31.0.0/18" according to that msg

If the latter applies, then sending the response from vsv04 to LMCclient would go the following route:
- vsv04 has no explicit route - so takes the default gateway router2
- router2 sees that the forwarded packet leaves the same way it entered the router and thus sends an ICMP redirect to the originating system (vsv04)
- vsv04 ought to remember the redirect and send further packets directly to the proper router (LMC gateway)

If the above, alternative description actually meets your situation, maybe adding an explicit route on vsv04 would help?

But if the first description applies, or none of both, could you please update? Seems I'm lost somewhere...

Regards,
Jens

jmozdzen
18-Oct-2012, 14:21
> would you mind take a look at my tcpdumps when I get the situation captured?

I'll take a look, but we'll first need to correct my idea of how your network looks, to better meet reality...

Concerning your description, did I get this right:
- user is on the SLES11SP2 server
- tries to copy file to SLES9 samba share (how? Is the share mounted at the SLES11SP2 server? smbclient? ...?)
- SLES9 server is directly attached to LAN (10.31.0.0/18)
- other (working) MSWindows/Ubuntu client is directly attached to 10.31.0.0/18 network?

Regards,
Jens

tkarhu
18-Oct-2012, 19:09
In the above picture, I see three subnets:
- the network between LMC client and LMC gateway, said to be "10.31.64.0/24"
- the network between LMC gateway and router2, labeled "LAN" in msg #7, "10.31.0.0/18" according to that msg
- the network between router2 and vsv04, currently no information available
Jens

OK, now I need to take veeeery slow as I drop off otherwise ;).
router2 and vsv04 are in the same 10.31.0.0/18 network, as well as the other legs of the LMC gateways (10.31.10.2 and 10.31.11.11)

so if I am still on the same boat, from your choices, it's:

- the network between LMC gateway and *both router2 and vsv04*, labeled "LAN" in msg #7, "10.31.0.0/18" according to that msg

That explicit route idea is splendid. This is a demo project so any workaround would do just fine.
This of course wouldn't explain why other than SLAS11SP2 are so working, but what the heck.. off to test!

tkarhu
19-Oct-2012, 10:52
Status update:

VM host vsv04/10.31.10.4/SLES11SP2, which has explicit route since yesterday evening, has now pinged over 4h (record) in row. Promising.
Physical host itds/10.31.3.50/SLES11SP2, this one I wanted to first lose comms, then add route to see what happens. This has request timed out no for +4h. No change even route added on the fly


There are now 2 more observations/questions I have (perhaps basic - so bear with me again). Let's focus on that physical box. itds/10.31.3.50 to keep things as simple as possible.

1. I put tcpdump | grep ICMP running to see my echo requests from the Win7/10.31.64.2 (new addr from the pool - gateway for this is gw1/10.31.11.11) behind VPN. What caught my eye looking at tcpdump is:

10:07:28.473471 IP 10.31.64.3 > vsv04: ICMP echo request, id 1, seq 18248, length 40
10:07:28.490802 IP 10.31.64.28 > vsv22: ICMP echo request, id 1792, seq 49222, length 40
10:07:28.500997 IP 10.31.64.5 > vsv01: ICMP echo request, id 1, seq 17074, length 40
10:07:28.501918 IP 10.31.64.6 > 10.31.10.250: ICMP echo request, id 10, seq 53266, length 40
Not being a network specialist, but remembering that switch's function is to forward correct packet to correct ports, I started to wonder why do I see completely other LMC client's traffic to completely other hosts? I called a network specialist of ours to clarify and he said "after switch learns hosts MAC, there should only be traffic visible destined to it (host itds in this case)".
Do you see any simple reason why I see above in host itds's tcpdump?

2. For these days I have had ping running on 4 hosts from this VPN client. Our guest wlan only disconnects it every night, other than that ping runs all the time. Ping output Time-to-Live caught my eye.

SLES9: has pinged now 5 days in row. TTL 62
SLES11SP1: has pinged now 5 days in row. TTL 62
SLES11SP2 VM: after explicit route ping appears ok. TTL 63. Earlier this week I noticed that TTL changed from 62 to 63 on the fly while pinging.
SLES11SP2 PHY: doesn't ping even explicit route added (on the fly). TLL 63 and 63 seen like above VM.

Does TTL play any part in this equation? Especially why TTL63 for that SP2 if the client is the same Win7/LMC?

tkarhu
19-Oct-2012, 11:08
Above is partly wrong. SLES11SP2 physical also started to ping after explicit route add. I messed that part myself. Tried to delete above message but it apparently failed. TLL and tcpdump observations are still valid though. That +4h pinging is VERY promising now.

jmozdzen
19-Oct-2012, 11:53
Status update:

VM host vsv04/10.31.10.4/SLES11SP2, which has explicit route since yesterday evening, has now pinged over 4h (record) in row. Promising.


This might actually point in the direction of timing-out ICMP redirect entries. But I must admit, my experience in that area isn't too in-depth.


There are now 2 more observations/questions I have (perhaps basic - so bear with me again). Let's focus on that physical box. itds/10.31.3.50 to keep things as simple as possible.

1. I put tcpdump | grep ICMP running to see my echo requests from the Win7/10.31.64.2 (new addr from the pool - gateway for this is gw1/10.31.11.11) behind VPN. What caught my eye looking at tcpdump is:

I recommend to use
tcpdump -eni eth0 icmp that way you'll see the MAC addresses as well (and get addresses instead of numbers, which is helpful in case of wrong DNS entries)



10:07:28.473471 IP 10.31.64.3 > vsv04: ICMP echo request, id 1, seq 18248, length 40
10:07:28.490802 IP 10.31.64.28 > vsv22: ICMP echo request, id 1792, seq 49222, length 40
10:07:28.500997 IP 10.31.64.5 > vsv01: ICMP echo request, id 1, seq 17074, length 40
10:07:28.501918 IP 10.31.64.6 > 10.31.10.250: ICMP echo request, id 10, seq 53266, length 40
Not being a network specialist, but remembering that switch's function is to forward correct packet to correct ports, I started to wonder why do I see completely other LMC client's traffic to completely other hosts? I called a network specialist of ours to clarify and he said "after switch learns hosts MAC, there should only be traffic visible destined to it (host itds in this case)".
Do you see any simple reason why I see above in host itds's tcpdump?
That specialist's answer puts it a bit too plain: A switch will forward traffic for a specific MAC address to the port that the switch believes to lead to that MAC address. (That's a reverse point of view from the one described by your specialist)
If the switch has no clue where to find that specific MAC address, it will broadcast the frame to all ports. And of course, link layer broadcast and multicast frames will be forwarded to all/multiple ports, too. So on any port, you'll see traffic for the MAC address(es) believed to be reachable via that port, then broad- and multicast traffic and packets for destination addresses the switch doesn't know about.

An important part in case of the above packets are the link layer addresses (not shown in your output), to which Ethernet adapter were those packets sent? *Then* you can decide why you see these packets in your trace.

Debugging the exact cause might involve looking at the switch's MAC address table, which is far beyond the scope of this discussion. OTOH, another cause could be the LMC VPN gateway and it's handling of the packets... but my crystal ball is foggy today ;)



2. For these days I have had ping running on 4 hosts from this VPN client. Our guest wlan only disconnects it every night, other than that ping runs all the time. Ping output Time-to-Live caught my eye.

SLES9: has pinged now 5 days in row. TTL 62
SLES11SP1: has pinged now 5 days in row. TTL 62
SLES11SP2 VM: after explicit route ping appears ok. TTL 63. Earlier this week I noticed that TTL changed from 62 to 63 on the fly while pinging.
SLES11SP2 PHY: doesn't ping even explicit route added (on the fly). TLL 63 and 63 seen like above VM.

Does TTL play any part in this equation? Especially why TTL63 for that SP2 if the client is the same Win7/LMC?

TTL gives an indication of the number of routers/hops the packets had to cross on their way. An TTL change indicates a topology change (which may be caused simply by optimizing routing tables, taking an extra router out of the path).

The default ttl of Linux boxes seems to be 64 - so when you ping the SLES9 and SLES11SP1 systems, the response crosses two routers. One would be the VPN gateway, my guess for the other is the default router (which needn't be in the path anyhow, as I had explained earlier). SLESS11SP2 answers are sent directly to the VPN gateway, so only one hop, leading to the remaining TTL of 63.

Concerning the 11SP2 physical machine: How do you get a TTL if you receive no ping replies ("SLES11SP2 PHY: doesn't ping")?

I'm under the impression that your network (routing) setup needs some "optimization"... but that would be an on-site job, maybe your network specialist can help out?

Oh, another question: In your original message, you reported that you see only every 5th or 6th ICMP echo request - is this still true when pinging vsv04 with the explicit route? Does the *client* report missing responses? If the latter is the case, it'd be rather interesting to see where the other four/five responses come from...

Regards,
Jens

jmozdzen
19-Oct-2012, 12:10
Hi,

ok, that answers one of my questions.

I had a quick check via Google, and came up with the following:

https://lkml.org/lkml/2012/7/20/428

So it's a known problem (I say "imported from uptream kernel") with probably an open SR at Novell. You might want to contact your service representative to hook up to that SR...

Regards,
Jens

PS: I'm too old-school... I don't value network designs too much that rely on ICMP redirects... and now I have another reason for that :D Proper static routes with be the work-around.

tkarhu
22-Oct-2012, 07:36
Oh, another question: In your original message, you reported that you see only every 5th or 6th ICMP echo request - is this still true when pinging vsv04 with the explicit route? Does the *client* report missing responses? If the latter is the case, it'd be rather interesting to see where the other four/five responses come from...

Regards,
Jens

(Forget about that physical 11sp2 not pinging after static route add. It's completely *bs* and I couldn't delete the message after I noticed what I wrote. Physical box pings just as reliable as 11SP2 virtual machine. TTL63 when static route to LMC connection manager in place, TTL62 without.)

I've now determined to nail this (and prove there's nothing wrong with our network ;)). I removed the static route from the physical box, "itds". This morning it stopped pinging from Win client (LMC VPN) after 3mins. So, a reverse proof of the fix.
I stopped client's VPN session and restarted it, still no ping from "itds". I rebooted itds, and only after reboot ping came back on the Win7 client.

Now I have

tcpdump -lnni eth0 -w /tmp/itds_2012-10-22.dmp
running. I expect connection to be lost within few hours today.
I'll send you the dump, so it's easier to look at the "every 6 echo reply"-case. Also you might be able to see what's cooking when the connection drop actually happens.

Kind regards,
Timo

tkarhu
22-Oct-2012, 09:08
OK, now happened again. I can't tell anything out of it so I hope it makes sense to you. I'll try to send it to you via private message.

08:57 - 10:05 pinging fine
10:05:49 redirected packet from 10.31.0.1
10:05:49 - 10:06:52 pinging fine
10:06:53 no replies from 11SP2 back to default or LMC gateway
packet skipping, according to Wireshark, is now 16 (if I calculated correctly)

jmozdzen
24-Oct-2012, 15:56
Hi Timo,

I answered more detailed in a private message, but to sum it up for others following this thread:

1. I saw some (to me) unexpected behavior of the ping client: The ICMP sequence numbers were not consecutive, but with varying increments although the request frequency was about 1Hz ( ;), make that one request per second) when responses were received, and 5 seconds when no responses were received. When I could see ICMP requests by the client going to multiple destinations, there was a +1 increment, thus I suspect the client has a system-wide request ID generator.

2. The first ICMP redirect by the router was seen very late, 1000s of seconds into the trace.

3. That redirect wasn't honoured by the server immediately - it took 37 seconds until ICMP echo responses were sent to the LMC gateway instead of the default router!

4. 27 seconds later, the server stopped responding to those specific ICMP echo requests (but answered others). That's roughly a minute, could be an expiring redirect cache entry.

I feel that it's the Linux ICMP redirect bug, referenced above in msg #4. And the "case of the vanishing ICMP echo requests", aka "request id increments by 4", seems to be just unexpected ping client behavior.

Regards,
Jens