PDA

View Full Version : Delayed first network connection



sauerwein
11-Jul-2014, 11:35
Hi,

on an new installed SLES 11 SP3 server (fully pachted) I have a strange phenomenon:

Any first network connection (ssh, vnc, http) results in a wait of about 8 to 20 seconds and then get a reply.
When I close the ssh connection on the client and try another connection (ssh, vnc or http), I immediately get
a response to login / the web page.

I replaced the network card, nothing helps. We have a pure IPv4 network. Client and server are in the same subnet.

For the network specialists: I montitored the network traffic between the client (136.199.34.34 bib-edv34.uni-trier.de)
and the server (136.199.34.247 ub-zcm.uni-trier.de) with an ssh connection:

-----------------------------------------------------------------------------------------------------------------
ub-zcm:/tmp # netstat -pantc | grep 136.199.34.34
tcp 0 21 136.199.34.247:22 136.199.34.34:1304 VERBUNDEN 29658/sshd: [accept
tcp 0 21 136.199.34.247:22 136.199.34.34:1304 VERBUNDEN 29658/sshd: [accept
tcp 0 21 136.199.34.247:22 136.199.34.34:1304 VERBUNDEN 29658/sshd: [accept
tcp 0 21 136.199.34.247:22 136.199.34.34:1304 VERBUNDEN 29658/sshd: [accept
tcp 0 21 136.199.34.247:22 136.199.34.34:1304 VERBUNDEN 29658/sshd: [accept
tcp 0 21 136.199.34.247:22 136.199.34.34:1304 VERBUNDEN 29658/sshd: [accept
tcp 0 21 136.199.34.247:22 136.199.34.34:1304 VERBUNDEN 29658/sshd: [accept
tcp 0 21 136.199.34.247:22 136.199.34.34:1304 VERBUNDEN 29658/sshd: [accept
tcp 0 21 136.199.34.247:22 136.199.34.34:1304 VERBUNDEN 29658/sshd: [accept
tcp 0 0 136.199.34.247:22 136.199.34.34:1304 VERBUNDEN 29658/sshd: root [p
tcp 0 0 136.199.34.247:22 136.199.34.34:1304 VERBUNDEN 29658/sshd: root [p

ub-zcm:/etc/init.d # tcpdump host 136.199.34.34
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on p6p1, link-type EN10MB (Ethernet), capture size 96 bytes
07:19:29.743860 arp who-has ub-zcm.uni-trier.de tell bib-edv34.uni-trier.de
07:19:29.743877 arp reply ub-zcm.uni-trier.de is-at 00:0a:f7:3e:39:20 (oui Unknown)
07:19:29.744049 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: S 2026223864:2026223864(0) win 65535 <mss 1460,nop,nop,sackOK>
07:19:29.744140 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: S 4197396408:4197396408(0) ack 2026223865 win 14600 <mss 1460,nop,nop,sackOK>
07:19:29.744304 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: . ack 1 win 65535
07:19:29.776946 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 1:22(21) ack 1 win 14600
07:19:32.783125 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 1:22(21) ack 1 win 14600
07:19:34.759187 arp who-has bib-edv34.uni-trier.de tell ub-zcm.uni-trier.de
07:19:34.759332 arp reply bib-edv34.uni-trier.de is-at 00:1e:4f:f4:8b:90 (oui Unknown)
07:19:38.791128 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 1:22(21) ack 1 win 14600
07:19:38.791291 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: . ack 22 win 65514
07:19:38.850267 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: P 1:639(638) ack 22 win 65514
07:19:38.850313 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: . ack 639 win 15312
07:19:38.852104 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: . 22:1482(1460) ack 639 win 15312
07:19:38.852124 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 1482:1566(84) ack 639 win 15312
07:19:38.852359 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: . ack 1566 win 65535
07:19:38.852818 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: P 639:655(16) ack 1566 win 65535
07:19:38.859502 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 1566:1846(280) ack 655 win 15312
07:19:38.961925 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: P 655:927(272) ack 1846 win 65255
07:19:38.972329 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 1846:2694(848) ack 927 win 16588
07:19:39.163815 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: P 927:943(16) ack 2694 win 64407
07:19:39.203127 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: . ack 943 win 16588
07:19:39.203303 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: P 943:995(52) ack 2694 win 64407
07:19:39.203343 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: . ack 995 win 16588
07:19:39.203407 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 2694:2746(52) ack 995 win 16588
07:19:39.225987 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: P 995:1063(68) ack 2746 win 64355
07:19:39.228870 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 2746:2814(68) ack 1063 win 16588
07:19:39.229640 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: P 1063:1163(100) ack 2814 win 64287
07:19:39.237201 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 2814:2882(68) ack 1163 win 16588
07:19:39.503134 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 2814:2882(68) ack 1163 win 16588
07:19:39.503311 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: . ack 2882 win 64219
^C
31 packets captured
31 packets received by filter
0 packets dropped by kernel
-----------------------------------------------------------------------------------------------------------------

The Send-Q ist filled about 9 seconds and then empty. This is exact the time I have to wait for getting the password
prompt.

Does somebody know this phenomenon?
How ca I analyse the Send-Q problem?

If I transfer a 3 GB file to the server, it interupts after about 1.4 GB with a network error.

I have no ideas anymore to solve the problem.

Regards
Harald Sauerwein

jmozdzen
14-Jul-2014, 12:57
Hi Harald,

> Any first network connection (ssh, vnc, http) results in a wait of about 8 to 20 seconds and then get a reply.
> When I close the ssh connection on the client and try another connection (ssh, vnc or http), I immediately get
> a response to login / the web page.

such descriptions trigger my "DNS reverse resolution" association :)

Is the target server able to resolve the client's IP address to a matching host name in time? For a test, you might want to enter an according entry to /etc/hosts, make sure "files" are checked before "DNS", and rerun your test. Or simply try to resolve a new client's IP address from the server command line before the client's first connection attempt.

Once the first DNS query was run, the result is most probably cached on the server (by nscd), hence the speedier reaction of the server on future requests.

The aborting file transfer would have nothing to do with DNS, though.

Regards,
Jens

sauerwein
15-Jul-2014, 11:51
Hi Jens,

DNS requests are fast and the client's IP name and or address is resolved in time. I also tested to add the client in the hosts file, same behaviour.
With ' make sure "files" are checked before "DNS" ' you mean the entries in nsswitch.conf? They are:
hosts:files dns
networks: files dns

Do you have any idea how to debug my problem?

Regards,
Harald

jmozdzen
15-Jul-2014, 12:33
Hi Harald,

Hi Jens,

DNS requests are fast and the client's IP name and or address is resolved in time. I also tested to add the client in the hosts file, same behaviour.
With ' make sure "files" are checked before "DNS" ' you mean the entries in nsswitch.conf? They are:
hosts:files dns
networks: files dns

yes, all that is like I meant it - I sorry that this didn't solve the problem.


Do you have any idea how to debug my problem?

A key element is to understand what is happening where. From your initial report:

ub-zcm:/etc/init.d # tcpdump host 136.199.34.34
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on p6p1, link-type EN10MB (Ethernet), capture size 96 bytes
07:19:29.743860 arp who-has ub-zcm.uni-trier.de tell bib-edv34.uni-trier.de
07:19:29.743877 arp reply ub-zcm.uni-trier.de is-at 00:0a:f7:3e:39:20 (oui Unknown)
07:19:29.744049 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: S 2026223864:2026223864(0) win 65535 <mss 1460,nop,nop,sackOK>
07:19:29.744140 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: S 4197396408:4197396408(0) ack 2026223865 win 14600 <mss 1460,nop,nop,sackOK>
07:19:29.744304 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: . ack 1 win 65535
07:19:29.776946 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 1:22(21) ack 1 win 14600
07:19:32.783125 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 1:22(21) ack 1 win 14600
07:19:34.759187 arp who-has bib-edv34.uni-trier.de tell ub-zcm.uni-trier.de
07:19:34.759332 arp reply bib-edv34.uni-trier.de is-at 00:1e:4f:f4:8b:90 (oui Unknown)
07:19:38.791128 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 1:22(21) ack 1 win 14600
07:19:38.791291 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: . ack 22 win 65514
07:19:38.850267 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: P 1:639(638) ack 22 win 65514
07:19:38.850313 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: . ack 639 win 15312

If I read this right, the trace was taken on the server side ("ub-zcm") and is filtering traffic concerning the client host (136.199.34.34, "bib-edv34").

- The client send an ARP request to get to the server (so they are on the same physical network/broadcast domain) and receives an answer
- The initial session establishment takes place (up to time stamp 07:19:29.776946)
- then somehow, the packets from the server to the client get dropped:

07:19:29.776946 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 1:22(21) ack 1 win 14600
07:19:32.783125 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 1:22(21) ack 1 win 14600
07:19:34.759187 arp who-has bib-edv34.uni-trier.de tell ub-zcm.uni-trier.de
07:19:34.759332 arp reply bib-edv34.uni-trier.de is-at 00:1e:4f:f4:8b:90 (oui Unknown)
07:19:38.791128 IP ub-zcm.uni-trier.de.ssh > bib-edv34.uni-trier.de.boomerang: P 1:22(21) ack 1 win 14600
07:19:38.791291 IP bib-edv34.uni-trier.de.boomerang > ub-zcm.uni-trier.de.ssh: . ack 22 win 65514
You can see that the server's ack is initially sent on 07:19:29.776946, then retransmitted on 07:19:32.783125. The server then re-runs an ARP sequence for bib-edv34, retransmits the ACK another time (07:19:38.791128) and now receives the proper response(s) immediately (07:19:38.791291).

There may be several reasons for this. First simple check would be a trace on bib-edv34, to see what is received, and when - do the packets from the server get dropped (do not reach the client), or are the client's responses dropped on their way to the server?

I'm *asssuming* that the MAC address (as returned by the ARP sequence at 07:19:34.759332) is the same as during the initial communication.

Next step, after determining which packets actually get dropped, would be to check the network switch(es) for details on the according network path and indications of errors. Port statistics (any errors on the ports along the path), ARP storms, unexpected adapter relocations, ... anything unusual.

Regards,
Jens

sauerwein
18-Jul-2014, 11:16
Hi Jens,

the problem is solved!
After creating wireshark network captures on client and server at the same time we saw:
2 13:43:40.975 Dell_f4:8b:90 Broadcast ARP Who has 136.199.34.247? Tell 136.199.34.34
3 13:43:40.976 Broadcom_3e:39:20 Dell_f4:8b:90 ARP 136.199.34.247 is at 00:0a:f7:3e:39:20
...
8 13:43:41.006 DellEsgP_f6:e6:48 Dell_f4:8b:90 ARP 136.199.34.247 is at 00:13:72:f6:e6:48

The problem was an old DELL PowerEdge 2800 server where the remote access card had the same IP as ub-zcm.
The new dell PowerEdge servers do respond to a ping at there iDRAC remote access cards IP, the old ones do not.

Normally when using an dupklicat ip I'm used to be getting informed with messages like "duplicate ip" or something else.

Thanks a lot for your proactivity and tips the help me solving this unusual problem!

Kind regards
Harald Sauerwein

jmozdzen
21-Jul-2014, 11:58
Hi Harald,

great to see you got it working, and thank you for reporting back the root cause!

Duplicate IP addresses can be a *****, fortunately you're one of those that can read a tcpdump trace to identify the issue... and that duplicate IP is most probably also the cause of the aborting transfer you mentioned in your initial message.

With regards
Jens