PDA

View Full Version : Firewall and MPI over InfiniBand switch



bta
03-Mar-2015, 14:01
Dear forum,

I'm having a bit of trouble with my firewall and MPI configuration. I have two nodes of a cluster connected with both Ethernet and InfiniBand that should run a simulation either with Platform MPI or Intel MPI together, but it doesn't work (duh). From what I've seen, it looks as the firewall is still blocking ports eventhough it's switched off or the interfaces are assigned to the internal zone or I explicitly open ports.

Here's the result of a port scan from slave to master:


n001:~ # netcat -zv 192.168.20.1 1-50000
admin-ib.default.domain [192.168.20.1] 44322 (pmcdproxy) open
admin-ib.default.domain [192.168.20.1] 44321 (pmcd) open
admin-ib.default.domain [192.168.20.1] 43483 (?) open
admin-ib.default.domain [192.168.20.1] 39828 (?) open
admin-ib.default.domain [192.168.20.1] 37392 (?) open
admin-ib.default.domain [192.168.20.1] 37194 (?) open
admin-ib.default.domain [192.168.20.1] 35288 (?) open
admin-ib.default.domain [192.168.20.1] 15007 (?) open
admin-ib.default.domain [192.168.20.1] 15004 (pbs_sched) open
admin-ib.default.domain [192.168.20.1] 15001 (pbs) open
admin-ib.default.domain [192.168.20.1] 7890 (?) open
admin-ib.default.domain [192.168.20.1] 4673 (cxws) open
admin-ib.default.domain [192.168.20.1] 4672 (rfa) open
admin-ib.default.domain [192.168.20.1] 2049 (nfs) open
admin-ib.default.domain [192.168.20.1] 777 (multiling-http) open
admin-ib.default.domain [192.168.20.1] 737 (?) open
admin-ib.default.domain [192.168.20.1] 682 (xfr) open
admin-ib.default.domain [192.168.20.1] 111 (sunrpc) open
admin-ib.default.domain [192.168.20.1] 22 (ssh) open

The MPI job would fail with such an error message (XXX is my master):


[proxy:0:1@n001] HYDU_sock_connect (./utils/sock/sock.c:227): unable to get host address for XXX (2)
[proxy:0:1@n001] main (./pm/pmiserv/pmip.c:396): unable to connect to server XXX at port 34272 (check for firewalls!)

The odd thing is, if I do a port scan again, the port 34272 would now be listed as open. However, that doesn't help since MPI starts with random ports.

What's going on? Is another part of the system blocking ports (I can close down all ports with my firewall, but I can't seem to open other ports)? May it be the switch?

Cheers and thanks!

bta
03-Mar-2015, 17:32
Wild goose chase... My master node was not in /etc/hosts, at least not with the proper host name. Also I learned that a port scan lists ports only as open if something is listening, which of course explains why the port would be shown as open after the MPI program started. Duh.