Slow name resolution on linked services

I have a service (called printing) which is a Java application running in Tomcat which is used to print things via CUPS. This I believe is somewhat irrelevant but I’m putting this there for sake of completeness. The container runs Tomcat and has the CUPS client configured with the following in /etc/cups/client.conf: ServerName cups:631

The service is linked to a cups service which is a container running the CUPS daemon.

When I start my printing service it takes a long time before things work. I have an REST call I can make to get the list of printers - Java queries the local CUPS client which returns the printers available on the remote CUPS server.

If the CUPS client is configured with a server that is accessed through FQDN (i.e. cups.service.example.com) and this is accessed through external that works totally fine.

But if the CUPS client is configured as above - to access the CUPS server through a container link (aka service link in Rancher), then it takes a long time before the data is retrieved.

If I pull up the printing service container’s console, and I do wget http://cups:631, it takes 8 seconds for the name resolution:

root@3e5638768eca:/tmp# wget http://cups:631                                    
--2015-10-27 16:32:31--  http://cups:631/                                       
Resolving cups (cups)... 10.42.130.83                                           
Connecting to cups (cups)|10.42.130.83|:631... connected.                       
HTTP request sent, awaiting response... 200 OK                                  
Length: 3697 (3.6K) [text/html]                                                 
Saving to: ‘index.html.1’                                                   
                                                                                
index.html.1        100%[=====================>]   3.61K  --.-KB/s   in 0s      
                                                                                
2015-10-27 16:32:39 (131 MB/s) - ‘index.html.1’ saved [3697/3697]           
                                                                                
root@3e5638768eca:/tmp#             

And that 8 seconds is more or less constant. i.e. it’s not just the first hit.

The CUPS service is in a different stack than the printing webapp service.

On the other hand, if I wget http://cups.service.example.org:631 - i.e. through my network DNS instead of through the service link, it’s instantaneous.

I just realized that since 8 seconds is a constant, it looks to me like a timeout. Could it be that first it tries to query my DNS server, times out and then tries the service discover of Rancher?

I’m unsure how to go about debugging this.

Ok, I did some more tests and found that if I do ping cups - it takes 8 seconds to get started (name lookup timeout) but if I do ping cups.rancher.internal then it’s instant.

The resolv.conf file has search example.com rancher.internal in it. That’s because my host servers have Docker configured with a search DNS suffix and DNS entry. I noticed in the resolv.conf file that Rancher disabled the DNS server set by Docker but left the search suffix.

I would say that the sequence should be opposite if Rancher is going to leave my DNS suffix - it would be best to have rancher.internal example.com instead.

I thought the problem was that I was starting the Docker service with --dns-search example.com but when I took it out, it simply used the dns-search suffixes from the host operating system.

So the issue really is that Rancher should put rancher.internal before the host’s DNS suffix.

I can right now work around this by starting the service with a DNS Suffix option set to rancher.internal.

I’m not sure I agree, we are the last in the list because you may have unqualified internal names that could be stomped over if someone happens to link a service with the same name (and then you can use .rancher.internal explicitly to fix it, as you said, rather than figuring out what the FQDN for the name is in your regular DNS).

What is weird is that checking service.example.com should fail very quickly and then move on to service.rancher.internal. It sounds like the recursive resolvers configured on the host/in the network agent aren’t actually reachable, so instead of getting back nxdomain immediately we’re waiting for it to timeout.

You are 100% on the sequence, it’s better rancher.internal be last - that really make sense, I was just looking at my immediate problem there.

I’ll have a look at my DNS servers - I think you’re right something is taking too long there.

Thanks for all your work - I love your product!

If you open up a console on the Network Agent on one of the hosts, the config file for the DNS server is in /var/lib/cattle/etc/cattle/dns/answers.json.

There will be a section for each container on the host, but unless you’re setting resolvers on specific containers the "recurse" arrays should all be the same and are the servers we’d be trying to resolve the name on first. I believe the timeout is 2 seconds, so there are presumably 4 recurser IPs, or 2 IPs * 2 search paths before rancher.

Then you can do apt-get install dnsutils to get dig installed and try dig @<one of the recurse IPs> service.

Just to follow up - it turns out the DNS servers (which are Windows) were set up with “Use WINS forward lookup” pointing to two servers which don’t run WINS…

Thanks - you not only solved a problem I was having with Rancher, you solved a problem on my network :slight_smile:

Lookups are basically instantaneous now, resolving the service name (without suffix) immediately.

Cool, glad you got it worked out.