Name Resolution Issue Due To Cache Inconsistencies In CoreDNS: Inside The Mind Of A Problem Solver

CoreDNS In Kubernetes
In upcoming days CoreDNS 1.5.1 will be released, which will include a fix for an interesting name resolution issue in the Autopath plugin found by Andras Spitzer  (AAA Minds) – aka Sendai – a Snr. Site Reliability Engineer at Curve with 20 years experience engineering, building and automating wildly diverse systems at world class corporate and startup settings alike. Here Sendai takes us step by step through his investigation. 
 
If you’re using CoreDNS in Kubernetes with cache and autopath plugins enabled, please upgrade as all versions prior 1.5.1 are affected.
 
It feels great to contribute to the Open Source community. We also want to thank the #CoreDNS team who were extremely helpful and responsive during the PR.

The story of our investigation

Symptom

We spotted from our logs that were experiencing issues relating to curl and Twilio’s API, with the following error message:

 curl: (6) Could not resolve host: api.twilio.com

This originally hit us as an exception in one of our Services – the Twilio SDK failed to connect to api.twilio.com.

 message: Services_Twilio_TinyHttpException: Could not resolve host: api.twilio.com in /opt/project/source/vendor/twilio/sdk/Services/Twilio/TinyHttp.php:119

Once we tracked down where this comes from, we saw that the Twilio SDK is using curl to connect to api.twilio.com. These are the the failure events in the past month:

Screenshot 2019-05-25 at 12.25.14

The majority of these cases are related to api.twilio.com, but this issue is not limited to api.twilio.com, or even to curl. This is a DNS resolution problem which affects any external hostname. We even see occasions where this issue impacts resolving our S3 buckets as well.

On 16th May we implemented a workaround by defining static IPs for api.twilio.com, which dropped the visible failure count significantly, but this is just a temporary solution and also won’t prevent from other external names to fail within our cluster.

Scope

In May alone, we had 2378 occasions – that’s around 80 per day – when we failed to resolve a hostname using curl. This is just curl, so it’s most likely we had more resolution failures than this.

Investigation

First and foremost if I want to fix a problem, I have to be able to reproduce the symptom. If I can’t reproduce it, I can’t succeed. I can’t test it, I can’t analyse it, and worst  of all , I can’t provide proof that my solution works.

So as the first step, I always try to isolate the symptom and replicate it an a controlled environment. Since this issue hit us in production, my tools are limited of what I can use to analyse the case. I was quickly able to track down that the exception we saw from our service, through the twilio SDK is actually a curl call. I can call curl outside of our service, so that was step one complete!

At that time I knew nothing about the issue, so I wanted to stay as close as possible to the original problem, so I couldn’t rule out anything. I started testing within the service, using curl. I wrote a small shell script, which calls curl every 5 seconds. I could have had a more frequent call, but I chose 5 seconds to make sure I wasn’t putting load on the production infrastructure.

When I called curl https://api.twilio.com, I got this:

 <?xml version='1.0' encoding='UTF-8'?>
 <TwilioResponse><Versions><Versions><Version><Name>2010-04-01</Name><Uri>/2010-04-01</Uri><SubresourceUris><Accounts>/2010-04-01/Accounts</Accounts></SubresourceUris></Version></Versions></Versions></TwilioResponse>

Test script

That’s not bad but too long for effective analysis. I knew that I had to analyse hours, maybe days of data, so I had to cut out any noise I could so I already had a filtered result to work with. So, I came up with this tiny script:

 #!/bin/sh
 
 while true
 do
  date | tr '\n' ' '
  curl https://api.twilio.com 2>&1 | egrep "TwilioResponse|Could not resolve" | sed 's/^.*curl/curl/g;s/<Versions>.*$//g'
  sleep 5
 done

I called it test1.sh because I’m creative.

This helped me to get a single keywork when we can query Twilio, and the curl error message when we can’t. For example:

 root@testimg7:/var/tmp# ./test1.sh 
 Sat May 25 00:58:10 UTC 2019 <TwilioResponse>
 Sat May 25 00:58:15 UTC 2019 <TwilioResponse>
 Sat May 25 00:58:21 UTC 2019 <TwilioResponse>
 Sat May 25 00:58:26 UTC 2019 <TwilioResponse>
 ...

I could work with that. It’s nice and tidy. So, I started to run my script in the service Pod. One step forward so I can’t move away from production while replicating the problem. Within minutes, this is what I got:

 Sat May 25 00:58:37 UTC 2019 <TwilioResponse>
 Sat May 25 00:58:42 UTC 2019 <TwilioResponse>
 Sat May 25 00:58:47 UTC 2019 <TwilioResponse>
 Sat May 25 00:58:53 UTC 2019 curl: (6) Could not resolve host: api.twilio.com
 Sat May 25 00:58:58 UTC 2019 curl: (6) Could not resolve host: api.twilio.com
 Sat May 25 00:59:03 UTC 2019 curl: (6) Could not resolve host: api.twilio.com
 Sat May 25 00:59:08 UTC 2019 <TwilioResponse>
 Sat May 25 00:59:13 UTC 2019 <TwilioResponse>
 Sat May 25 00:59:18 UTC 2019 <TwilioResponse>
 Sat May 25 00:59:24 UTC 2019 <TwilioResponse>
 Sat May 25 00:59:29 UTC 2019 <TwilioResponse>
 Sat May 25 00:59:34 UTC 2019 <TwilioResponse>

It was time for celebration, as I achieved my first goal: replicate the issue without the service/twilio SDK.

Istio

Since our service is in the default namespace, and this namespace is Istio enabled, my first guess was that the issue could be related to Istio. Even though Istio is a TCP proxy and doesn’t process UDP at all, Istio is usually among our top candidates when it comes to potential root cause of bizarre behaviours.

Since I was at the very beginning of my journey, I knew that I could even shoot in the dark. That’s not really my style but it helps to identify the scope of the issue, in other words who’s affected? Once I can find out the components that are consistently affected, I can start looking for their common pattern, which supposedly gets me closer to the root cause. Also, shooting in the dark is fun. It’s important to make the investigation fun so our brains enjoy solving it. Joy is a great motivator for anything brain related. Or at least, that’s what I was told….

I spun up two Pods for testing. I  wanted to see if we experienced this even outside of Istio, and outside of clientapi. One pod was in the sre namespace (Istio disabled) the other was next to our service, in the default namespace.

To my surprise, the Pod in the sre namespace didn’t show any symptoms, but the the one in the default namespace did. Haha, Istio. I knew it! I always warn myself not to fall in love with my ideas, so how could I confirm that it was indeed Istio? By spinning up a Pod as similar as the the istio enabled ones, but not istio enabled. Got it! So, I have to spin up a pod in the default namespace, but opt-out of the Istio injection.

‘Cool’ I thought ‘let’s do this quickly, and claim victory of the mighty Logic.’

We can opt-out of Istio with the annotation sidecar.istio.io/inject: "off", so I spun up my Pod, ran my sweet test1.sh script… and it kept producing  the symptom.

What?

So I was experiencing this even outside of Istio in the default namespace, but I wasn’t experience this in the sre namespace? How was that possible? Also, God why you hate me?! So it’s not Istio. Fair enough.

Ok, so I knew it was not Istio. How did the namespace comes into play? No idea. I know that the Pods are using their namespace as part of the DNS resolution, but this still didn’t add up. So, knew that I had to dig deeper. Install wireshark, pick up strace, trace, maybe even gdb. Luckily I already have a test image built for such purposes. I call it testimg, because I’m creative.

I also knew that I needed a lab to play with, and since don’t want to risk any production impact, I wanted to move this off of our production cluster. So, I tested this symptom in our lower environments. dev, impacted.. GOOD. feat1… not. stag, neither. What #2? How?

I set up my test script in the Kubernetes node itself, which hosts the Pod. It never made a mistake, always received the correct answer.

I set up a second test script which ran in parallel with my twilio test. It was querying Amazon S3, a different target. I was hoping to see to fail resolving at the same time when twilio fails to resolve. It wasn’t. Even when twilio failed to resolve, Amazon S3 resolution was working fine.

Anyway, at least luckily I had the symptom in dev, where I could do whatever I wanted to hunt this down. So, I moved to dev. Let’s see which components are suspect so far:

  • I can rule out Istio. Istio has an alibi. It was busy preparing 503 packages for Christmas.

  • CoreDNS

  • Upstream DNS (Amazon)

  • Networking/Calico

  • Node

  • curl

  • source Linux Kernel

First enabled logging in CoreDNS to find out if I see anything useful, but it was too noisy I couldn’t make any sense of it. Also, I couldn’t trust any of the components listed above, so I also wanted to have an extra pair of eyes on the DNS traffic. Wanted to run Wireshark both inside CoreDNS and the source Pod, to see if we’re losing any packets. Wanted full control.

Clone wars

So, I could replicate the problem, now I had to isolate it as much as possible reducing noise so I can inspect each component one by one, and find out who’s the murderer. So, I set up a second DNS service called test-kube-dns (the original is called kube-dns), which had a single Pod for the sake of simplicity. That pod is running my testing with all the tools. I fetched CoreDNS and ran it manually. Server side is ready for testing.

It was nice to have test server, but we needed a Pod which will be configured to use the test server. Normally resolv.conf is configured by Kubernetes, but we can override that, so I defined this for pod of my test Pods in sre and the default.

  dnsConfig:
    nameservers:
    - 101.79.112.130
    options:
    - name: ndots
      value: "5"
    searches:
    - default.svc.cluster.local
    - svc.cluster.local
    - cluster.local
    - eu-west-1.compute.internal

I was happy with the test environment. I had a test server with all the tools I need and 2 test pods, one in sre another one in default, none of them Istio enabled.

Let’s start testing.

Have faith

At this stage I prayed to see the symptom again, because that would be a huge step ahead in my analysis. So, I started to run my test script. I waited… 5 minutes. Nothing. 15 minutes. Nothing. I started to get nervous. 30 minutes. No symptom.

As a desperate move I started my test script in my other test Pod as well. I left my other test on. After a few minutes, I started to see:

 Sat May 25 03:29:44 UTC 2019 <TwilioResponse>
 Sat May 25 03:29:49 UTC 2019 <TwilioResponse>
 Sat May 25 03:29:54 UTC 2019 curl: (6) Could not resolve host: api.twilio.com
 Sat May 25 03:29:59 UTC 2019 <TwilioResponse>
 Sat May 25 03:30:05 UTC 2019 curl: (6) Could not resolve host: api.twilio.com
 Sat May 25 03:30:10 UTC 2019 <TwilioResponse>
 Sat May 25 03:30:15 UTC 2019 curl: (6) Could not resolve host: api.twilio.com
 Sat May 25 03:30:20 UTC 2019 curl: (6) Could not resolve host: api.twilio.com
 Sat May 25 03:30:25 UTC 2019 curl: (6) Could not resolve host: api.twilio.com
 Sat May 25 03:30:30 UTC 2019 <TwilioResponse>
 Sat May 25 03:30:35 UTC 2019 curl: (6) Could not resolve host: api.twilio.com

The odd thing was that the failures were in sync with each other. If I ran only one of the test, either of them, I had no problem. But when I ran both of them, I started to see the symptom again.

So, I my attention moved towards CoreDNS. It should be CoreDNS.

CoreDNS

We use this configuration for CoreDNS:

 .:53 {
     log
     errors
     health
     kubernetes cluster.local. in-addr.arpa ip6.arpa {
       pods verified
       upstream
       fallthrough in-addr.arpa ip6.arpa
    }
     prometheus :9153
     proxy . /etc/resolv.conf
     loop
     cache 30
     loadbalance
     reload
     autopath @kubernetes
 }

So, I enabled logging in my test CoreDNS and also started to record the network traffic. I can’t rule out any of these components. The most obvious question was, ‘are we losing UDP packets?’ This is what comes to mind as an obvious option, but it’s not realistic. Our client Pod receives the DNS answer, it’s just not the correct one.

So, I started to record everything, and I saw a few interesting things that I wasn’t suspicious about, even though it was weird. One was autopath, which aims to take over the resolution of external names from the client to the server side, so it reduces the overall DNS traffic between CoreDNS and the Pods. We knew it worked, we saw the DNS request count drops when we enabled it. It’s a pretty cool feature. During my tests, autopath needed time to get triggered, at the beginning of the session it was the client side which took care of the complete resolution.

Search path

In DNS resolution there is a search path defined /etc/resolv.conf, which helps the resolver to find out which domain your entry belongs to. This is important to support short names, so if you’re in the domain of cnn.com, and you want to query www.cnn.com, you can just type ping www and the resolver will try all your default domains the request.

Our search path looks like this, for example:

 root@testimg11:/var/tmp/curl# more /etc/resolv.conf 
 nameserver 101.79.112.130
 search default.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
 options ndots:5

It’s worth to note that the default keyword is the namespace of the pod, so our resolv.conf always reflect which namespace our Pod is in. This goes back to the Kubernetes DNS mechanism on how you resolve Pods and Services, and it’s a normal behaviour.

So, just to put this theory into practice, if you call curl https://api.twilio.com this will trigger a name resolution of api.twilio.com, but the resolver will try to resolve this in the following order:

  • api.twilio.com.default.svc.cluster.local

  • api.twilio.com.svc.cluster.local

  • api.twilio.com.cluster.local

  • api.twilio.com.eu-west-1.compute.internal

  • and finally, api.twilio.com

The reason is that the resolver believes you used a short name. There are two ways you can change this behaviour:

  • Add a dot at the end of your entry, for example: curl api.twilio.com.. This will tell the resolver that this is not a short name, it’s an absolute FQDN so we don’t need the search path. Just resolve it as it is.

  • The parameter called ndots you see in the resolv.conf, which is a number, in this case 5. This means that if my entry has more than 5 dots in it (it doesn’t, api.twilio.com has only 2), the order of the request will change in a way that my original entry will be queried first, instead of last. 5 is a well known and fair compromise in Kubernetes clusters, and works fairly well.

Just to clarify, these options are all change on how effective the DNS resolution is, not whether it works or not.

To sum up, we had autopath which supports to query external addresses, and we had the local search path in our resolv.conf which would help our resolver library to resolve the address anyway, even if autopath didn’t work, for whatever reason. It needs some warmup time. Or it’s Christmas. Or Hanukkah.

Bits and pieces

So I had my first packet dump of the DNS conversation which failed after a few minutes. Great. Let’s review that. One thing that occurred to me, is that whenever we started failing the DNS resolution with curl, the client side, the Linux side stopped going through the search path.

It works for a few minutes, it goes through the search path for a few minutes, and suddenly it just gives up on the first item, which is api.twilio.com.default.svc.cluster.local.

I couldn’t understand why, so I had to go deeper. I was sure that I found a bug on the client side. I was comparing two DNS packets, they were identical, the first still triggered the search path in Linux, the second did not. My suspicion from CoreDNS moved to Linux, more precisely to curl or the resolver.

What #n ? Why? How?

Two identical DNS packets, 5 seconds apart, the first triggered the search path (correct way), the second didn’t (incorrect way). I saw why this resulted a DNS resolution problem. I knew it couldn’t be the resolver, as this would be too big of a problem. This must have been a curl issue. Maybe they used the resolver in a unique way?

Curl

I picked up gdb and downloaded curl to recompile it. I wanted to see where it went wrong. I enabled debugging symbols and disabled all C optimisation. When we compiled with optimisation, we couldn’t inspect variables while debugging. I wanted to see it all. I’m Neo. Unfortunately I’m closer to Butters in South Park than to Keanu Reeves, but that’s good enough. Professor Chaos always has a few tricks in his sleeve. Never fails to disappoint.

So, I enabled all debugging, started to drop breakpoints all over the place, and I had to realise that the curl is not the issue here. I even tried with the latest curl, same issue. It goes deeper. The code path leads me to getaddrinfo, dns-host and res-query.

Resolut

My next step had to go deeper, to the resolver, and I was able to pinpoint the code path which acts differently when we give up on the search path. The code path was interesting: it stopped going through the search path, because we received an answer. I couldn’t believe it. No, we didn’t. So, I went deeper, when I realised, actually we might have. Just not what I expected.

The libresolv stopped going through the search path for our entry api.twilio.com.default.svc.cluster.local despite the fact that we received an NXDOMAIN for our A record, because we received a NOERROR for our AAAA record.

Wow. That is interesting. For many reasons. Firstly, if you look up the the A record for api.twilio.com, it has a healthy A record:

 ip-10-12-8-46:~ sendai$ host -t a api.twilio.com
 api.twilio.com is an alias for virginia.us1.api-lb.twilio.com.
 virginia.us1.api-lb.twilio.com is an alias for nlb-api-public-c3207ffe0810c880.elb.us-east-1.amazonaws.com.
 nlb-api-public-c3207ffe0810c880.elb.us-east-1.amazonaws.com has address 18.208.54.140
 nlb-api-public-c3207ffe0810c880.elb.us-east-1.amazonaws.com has address 18.211.224.155
 nlb-api-public-c3207ffe0810c880.elb.us-east-1.amazonaws.com has address 18.212.47.248

It has a broken AAAA record. Why? Because it returns a CNAME:

 ip-10-12-8-46:~ sendai$ host -t aaaa api.twilio.com
 api.twilio.com is an alias for virginia.us1.api-lb.twilio.com.
 virginia.us1.api-lb.twilio.com is an alias for nlb-api-public-c3207ffe0810c880.elb.us-east-1.amazonaws.com.

Which is not NXDomain and it leads to no AAAA addresses:

 ip-10-12-8-46:~ sendai$ host -t aaaa nlb-api-public-c3207ffe0810c880.elb.us-east-1.amazonaws.com
 nlb-api-public-c3207ffe0810c880.elb.us-east-1.amazonaws.com has no AAAA record

Let’s think this through, a few observations/questions occurred to me:

  • If instead of a broken AAAA record Twilio would not have an AAAA record at all, the local resolver would keep going through the search path, and finally would be able to resolve the address.

  • How the hell we receive a half correct record from CoreDNS? True that Twilio has a broken AAAA record, but its A record is correct, so we should either have both A and AAAA complete for api.twilio.com.default.svc.cluster.local if autopath works, or we should have NXDomain for both if autopath doesn’t work. How do we get to a state where it’s half/half?

CureDNS

It was time for my puzzled suspicion to turn again, back to my old friend CoreDNS. The question was: why do I get an inconsistent A/AAAA response from CoreDNS?

I previously tested the symptom with all CoreDNS versions, including the latest one, and they were all affected. It was time to fetch CoreDNS and recompile it with debug symbols and no optimisation. Unfortunately CoreDNS logging is insanely weak – most modules don’t even log errors, not that helpful tracing so I could understand the inner workings of this odd beast.

Cash

Going back to my raw packet dump, I started to review again and again, to make sure I’m not missing anything. After a while, I was able to point out the exact problem. The cache within CoreDNS can become inconsistent as occasionally for the same request autopath either works or not. Since the cache module is also caching autopath results, that’s how we end up having half consistent records in cache of A and AAAA records for the same entry, one NXDomain one NOERROR. When that happens, any client that’s trying to resolve that entry will fail to do so, until the cache expires in CoreDNS.

One more question to ask:

  • The cache is set to 30 seconds – so why do these items expire at a different pace and not at the same time?

The reason is that the TTL we receive for A and AAAA differs, and TTL is stronger than the cache expire value. If both TTLs are larger than the cache expire time, they’ll expire together, which is fine, as whatever method we use to resolve, autopath or without autopath, it’ll be consistent. The problem is that when the TTLs are smaller than the cache timeout value and different from each other. In that case, they both will expire from the cache in different times, which will leave room to make a query. If that query is different from the one which is still in cache, in a sense that it’s autopath enabled or not – BANG. We hit inconsistency.

We could blame the cache here, but a better question is ‘why does autopath stop working occasionally?’ And ‘Under what circumstances does it working?’ If it would work consistently, that would keep the cache consistent, but as it wasn’t, it resulted a cache entry for either NXDomain for A or AAAA in cases when autopath wasn’t triggered, and NOERROR when it was. This half state will break our client resolvers.

Next step: Identify why autopath decides to stop functioning occasionally. Maybe it has AI in it?

AI Path

At this stage I knew that the whole problem was around inconsistent cache state, which was caused by the fact that autopath was triggered in same cases while it wasn’t in others, for the same entry. So I downloaded CoreDNS, and decided to run it through gdb, but this type of analysis is fairly challenging, as the DNS server still has to be fairly responsive, even during my analysis. Having breakpoints and manually controlling the process flow is too slow for that, also CoreDNS is fairly simple, so I knew I had to take another approach.

As a matter of fact CoreDNS is too simple. Even too simple to have proper logging within the plugins, otherwise I just would have tuned up the log level and viola. Not here, not with CoreDNS. So, my approach was that I went through the code and placed analytical checkpoints where I printed the significant parts of the internal state. I just needed visibility.

Once I placed my extra logging into CoreDNS I ran my first few tests, I started to see something interesting.

 [INFO] plugin/autopath: FirstInSearchPath: api.twilio.com.default.svc.cluster.local. - [argo-events.svc.cluster.local. svc.cluster.local. cluster.local. eu-west-1.compute.internal. ]
 

I was playing in an isolated test scenario with my own dedicated test Pods and test CoreDNS. My Pod ii called testimg7 and it’s in the default namespace.

How the hell does argo-events namespace comes into play here?

The reason I added a checkpoint to FirstInSearchPath, was because if this returned False, autopath won’t get applied. Based on this output, I’ wasn’t surprised. We queried from the default namespace, querying api.twilio.com, which ended up being api.twilio.com.default.svc.cluster.local. because of our local search path.. so far that’s normal.

Where argo-events.svc.cluster.local. comes from and why it’s here?

Autopath works in a way that whenever there is a request coming to CoreDNS, autopath has to figure out the client’s search path in /etc/resolv.conf. It has to know what search path to try, for each request, and that’s not constant, that’s changing based on the namespace. That’s not easy, as it has no access to the Pod at all. But, it can reconstruct it.

Once autopath reconstructed the search path of the client, it’ll compare the request. If the request ends with the first domain in the search path, autopath is triggered and will do its thing. If not, it won’t process the request. This feature is not that relevant in Kubernetes, as this always should be the case, but autopath supports other modules as well, not just Kubernetes where you have to check this.

Autopath expects the plugin to do this job on behalf of autopath. In this case, this is the Kubernetes plugin. Kubernetes plugin is doing this by getting the source IP of the Pod, and asking Kubernetes:

Which namespace does this IP belong to?

This way, autopath in Kubernetes plugin in CoreDNS can reconstruct the search path of the source Pod.

CoreDNS is clearly making a mistake here, the reconstructed search path is incorrect, which makes autopath to opt-out. So, I had to go and see this myself.

“This IP is not in the argo-events namespace.” – I told myself.

To my surprise, I was only partially correct. I saw this:

 andras.spitzer@blue.dev.imaginecurve.com:~/tmp/coredns/testimgpod$ kubectl get pods --all-namespaces -o wide | grep 100.104.247.253
 argo-events   webhook-j78g2   0/2     Error        0 45d  100.104.247.253
 argo-events   webhook-mx99l   0/2     Completed    0 30d  100.104.247.253
 default       testimg7        1/1     Running      0 10d  100.104.247.253

As Kubernetes is reusing virtual IP addresses, even historical Pods will show up in the list, with that IP. And they can be in another namespace. And this is what happened in our case.

My image testimg7 is in the default namespace and has the IP of 100.104.247.253.

It’s clear that only my test image was running, but the autopath code in the Kubernetes module only cares about the Pod’s IP, not its state. It’ll get the Pod info for the first item, webhook-j78g2, which is in the argo-events namespace.

Since our test Pod is in the default namespace, our query of curl https://api.twilio.com will first try to resolve api.twilio.com.default.svc.cluster.local, which when  it arrives to autopath, it will try to rebuild the search path of the source Pod by querying the Pod info using only the IP address. It’ll get the Pod info for webhook-j78g2, which is a historical Pod in a different namespace, so autopath will build an incorrect search path, and api.twilio.com.default.svc.cluster.local won’t be equal to the first domain in the search path, which in this case incorrectly will be api.twilio.com.argo-events.svc.cluster.local.

Autopath won’t get applied, and since we since the A and AAAA records expire at different times, one of these entries will become NXDomain in the cache, while the other remains NOERROR.

This response will stop our clients to follow their local search path, so the app who initiated the connection/resolution will fail, just as curl did.

Fruit of mistake

Now we know that the problem is coming from the fact that the search path in a the source Pod (which is correct) is different what CoreDNS’s autopath module believes (which is incorrect). How was I able to test reproduce this at the very beginning?

Because I made a mistake.

If you go back to Clone wars, you’ll see that I set up my test pods with the following manual DNS configuration:

  dnsConfig:
    nameservers:
     - 101.79.112.130
    options:
     - name: ndots
      value: "5"
    searches:
     - default.svc.cluster.local
     - svc.cluster.local
     - cluster.local
     - eu-west-1.compute.internal

My goal was to point my test pods to my test CoreDNS. What I didn’t know at that time, was that I also made a mistake which helped me to reproduce the symptom. I set the search path of my test Pods this:

  • default.svc.cluster.local

  • svc.cluster.local

  • cluster.local

  • eu-west-1.compute.internal

This is fine for my test Pod in the default namespace, but it’s incorrect for my Pod in the sre namespace. The one in the sre namespace should have had sre.svc.cluster.local as its first item. The fact that I made this mistake, even though my source Pod in the sre namespace wasn’t impacted by this bug, still allowed me to experience the symptom, just the other way around.

This time CoreDNS was correct and my Pod was incorrect. Very similar to what we have in production, just the other way around.

How odd was it that a small mistake can lead towards a solution? This was the fruit of my mistake.

Context

We need to have the following items in place to trigger this bug:

  • CoreDNS running with

    • cache enabled (cache 30, for example)

    • autopath enabled using the kubernetes plugin (autopath @kubernetes)

  • glibc Linux (tested with Ubuntu 18.04.1 LTS)

    • may also work with musl/Alpine, haven’t tested

  • The context has to support IPv4 and IPv6

  • 2 test Pods

    • one with the default (correct) search path

    • one with an incorrect search path (the first entry should suggest a different namespace, to emulate the scenario autopath fails to identify our Pod properly)

  • An external address (for example, api.twilio.com) we want to resolve, and which

    • has a proper resolvable A record

    • has also an AAAA record with a CNAME, which has no AAAA record (broken configuration)

  • The A and AAAA records must have different TTLs

    • so A and AAAA will expire at different times

Reproduce

Step 1

Write a shell script, test.sh, which will try to resolve api.twilio.com at every 5 seconds, for example:

 #!/bin/sh
 
 while true
 do
  date | tr '\n' ' '
  curl https://api.twilio.com 2>&1 | egrep "TwilioResponse|Could not resolve" | sed 's/^.*curl/curl/g;s/<Versions>.*$//g'
  sleep 5
 done

This script will give us two type of output:

 Sat May 25 03:29:44 UTC 2019 <TwilioResponse>
 Sat May 25 03:29:49 UTC 2019 <TwilioResponse>

When we were able to resolve api.twilio.com

and

 Sat May 25 03:29:54 UTC 2019 curl: (6) Could not resolve host: api.twilio.com

When we failed.

Also, it wouldn’t matter much if we increase the resolution frequency, as the inconsistency potentially can hit only when the CoreDNS entry cache expired and has to be repopulated.

Step 2

Set up two test Pods in the default namespace, called test1 and test2. Let test1 configuration be default, while have this configuration for test2:

   dnsConfig:
    searches:
      - test2.svc.cluster.local
      - svc.cluster.local
      - cluster.local
      - eu-west-1.compute.internal

This will make Kubernetes autopath to ignore this Pod, as when we’ll try to resolve api.twilio.com which will turn into api.twilio.com.default.svc.cluster.local, this won’t match test2.svc.cluster.local .

With this we simulate the case when autopath gets a historical Pod which used to have our IP, and which was in a different namespace, like test.

With this, test1 will be served by autopath while test2 won’t.

Step 3

Start running our test script in test2 - this will work, never fail, autopath will never kick in, api.twilio.com.default.svc.cluster.local will always result NXDomain, and the client side iterates through the search path. At the end of the search path, it’ll resolve api.twilio.com which has a proper A and a broken AAAA address. Curl will be happy as we have a proper A address. Always.

Step 4

Leave test2 on, and start running the test script on test as well. Within 5-20 minutes you’ll see:

 Sat May 25 03:29:54 UTC 2019 curl: (6) Could not resolve host: api.twilio.com

Messages in sync with each other, randomly come and go, depending on whether our current cache is consistent/inconsistent. It can stay broken for only a few seconds  up to minutes, depending on the request patterns, cache settings and TTLs.

When the inconsistency hits, no one can resolve api.twilio.com via CoreDNS from the cluster.

If

  • If we would use CoreDNS without cache, we would never hit this bug

  • If we would use CoreDNS without autopath with the Kubernetes plugin, we would never hit this bug

  • If we would use only IPv4 in this example, we would never hit this bug. Cache would still be inconsistent, sometimes api.twilio.com.default.svc.cluster.local would resolve as NOERROR when autopath was applied and would resolve to NXDomain when it wasn’t, but even when it’s NXDomain our client side resolver will keep going through the search path and finally would resolve api.twilio.com. We need both IPv4 and IPv6 entries to have this half-way inconsistency.

  • If IPv6 address would be properly configured with twilio, we might hit this bug, depending on whether our network can route IPv6.

  • If api.twilio.com would have no IPv6 DNS entry configured at all we would never hit this bug.

  • If the TTLs for the A and AAAA records would be in sync, we would never hit this bug, as both entries would expire at the same time. This means it would not give a chance to have A and AAAA records for the same entry, one triggered by a correctly identified and the other by an incorrectly identified Pod. If we would only use one of these, it would result in consistency, even with or without autopath. It would result differing performance, but not inconsistency.

As you see the bug is highly contextual.

Note: it’s interesting to see why libresolv /getaddrinfo gives up on search path when for a single request it receives an IPv4 NXDomain and IPv6 CNAME which has no AAAA record, at the end, this is not sufficient info to initiate a connection to my requested address. I haven’t dug deep enough to see whether this is a bug or feature. I only went down to confirm the behaviour. Still, it makes me wonder if this is the correct behaviour.

Summary

Cache inconsistencies in CoreDNS leads to occasional DNS resolution problems in any client. The inconsistencies are coming from the fact that Autopath occasionally identifies the source Pod incorrectly in Kubernetes.

Workarounds

In order of preference:

Disable cache

Disabling cache will make sure that we’ll always get fresh data, also the cost of having an internal DNS query is low so the cache is not that important in our case. Even if we disable the cache and leave in autopath not all requests will trigger autopath, but at least it won’t respond with an inconsistent state to our clients, so it should never fail to resolve as a result.

Disable autopath

If we disable autopath, the resolution will work always but the number of requests will increase 3-5 times, which in itself might result higher number of latency related issues we just discussed at the beginning.

Overall either of these options will result no more resolution failures, but their performance implications differ, hence the preference of leaving autopath in instead of the cache.

Solution

The solution is to make sure autopath either works all the time or none of the time at all, so cache can remain consistent. If it works half the time, it’ll result inconsistent cache entries, which will result in failing DNS resolutions.

The autopath code in the Kubernetes plugin within CoreDNS has to add a check to make sure they not just look up Pods based on IPs, but also make sure to look for Pods only that are Running, ignoring any previous Pods from the past.

Just opened a bug report with CoreDNS about this and proposed a fix.