CloudFormation, Route53, and ... EKS?

Originally published at: CloudFormation, Route53, and ... EKS? - Skycrafters

The other day I got halfway through writing a very irate support ticket to AWS, stopped to do some fact checking, and learned something deeply annoying.


One of the teams I work with manages a bunch of services. One of these “services” is some common Amazon Route53 infrastructure that is set up using AWS CloudFormation. Over the history of the project, the deployment in the development account the team uses has been a little flaky. Each time we hit a deployment problem, the problem always turned out to be rate limiting. It never happened in production, so the flakiness didn’t quite get the attention that it could have.

Rate limiting in Route53

Some background: Route 53 has a hard limit of 5 requests per second per AWS account. For most folks, this is fine. However, it wasn’t working well for this team.

The team raised support tickets and got advice like “CloudFormation will attempt to create your resources in parallel. One option to avoid rate limiting is to add DependsOn links to serialize the resource creation.” We weren’t super-happy with that answer. Granted, there is a CloudFormation roadmap item to fix it, but we needed something in the interim, and it worked … mostly.

On this day a deployment had failed again, and after going after the usual suspects and making sure that the resources were properly serialized, I got very irate. I was halfway through writing a support ticket:

We are still encountering Route53 rate limits and our CloudFormation stack deployment / updates are intermittently failing, sometimes after only two resources are created. There are no Route53 API calls being made by our applications, only through CloudFormation.

We are quite frustrated at this point and would like to request a session with a solution architect to help us understand how we should be doing this and

and I paused.

Check yourself before you wreck yourself

“There is no point in using the word ‘impossible’ to describe something that has clearly happened.” — Douglas Adams, Dirk Gently’s Holistic Detective Agency

“Is it true that there are no other Route53 API calls being made?” I asked myself. A quick jaunt into AWS CloudTrail told me the answer, and also opened a gaping pit beneath my feet.

There were 436 Route53 API calls made in the 2-minute period surrounding our CloudFormation failure. If you do the math, that’s 3.6 requests per second on average, so it’s not at all surprising that we maybe tipped over the limit of 5 at some point in there.

“But where are these coming from?” was my immediate question, and it was immediately answered.

Virtually all of these requests were being made by an EC2 instance that was part of an Amazon Elastic Kubernetes Service (EKS) cluster.

Talking through this with some other folks, I learned that they’d configured external-dns on the cluster, and that this behavior is actually documented.

The production account doesn’t have the EKS cluster, so it’s not overwhelmed with Route53 API calls, which explains why deployment never failed there.

Buh-bye

I wanted to decommission the cluster immediately, but unfortunately some teams still need it, so I wasn’t able to.

The external-dns documentation says that one workaround for the issue of the controller eating your entire Route53 request budget is to extend the interval that the controller reconciliation loop runs at. In this particular cluster, the reconciliation loop was running every minute (the default!) to reconcile a set of records that change approximately never. I followed the instructions, set the interval to a week, and settled in to see what happened.

The first thing I noticed was that immediately the calls to Route53 stopped. Not surprising, but great to get confirmation. Several hours after the change, there were still no calls from the previously-misbehaving cluster.

All is well now, and I get to put away my detective hat for another day.

“The light works,” he said, indicating the window, “the gravity works,” he said, dropping a pencil on the floor. “Anything else we have to take our chances with.” — Douglas Adams, Dirk Gently’s Holistic Detective Agency

What I learned

First, CloudTrail was instrumental here. I’m still a novice, but I’m learning how powerful a tool it is. Once I knew what to look for, it was immediately obvious what the source of the rate limiting was. The events in CloudTrail identified the EC2 instance and even made it clear that the source of the requests was in an EKS cluster.

Second, I was reminded that Kubernetes is not a get-out-of-ops-free card. There is a lot of expertise involved in running Kubernetes well, even when you’re using a managed service like EKS. I knew this before, but this was an example of a cluster I didn’t even know existed (don’t worry: someone more responsible did know!) having side effects way outside its scope.

Got any detective stories you’d like to share? Comment below, I’d love to hear them!

3 Likes

Nice write-up @glb and love this kind of investigation, especially when you get to find the culprit at the end.
The API call quota has hit me big time in the past and even came up with a custom-made exponential back-off solution to reduce the number of API calls made.
I’m actually surprised that the controller doesn’t have such back-off baked-in, that would make it “smart” about when to make such an API call to Route53 as opposed to a static interval where important reconciliation could be made during that interval.

3 Likes

Great content here, @glb. I love how you expose your fail moment, but linking it with the learning you got out of it.

I have a quick story to mention about the back-off mechanism built in their AWS tools. First, I learned the hard way that API calls are obviously throttled for the better quality of service of all customers. In order to speed the deployment of hundreds of CloudFormation Stacks, I decided to, for each stack to deploy, instantiate a new SDK client and deploy it. It didn’t take a while to start to get deployment failures back :sweat_smile:

Then I researched and found out that baked-in back-off mechanism is built-in in AWS’ SDK. I then changed my approach to instantiate just one copy of the SDK and loop through all the stack deployments I had to do. And it worked! Mostly. I would still get a deployment failure or two every about 100 deployments. I went the lazy route and added a 3-second sleep between deployments and I stopped having issues. It was far from an ideal solution, despite not impacting me much, but I never found the actual root cause.

I wonder how many more people have challenges with the (lack of?) back-off mechanism built-in in AWS tool and services.

3 Likes

@glb I loved this post! We have all been there about ready to send an irate message somewhere and then find out that we actually didn’t have all of the information. lol. At least you did not send the email to AWS :sweat_smile:

I do have one story like this, however, the email was sent to Azure though! I did not send the email but I was a part of this story. It was around the different flavors of encryption that Azure has and how it interacts with Azure’s best practices. Well there is more than one location to find encryption for Azure with different levels of encryption as well. There is more than one place showing if you have encryption, but they have different purposes. My co-worker sent an email thinking that Azure had this setup wrong after the first glance but we just didn’t understand that they represented different things. We found this out a day later after further research into how the encryption works in the Azure environment. Never did find out if they replied or not :thinking:

2 Likes

Thanks @xabi ! The external-dns controller has a number of options that can help it behave better, but yeah it seems like there is room for improvement. I’m sure that as it gets more use folks will iterate on it; as much as I’m not a fan of Kubernetes, the community sure moves fast!

1 Like

Thanks @raphabot ! Great story about back-offs, I think everyone who works with the SDK runs into exactly that problem at some point :slight_smile:

1 Like

I think I heard that before :eyes:

1 Like

Yeap, I din’t know that. Lesson learned, I guess! :slight_smile:

1 Like

Thanks @Tabs ! That quote was literally what I had in my draft, and the trailing “and” is where I stopped in my tracks and said “wait.”

Love the story about docs and things being inconsistent – I’ve often said that one of the first hires I’d make if I were to start a company would be someone who trained as a librarian (yes, even before I got to information architects!). I have so much respect for people who can organize documentation effectively, and so often I find that project wikis and other internal docs get less love than they need.

2 Likes

@raphabot you might have heard my ranting and complaining about it before :joy: I learned so much though!

2 Likes