
BLUF (Bottom Line up Front)
It can be hard to determine if EKS or ECS is the best fit for a project.
If you think in terms of individual traits, it’s easy to pick a clear winner. The difficulty comes from the fact that, to make a valid comparison, you need to view each product as a bundle of traits, so it’s more accurate to think of each product as a mixed blessing of simultaneous pros and cons.
Once you realize that, two natural consequences follow:
1. Neither option is best by default. Best fit should be determined based on project and organization-level requirements and constraints.
2. Objectively determining which is the best fit for a project requires bits of critical thinking and complex reasoning.
In this article, you’ll find rarely documented differences, implications, and insights relevant to common project and organizational level constraints, along with other factors that are worth considering when deciding.
It’s best to think of this content as guided reasoning through conditional advice that, when paired with a specific project, quickly turns into practical advice and guidance that you can use to make a well-reasoned decision.
Table of Contents
BLUF (Bottom Line up Front)
Table of Contents
Target Audience
Introduction
Section 1: EKS and ECS differences of minor note
1. ECS has significantly better Fargate pricing (EKS Fargate is expensive).
2. EKS’s EC2 container density is significantly better than ECS’s.
3. ECS offers more advanced service discovery options.
4. EKS has a slight advantage over ECS in terms of multicloud and local development, since it’s based on Kubernetes.
5. ECS doesn’t charge for ctrl plane costs, while EKS does.
Section 2: Insights on potentially critical differences
1. EKS offers faster scaling of containers by default and advanced autoscaling support, when an add-on named keda.sh is used.
2. EKS’s IaC (Infrastructure as Code) is fundamentally superior to ECS’s, because it’s standardized, easy to read, decoupled, declarative, and supports stateful metadata.
3. ECS has no built-in support for volumes.
4. EKS’s level of difficulty tends to be variable and paradoxical:
“EKS tends to be hard, because it’s too easy.”
5. ECS updates are extremely rare, which is a big boon in terms of stability and avoiding toil associated with cluster operations.
6. EKS has unavoidable maintenance overhead and an order of magnitude more moving parts than ECS, which never needs maintenance.
Section 3: Rules of thumb for choosing ECS, EKS, or neither.
1. ECS might be the better choice in scenarios where:
2. Attempting to objectively justify EKS as the better choice requires a bit more involved contemplation of situational factors: (This section also contains good advice on EKS Auto Mode.)
3. Stateful Applications are a scenario that’s common enough to justify a discussion of rule-of-thumb based logic:
Conclusion
Target Audience
This article is for anyone who wants to know how to make a well-reasoned decision when choosing between EKS and ECS.
If you are the target audience, then reading this relatively long article will be worth your time, because after reading you’ll be well positioned to achieve the main goal of making a well-reasoned decision and the following three useful auxiliary goals:
1. Discover unexpected differences, pros, and cons.
(I’ll focus on those that are hard to google and often undocumented.)
2. Learn decision-making metadata.
(Like meaningful implications of choices and insights about commonly encountered critical factors and constraints that are worth weighing heavily when contemplating options. Those also double as great candidates for establishing rule-of-thumb based recommendations.)
3. Understand the reasoning upon which claims are based.
If you can understand, agree with, and paraphrase the reasoning and justification shared to show claims, implications, or insights are valid, then you can use that knowledge to increase confidence (at a personal, team, or organizational level) that a decision is both reasonable and based on valid claims.
Introduction
The rest of this article is split into sections matching the above table of contents. Each section will have a list of claims along with a mix of clarifying explanations, factual evidence, and anecdotal evidence to help establish their validity.
The first section has a numbered list of EKS or ECS specific differences that are handy to know about, but usually not a big deal.
The second section contains insights on differences that are worth paying attention to, because they have the potential to be critical factors when making decisions.
The third section contains conditional rules of thumb based on common project, team, or organizational-level concerns. Pairing these with your specific circumstances can produce practical advice.
Section 1: EKS and ECS differences of minor note
It’s possible for both ECS and EKS to be backed by Fargate or EC2. While possible, there’s a strong tendency for ECS to use Fargate and EKS to use EC2 instances.
The first two differences of this section point out a pro of each that explains why this tendency’s existence makes sense. The next three differences involve minor pros along with explanations of why they’re relatively negligible.
1. ECS has significantly better Fargate pricing (EKS Fargate is expensive).
This is of minor concern, because regardless of price, EKS users might still prefer EC2 due to functionality advantages. (EC2 supports container image caching and daemonsets, while Fargate doesn’t.)
- In theory, according to AWS’s Fargate Docs, Fargate instances are supposed to be slightly more expensive than EC2 in exchange for convenience and security benefits, and that trade-off is intended to promote a lower total cost of ownership.
In reality, that understanding is only valid for ECS, because ECS supports AMD(AKA Intel/x86_64), ARM, on-demand, and spot based Fargate instances.
- EKS only supports on-demand AMD based Fargate instances.
EKS doesn’t support “ Fargate spot instances” nor “ ARM based Fargate”.
- This means when doing a price comparison for EKS, you have to compare the most expensive to the least expensive option AND deal with Fargate’s instance size rounding logic on top of it’s extra costs.
- For the sake of a concrete example, let’s say a container underwent right sizing analysis to determine it runs best with 1.8 CPU and 1.2 GB ram.
If you need 2 CPU, Fargate automatically rounds you up to 4 GB ram, and we have to use AMD based on-demand Fargate pricing. In us-east-1 the Fargate price is $2.37/day. This container would fit on a t4g.small ARM based EC2 EKS spot node, which would cost $0.1512/day.
- So EKS backed by Fargate can be 15x more expensive than EKS backed by EC2! (It’s also worth pointing out that this difference is based on the cheapest region; the 15x difference is even higher in more expensive regions!)
2. EKS’s EC2 container density is significantly better than ECS’s.
Note: If most of your containers use at least 1 CPU and 1 GB ram, then there will be a minimal difference in price, because t3/t4g.small can host 2 ECS tasks of that size.
If you deploy a large number of microservices that have low CPU and ram usage, then those containers can be significantly cheaper to run on EKS thanks to EKS’s higher container density . (Note: This could be a significant cost savings differentiator for organizations that do large scale deployments of multiple microservices, but for most organizations I don’t think the cost difference would be overly significant, so I consider this to be a minor difference of note.)
- Many ECS users tend to use Fargate instances, which only allow one task/pod per instance anyways. What’s not common knowledge is that ECS backed by EC2 is significantly worse than EKS backed by EC2.
- A t4g.small can run 11 EKS pods, or 2 ECS tasks.
A t4g.large can run 35 EKS pods, or 2 ECS tasks.
- I calculated that using:
aws ec2 describe-instance-types \
--filters "Name=instance-type,Values=t4g.*" \
--query "InstanceTypes[].{Type: InstanceType, MaxENI: NetworkInfo.MaximumNetworkInterfaces, IPv4addr: NetworkInfo.Ipv4AddressesPerInterface}" \
--output table
- The command’s output shows t4g.small supports 3 ENIs. ECS needs 1 for the host VM, then gives 1 per task. A good shorthand formula is $MAX_ENI -1 = $MAX_TASKS_PER_EC2_INSTANCE. (3–1=2).
- Note: ECS does have a poorly name named feature called ENI trunking (AKA AWS VPC Trunking). In theory, that’s supposed to increase ECS’s container density, but in practice it’s only supported by instance types that NEVER make sense to use from a FinOps cost optimization perspective, so I’d personally recommend you just pretend it doesn’t exist.
- Some related notes of interest:
Normally, AMD instances tend to be the best choice by default, due to lower cost and higher performance. The a in t3a stands for AMD, so as a general rule of thumb: t3a is better than t3; however, ECS’s container density is a rare case where t3 beats t3a, as t3a.small oddly only has 2 ENIs, so it can only support 1 ECS task, while t3.small has 3 ENIs and can support 2 ECS tasks.
- I think the 2 biggest reasons why ECS users prefer Fargate instances are:
1. Setting up ECS backed by Fargate is relatively turnkey, while ECS backed by EC2 requires some extra work.
2. The lack of significantly higher container density for EC2 instances results in a lower incentive to switch from ECS backed by Fargate to ECS backed by EC2.
3. ECS offers more advanced service discovery options.
I consider it a minor advantage of ECS as it’s only useful in rare use cases, and in such cases, EKS can achieve similar outcomes by installing optional tools to extend EKS’s baseline functionality.
- EKS’s service discovery is in the form of Inner Cluster DNS Names that are only resolvable by pods within of the cluster and follow a predictable pattern of $SERVICE.$NAMESPACE.svc.cluster.local
- ECS offers two service discovery options. One is API based and supports Inner ECS Cluster Communications, the other is DNS based and supports Inner VPC Communications. ( You can read more about them here if you’re interested.)
4. EKS has a slight advantage over ECS in terms of multicloud and local development, since it’s based on Kubernetes.
Here’s why that difference is relatively negligible:
- Multicloud is almost always a terrible idea; the only way to do it right is by leveraging cloud-agnostic design, which is also rarely a good idea. The main issue with multicloud is that in practice, it’s easy to find 10 real problems for every 1 theoretical benefit.
- Minikube and Rancher Desktop can be used to run Kubernetes locally, but in practice, I’ve only seen highly skilled individual engineers benefit from this through manually implementing it on their individual laptops.
There are many integration specific nuances that make it prohibitively difficult to implement a consistent local development experience with useful integrations team-wide without significant investment. So in reality, the practical value of this benefit tends to be very limited.
- Since ECS is based on Docker, highly skilled engineers can do docker based local development on their individual laptops. Local docker based development is partially applicable to ECS, just like how local Kubernetes is partially applicable to EKS.
Likewise, since only highly skilled engineers can navigate the integration specific nuances to gain benefits that only apply to themselves as an individual developer, being docker based tends to have limited practical value for ECS as well.
5. ECS doesn’t charge for ctrl plane costs, while EKS does.
I consider this to be a relatively minor advantage, because EKS’s costs are very affordable and easy to justify. That said, I could see it mattering for a startup that needs to run as cheap as possible.
- If you keep your EKS cluster up to date, it’s $864/year per cluster. That said, it’s a common practice to have a dev, stage, prod, and a few temporary sandbox clusters, so $3k/year is a more reasonable estimate.
- Here are 3 reasons why EKS’s control plane costs are easy to justify:
- 1. They make sense regardless of ECS.
If you tried to run a cloud-agnostic Kubernetes distribution like Talos or Rancher, you’d need to pay for 3 VMs to act as HA/FT control plane nodes. EKS’s control plane costs are cheaper than the DIY option, while offering the benefits of a managed service.
- 2. EKS has features worth paying a small premium for, mainly easier debugability and faster feedback loops. (The next section elaborates on this further.)
- 3. EKS has its own cost saving features like EKS backed by EC2 being cheaper than ECS backed by Fargate and potentially cheaper than ECS backed by EC2 due to higher container density, Kubernetes Ingress makes it easier to configure load balancers to be shared by multiple services allowing you to pay for less AWS LBs, karpenter.sh offers automated right sized autoscaling, and keda.sh offers advanced container autoscaling.
- These can result in EKS having a cheaper hosting cost than ECS. There are too many conditional variables involved to judge one as tending to be cheaper, but what is true is that it’s not uncommon for EKS to be cheaper overall or have a net negligible difference in cost, and that makes the existence of EKS control plane costs a potentially irrelevant point.
Section 2: Insights on potentially critical differences
This section contains three pros of EKS. An EKS paradox that tends to be a con, but can be a pro. Followed by two ECS pros.
1. EKS offers faster scaling of containers by default and advanced autoscaling support, when an add-on named keda.sh is used.
- EKS tends to scale up faster than ECS in general. When comparing EKS backed by EC2 vs ECS backed by Fargate this is intuitive given Fargate lacks support for container image caching.
What’s not intuitive is that it’s also true that EKS tends to produce fast startup times more frequently, even when EKS and ECS are both backed by EC2 instances.
This is because EKS has a higher container density, which allows it to take advantage of container image caching more frequently. Thanks to image caching, it’s not uncommon for EKS pod to start within seconds.
- ECS has good support for autoscaling and supports scaling based on custom CloudWatch metrics, but ECS is not as good as EKS when it comes to scaling. EKS is noticeably faster and better at autoscaling containers thanks to some nuances around ECS’s implementation details, metric resolution, and support for scaling to 0.
- I’ll elaborate on ECS’s problematic implementation details. Target based autoscaling can only scale up by a set number of capacity units, it offers a second option of step based autoscaling to allow a degree of variable scale up, but there ends up being a 1 minute lag time between scaling decisions, because metric data points are collected once a minute.
- (How often metrics are collected is often called metric resolution.)
- EKS uses Kubernetes’ HPA (Horizontal Pod Autoscaler), which has a default metric resolution of 15 seconds, so it can scale 4x sooner.
- ECS’s scaling works based on CloudWatch Metrics, CloudWatch Alarms, and a mix of suboptimal defaults and uneditable defaults tend to make it not as good as EKS’s HPA options.
ECS does have a best-case scenario where it can beat EKS in one way, in exchange for bad trade-offs, but overall ECS’s options aren’t as good.
- Here’s what ECS’s best case metric autoscaling scenario looks like:
A high-resolution custom CloudWatch metric can have a resolution of up to 1 second.
In reality, only every 10 seconds matters, because the CloudWatch alarm is what triggers the scaling, and that has a minimum evaluation period of every 10 seconds.
This technically beats EKS’s uneditable HPA default evaluation period of 15 seconds, but this benefit gets paired with a bad trade-off. If you implement a 10-second metric resolution (actually anything under a minute), then you only get 3 hours of metric retention.
- Here’s what ECS’s metric autoscaling tends to look like in common scenarios:
- ECS’s CPU and ram metrics have metric resolutions of 1 minute, which is an uneditable default, and that forces the CloudWatch Alarm’s minimum evaluation period to be every minute for the most common scenarios.
- Even if you implement a custom metric, you might choose to make it once a minute so you can have 15 days of retention and a cheaper price.
- It’s also worth pointing out that ECS’s metric resolution has an implicit default value of 1 minute, and it needs explicit configuration to enable 10 second granularity.
Kubernetes’ metrics server has a better metric-resolution implicit default value of 15 seconds.
- ECS can only scale to 0 when SQS queue metrics are used; normally it can’t scale to 0.
- EKS can install a free add-on named keda.sh to enable more robust autoscaling capabilities, like custom metrics, scaling HTTP traffic to 0, and cron based scaling.
The cron option is especially useful in the common scenario of needing extra baseline capacity in addition to autoscaling, to smoothly handle traffic spikes during peak business hours, and allow the baseline capacity to scale down during predictable periods of low activity.
2. EKS’s IaC (Infrastructure as Code) is fundamentally superior to ECS’s, because it’s standardized, easy to read, decoupled, declarative, and supports stateful metadata.
These traits produce multiple order of magnitude advantages in terms of debugability, feedback loops, and overall DevOps user experience.
- EKS follows the kubernetes standard of kubectl + yaml.
It’s quick and easy for humans to read, learn, and edit. (YAML’s intrinsic support of JSON is an added bonus, since YAML is a superset of JSON.)
- ECS doesn’t really have an official standard, in terms of both IaC and tooling. This makes ECS’s IaC harder to learn since there’s an immediate need to research and choose between common tooling options of AWS Copilot, AWS CDK, Terraform, or Pulumi.
- When it comes to options for IaC and tooling, EKS has more options, but in terms of a standard, there’s only one. Meanwhile, all of EKS’s options are based on that one standard; this gives EKS a significant advantage both in terms of ease of learning and odds of developing a skill set that’s likely to transfer between employers who prefer to develop according to common standards.
- ECS’s lack of conceptual decoupling and IaC results in several inherent disadvantages.
A ECS Service is conceptually equivalent to a EKS deployment, service, configmap, secret, and AWS IAM role tightly coupled together into a bundle. This creates multiple problems:
- 1. One issue is that ECS’s tight coupling of deployments imposes limits on in-place edits that can be made to a deployed object.
It’s not uncommon for changes to require full redeployments or annoying two-step processes involving deletion and recreation.
That gets old quickly when you need to make iterative changes.
ECS makes it very annoying to add, remove, or switch load balancer types. You’ll find you can’t use the same service name without deleting and recreating from scratch, which would cause downtime. You can work around this by doing a blue-green cutover, but that involves overhead.
Engineers interacting with EKS on the other hand, experience a great user experience thanks to declarative and idempotent updates.
- 2. ECS deployments are naturally slower than EKS deployments.
- 3. When an ECS deployment goes wrong, it’s easy to find yourself debugging a black box with no clear feedback signals that the actual state has converged to the desired state.
So it’s not uncommon for ECS engineers to need to wait for timeouts and spend about 4 minutes between iterations of experiencing “computer says no”.
Some tooling choices like AWS copilot can make this black box experience even worse. When copilot hits certain debug scenarios, the wait time can often be 20–60 minutes between iterations, due to needing to wait for ECS, Cloud Formation, and other black box abstraction layers to timeout.
Let’s not forget that when failures occur, ECS’s black box nature, often fails to give feedback or hints about the cause of the error.
- 4. There’s a tendency for debugging problems with ECS deployments to require more iterations than debugging EKS yaml objects, which are simpler and decoupleable.
This tendency occurs because ECS tasks are tightly coupled bundling of multiple components; the scope of troubleshooting is larger, with no easy way to narrow down which component to focus on.
- When you eventually need to debug EKS, you can easily systematically deploy, edit, and debug components independently of each other. And it’s easy to get to a feedback loop measured in seconds, because there are clear feedback signals when you run a kubectl describe or output yaml command and look at a YAML object’s stateful events and status. EKS’s feedback loop tends to be constrained by how fast you can think and type.
- Thanks to kubectl port-forward, it’s trivially easy to debug private IP services on EKS. ECS doesn’t really have an equivalent.
- An essential element of Kubernetes’ great user experience in terms of feedback is that controllers append stateful metadata to yaml manifest of component objects that get kubectl applied against a live cluster’s etcd database.
Thanks to that, engineers can leverage kubectl describe and output yaml commands to see an object’s event or status metadata; this gives fast and often specific feedback about the success or failure of various stages in individual component objects.
So when something goes wrong, you can quickly become confident about which individual component failed and what went wrong.
- Let’s compare that to ECS’s broken workflow, where the AWS Web GUI will let you create an ECS service of type load balancer, and the GUI’s deployment questionnaire will ask what subnets you want to deploy to.
It’ll tell you that you can’t deploy to both public and private and that you need to pick one, but it doesn’t tell you critical detail of the question being asked:
Are these the subnets for the load balancer or the subnets for the backend instances?
Also, it only asks a question to choose subnets once, when it should ask it twice so you can follow the standard best practice of making the load balancer public and the backend instances private.
ECS’s workflow allows you to do things it should be programmed to prevent you from doing, like deploying a public IP load balancer in a private subnet.
That obviously won’t work, and should be caught by input validation; instead, you get an avoidable error paired with poor feedback about why it’s not working.
- Another common debug scenario where EKS shines:
Let’s say an engineer makes a fresh VPC to do some testing, and then mistakenly deploys a EKS or ECS cluster into a VPC that doesn’t have a NAT gateway.
- With ECS, you’ll get zero logs and zero metrics, because the container image pull will fail due to a lack of internet access. The thing is, you’ll be flying blind with zero feedback about what went wrong. ECS exec isn’t turn-key built into the platform on-by-default and it’s hard to enable.
It’s easy to waste many iterations fruitlessly debugging the ECS task or ECS service’s config if you don’t realize it’s a VPC issue, because ECS is more of a black box with poor feedback.
There’s nothing that even tells you the image pull failed; you’re forced to intuit that it must be the cause of the failure.
- If you run into this same problematic scenario with EKS, it’s much easier to solve, because even if the worker nodes lack internet access the ctrl plane is publicly reachable by default, so kubectl can be used to get feedback like image pull failed, which is a good hint about a lack of internet connectivity.
3. ECS has no built-in support for volumes.
- Something I was surprised to learn about ECS is that if you want to inject configuration or a secret (from config in-lined in an ecs-task-definition.json or referenced from AWS Secrets Manager), environment variables are ECS’s only official built-in option for loading configuration. There’s no easy, built into the ECS platform, option to mount config or secrets as a file.
- This inability to easily inject dynamic configuration files is a major reason why ECS doesn’t have an equivalent to the Kubernetes’ concept of an Ingress Controller.
- ECS can use EFS for stateful storage, and hacky solutions do exist to mount files, but the point is those are all forced methods that aren’t easy to implement, because there’s no ECS Platform level built-in support for volumes.
There’s a nuance worth clarifying about that statement:
ECS has AWS-Platform level support for EFS, but there’s no ECS-Platform level integrations that make it easier to set up.
(Not only is it not easy, I’d argue that it’s often harder to set up EFS on ECS, compared to setting it up on standalone EC2 or EKS, because ECS tends to act like a black box that’s hard to debug with a slow feedback loop.)
- EKS on the other hand, has an EFS CSI driver to set up a Kubernetes Storage Class; that integration and EKS-Platform level built in support can make it easier to set up EFS on EKS.)
- In EKS config and secrets can be mounted as an environment variable or a file. Statefulsets, storage classes, persistent volumes, and other advanced features make stateful workloads relatively easy to implement.
4. EKS’s level of difficulty tends to be variable and paradoxical: “EKS tends to be hard, because it’s too easy.”
ECS’s level of difficulty is relatively static, because ECS has more fundamental constraints. The significance of these nuances is that when an implementation team has access to insightful consultation based on experience and uses a good implementation strategy, then EKS can be easier than ECS.
(For anyone unaware, that’s a critical thinking based viewpoint that goes against a common claim found in AWS docs, that ECS is always easier than EKS.)
- Please bear with me as this particular insight takes time to explain as it involves nuances, an empirical paradox, and sharing a few trains of thought that are essential for making the following insight intuitive:
- EKS has the potential to be easier than ECS. ECS has intrinsic disadvantages that are unavoidable; most of the important ones have been mentioned earlier. EKS’s disadvantages are fundamentally different, because they tend to be an emergent property that arises from a paradox caused by natural biases.
If you understand the big picture of EKS’s paradoxical disadvantage, its cause, and how to strategically avoid it, then EKS becomes much easier.
- EKS is a great example of an old adage: “You can have too much of a good thing. When good things are taken to extremes, they often give rise to problems.”
- Kubernetes is too good, too capable, and wildly successful to such a degree that it has a massive eco-system. That massive ecosystem is complex, and that complexity, paired with the natural bias to adopt kubernetes tools that have clear benefits paired with hidden costs, is a big reason that EKS is perceived as hard.
- An important realization is that Kubernetes’ big complex ecosystem is all optional. If you strategically choose to minimize its use, then EKS stays easy.
- I’m going to give some context, and then go into another related paradox:
- One of the principle level engineering skill sets I have, is the ability to distinguish between problem solvers and problem transformers. Problem transformers are tools and techniques that solve 1 problem in exchange for creating N new problems, and they tend to be bad ideas and things to avoid unless you really know what you’re doing and have thought through various second-order consequences.
- That logic is why I think the following tend to be bad ideas:
Kubernetes Operators, Statefulsets, using APIs that aren’t v1, operators that deploy statefulsets using alpha APIs is a big one, service meshes, and nginx-ingress controller. We also can’t forget Hashicorp Vault: “Friends don’t let friends use hashi-vault”.
- Some examples of problems introduced are critical CVEs, maintenance overhead, and forced updates to EKS eventually leading to forced updates of apps.
It’s not uncommon for updates to introduce breaking changes, and there’s now new ways that your services can fail and a larger blast radius if something does go wrong, which means you need extra testing.
- A sufficiently complex setup can create issues related to IaC, automation, documentation, skill set bottlenecks, staffing concerns, and more.
- With that context in mind, here’s the related paradox:
If someone has a bad idea, ECS is restrictive, inflexible, and just difficult enough to make it easy to realize a bad idea is going to be hard to implement. It’ll be hard enough to do the bare minimum, so the bare minimum will be done, and as a result, ECS tends to be viewed as easier.
- EKS again is too flexible and too capable for its own good.
When someone comes up with a bad idea, EKS offers so much freedom and ease of implementation that any idea can be implemented, including bad ideas.
It sucks when bad ideas get shared, because Kubernetes is so easy to use that others can easily adopt and implement unintentionally bad ideas.
Then, when problems happen, Kubernetes takes the heat for being too difficult.
The hidden truth is that many problems can be avoided by simply avoiding bad ideas.
Or phrased another way, “EKS is hard when people make it hard.”
- The above insights can be combined to form the basis of a good EKS implementation strategy:
Basically, try to stick to simple EKS functionality by implementing a rule of thumb to prefer built-in functionality, regularly consider the YANGI principle, and avoid features that aren’t available in ECS. (ECS doesn’t really have operators, ingress controllers, persistent volumes, non-stable APIs, a massive ecosystem of third-party tools, and AWS App Mesh support is being removed.)
If you compare the two with this strategy in mind, you get closer to an apples to apples comparison; only now, EKS starts to look like it might be a flavorful Honeycrisp Apple.
5. ECS updates are extremely rare, which is a big boon in terms of stability and avoiding toil associated with cluster operations.
- In theory, an on-demand Fargate instance could go years without updating; this avoids outages caused by breaking changes associated with updates.
- The EKS platform will eventually force adopters to update, and if applications running on the cluster were neglected and never updated, there’s a risk of a forced platform update breaking outdated versions of applications running on the EKS cluster that are designed to work with specific versions of Kubernetes.
- The nginx-ingress controller is a good example of a well-known application that has a table that lists specific versions of kubernetes that specific versions of the ingress controller are meant to run on.
- A relatively common self-inflicted, yet troublesome, scenario that some organizations run into with EKS is that they’ll hire a contractor to set up an EKS based solution as fast and as cheap as possible. Then, when the contract ends, the EKS cluster and its workloads can go untouched for years.
Eventually someone learns that the EKS platform eventually forces updates. That’s either after something broke or said organization tasked someone with updating their EKS cluster(s) at the last minute before a forced auto update.
It’s often the case an engineer discovers they’ve underestimated the level of effort the task involves, because they have to update multiple workloads deployed on the cluster in addition to updating the cluster.
- Normally that’s no big deal, but it can be stressful to do under a last-minute deadline. Which, if missed, could cause production outages. Especially since such organizations usually throw such a task at a non-kubernetes expert, and neglected EKS clusters tend to be the result of poorly planned rush jobs, so they lack IaC or documentation on how they were set up by an engineer who left over 2 years ago.
- In addition to forced updates, EKS’s EC2 nodes have more scenarios where worker nodes need to reboot than an ECS cluster backed by Fargate instances.
(EC2 nodes can be scaled down by a cost-saving cluster autoscaler like karpenter.sh, which also defaults to rebooting on-demand nodes every 30 days so they get the latest patch updates for EKS Worker Node VMs.)
- EKS also makes it easier to introduce optional complexity in the form of karpenter.sh, kubernetes gateway-api controllers, ingress controllers, and service meshes.
- Adoption of these optional components introduces risk in the form of introducing new ways failures can occur, having more things that can go wrong, new failure mode scenarios like misconfigurations, bad updates, supply chain vulnerabilities, critical vulnerabilities that require rushed updates, semi-frequent reboots as a feature, or breaking changes associated with updates.
- An irony is that service meshes like Istio have features designed to improve uptime, like forming a multi-regional mesh between multiple clusters, retry logic, and a few other things built in; however, in practice it’s not uncommon for them to be common sources of downtime.
- Once an app is deployed and running, ECS tends to be maintenance-free.
- EKS setups often have at least 3 clusters, and each has multiple components, all of which update and collectively result in inescapable maintenance toil.
- Luckily, many advancements have allowed maintenance toil to be minimized; a big one is EKS Auto Mode, but even if that’s not used, AWS Managed Add-ons, v1 stable APIs for ingress and karpenter have made updates easier, faster, and worry-free.
6. EKS has unavoidable maintenance overhead and an order of magnitude more moving parts than ECS, which never needs maintenance.
These 2 seemingly minor downsides of EKS produce second-order effects that result in significant disadvantages in terms of operational overhead. For the sake of adding a bit of clarity to the contextual meaning of significant, let’s use 2–14 days/year of engineering time dedicated to EKS maintenance as a ballpark estimate of what to expect.
- ECS’s traits result in it being very forgiving when common DevOps best practices aren’t known or are intentionally ignored/minimized.
ECS Admins can experience reliability despite frowned upon patterns like:
- 1. Implementing workflows involving a mix of manual operations and partial automation with minimal IaC or automating deployment of workloads, but not provisioning of clusters.
(This can be successful, because once a team gets ECS working, it tends to keep working.)
- 2. Ignoring the security benefits of multiple clusters and throwing everything in 1 ECS cluster.
(Even if dev and prod are mixed in 1 ECS cluster, it rarely hurts reliability thanks to ECS’s common architectural patterns, which result in most possible problems having a small blast radius:
It’s common for ECS services to have their own AWS Managed LB instead of a shared Kubernetes Ingress Load Balancer. Fargate containers get their own individual VMs, which improves isolation.)
- 3. Not implementing accurate resource limits and requests.
(Fargate instances having a ratio of 1 ECS task to 1 VM, makes it so this is only a cost optimization issue, rather than both a cost optimization and reliability issue.)
- EKS has enough moving parts and complexity that rigorous implementation of best practices is practically a requirement for admins who want their EKS clusters to stay reliable and maintainable in the long run.
- The need for rigorous implementation of best practices has a significant downside.
Chores associated with the implementation of maintenance and best practices tie up engineering time, and engineering hours are expensive. Worse still, the pursuit of best practices can lead engineers down DevOps Yak Shaving rabbit holes.
DevOps Yak Shaving can be a perfectly reasonable activity, but it’s also a potentially problematic rabbit hole, because it’s hard to distinguish between perfectly reasonable, potentially overkill, and completely unnecessary.
I’d like to elaborate on the significance of this, by pointing out that ECS admins aren’t only spared from chores; they also encounter fewer DevOps Yak Shaving scenarios.
- Any team managing EKS will quickly conclude that they need at least 2 EKS clusters for the sake of reliability.
It’s easy to discover that EKS component updates or misconfigurations can cause breaking changes, and problems with many components like ingress, DNS, CNI, karpenter.sh, service meshes, and unhealthy nodes have large blast radiuses.
That makes it intuitively obvious that isolated environments and testing updates in lower environments is essential to EKS’s reliability.
- Once EKS teams get used to an environment promotion workflow, another common realization occurs:
Testing in a lower environment is only valid when the lower and higher environments are relatively similar to each other.
- Many EKS teams invest time in developing rigorous implementations of infrastructure as code, end-to-end automation, and CICD pipelines that can ensure their lower and higher environments stay in sync.
- Another common experience is that dev environments are more likely to change faster, more frequently, and deal with manual changes. Break glass manual changes to production to resolve outages and incidents also aren’t uncommon.
Thus, it’s not uncommon for EKS teams to experience config drift, where a live environment doesn’t match the IaC and automation defined in a git or a CICD pipeline.
So ensuring live matches IaC requires further rigorous implementation of best practices, which usually involves a mix of CICD pipelines and GitOps implementations.
Section 3: Rules of thumb for choosing ECS, EKS, or neither.
1. ECS might be the better choice in scenarios where:
- You want to design an app to run 2–10 years with maximum uptime and minimum maintenance.
Let’s say you have an application that rarely updates, like internal tooling services such as a custom internal auditing application, or a public-facing application that wouldn’t cause significant problems if it were hit by a zero-day critical remote code execution vulnerability. (Due to best practices like using a distroless image that has no shell and IAM rights that follow the principle of least privilege.)
In such cases, it may make sense to prioritize ECS’s strength of near-zero operational overhead over EKS’s strength of easier debugging and faster feedback loops.
- You have relatively simple applications, deploy updates to environments only a few times a day, have stable application and architectural needs that rarely change, or plan to invest in a custom ECS CICD pipeline that only deploys simple patterns that always work.
In these circumstances, you’ll likely minimize your exposure to ECS’s downsides of being hard to debug when something goes wrong and having a slow feedback loop.
- You have a small team, only developers on your team, and none of your team members are remotely interested in or willing to learn how to rigorously implement DevOps best practices.
Then ECS is more forgiving (compared to EKS) when operational best practices are lacking or minimized.
2. Attempting to objectively justify EKS as the better choice requires a bit more involved contemplation of situational factors:
- ECS and EKS both are mixed blessings, with simultaneous benefits and downsides. That said, you can think of ECS as more balanced, where the benefits and downsides are both relatively mild (low risk, low reward). EKS on the other hand, is a case of more significant benefits and moderate downsides (moderate risk, significant reward).
- Before going into EKS’s benefits, it’s prudent to start with the downsides, as considering the downsides up front can put you in a better frame of mind to get more out of a question that adds good perspective:
“Do the benefits I’m seeing at least offset if not outweigh the downsides?”
- EKS’s 1st downside: Successful adoption of EKS requires rigorous implementation of DevOps Best Practices, which in turn requires willingness along with a significant investment of resources.
This can take more time than many would expect, because “DevOps Problems” often require “DevOps Solutions”, which often tend to need to be paired with problem transformers before genuine solutions can be applied.
This is also the reason that DevOps professionals sometimes jokingly refer to themselves as professional yak shavers. Transform a problem infinite times, and it devolves into shaving a yak.
(Let’s ballpark estimate this at 1–4 months of a one-off investment in engineering effort.)
- If you want to positively influence that 1–4 month time commitment. Pay enough to hire critical thinkers that understand the nuanced differences between problem solutions and problem transformations. Make sure your team understands the logic and reasoning behind the insightful paradox that “EKS tends to be hard because it’s too easy.”
And favor principles and advice like:
KIS (Keep it Simple), YAGNI (You aren’t gonna need it), v1 stable APIs are your friends, managed services like the AWS LB Controller should be preferred over DIY Ingress Controllers like Nginx (Hint: The former is a solution and the later a problem transformation, quay.io’s vulnerability scan mentioned a 6-year-old image of nginx-ingress-controller had 76 critical vulnerabilities, an average of 1 critical CVE per month!), and EKS Auto Mode is worth considering.
(Auto Mode is also a problem transformation as it has some downsides like introducing a black box effect and increasing costs; however, it’s often worth it for small-scale greenfield clusters.)
Do these things, and EKS can stay relatively easy.
Remember: “EKS is hard when people make it hard.”
- EKS’s 2nd downside: It has an inescapable degree of maintenance toil.
(Let’s ballpark estimate it at 2–14 days of maintenance per year for 3 clusters.)
- It’s worth pointing out that EKS Auto Mode, introduced Dec 2024, can be of great help in minimizing both of EKS’s 2 main downsides.
- EKS’s Auto Mode eliminates a lot of prerequisite work (involved in setting up common add-ons), prerequisite knowledge, ongoing maintenance, and even failure modes since it’s able to avoid breaking changes by only using and automating upgrades of add-ons based on v1 stable APIs.
- That said, it doesn’t eliminate maintenance entirely, and it has some downsides in terms of adding some costs and turning EKS into more of a black box, which hurts debugability.
(It runs karpenter.sh pods on the managed control plane nodes, so you can’t access karpenter.sh pod’s logs, and those are often needed to debug edge cases.)
- Prerequisites are another thing EKS Auto Mode doesn’t eliminate entirely. This linked page mentions you’ll need to supply an Auto Mode specific ingress class, load balancer service annotations, and a storage class that references an Auto Mode specific EBS Volume Provisioner to use all of Auto Mode’s features.
- It’s also useful to know up front that it’s hard to migrate from Auto Mode disabled to Auto Mode enabled. At first glance, the EKS web portal makes it look like EKS auto mode can simply be toggled off and on.
However, if you read this Migration Reference page, you’ll discover an in-place migration is a very painful manual process, to such an extent that it’s often easier to just deploy a new cluster and do a blue-green migration to it. The migration difficulty occurs because EKS Auto Mode has its own unique ingress class, load balancer class, and EBS CSI volume provisioner, so if you try to do an in-place migration of EKS Auto Mode disabled to enabled on a pre-existing cluster, you’ll need to re-create Kubernetes services of type Load Balancer, Ingress objects, and any PVCs you created before auto mode was enabled.
- Given those minor issues and the added costs, it doesn’t make sense to recommend EKS Auto Mode in all cases; even so, I think it’s often a good choice for small-scale greenfield clusters that need to be managed by teams who are new to EKS. I also think EKS Auto Mode makes a lot of sense for anyone who plans to have more than 6 long-lived clusters.
- Now we can talk about benefits, and the most important one to point out is a time saver that has the biggest impact on offsetting the time costs mentioned earlier.
- EKS offers significant benefit in terms of easier and faster debugging along with faster feedback loops, and under the right circumstances, (Often development involves continuous debugging), this trait alone can carry a win in terms of introducing enough time savings to offset the time costs and produce a net time savings in addition to EKS’s other benefits.
- Here are some common scenarios where EKS becomes easy to justify: Your app deploys to environments frequently enough to justify a CICD pipeline that’s as fast as possible, has frequent changes, complex integrations, service oriented architecture, is actively undergoing a transformation from monolithic to service oriented architecture using the strangler pattern, or has complexities that result in the ability to easily debug with a fast feedback loop being seen as highly valuable.
- Do you expect that some of your apps will have very spiky traffic, and scaling up as fast as possible could be a deciding factor?
If so, EKS’s advanced scaling options (powered by the keda.sh add-on), like scaling to zero and cron schedule based scaling, can become significant advantages in addition to cost-saving opportunities.
- Do any of your apps have a hard requirement for loading application configuration as files? Robust storage options?
- Do you need or see sufficient benefit in having access to advanced functionality and engineering patterns like cluster-level RBAC, policy as code, GitOps, advanced load balancing, OIDC/Authn/z integrations, and generally customizability that makes it easy to implement any idea in ways that offer a great engineering user experience?
- EKS can be a great fit for projects where its benefits outweigh the 2–14 days/year maintenance overhead it’s adoption incurs, and you can get stakeholders to agree to 1–4 months of one-off investment lead time needed to implement it according to best practices.
3. Stateful Applications are a scenario that’s common enough to justify a discussion of rule-of-thumb based logic:
- EKS makes it easy to run stateful applications and has good support for them, but just because it’s easy doesn’t mean it’s automatically a good idea. I tried to offer a well-informed view of EKS’s level of difficulty that comes alongside of its benefits, but here’s an important clarification, the level of difficulty I’ve mentioned thus far assumes you’re running stateless applications (and maybe a few stateful apps where data loss is acceptable, like in the case of valkey/redis caches or self-hosted monitoring tools like Grafana Lab’s PLG observability stack.)
- Many times people install stateful applications on kubernetes, but fail to consider the application’s full lifecycle, including automated backups, automated restores, testing, and CICD pipeline integrations.
When these are considered in full, the operational overhead needed to maintain great long-term reliability is often extremely high.
- High enough to consider 3 options:
- 1. Consider accepting an increased risk of seeing multiple hours of downtime about once a year, in exchange for avoiding significant operations overhead, by intentionally not rigorously implementing best practices associated with stateful workloads.
- 2. If you need highly reliable uptime and easier disaster recovery options, then offloading to an expensive managed service may result in a lower total cost of ownership.
- 3. Verify if stakeholders have appetite and willingness to invest 1–6 months (per individual stateful app/database) of dedicated engineering time and effort to rigorously implement best practices associated with stateful workloads, along with increased maintenance overhead.
If you understand the paradox I explained above, then you’ll understand that EKS ironically tends to be viewed as hard because it’s too easy, but with the right strategy EKS can stay easy and in many cases is the better choice, but not always, mainly because while ECS sucks to debug, once you get it right it tends to keep working maintenance free.
If you found this useful you may want to checkout doit.com/services.