Hey guys, it’s been a while. I have been in and around, and have been soaking in new stuff that I am excited to share. It has been three years of using Kubernetes for testing and roughly a year of running Kubernetes on production systems, and I have come up with some salient learnings I would like to share.
As we all know, Kubernetes is the buzz of the internet, and this makes it the hottest topic now. Sysadmins, DevOps Experts, and C-level executives are all excited about the promise of Kubernetes. But like with every tool and system, and as the saying goes, ‘When the purpose is not known, abuse is inevitable. Kubernetes promises a lot of things that all tie around high availability. But in the end no matter how good a tool has been talked about and no matter the number of articles you read, the architecture is always the key to bringing out the best in it.
In this piece, I will be talking about how to plan capacity on your Kubernetes Cluster, and how it affects the availability and scalability of your system. From the OnPrem days, we were taught to always over-estimate capacity so that during spikes, new resources could be easily provisioned and attached to the existing resources. Now, in the days of the cloud, principles like the AWS Well-Architected Framework teach us to stop guessing capacity and use capacity on demand and tear it down if not in use.
Both of these schools of thought are quite correct based on the environment and the circumstances. In Kubernetes, three major resources determine the availability of your cluster and application:
- Node CPU Core
- Node Memory
- Node Pod Capacity
Node Pod Capacity is our focus here. Now, the number of pods a node can take is determined by different factors. We will focus on the factors around EC2 Instances on AWS. There is a formula that can be used to calculate the number of pods a node can take. The formula goes like this:

An example is when you have a t3.large instance that you want to use as a node in your cluster. You will have to calculate the maximum number of pods that it can accommodate. Parameters such as Maximum supported Network Interfaces for instance type, and IPv4 Addresses per Interface can be found here. Since we will be using that document, we will reference those parameters and rename them.
Let:
Maximum support Network Interfaces, for instance, = e
IPv4 Addresses per Interface = i
From the document, these are the parameters for e and i:
Press Enter or click to view the image in full size

From the figure above e = 3 and i = 12
Max Pods = (e * i) -1
Max Pods = (3 * 12) -1 = 35
This means that the instance can only take a maximum of 35 pods.
The danger behind not knowing this is that you are not able to plan how many pods you should have in your cluster. If you do not plan the number of pods in your cluster, then you stand a chance of experiencing scheduling issues when the pods’ capacity gets filled up and your cluster is not able to schedule more pods. It will look like this:
Press Enter or click to view the image in full size

If you look at the figure above, you can see that the number of scheduled pods has filled up the number of available pods. This has made the cluster unstable, hence the Nodes are showing red. This is made visible with the use of a 3rd-party tool called Rancher, which helps us visualize the cluster.
The strange part of this metric is that you can see the CPU and Memory are not close to 30% of usage, yet the pods are 100% used up. It means that as you plan CPU and Memory capacity, pod capacity is always key.
Recommendations
We have seen the danger involved if we do not plan our cluster properly. If we do not know the number of pods a Node can take and plan appropriately, it can lead to catastrophic failure of pods and services. From the standpoint of high availability and effective failover, these are my recommendations on pod capacity planning:
- Always know the number of pods a single Node can take at any time, using the formula and steps shared above.
- Every pod should have at least 2 replicas for high availability of a microservice
- Overestimate your pod capacity by double what is being consumed by your pods currently. This will allow pods to be rescheduled anytime there is a failure.
- Always audit your pod capacity vs scheduled pods
For instance, if the pod capacity based on the number of Nodes in a cluster is 70 pods (pod capacity for a cluster is the number of pods the cluster can accommodate based on the capacity of each node in the cluster), and the number of pods scheduled is 50. It means that we have a surplus of only 20 pods for services to recover with. A better recommendation will be to have a surplus of 35, which gives more capacity for pods to be rescheduled in the eventuality of failure or if pods need to be rescheduled for high availability.
Proper auditing and monitoring of your cluster are very key and crucial to maintaining high availability. Designing your system to recover from failure from the start should be part of your system architecture and not an afterthought. Kubernetes comes with failover, YES. But you need to understand and architect based on that to reap the true benefits of that and all over beautiful features of Kubernetes.