Top Ten Challenges - Part 6: Sizing and Footprint Optimization

(This is part 6 of a series about challenges and considerations that I have come across with customers when deploying software into Kubernetes. You can find the intro post here.)

This is a topic that is relatively hard to cover in a blog post, because I could write a whole book about it. So it will be at somewhat high level, but I plan to dig deeper into some of the aspects described below in later blog posts.

For us at IBM, many customer discussions start with the ask for a PoC or some form of trial, allowing the customer to take a first hand look at our software. And as part of that, the question is "How big does the cluster need to be to run this?" Which then directly leads to the question if a new cluster has to be created, or if an existing one can be used, where the cluster should run (local, cloud, if cloud, which one). The answer to that last question is indeed that we don't care, it can be wherever, after all, that is one of the benefits of Kubernetes, and OpenShift, for example, runs pretty much everywhere. And in most cases, the required capacity to run our software does not change depending on where the cluster is hosted.

So how do you size a containerized workload running in Kubernetes? To start, you can specify two values for both CPU and memory that a pod requires, namely the "request" and the "limit". In a nutshell, the request value determines how much capacity is reserved for that pod, and for memory that is quite simple: the aggregate memory requested by all pods on a given cluster node cannot exceed the memory that is available on that node. Duh, right? For CPU, it is a bit more complex, because CPU is a shared resource, in other words, CPU requests reserve a timeslice during which the CPU (or rather, a portion of the CPU) can be used by a pod.
Limits let you define upper boundaries for both CPU and memory, in other words, they represent the point a pod cannot go past, which is useful in ensuring that a pod, or set of pods, cannot monopolize a node or an entire cluster. Note that the aggregate CPU limits across all pods can go above the available CPUs in the cluster, in other words, you can "overcommit" your cluster. Most of our test clusters are heavily overcommitted, because they don't need production stability. 

So back to the question: how do you determine required size of a deployment? One way of doing that is to simply add up all the pods' CPU memory requests and limits. We have a rule for our software that each pod must declare these values. As I also mentioned in a previous post, another option that we are discussing regularly is to not declare any of these at the pod level, but to use namespace quotas instead. And we regularly run a useful tool called kube-capacity against our clusters, which captures the current requests and limits for both CPU and memory.

A problem with all this is that it paints a fairly static picture of the capacity that a workload consumes. But we want systems to be elastic, and be able to dynamically react to fluctuations in load, right? And Kubernetes offers a variety of tricks to us that make it possible to scale, at the cluster level (by adding and removing nodes), but also at the pod level, using things like the Horizontal Pod Autoscaler. Moreover, and I hope you forgive me for a shameless plug here, we use Turbonomic to monitor resource consumption in our clusters, which then gives us recommendations on how to rightsize the environments, which can even be automatically implemented. Very cool.

However, while it is true that it is hard to predict exactly how much capacity a workload consumes, and that there are mechanisms to optimize the footprint, both statically and automatically, you still need to have at least a rough idea about the required infrastructure. Our software comes with install flavors we call "profiles", where we have predefined settings for things like pod sizes based on expected use, for example, a "starter profile" for cases where we expect minimum load and don't need any redundancy, or a "production profile" that comes with fully redundant components for high availability.  (Note that I haven't even mentioned required storage capacity here, I'll leave that for a future post.)

 
 Let me also mention a couple of challenges we have seen, and I'm sure everyone running a Kubernetes cluster in production can share similar experiences. 
We found that the way the scheduler works, at least in OpenShift, is that it starts throttling pods long before they approach their defined CPU limit. Pods that run at, say, half the CPU limit, are already seriously slowed down. That has led some of our teams to recommend setting CPU limits to really high values, indeed to avoid any throttling, but that has other negative side effects, for example, heavy overcommitting of clusters. 
A related side effect is that we have seen cases where making requests and limits too small leads to erratic behavior of a pod, for example, throttling can slow down the pod to a point where the liveliness probes assume the pod is failing, and subsequently the pod is killed and recreated, even though it was healthy. So, if you see pods in your cluster that are being restarted constantly, you may have set the CPU limit too low. This has reached the point where our support teams, when alerted to pods either being restarted or crashing, might recommend increasing the memory and CPU for the pod as a first measure to see if that corrects the situation. (Hint: when doing a healthcheck on your cluster, don't just look for crashed pods - look for restarts, too.)
And by the way, keep in mind that if you run Java code in your container, you have to make sure that the JVM can leverage enough of the memory provided to the container to function. We saw a case recently where we kept getting OutOfMemory exceptions, and increasing the memory for the pod didn't resolve it, because the JVM was started with a hardcoded heap size. 

Finally, the overall footprint of a solution is also influenced by the structure of the cluster as a whole. For example, all the public cloud providers offer different types of servers (different CPU and memory, as well as different type of CPU), with different cost associated with each type. I have seen calculations where a cluster consisting of a standard server type was cheaper in infrastructure cost than a cluster with equivalent capacity, but with more expensive servers. I often get asked if we prefer clusters with fewer, large nodes, or clusters with many, but small nodes. And I don't think there is one answer to that question, it really depends on what you are running on these nodes, where it runs, how you operate it etc etc. My favorite worker node type has 16 vCPU and 64GB of memory, but default nodes deployed by (OpenShift) installers are often smaller. The master nodes we deploy are typically 8x32 or 4x16, and that is just fine for our use.

As I mentioned above, this topic is hard to give justice in just a short blog post, and I will try to dive into some of the aspects mentioned here in much more detail in future posts.

(Photos by the blowup and by Siora Photography on Unsplash.)

Comments

Popular Posts