Making the case for Kubernetes operators

January 22, 2022

Making the case for Kubernetes operators

In my previous post, I tried to describe what operators are. Here, I will outline what drove us towards using them across almost all of our offerings, what has and has not changed since then, and if I believe we would make the same decision again today.

As background, when we first started creating software for Kubernetes, we chose Helm as the orchestration tool. Helm 2, that is, and it required the infamous 'tiller' to be deployed in the target cluster. That tiller was broadly believed to risk becoming a "Trojan horse", something that ran at cluster scope and which could possibly be highjacked to do bad things. And it was indeed one of the initial reasons we looked for alternatives. Well, as of Helm 3, there is no tiller anymore, so that problem went away. But we had already switched over to operators.

Operators. I remember more than one meeting with customers where someone would ask "Do you use operators?", and when I answered "Yes", it was seen as a sure sign that we knew what we were doing. Really! But, seriously, leveraging the operator pattern had tremendous momentum, and after some exploration and evaluation, we decided to build operators as the main way to install and manage our software in Kubernetes and OpenShift. OpenShift, specifically, was rearchitected at the same time, going from version 3 to version 4, to be based entirely on an operator-based model. Moreover, it had (and still has) first-class operator support built-in via OLM and the operator hub, and that was another factor.

So, besides the fact that they are cool, why build operators? Let me start with what I consider the biggest reason: Kubernetes itself is based on a model that uses operators, where desired target state is captured as a set of declarative resources and associated runtime components are deloyed to create (and maintain!) that desired state. Extending that same model into custom workloads and applications makes sure you are well aligned with the operational model and all the tools and technologies that exist for Kubernetes. What's more, we have seen GitOps become more and more popular lately, and using operators in GitOps is a natural fit, because you can articulate the entire stack - infrastructure, Kubernetes and applications - in a consistent and declarative model. Ultimately, installing an application, or whatever other workload, becomes as simple as "kubectl apply -f myApplicationCR.yaml". It doesn't get much simpler than that.

The next point is equally important. I have always thought of Helm as being mostly a "fire and forget" type approach. You kick off the installation of the chart, and when the deployment is done, hopefully successfully, its state is "deployed". There is not much that Helm does for you afterwards, it does not monitor the deployed resources and you have to explicitly call whatever other function you want to execute. It does support upgrades, though, which is obviously important, and it also has a "rollback" function. Rollback support with operators, and OLM, is sorely lacking (or should I say: missing?), in my opinion. At the same time, rollback is a tricky topic, because even if there is a mechanism to say "go back to a previous version", the software has to support or at least tolerate it, and do so without the need to simply delete and redeploy everything.

Operators can go above and beyond Helm's basic cycle of install -> upgrade/rollback -> uninstall, and that is because they will keep watching and monitoring their owned resources and take corrective action in case something changes. This allows systems to become more automated and autonomous in keeping a system at its desired state. The highest level of operator maturity is called "auto pilot" for that very reason. But note that I wrote "can go above and beyond", because this corrective action has to be implemented to begin with, and if the developer of the operator hasn't added the required reconcile logic, it won't reconcile anything, of course.

And that gets me to an important point: operators have to be implemented, and as a result, there can be very good and very bad, very mature and very immature operators. Saying that you have one has no value in itself, it is like saying "I have code". That has certainly changed from the earlier days when, as I mentioned above, saying the word "operator" made you an expert.

Each operator represents an abstraction, and that can be both good and a bad. It is good in that you can encapsulate the logic to mange and integrate a number of Kubernetes resource into one place. It is bad in that you are now one level removed from those basic resources. The creation of the underlying resources is implemented in code (unless you use Helm or Ansible-based operators), and if you want to change anything about these resources, you have to hope that the abstraction (CRD) gives you a way to do so, or you have to change and rebuild the entire operator. For example, assume you want to add certain pod affinity rules to a deployment that will isolate a particular workload on a set of nodes. The operator may not let you define these affinity rules in its CRD, and even worse, it may overwrite any changes you make to the pods at runtime as part of its reconciliation.

With Helm charts, doing this is much easier, because you can simply override any resource definition in the chart, without the need to recompile or rebuild anything. At the same time, it's also easier to break things if you start making changes at a low level, of course.

A colleague of mine has developed a very clever mechanism by which users of our software can optionally "inject" changes to resources, which are captured in a separate CR, and without impacting the existing operator. I think that such a "resource override" mechanism should be standard for all operators, and maybe we should try to get it added to the Operator SDK.

Let me mention a couple of other drawbacks. For starters, there is a bit of a learning curve to get through before you can write your own operator. Being skilled in Go definitely helps. Next, it basically adds a whole new step to the install experience for an application: before you can install the application, you have to install the operator managing that application. Maybe not that big of a deal, but the expectation often is that an operator is installed and managed by the cluster administrator, whereas the application instance is installed and managed by the application owner. And I have heard of administrators who would rather not be responsible for dealing with operators.

If you are using OLM, operators can be installed in two different modes, basically either creating and watching resources in all namespaces, or creating and watching in one namespace only. The first option is often considered risky by cluster owners, because it implicitly requires cluster level roles and access for that operator - which could be abused in a malicious way, or do bad things by accident. The second option limits access to one namespace, but what if your application runs across two namespaces? I think we need to evolve OLM towards a model where we have exact control over which resources and namespaces the operator has access to, and it's possible that work towards such a model is already happening.

While I am at it, OLM is a bit of a beast. While very powerful, it is also complex. I keep saying that two of its main benefits are the ability to handle dependencies and the ability to control versions. On that later point, though, while there is a way to (manually or automatically) upgrade operators, that does not necessarily mean the operands are upgraded, too. You have two options, either you tie the operator version to the operand version, or you decouple them, maybe have an attribute in the CR that indicates the operand version, and an upgrade is triggered by updating that attribute. I also don't find the creation of catalogs and catalog entries trivial, with bundles and indexes and what have you, even though I know that this part is about to change. Another improvement that is coming, by the way, is better support for air gapped environments.

Another, relatively minor drawback is that there are now more moving parts. The operator has a runtime component, the controller, and that will use additional capacity on your cluster. I say that it is a relatively minor issue, because operators should really be implemented in small pods that won't take up a lot of space. A related consideration is granularity. As in, how many operators do you need? Assume you have a complex application that consists of many different services. Should you create one large operator that controls the entire set of services, or should you define an operator per service and effectively create a hierarchy of operators working together? There is no right or wrong answer, I think, it is really a design decision you have to make. Similarly, it is critical to define appropriate CRDs, because it is hard to change those later on once you have them in your cluster.

So where does that leave us? My answer is that I still believe that the use of operators is a good architectural choice. But I also think that that's only true if you create operators which support end to end lifecycle management function and not 'just' install, with the goal of eventually going into "auto pilot" mode. I also recommend using OLM, since it offers some much needed functionality, especially its catalog function, but it is complex and requires some (self-)education before you can fully leverage it.

Teams that are keen to keep everything at standard Kubernetes resource level, and are willing to pay the price for having to manage those resources at a fine grained level (and gain benefit from it, of course), might be well served with Helm.

(Photos by Markus Winkler, Markus Spiske and Marvin Meyer on Unsplash.)