Kubernetes operators

The use of "operators" has been one of the key architectural decisions we took when embarking on the journey to bring our software to Kubernetes and OpenShift. Not that It has been that way from day one. And it's being challenged regularly, which I totally appreciate, and even enjoy - few things are more fun than arguing over an architecture, or a pattern, and constantly question if we are doing the right thing.

In this post, I'll try to describe the basic principles of operators, and how that has led us to the use of operators across our software offerings. There are lots of articles out there, which describe operators, but I still feel like it makes sense to add my own description. I will then follow up with a separate post trying to outline some of the pro's and con's, and explain why I still believe it is the right choice.

So I'll start out with some level-setting on what operators are, which you may safely skip if you are familiar with the topic, and go straight to the next part.

Operators are native to Kubernetes

In Kubernetes, everything has an abstraction, in the form of a well-defined resource type. There are resources describing containers and pods, network interfaces, storage volumes, secrets, and more. There are even abstractions describing the underlying topology, for example, the nodes that make up the cluster. For each of these resources, there is a runtime component that understands the individual resource types and their attributes and maps them into the actual software that runs the system. These runtime components, called "controllers" not only watch for the creation of a resource and its initial deployment, they are also in charge of keeping the actual system in sync with the declared resources. In other words, the runtime tries to keep the cluster in exactly the state that is described by its resources. If the system starts to change from that target state, the controllers try to reconcile it accordingly. 

A simple example of this is the ReplicaSet controller. One of the things a ReplicaSet describes is one or more pods (which then run the actual containers), and it attempts to keep the number of running pods equal to the expected number as stated in the .spec.replicas attribute of the ReplicaSet instance. Which means that if a pod fails, or is deleted for whatever reason, the ReplicaSet will automatically create a new pod. In that respect, Kubernetes lets you automate tasks that otherwise have to be done manually by an operator. And that's exactly where the name "operator" for this pattern originates, namely that the system contains software operators monitoring and managing the cluster and its resources.

Custom operators

So why not extend that same pattern into the applications and workloads running in Kubernetes? Kubernetes lets you define your own resource types, in the form of Custom Resource Definitions (CRD), and you can deploy your own controllers, which understand these custom types and map them into 'real' things. For example, you can define a resource type for a "database", where the controller takes care of starting the appropriate containers to run the DB server, allocates persistent volumes and creates secrets with the credentials needed to access that database. All you have to do is create a resource instance of type "database" and let the controller handle the rest. And this combination of (one or more) CRD and runtime controller is what makes an operator, in the Kubernetes world. 
 

At IBM, we created custom types for many of the components in our software portfolio, and we developed controllers that take over the lifecycle management for these types. This allowed us to codify much of the orchestration needed to deploy our software, across the container, network, storage and other resources that are needed. We can test and harden that logic to the point where our customers need not worry about doing it themselves anymore. This makes it easier to consume our software, but it also takes away fine grained control over individual resources, and as is the case with any abstraction, it takes away details in favor of the abstraction. This lack of control is one of the main reasons I have heard for why operational teams oppose the use of operators. I'll get back to that in my later post.

Install versus lifecycle management

One aspect to keep in mind is that the idea behind operators is that they not only create the runtime elements for an application, in other words, "install" the application, but that they also ensure the application stays at its desired state. To do this, the controller effectively runs in an endless loop that constantly checks for any changes in the system and taking corrective actions if something changes. The diagram above illustrates that very well, in that it shows distinct "Observe", "Analyze" and "Act" functions in the controller.

Moreover, operators can offer lifecycle management functions. Going back to the database example I used above, there could be an additional CRD that represents a backup of the database. The controller would monitor for the creation of such a backup instance and trigger the actual backup of the database accordingly. This means that the system is managed through the creation, deletion and change of custom resources, with runtime controllers that act on those events. I'm emphasizing this, because I often have the impression that operators are thought of as "installers", whereas they are meant to be much more. In fact, a five-level maturity model exists that lets you indicate how functional your operator is.

Operator Lifecycle Management

So, operators manage the lifecycle of whatever application or workload they were built for. But how do you manage the operators themselves? That is where the Operator Lifecycle Management (OLM) framework comes in. It defines the ability to build catalogs of operators that lets you deploy the operators as well as the "operands" (i.e. the resources they manage), it lets you define dependencies between operators, it controls how operators are versioned, and more. To be sure, you don't have to use OLM to use operators; remember, an operator is one or more CRDs and a runtime controller you deploy in form of, say, a Deployment. But it makes using operators a lot easier, which is why we have OLM definitions for all of our operators.

By the way, OLM is part of a larger Operator Framework and is a project within CNCF, so it's entirely community driven. Another element of that framework is the SDK, and if you plan to create your own operator, I highly recommend that you use it.

Operator SDK

Indeed, we found that by far the easiest way to develop your own operator is to use the Operator SDK. At a high level, it offers the development of three types of operators, namely 
  • Go-based, 
  • Ansible-based, or 
  • Helm-based. 
The Ansible and Helm flavors are similar in that they effectively wrap an existing Ansible playbook or Helm chart into an operator. This is a good option if you already have existing assets that you would like to reuse, and/or if you are not familiar with writing Go code. We had quite a few of these operators in the early days. Over time, however, almost all of our teams switched over to creating Go-based operators, because it was the most flexible and extensible way to develop operators, and it became almost necessary when adding functionality that was going past the initial deployment ("day 2 management"). These operators use the rich support for the Kubernetes API in Go to build and manage resources.

Alternatives

What are the alternative solutions to developing and/or using operators? Well, it starts out with simply not using operators, not using CRDs and not utilizing the associated custom controllers. And that in turn means sticking with the Kubernetes native resource types and using "kubectl" as the main (and maybe the only) way to interact with the cluster and the workloads deployed on it.

If you need to orchestrate multiple resources in support of your application, a set of tools exist that let you automate this. The most prominent and most popular is probably
Helm. It allows assembling related resource files into so-called "Helm charts", and it comes with a template engine that supports embedding templates into these resource file to account for variations between individual deployments. Helm used to require a form of agent deployed into each target cluster, called the "tiller", but more recent versions do not use this anymore.
We actually started out using Helm when we first created software for Kubernetes and it served us really well. I'll go into the reasons why we moved away from Helm in my next post.

Another popular tool in this space is "kustomize". I don't know what the numbers are in terms of users for Helm versus kustomize, but I have the impression that kustomize is gaining in popularity quickly. It doesn't come with templates, but uses YAML overlays to let you define variations, and it is natively plugged into the kubectl CLI and doesn't require a separate client (like Helm does). 

Besides Kubernetes native solutions like these, you also have more, umm, traditional orchestration and automation tools, like Ansible or Terraform, but I won't go into these further here. There are plenty of websites offering comparisons and recommendations about how all these compare that I don't want to repeat. Also, keep in mind that you can definitely mix and match these solutions, in other words, they are not mutually exclusive. You can take the output of a Helm chart and feed it into kustomize. And you can definitely use both to handle CRs that are managed by an operator on the cluster. 


(Photos by Barrett Ward and CDC on Unsplash.)

Comments

Popular Posts