Making the case for Kubernetes operators
In my previous post, I tried to describe what operators are. Here, I will outline what drove us towards using them across almost all of our offerings, what has and has not changed since then, and if I believe we would make the same decision again today.
As
background, when we first started creating software for Kubernetes, we
chose Helm as the orchestration tool. Helm 2, that is, and it required
the infamous 'tiller' to be deployed in the target cluster. That tiller
was broadly believed to risk becoming a "Trojan horse", something that
ran at cluster scope and which could possibly be highjacked to do bad
things. And it was indeed one of the initial reasons we looked for
alternatives. Well, as of Helm 3, there is no tiller anymore, so that
problem went away. But we had already switched over to operators.
Operators.
I remember more than one meeting with customers where someone would ask
"Do you use operators?", and when I answered "Yes", it was seen as a
sure sign that we knew what we were doing. Really! But, seriously,
leveraging the operator pattern had tremendous momentum, and after some
exploration and evaluation, we decided to build operators as the main
way to install and manage our software in Kubernetes and OpenShift.
OpenShift, specifically, was rearchitected at the same time, going from
version 3 to version 4, to be based entirely on an operator-based model.
Moreover, it had (and still has) first-class operator support built-in
via OLM and the operator hub, and that was another factor.
So,
besides the fact that they are cool, why build operators? Let me start
with what I consider the biggest reason: Kubernetes itself is based on a
model that uses operators, where desired target state is captured as a
set of declarative resources and associated runtime components are
deloyed to create (and maintain!) that desired state. Extending that
same model into custom workloads and applications makes sure you are
well aligned with the operational model and all the tools and
technologies that exist for Kubernetes. What's more, we have seen GitOps
become more and more popular lately, and using operators in GitOps is a
natural fit, because you can articulate the entire stack -
infrastructure, Kubernetes and applications - in a consistent and
declarative model. Ultimately, installing an application, or whatever
other workload, becomes as simple as "kubectl apply -f
myApplicationCR.yaml". It doesn't get much simpler than that.
The
next point is equally important. I have always thought of Helm as being
mostly a "fire and forget" type approach. You kick off the installation
of the chart, and when the deployment is done, hopefully successfully,
its state is "deployed". There is not much that Helm does for you
afterwards, it does not monitor the deployed resources and you have to
explicitly call whatever other function you want to execute. It does
support upgrades, though, which is obviously important, and it also has a
"rollback" function. Rollback support with operators, and OLM, is
sorely lacking (or should I say: missing?), in my opinion. At the same
time, rollback is a tricky topic, because even if there is a mechanism
to say "go back to a previous version", the software has to support or
at least tolerate it, and do so without the need to simply delete and
redeploy everything.
Operators
can go above and beyond Helm's basic cycle of install ->
upgrade/rollback -> uninstall, and that is because they will keep
watching and monitoring their owned resources and take corrective action
in case something changes. This allows systems to become more automated
and autonomous in keeping a system at its desired state. The highest level of operator maturity is called "auto pilot" for that very reason. But note that I wrote "can
go above and beyond", because this corrective action has to be
implemented to begin with, and if the developer of the operator hasn't
added the required reconcile logic, it won't reconcile anything, of
course.
And
that gets me to an important point: operators have to be implemented,
and as a result, there can be very good and very bad, very mature and
very immature operators. Saying that you have one has no value in
itself, it is like saying "I have code". That has certainly changed from
the earlier days when, as I mentioned above, saying the word "operator"
made you an expert.
Each
operator represents an abstraction, and that can be both good and a
bad. It is good in that you can encapsulate the logic to mange and
integrate a number of Kubernetes resource into one place. It is bad in
that you are now one level removed from those basic resources. The
creation of the underlying resources is implemented in code (unless you
use Helm or Ansible-based operators), and if you want to change anything
about these resources, you have to hope that the abstraction (CRD)
gives you a way to do so, or you have to change and rebuild the entire
operator. For example, assume you want to add certain pod affinity rules
to a deployment that will isolate a particular workload on a set of
nodes. The operator may not let you define these affinity rules in its
CRD, and even worse, it may overwrite any changes you make to the pods
at runtime as part of its reconciliation.
With
Helm charts, doing this is much easier, because you can simply override
any resource definition in the chart, without the need to recompile or
rebuild anything. At the same time, it's also easier to break things if
you start making changes at a low level, of course.
A
colleague of mine has developed a very clever mechanism by which users
of our software can optionally "inject" changes to resources, which are
captured in a separate CR, and without impacting the existing operator. I
think that such a "resource override" mechanism should be standard for
all operators, and maybe we should try to get it added to the Operator
SDK.
Let
me mention a couple of other drawbacks. For starters, there is a bit of
a learning curve to get through before you can write your own operator.
Being skilled in Go definitely helps. Next, it basically adds a whole
new step to the install experience for an application: before you can
install the application, you have to install the operator managing that
application. Maybe not that big of a deal, but the expectation often is
that an operator is installed and managed by the cluster administrator,
whereas the application instance is installed and managed by the
application owner. And I have heard of administrators who would rather
not be responsible for dealing with operators.
If
you are using OLM, operators can be installed in two different modes,
basically either creating and watching resources in all namespaces, or
creating and watching in one namespace only. The first option is often
considered risky by cluster owners, because it implicitly requires
cluster level roles and access for that operator - which could be abused
in a malicious way, or do bad things by accident. The second option
limits access to one namespace, but what if your application runs across
two namespaces? I think we need to evolve OLM towards a model where we
have exact control over which resources and namespaces the operator has
access to, and it's possible that work towards such a model is already
happening.
While
I am at it, OLM is a bit of a beast. While very powerful, it is also
complex. I keep saying that two of its main benefits are the ability to
handle dependencies and the ability to control versions. On that later
point, though, while there is a way to (manually or automatically)
upgrade operators, that does not necessarily mean the operands are
upgraded, too. You have two options, either you tie the operator version
to the operand version, or you decouple them, maybe have an attribute
in the CR that indicates the operand version, and an upgrade is
triggered by updating that attribute. I also don't find the creation of
catalogs and catalog entries trivial, with bundles and indexes and what
have you, even though I know that this part is about to change. Another
improvement that is coming, by the way, is better support for air gapped environments.
Another,
relatively minor drawback is that there are now more moving parts. The
operator has a runtime component, the controller, and that will use
additional capacity on your cluster. I say that it is a relatively minor
issue, because operators should really be implemented in small pods
that won't take up a lot of space. A related consideration is
granularity. As in, how many operators do you need? Assume you have a
complex application that consists of many different services. Should you
create one large operator that controls the entire set of services, or
should you define an operator per service and effectively create a
hierarchy of operators working together? There is no right or wrong
answer, I think, it is really a design decision you have to make.
Similarly, it is critical to define appropriate CRDs, because it is hard
to change those later on once you have them in your cluster.
So
where does that leave us? My answer is that I still believe that the
use of operators is a good architectural choice. But I also think that
that's only true if you create operators which support end to end
lifecycle management function and not 'just' install, with the goal of
eventually going into "auto pilot" mode. I also recommend using OLM,
since it offers some much needed functionality, especially its catalog
function, but it is complex and requires some (self-)education before
you can fully leverage it.
Teams
that are keen to keep everything at standard Kubernetes resource level,
and are willing to pay the price for having to manage those resources
at a fine grained level (and gain benefit from it, of course), might be
well served with Helm.
- Get link
- X
- Other Apps
Thanks for the Great article Andre. Shared at Philips! :-)
ReplyDelete