Top Ten Challenges - Part 9: Air gap, disconnected clusters

(This is part 9 of a series about challenges and considerations that I have come across with customers when deploying software into Kubernetes. You can find the intro post here.)

From the early days of working with customers on the deployment of our Cloud Pak offerings, we have been asked whether and how we support "air gapped installs". Air gapped means that the environment in which we deploy the software is not connected to the Internet, and thus we cannot pull container images and other content from public sites. In my experience, though, almost all of these environments are not truly air gapped, but their connectivity into the Internet is tightly controlled, and, for example, pulling container images from public registries is prohibited. 

Ironically, it's supposed to be one of the benefits of using containers, that images can be 'pulled' on demand, where and when needed, from a variety of public registries, and software binaries don't have to be manually downloaded from a file server. This is similar to the use of tools like "apt" for the install and maintenance of packages on Linux. But with air gap, we are back to the need to download, after all, somehow the bits have to make it into the target system, right? 

We can divide the assets that form a solution that is to be deployed into Kubernetes into two categories: container images and resource definitions. The former live in an image registry, the latter live in GitHub. Luckily, it is relatively easy to run these locally, that is, within the air gapped network that the cluster is in. 

There are plenty of options to run your own image registry (OpenShift comes with one built-in), and you can typically find private image registries in your public cloud provider accounts, too. So it becomes a matter of getting the right set of images copied from wherever they live on the Internet into your private registry. Typically you would use a "bastion host" for this, that is, a machine that can access the Internet and which is also connected to your private network. And then you run a tool like "skopeo" to do the actual copying of the images. The challenge here lies in making sure you get all of the images you need, and to have a process that ensures you regularly update your local registry with more recent image versions, especially when it comes to addressing security vulnerabilities. For Cloud Paks, we created a packaging artifact we call a "CASE", which serves many purposes, but one is that it contains a list of all the images needed for a particular component, which can then be used to mirror them to a local image registry. I actually expect that we'll see more innovations and tools emerging that help in the management of images overall, including for air gap scenarios.

The other category of assets, i.e. the resource definitions for the Kubernetes resources, ideally live in a Git repository, and become part of a fully automated management system based on GitOps principles. I happen to have a strong opinion about this: as a general rule, all Kubernetes based systems should be managed through GitOps, that is, by storing files in Git and using toolchains to automate their deployment. In case you were wondering, we use ArgoCD as the CD tool for our internal testing of Cloud Paks, and that is working very well for us.
(GitOps is not in scope for this post, I may pick it up in a separate post, so let me simply point you to a GitHub repo we use to store some of our early artifacts.)

So how do you use Git and GitOps in an air gapped environment? Well, the principle is very similar to the one we use for images, namely that we run the Git repo locally, and use a Bastion Host to clone the public GitHub repo to a private one, and then maintain content there. We use operators for the deployment and management of Cloud Paks, and these operators live in an operator catalog we manage via OLM, and we are anxiously awaiting improvements in how these catalogs are built that will make it much easier to use them in air gapped environments.

Putting those two pieces together, that is, resource definitions that reference container images, raises another challenge: image references are defined in the container section of the Pod specification (typically driven by a template spec in a Deployment or DaemonSet), including the registry they can be pulled from. If using a local image registry instead of the original (public) one, it means you have to update all of the places where the image is referenced and point to the local image registry. That can be very tedious when using many images, and we are relying on a component in OpenShift called an ImageContentSourcePolicy, which lets you configure a redirect of sorts, basically telling the cluster "if an image is said to be at 'quay.io', pull it from 'privatereg.local' instead". It feels a bit like overwriting the DNS server, just for images.

In the early days of our Kubernetes journey, we found a number of instances where this approach caused issues, for example, when an image we used was moved from, say, Docker Hub to quay.io and our ImageContentSourcePolicy wasn't updated accordingly. Moreover, we had to make sure that we didn't incorporate code that was indiscriminately making HTTP calls to a public address, for whatever reason. And, needless to say, before you can deploy any software into your Kubernetes cluster, you have to deploy that cluster to begin with, so you need an installer that supports air gapped installs (OpenShift has one).

Again, keep in mind with all of this that almost all environments (well, the ones that I have seen) are not truly air gapped. But they have restrictive proxies in place that force you to explicitly allow any public address for access by internal systems. And often, security teams won't let you define wildcards, for example, something like *.acme.com. The main reason for using air gap being, though, that Kubernetes teams won't allow using any image that has been downloaded directly from the Internet, without any vetting or scanning. I sure wouldn't either.
 
Ultimately, the lessons learned here are: 
(1) Air gap is a very common way of running a Kubernetes cluster, so you need to treat it as a first class citizen, so to speak.
(2) It has to be easy to mirror images and other artifacts into a private location.
(3) You have to regularly (or rather, continuously) test it! (And, believe it or not, we had a hard time simulating an air gap environment in our lab. :-))
 

Comments

Post a Comment

Popular Posts