Using ChatGPT to monitor and optimize OpenShift

In my previous posts, I have talked about using ChatGPT as a tool to learn Kubernetes and as a tool to troubleshoot Kubernetes. In this part, I will complete the mini series, so to speak, with a look at using ChatGPT to monitor and optimize Kubernetes.

Operating a Kubernetes cluster carries the primary objective of keeping workloads running at or above their respective service levels, and do so with maximum efficiency. That is, maintain the availability and performance of deployed applications while minimizing cost for infrastructure (compute, storage, networking, etc) and other parts of the environment. And just like with any other aspect of IT, organizations are applying Artificial Intelligence to do this better and faster. 

Sidenote: I am using Kubernetes and OpenShift interchangeably in these blog posts, because what I write about applies to both. Personally, I am using OpenShift clusters for all my development and testing. 

That triggered the question for me if this extends to ChatGPT, the tool everyone is currently excited about. How would I use it to ensure the overall health of my cluster, and also look for opportunities to scale workloads to minimize the required (costly) capacity? What do I need to share with the engine to get meaningful advice back?

To answer that last question, let's simply ask: what data should I share with ChatGPT to allow it to help me?

As we have seen in previous cases, the answer is surprisingly comprehensive and detailed. Note that it doesn't say "give me the admin password to the cluster and I'll take a look for you", or "send me the following log files". It goes without saying that I wouldn't have done such a thing anyway. While I like ChatGPT for early exploration and prototyping, I would not share actual production data with it, simply because I wouldn't know where it goes. Besides, its developers openly admit that it can make mistakes - and sometimes just make things up - which makes it impossible to use in a true enterprise context, in my opinion. But as before, it gives me good pointers about the things that matter.

So I answered its questions, somewhat generically, and got this back: 

Some good advice here, still not giving me the final solution, but given that I only shared some high level details about my cluster and the workloads running on it, that's not a surprise. It certainly gives me food for thought about where to take a closer look and possibly take some actions. And it does point out that I should take advantage of monitoring and logging tools, even though it stops short of telling me which ones it recommends.

Finally, I gave it a concrete alert message that I saw in my console overview, and asked what it suggested I do about it:

If you have followed along my previous question and answer game with ChatGPT, this won't come as a surprise anymore: it gives a solid overview explanation and recommends a number of actions, all with yet another brief explanation. It is not the final answer that easily lets me solve my problem, but it gets me on the right track from where I can do further investigation. It is hard to get similar advice from other sources, at least it would take a lot longer. So, there it is again: this AI engine doesn't replace the human and her/his troubleshooting and deep technical skills, but it lets them accelerate their work.

Let me compare and contract this with a commercial tool that does this type of thing for a living, so to speak. I picked one of our own products, IBM Turbonomic, as an example. It does indeed require that you give it credentials and access to your cluster. It will then scan and assess the state of the cluster and the workloads running there, and give recommendations about any corrective actions you can take (or you can say those actions should be executed automatically without human intervention). 


This goes way above and beyond of what we say above with ChatGPT, in that it tells me specifically what to do, for example, which pod to move from one node to another, or where and how to resize the environment, based on real usage. What isn't shown above is that it can even tell me concrete dollar savings in the cost of the environment (if it runs on a public cloud provider) after these actions have been completed. Very cool!

Again, this is an unfair comparison, because ChatGPT is a generic generative AI model, and Turbonomic is commercially available software for resource and workload optimization. And I gave it direct access to my cluster, whereas I didn't let ChatGPT do that (for the reasons I mentioned earlier). But where I see all this going is not that you would pick one over the other - you would use both in a complementary way. Let specialized tools scan environments and collect relevant data, and then use a large language model (ChatGPT is just one of several out there, and there are many more to come) to summarize results and offer an intuitive natural language interface to the end user.


Comments

Popular Posts