Using ChatGPT (and other tools) to troubleshoot OpenShift
For this post, I wanted to use it as a tool for troubleshooting (which, you could say, is a specific way of learning). The opportunity presented itself when one of the OpenShift clusters I am using went into a state of disarray, with lots of pods failing, and I was curious to find out what the problem was.
As a disclaimer, I don't necessarily think of myself as an expert in OpenShift troubleshooting; I seem to constantly see issues I have never seen before, and I tend to take a rather chaotic approach at getting to the root cause of a problem. Moreover, the troubled cluster runs in one of our development environments, which makes no promises about stability and which assumes that clusters are used only for short-lived purposes. I mention that to make it clear that the situation I encountered is not typical for an OpenShift cluster.
By the way, neither of the errors I am using as examples are trivial, in my opinion. I am deliberately trying to solve hard problems here.
Example 1
Again, the cluster has lots of failing pods, and many of them stuck in Terminating state. So I looked at the logs and found that there was a common error message, one that I have indeed seen many times before when clusters are in trouble: "context deadline exceeded".
I usually start by putting the error message into my Google search window to see if it offers pointers to useful resources:
In this case, however, the search results don't seem overly helpful, certainly not when looking at the top results. A couple of bugzilla entries for an old OpenShift release, and an issue raised against Ceph, over three years ago.
I then tried a new engine that I only recently became aware of, Perplexity AI. It uses large language models (think GPT), combined with an existing search engine (in this case, I believe, Bing). I expect that we'll see a fundamental shift in search coming real soon, where we replace 'traditional' search engines with those offering additional, AI-driven functionality. This one is a good example.
On the website, it says you can "Ask anything...". I entered the following:
how do i resolve the following error in openshift: KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Here is what came back:
Again, not overly useful, to be honest. I seriously doubt that the "timeout value for the pod" has anything to do with it, nor am I sure what that timeout value even is (it doesn't say).
Another interesting aspect of AI-driven search is that it supposedly learns continuously and includes new sources into its answers. I let a day pass and asked the exact same question again, and this time I got a slightly different answer:
Not much better, but at least it is acknowledging that I was looking for a reason for this error when deleting (killing) a container. But how do I reduce the workload of a container I want to delete? Oh well.
Back to ChatGPT then. I asked the same question there, and here is the response:
I really like this! It gives background information about what can cause the error, suggests actions I can take to resolve it, and adds the actual commands I should execute. Instead of telling me to change the "timeout value", it talks about the "grace period", which is an actual thing in Kubernetes. Much better!
One aspect of a chat bot like ChatGPT is that it stores the context of the exchange, so I could have followed up with a question like "I tried all this, but it didn't work - do you have any other ideas?" I did not do that and will leave it for another day.
Example 2
Instead, I looked for help with another problem I saw in a number of logs on the cluster: it couldn't create new pods, because their name was reserved. Using the same approach as in the first example, I googled the error message:
The very first link in the search result looks promising. Following it, I get a good description of what the problem might be:
This all sounds really great, however, it claims that this problem has been resolved as of OCP 4.6.21. Since I am using OCP 4.10.50, the answer should not apply to my situation.
Next I try Perplexity AI again:
And here again, I must admit that I don't find the answer very satisfactory. It claims that the use of the cluster network provider is to be blamed. We are indeed using OpenShiftSDN as the network type for clusters running in this lab environment, and it has not ever caused any problems, as far as I know.
By the way, one of the "sources" it quoted for its answer is called "we-share-food.de", which looks like this:
An odd choice of source, isn't it?
Anyway, let's see what ChatGPT does:
And just like in the first example, this gives us the most coherent answer. A number of possible options, with some explanation, and adding concrete commands I can use to resolve it. What the screen shot above doesn't show is the last option it gave, which is indeed the one I think is most promising, namely that I drain the node and restart it. We know from the Google results above that timeouts in CRI-O may be responsible for at least of what we are seeing, and I notice the same error across a number of pods, running across several namespaces, and restarting nodes may help getting back to a better state.
By the way, I ended up trying to upgrade the entire cluster, because that includes draining and restarting every node in the cluster, but it wasn't successful. I now suspect that a problem with the storage provider is behind all this, but I'm still trying to get to the bottom of the problem. Keep fingers crossed I'll get it back up and running!
Summary
First of all, there is no silver bullet. None of the approaches I took gave me an answer I could blindly follow. All of them gave me pointers I could follow, some good, some not so good. But ChatGPT clearly gave me the most comprehensive advice:
- some background information I can separately follow up on and study
- multiple options I could try to resolve my problem
- concrete command line examples
This really shows how solving a technical problem with a commercial search engine has become much harder over the years. While technology has certainly increased in complexity, the results of a commercial search engines have been monetized. Something that's not the case for ChatGPT. And while its results are not always completely accurate, when debugging a technical problem that's fine as you're really looking for possible causes and ideas.
ReplyDelete