How to Troubleshoot Kubernetes Service 503 error(Service Unavailable)

motoskia

2 years ago

Troubleshoot Kubernetes Service 503 Service Unavailable Error

The HTTP 503 Service Unavailable error means that a website cannot be reached now because the server is not ready to handle the request. This could happen because it’s too busy, under maintenance, or something else which requires a deeper analysis.

In Kubernetes, it means a Service tried to route a request to a pod, but something went wrong along the way:

The Service could not find any pods matching its selector.
The Service found some pods matching the selector, but none of them were Running.
Pods are running but were removed from the Service endpoint because they did not pass the readiness probe.
Some other networking or configuration issue prevented the Service from connecting with the pods.

Running into errors on your site can be intimidating. However, most errors give you some hint as to what could cause them, which can helps to troubleshooting these issues in right way. The 503 error is not as polite, unfortunately, and does not give you much information to go on.

In this article, we have listed possible issue which cause 503 error and how you can troubleshoot the issue.

What Is an HTTP Error 503?

The Internet Engineering Task Force (IETF) defines the 503 Service Unavailable as:

The 503 (Service Unavailable) status code indicates that the server is currently unable to handle the request due to a temporary overload or scheduled maintenance, which will likely be alleviated after some delay. The server MAY send a Retry-After header field to suggest an appropriate amount of time for the client to wait before retrying the request.

Troubleshooting Kubernetes Service 503 Errors

Fix 1: Check if the Pod Label Matches the Service Selector

A possible cause of 503 errors is that a Kubernetes pod does not have the expected label, and the Service selector does not identify it. If the Service does not find any matching pod, requests will return a 503 error.

Run the following command to see the current selector:

# kubectl describe service [service-name] -n [namespace-name]

Example output:

Name: [service-name]
Namespace: [pod-name]
Labels: none
Annotations: none
Selector: [label]
…

Note: Replace service_name with your service name and your_namespace with your service namespace.

The Selector section shows which label or labels are used to match the Service with pods.

Check if there are pods with this label:

# kubectl get pods -n your_namespace -l “[label]”

If you get the message no resources found - it is a reason the Service cannot discover any pods, and clients gets a HTTP 503 error. Add the label to some pods to resolve the problem.
If there are pods with the required label - continue to the next solution.

Fix 2: Verify that Pods Defined for the Service are Running

In step 1 we checked the label which is used by Service selector. Run the following command to ensure the pods matched by the selector are in Running state:

# kubectl -n your_namespace get pods -l “[label]”

If the pod status is not Running - diagnose and resolve the error in your pod. There is could be various error it may show, you may need to understand and fix it. You can find some interesting troubleshooting from our previous articles.
If status is Running — proceed to the next solution.

Fix 3: Check Pods Pass the Readiness Probe for the Deployment

Next, we can check if a readiness probe is configured for the pod:

# kubectl describe pod pod-name -n namespace | grep -i readiness

This step provides helpful output only if the application is listening on the right path and port. Check the curl output with the curl -Ivk command, and make sure the path defined at the service level is getting a valid response. For example, 200 ms is a good response.

Readiness probe failed:

If the output indicates that the readiness probe failed — Check why the readiness probe getting fail, you need to check in details, Refer to our guide to Kubernetes readiness probes.
If there is no readiness probe or it succeeded — proceed to the next solution.

Fix 4: Verify that Instances are Registered with Load Balancer

If all the above steps did not find an issue, another common cause of 503 errors is that no instances are configured with the load balancer. Check the following:

Security groups - ensure that worker nodes running the relevant pods have an inbound rule that allows port access, and that nothing is blocking network traffic on the relevant port ranges.
Availability zones – if your Kubernetes cluster is running in a public cloud, make sure that there are worker nodes in every availability zone specified by the subnets.

This steps may helps you to discover the basic of the issues that can result in a Service 503 error. If you did not manage to quickly identify the root cause, you will need a more in-depth investigation across multiple components in the Kubernetes deployment. If there is more component malfunctioned, it will be hard to identify the exact cause. Another solution you consider is graceful shutdown.

Avoiding 503 with Graceful Shutdown

Another common cause of 503 errors is that when Kubernetes terminates a pod, containers on the pod drop existing connections. Clients then receive a 503 response. This can be resolved by implementing graceful shutdown.

To understand the concept of graceful shutdown, let’s quickly review how Kubernetes shuts down containers. When a user or the Kubernetes scheduler requests deletion of a pod, the kubelet running on a node first sends a SIGTERM signal via the Linux operating system.

The container can register a handler for SIGTERM and perform some clean-up activity before shutting down. Then, after a configurable grace period, Kubernetes sends a SIGKILL signal, and the container is forced to shut down.

Here are two ways to implement graceful shutdown to avoid a 503 error:

Implement a handler for SIGTERM on the containers matched to the Service. This handler should capture the SIGTERM signal and ensure that the server continues running until it completes all current requests, and then cleans up its activity and shuts down.
Add a preStop hook — the container can implement a hook to ensure it is not killed until the grace period ends. This delays the receipt of the SIGTERM signal until the end of the grace period. This hook can then be used to finish serving existing connections to avoid 503 errors. You can read our details explanation about Kubernetes Pod Graceful Shutdown — How?.