If you are working on kubernetes or heard about that with someone working on kubernetes, says Kubernetes is complex and hard to manage or troubleshoot. Same time you may see in kubernetes cluster either one, goes wrong sometimes.
You may experienced most common issues are container unavailable or pod doesn’t respond. With that any guess how your DevOps/SRE teams figures out the cause of the issue and fix it?
In this section will see what common scenarios your DevOps/SRE teams are may encounter, and how they address them.
1. Node unavailable
One of the key reason Kubernetes known for High Availability, where kubernetes automatically distribute the applications across multiple available nodes hosted on physical datacenter or virtual machines. If there is some availability issue, there is likely an insufficient number of available nodes.
If you see any node related issue, make sure you have enough nodes assigned to cluster. Any High availability cluster should contain minimum two nodes, please note we are speaking about kubernetes master node.
Even with enough nodes, you may find that nodes fail after you’ve set up and joined them to a cluster. One way to address this issue is to enable auto-recovery of any VMs that host nodes. Most cloud providers and on-premises VM platforms offer auto-recovery features that restart a failed machine automatically.
Increasing the number of servers in a cluster may also improve node availability, even if the number of nodes stays the same. When you spread nodes across multiple servers, you limit the harm done to your cluster by a server failure.
2. Noisy neighbors
Yes noisy neighbor problem, which refers to one application hogging resources in a way that deprives other applications of necessary resources, is a common challenge in a multi-tenant Kubernetes cluster.
One of the major roles of Kubernetes is, it ensures that all the applications have the required resources. You can either define the resource limit in kubernetes configuration files, but it is not mandatory for kubernetes scheduler looking got achieve the goal. As we know, Kubernetes has no way to automatically determine exactly how much compute, memory, or other resources an application may need at a given time. It can act based only on resource configurations that are specified in Kubernetes configuration files.
To troubleshoot Kubernetes’ noisy neighbor problems, first ensure that Kubernetes is configured with the information it needs to assign the right amount of resources to each workload. You can do this at the level of individual containers or pods using Limit Ranges, a capability that specifies the maximum resources that a container or pod can consume.
Also, proper use of namespaces can also help to troubleshoot noisy neighbors in a Kubernetes cluster. Using namespaces divides a single cluster into different virtual spaces. The Resource Quotas tool can place limits on the amount of resources that a single namespace can consume, thereby helping to prevent one namespace from using more than its fair share of resources. Keep in mind, however, that Resource Quotas apply to an entire namespace; they can’t control the behavior of individual applications within the same namespace. You should use Limit Ranges.
3. non-responsive containers
In a cluster that has sufficient nodes and properly configured Limit Ranges, and Resource Quotas, containers or pods may still be less responsive than they should be. This is usually due to poorly configured readiness or liveness probes.
Kubernetes uses liveness probes to check whether a container is responsive. If it’s not, Kubernetes restarts the container. Readiness probes determine whether a container or set of containers is both up and ready to accept traffic.
In general, these probes are good safeguards against situations where you need to manually restart a failed container or where containers are not yet fully initialized and therefore not ready for traffic.
Readiness and liveness probes that are too aggressive, however, can lead, somewhat paradoxically, to containers that are unavailable. For example, consider a liveness probe that checks a container every second and restarts the container if it determines that container is unresponsive. In some situations, network congestion or latency problems will cause the readiness check to take longer than one second to complete — even if the container is running without issue. In that case, the container will be restarted constantly for no good reason, leaving it unavailable.
To prevent this, configure readiness probes and liveness probes in ways that make sense for containers and environment variables. Avoid one-size-fits-all configurations. Each container probably needs its own policies.