Kubernetes Autoscaling – A Complete Guide

0
2813
Kubernetes Autoscaling

As we are moving towards massive Kubernetes world, we are all experimenting lot and trying to enable/extend our service wildly, this mean we are growing better than before. With that we cannot predict how many computing resources are needed. As no one master yet on Kubernetes and it is always stands unpredicted when the growth calculated. But it is our responsible to maintain the approx. resource available all the time and it shouldn’t get wasted when there is under usage. How to fix, Kubernetes autoscaling helps to maintain the resource and make sure only necessary resources are used.

Introduction

Autoscaling is a technique used in computing to dynamically adjust computational resources, such as CPU and memory, more efficiently depending upon the incoming traffic of your application. This technique available even with virtual machine era. Now the autoscaling also one of the core features of container orchestrator tools like Kubernetes.

Let’s Imagine we have an application deployed and running on Kubernetes; We are not sure the scaling requirement or how much resource required. At last, we will be end up with paying a lot more for the resources even if didn’t used. There the autoscaling comes to help us utilize resources efficiently in two ways. Its helps to,

  • Decreasing the number of pods or nodes when the load is low.
  • Increasing it when there’s a spike in traffic.

Here a few specific ways autoscaling optimizes resource use:

  • Saving on cost by using your infrastructure.
  • Increasing the uptime of your workloads in cases where you have an unpredictable load.
  • The ability to run less time-sensitive workloads once you have some free capacity because of autoscaling in low-traffic scenarios.

In this article, lets understand what is Kubernetes Autoscaling and what are the methods it is providing and how autoscaling working in Kubernetes.

What Is Autoscaling?

Autoscaling was first introduced in Kubernetes 1.3. When we talk about autoscaling in the Kubernetes context, in most cases, we ultimately scale pod replicas up and down automatically based on a given metric, like CPU or RAM.

We can achieve this by using Horizontal Pod Autoscaler (HPA). Autoscaling capabilities offered by HPA are applicable at the pod level, too. But you can autoscale your Kubernetes worker nodes using cluster/node autoscaler by adding new nodes dynamically. Managed Kubernetes offerings such as GKE by Google already offer such autoscaling capabilities, so you don’t have to reinvent the wheel and worry about its implementation.

In most cases with managed Kubernetes instances such as GKE, you specify a minimum and a maximum number of nodes, and the cluster autoscaler automatically adjusts the rest. Google has sort of won the Kubernetes battle among the cloud vendors by introducing Autopilot. GKE Autopilot is a hands-off approach to managed Kubernetes instances where Google manages every part (control plane, nodes, etc.) of your Kubernetes infrastructure.

Let’s discuss three different autoscaling methods offered by Kubernetes.

Horizontal Pod Autoscaler (HPA)

This method can also be referred to as scaling out. In this method, Kubernetes allows DevOps engineer, SRE, or your cluster admin to increase or decrease the number of pods automatically based upon your application resource usage. With HPA, you typically set a threshold for metrics such as CPU and memory and then scale up or down the number of pods running based upon their current use against the threshold that we set.

Vertical Pod Autoscaler (VPA)

This method can also be referred to as scaling up. Typically, with vertical scaling, we throw more resources such as CPU and memory to existing machines. In the Kubernetes context, Vertical Pod Autoscaler recommends or automatically adjusts values for CPU and memory. VPA frees you from worrying about what value to use for CPU and memory requests, and limits for your pods.

Cluster Autoscaler

This method typically comes to your rescue when pods cannot be scaled to their maximum capacity because there are not enough nodes to handle the load. The cluster autoscaler will autoscale the cluster itself by adding new nodes to the cluster to handle the increased demand. It will also periodically check the status of pods and nodes and take the following action:

  • If we cannot schedule pods because there are not enough nodes, then cluster autoscaler will add nodes up to the maximum size of the node pool.
  • If node utilization is low and we can schedule pods on fewer nodes, then cluster autoscaler will remove nodes from the node pool.

How Does It Work?

We first need to install metrics server on a Kubernetes cluster for autoscaling to work. Metrics server API plays an essential part in autoscaling, as the autoscaler (HPA, VPA, etc.) uses it to collect metrics about your pod’s CPU and memory utilization.

The autoscaler is defined as a Kubernetes API resource and a controller. The controller periodically scans the metrics server API and increases or decreases the number of replicas in a replication controller, or deployment, to match the observed metrics, such as average CPU utilization, average memory utilization, or any other custom metric to the target specified by the user.

The main difference between pod-scaling and node-scaling is that in pod-scaling, we scale in/out the number of pods (based on resource utilization, custom metrics, etc.), and in node-scaling, we add/remove nodes from the cluster to handle the increase/decrease in the demand.

Why Autoscaling?

Consider a scenario where you’re not using the autoscaling capabilities of Kubernetes. You end up needing to manually provision resources (and later scaling down) every time there’s a change in the demand. You either end up paying for peak capacity or your services fail because you don’t have enough resources available to handle the load. You might increase the number of pod replicas manually, but this trial-and-error approach may not be sustainable for the long term. It will also add to the frustration for users currently working with your application. You can overcome these issues by using the autoscaling feature of Kubernetes.

Limitations of Kubernetes Autoscaling

As we’ve seen, effective Kubernetes autoscaling requires a lot of fine-tuning. However, knowing how to best set configuration options isn’t always intuitive or obvious. Each of Kubernetes’ three autoscaling methods has challenges and limitations that we need to note if we choose to use them.

For example, neither HPA nor VPA takes input/output operations per second (IOPS), the network, or storage into account when making their calculations. This leaves applications susceptible to slowdowns or breakdowns. VPA also doesn’t allow us to update resource limits on actively running pods, meaning we must remove the pods and create new ones to set new limits.

Additionally, the cluster Autoscaler has limited compatibility with newer, third-party tools, meaning we can only use it in Kubernetes-supported platforms. Cluster autoscaling also only looks at a pod’s requests in making scaling decisions, rather than assessing the actual use. As a result, it can’t identify spare resources that a user request. This can lead to inefficient, wasteful clusters.

Best practices for Kubernetes autoscaling

1. Make sure that HPA and VPA policies don’t get conflict

The Vertical Pod Autoscaler automatically adjusts the requests and limits configuration, reducing overhead and achieving cost-savings. The Horizontal Pod Autoscaler aims to scale out and more likely up than down. Double-check that the VPA and HPA policies aren’t interfering with each other. Review your binning and packing density settings when designing clusters for business- or purpose-class tier of service.

2. Use instance weighted scores

Let’s say that one of your workloads often ends up consuming more than it requested. Is that happening because the resources are needed? Or were they consumed because they were simply available, but not critically required?

Use instance weighted scores when picking the instance sizes and types that are a good fit for autoscaling. Instance weighting comes in handy especially when you adopt a diversified allocation strategy and use spot instances.

3. Reduce costs with mixed instances

A mixed-instance strategy forges a path towards great availability and performance — at a reasonable cost. You basically choose from various instance types. While some are cheaper and just good enough, they might not be a good match for high-throughput, low-latency workloads.

Depending on your workload, you can often select the cheapest machines and make it all work. Or you could run it on a smaller number of machines with higher specs. This could potentially bring you great cost savings because each node requires installing Kubernetes on it, which adds a little overhead.

But how do you scale mixed instances?

In a mixed-instance situation, every instance uses a different type of resource. So, when you scale instances in autoscaling groups and use metrics like CPU and network utilization, you might get inconsistent metrics.

The Cluster Autoscaler is a must-have here. It allows mixing instance types in a node group — but your instances need to have the same capacity in terms of CPU and memory.

Conclusion

You might ask which type of autoscaling method you should use, and the honest answer is it depends. It’s not like HPA is better than VPA or vertical scaling is wrong compared to horizontal scaling. Both are serving different purposes, and you’ll use them in different circumstances.

Google search engine