Horizontal Pod Autoscaler(hpa) – Know Everything About it

0
4145
Horizontal Pod Autoscaler

In our recent post, we have discussed about Kubernetes autoscaling and methods of autoscaling with benefits and best practices. In this article, we are going to learn about one of Autoscaling method called Horizontal Pod Autoscaler. This method has own advantage as like other methods, and this is widely adopted method also, as per our understanding. Let’s see in detail about this method. 

What Is Horizontal Pod Autoscaler (HPA)?

A Kubernetes cluster is made up of one or more virtual machines called nodes. In Kubernetes, a pod is the smallest resource in the hierarchy and your application containers are deployed as pods. A pod is a logical construct in Kubernetes and requires a node to run, and a node can have one or more pods running inside of it.

Horizontal Pod Autoscaler is a type of autoscaler that can increase or decrease the number of pods in a Deployment, ReplicationController, StatefulSet, or ReplicaSet, usually in response to CPU utilization patterns. This process represents horizontal scaling because it changes the number of instances, not the resources allocated to a given container.

How Does HPA Work and What Are Its Benefits?

By default, HPA scales workloads based on pod metrics like average CPU/memory utilization and average pod utilization. It is also possible to use externally provided or custom metrics. After the initial setup, it can operate automatically — you only need to define the minimum and the maximum number of replicas, as per your requirements or demand. 

The configured HPA controller is responsible for checking metrics and scaling replicas accordingly by adding or removing pods. This scaling occurs automatically, but you can sometimes account for predictable fluctuations in loading requirements. HPA works in a loop by checking, updating, and re-checking metrics.

In the first step of the HPA loop, the controller continuously tracks resource (like CPU, Memory and other custom) utilization via the metrics server. Next, HPA calculates the optimal number of replicas based on the resource requirements. Then, the autoscaler decides whether to scale the application up or down. In the last step of the loop, HPA implements the target number of replicas.

As HPA is a continuous monitoring process, so this loop repeats as soon as it finishes. The default interval for HPA checks is 30 seconds. Use the --horizontal-pod-autoscaler-sync-period controller manager flag to change the interval value.

The autoscaling/v1 API version of the HPA only supports the average CPU utilization metric. The autoscaling/v2 API version allows scaling according to memory usage, defining custom metrics, and using multiple metrics in a single HPA object.

What Is the Impact of HPA on Kubernetes Resource Costs?

Running multiple workloads on a server instance can be cost-effective but tracking your Kubernetes costs and identifying where you can save is challenging. Autoscaling lets you tightly configure scaling to reduce waste and minimize application running costs.

Application usage often changes over time, requiring more or fewer pod replicas. HPA scales your workloads automatically. It is useful for stateless and stateful applications. Combining HPA with cluster scaling helps reduce costs for workloads with frequent demand changes, decreasing the number of nodes alongside the pods.

Properly configured, the HPA controller can monitor pods to determine if the number of replicas needs changing. It compares the current value to the target value.

How to Use HPA Metrics

As discussed above, the Horizontal Pod Autoscaler (HPA) enables horizontal scaling of container workloads running in Kubernetes. In order for HPA to work, the Kubernetes cluster needs to have metrics enabled. See how to enable metrics in the Kubernetes metrics server tool.

Kubernetes HPA supports four kinds of metrics:

Resource Metric

Resource metrics refer to CPU and memory utilization of Kubernetes pods against the values provided in the limits and requests of the pod spec. These metrics are natively known to Kubernetes through the metrics server. The values are averaged together before comparing them with the target values. That is, if three replicas are running for your application, the utilization values will be averaged and compared against the CPU and memory requests defined in your deployment spec.

Object Metric

Object metrics describe the information available in a single Kubernetes resource. An example of this would be hits per second for an ingress object.

Pod Metric

Pod metrics (referred to as PodsMetricSource) references pod-based metric information at runtime and can be collected in Kubernetes. An example would be transactions processed per second in a pod. If there are multiple pods for a given PodsMetricSource, the values will be collected and averaged together before being compared against the target threshold values.

External Metrics

External metrics are metrics gathered from sources running outside the scope of a Kubernetes cluster. For example, metrics from Prometheus can be queried for the length of a queue in a cloud messaging service, or QPS from a load balancer running outside of the cluster.

Horizontal Pod Autoscaler API Versions

API version autoscaling/v1 is the stable and default version; this version of API only supports CPU utilization-based autoscaling.

autoscaling/v2 version of the API brings usage of multiple metrics, custom and external metrics support.

You can verify which API versions are supported on your cluster by querying the api-versions. This command lists, all the versions.

# kubectl api-versions | grep autoscaling
autoscaling/v1
autoscaling/v2
autoscaling/v2beta1
autoscaling/v2beta2

Requirements

Horizontal Pod Autoscaler (and also Vertical Pod Autoscaler) requires a Metrics Server installed in the Kubernetes cluster. Metric Server is a container resource metrics (such as memory and CPU usage) source that is scalable, can be configured for high availability, and is efficient on resource usage when operating. Metrics Server gather metrics -by default- every 15 seconds from Kubelets, this allows rapid autoscaling,

You can easily check if the metric server is installed or not by issuing the following command:

# kubectl top pods

The following message will be shown if the metrics server is not installed.

error: Metrics API not available

On the other hand, if the Metric Server is installed, you should get appropriate output with resource utilization.

Installation of Metrics Server

If you have already installed Metrics Server, you can skip this section.

Metrics Server offers two easy installation mechanisms; one is using kubectl that includes all the manifests.

# kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

The second option is using the Helm chart, which is preferred. Helm values can be found here.

First, add the Metrics-Server Helm repository to your local repository list as follows.

# helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/

Now you can install the Metrics Server via Helm.

# helm upgrade --install metrics-server metrics-server/metrics-server

If you have a self-signed certificate, you should add --set args={--kubelet-insecure-tls} to the command above.

Verifying the Installation

As the installation is finished and we allow some time for the Metrics Server to get ready, let’s try the command again.

# kubectl top pods -n argocd
NAME CPU(cores) MEMORY(bytes)
argocd-application-controller-0 4m 24Mi
argocd-applicationset-controller-596ddc6c7d-d7lgl 4m 29Mi
argocd-dex-server-78c894df5b-svc87 14m 28Mi
argocd-notifications-controller-6f65c4ccdb-5cpb8 3m 22Mi
argocd-redis-ha-haproxy-787f9b5689-rpn62 6m 71Mi
argocd-redis-ha-server-0 13m 20Mi
argocd-repo-server-75b7c59bfb-cqtbz 15m 26Mi
argocd-server-d86d7959d-sd98v 16m 31Mi

Also, we can see the resources of the nodes with a similar command.

# kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
aks-agentpool-12792500-vmss000002 125m 3% 1233Mi 9%

You can also send queries directly to the Metric Server via kubectl.

# kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq

We can also verify our pod’s metrics from the API.

# kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/argocd/pods/argocd-server-d86d7959d-sd98v | jq

This is either the Metric Server control loop that hasn’t run yet, is not running correctly, or resource requests are not set on the target pod spec.

How to Configure Horizontal Pod Autoscaling?

As an illustration of the horizontal pod autoscaling capabilities, this article will show you how to:

  • Create a test deployment.
  • Create an HPA via the command line or use the declarative approach.
  • Apply custom metrics.
  • Apply multiple metrics.

Create a Deployment

The following section shows how to create a Docker image for a small PHP app that performs a resource-intensive calculation.

1. Create the app directory:

# mkdir hpa-test && cd hpa-test

2. Create a Dockerfile in a text editor:

# vim Dockerfile

3. Put the following contents into the file:

FROM php:5-apache
COPY index.php /var/www/html/index.php
RUN chmod a+rx index.php

4. Create the index.php file:

# vim index.php
<?php
$x = 0.0001;
for ($i = 0; $i <= 1000000; $i++) {
$x += sqrt($x);
}
echo "OK!";
?>

5. Add the mathematical operation to the file to create a CPU load.

6. Build the Docker image:

# docker build -t hpa-test:v1 .

7. Tag the image. Use your Docker Hub account username:

# docker image tag hpa-test:v1 motoskia/hpa-test:v1

8. Push the image to Docker Hub by typing:

# docker image push motoskia/hpa-test:v1

9. Create a deployment YAML in a text editor:

# vim hpa-test.yaml

10. The YAML defines the deployment and the service that exposes it. The spec.template.spec.containers section specifies that the deployment uses the Docker image created in the previous steps. Furthermore, the resources sub-section contains resource limits and requests.

11. Create the deployment by using the kubectl apply command:

# kubectl apply -f hpa-test.yaml -n foxutech

12. Confirm that the objects are ready and running.

# kubectl get all -n foxutech

Create HPA

With the deployment up and running, proceed to create a HorizontalPodAutoscaler object. The sections below illustrate the two methods for creating HPAs.

Install the Horizontal Pod Autoscaler

We now have the sample application as part of our deployment, and the service is accessible on port 80. To scale our resources, we will use HPA to scale up when traffic increases and scale down the resources when traffic decreases.

Let’s create the HPA configuration file as shown below:

# cat hpa.yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: hpa-demo-deployment
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hpa-demo-deployment
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 50

Apply the changes:

# kubectl apply -f hpa.yaml -n foxutech
horizontalpodautoscaler.autoscaling/hpa-test created

Verify the HPA deployment:

# kubectl get hpa -n foxutech
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
hpa-test Deployment/hpa-test 0%/50% 1 10 1 3m19s

The above output shows that the HPA maintains between 1 and 10 replicas of the pods controlled by the hpa-test. In the example shown above (see the column titled “TARGETS”), the target of 50% is the average CPU utilization that the HPA needs to maintain, whereas the target of 0% is the current usage.

You can modify the max and min on the Yaml. Even you can use CLI, but we recommend to follow the GitOps way to make sure all the changes tracked and we can avoid manual changes.

Increase the Load

Before increasing the load to see the HPA in action, use kubectl get to check the status of the controller.

# kubectl get hpa -n foxutech

The TARGETS column in the output shows that no load is generated. Consequently, the number of created pod replicas is still one.

Now increase the CPU load using the load generator. Execute the following command in a new terminal tab/window:

# kubectl -n foxutech run -i --tty load-generator --rm --image=busybox --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://hpa-test; done"

The command uses the busybox image to generate load and repeatedly query the hpa-test deployment. The deployment performs mathematical operations and engages the CPU.

In the first terminal tab/window, type the following command to watch the status of the HPA:

# kubectl get hpa -w -n foxutech

After a short time, the TARGETS column shows that the CPU load exceeds the limit specified upon the HPA creation. Consequently, the HPA increases the number of replicas to match the current load.

# kubectl get hpa -w -n foxutech
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
hpa-test Deployment/hpa-test 0%/50% 1 10 1 5m20s
hpa-test Deployment/hpa-test 158%/50% 1 10 1 5m32s
hpa-test Deployment/hpa-test 158%/50% 1 10 4 5m47s
hpa-test Deployment/hpa-test 129%/50% 1 10 4 6m2s
hpa-test Deployment/hpa-test 154%/50% 1 10 4 6m17s
hpa-test Deployment/hpa-test 124%/50% 1 10 4 6m32s

Stop Generating Load

To stop generating the CPU load, switch to the load generator terminal tab/window and press CTRL+C.

Type the following command to confirm that the number of replicas is back to one.

# kubectl get hpa -w -n foxutech

Autoscaling Based on Custom Metrics

The autoscaling/v2 API version allows you to create custom metrics to trigger the HPA. To switch to autoscaling/v2, convert the deployment YAML into the new format with the following command:

# vim hpav2.yaml

Open the new file and inspect its contents. For example, to create a memory-related HPA condition, type the following in the spec.metrics section:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-testv2-memory
spec:
  minReplicas: 2
  maxReplicas: 10
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hpa-test
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: averageValue
        averageValue: 50Mi

Kubernetes enables several custom metrics to be used for the HPA, including:

  • Load balancer traffic.
  • Outbound connections.
  • Queue depth.
  • Latency of a function or dependency.
  • HTTP request throughput.
  • Size of the message queue.
  • Request latency, etc.

Autoscaling Based on Multiple Metrics

The spec.metrics field in autoscaling/v2 allows setting up multiple metrics to be monitored by a single HPA. To specify more than one metric in a YAML, list them one after another. For example, the following section defines both CPU and memory metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-testv2-multi
spec:
  minReplicas: 2
  maxReplicas: 10
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hpa-test
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 50Mi

Limitations of Horizontal Pod Autoscaler

While HPA is most useful for autoscaling a stateless application, it can also work with stateful sets. However, HPA also has some limitations:

  • The application architecture must support distributed workloads — you might need to architect the application to support scaling. Otherwise, it might be impossible to distribute workloads across different servers.
  • HPA cannot always handle unexpected spikes in demand — new virtual machines can take several minutes to load, making it hard to keep up with sudden changes in demand.
  • Pods can waste resources or terminate frequently — if you don’t configure memory and CPU limits on the pods, they might work inefficiently.
  • The cluster can run out of capacity — HPA will not be able to increase the number of pods until you add new nodes to your cluster. You can use Cluster Autoscaler (CA) to scale nodes automatically.
  • DaemonSet— You cannot use the HPA to scale DaemonSets.
  • Conflict— If vertical pod scaling is set up on the system, the HPA may conflict.
  • The HPA does not take network and storage capacity into account, so using it may cause outages.

Kubernetes HPA Best Practices

When running production workloads with autoscaling enabled, there are a few best practices to keep in mind.

  • Install a metric server: Kubernetes requires a metrics server be installed in order for autoscaling to work. The metrics server enables the Kubernetes metric APIs, which the autoscaling algorithms utilize, to make scaling decisions.
  • Define pod requests and limits: A Kubernetes scheduler makes scheduling decisions according to the requests and limits set in the pod. If not set properly, Kubernetes will be unable to make an informed scheduling decision, and pods will not go into a pending state due to lack of resources. Instead, they will go into a CrashLoopBackOff, and Cluster Autoscaler won’t kick in to scale the nodes. Furthermore, with HPA, if initial requests are not set to retrieve the current utilization percentages, scaling decisions will not have a proper base to match resource utilization policies as a percentage.
  • Specify PodDisruptionBudgets for mission-critical applications: PodDisruptionBudget avoids disruption of critical pods running in the Kubernetes Cluster. When a PodDisruptionBudget is defined for a certain application, autoscaler will avoid scaling down replicas beyond the minimum value configured in the disruption budget.
  • Resource requests should be close to the average usage of the pods: Sometimes an appropriate resource request can be hard to determine for new applications, as they have no previous resource utilization data. However, with Vertical Pod Autoscaler, you can easily run it in recommendation mode. Recommendations for the best values for CPU and memory requests for your pods are based on short-term observations of your application’s usage.
  • Increase CPU limits for slow starting applications: Some applications (ex: Java Spring) require an initial CPU burst to get the application up and running. At runtime the application would typically use a small amount of CPU compared to the initial load. To mitigate this, it is recommended to limit CPU to a higher level. This will allow these containers to start up quickly and to add lower request levels that match the typical runtime request usage of these applications.
  • Don’t mix HPA with VPA: Horizontal Pod Autoscaler and Vertical Pod Autoscaler should not be run together. It is recommended to run Vertical Pod Autoscaler first, to get the proper values for CPU and memory as recommendations, and then to run HPA to handle traffic spikes.
  • Create HPAs using YAML files. The command-line method makes it more difficult to version-control.
  • If you employ custom metrics, ensure that you use the correct target type for pods and objects.
  • Using microservice architecture is a great way to ensure that your deployment takes full advantage of horizontal autoscaling.
  • Adding native support for parallel pods allows the HPA to create and terminate pods in parallel, speeding up the process.

Troubleshooting Kubernetes HPA

Insufficient Time to Scale

A common challenge with HPA is that it takes time to scale up a workload by adding another pod. Loads can sometimes change sharply, and during the time it takes to scale up, the existing pod can reach 100% utilization, resulting in service degradation and failures.

For example, consider a pod that can handle 800 requests with under 80% CPU utilization, and HPA is configured to scale up when the 80% CPU threshold is reached. Let’s say it takes 10 seconds for the new pod to start up.

If loads increase by 100 requests per second, the pod will reach 100% utilization within 2 seconds, while it takes 8 more seconds for the second pod to start receiving requests.

Possible solutions

  • Reducing the scaling threshold to keep a safety margin, so that each pod has some spare capacity to deal with sudden traffic spikes. Keep in mind that this has a cost, which is multiplied by the number of pods running your application.
  • Always keeping one extra pod in reserve to account for sudden traffic spikes.

Brief Spikes in Load

When a workload experiences brief spikes in CPU utilization (or any other scaling metrics), you might expect that HPA will immediately spin up an additional pod. However, if the spikes are short enough, this will not happen.

To understand why, consider that:

  • When an event like high CPU utilization happens, HPA does not directly receive the event from the pod.
  • HPA polls for metrics every few seconds from the Kubernetes Metrics Server (unless you have integrated a custom component).
  • The Kubernetes Metrics Server polls aggregate metrics from pods.
  • The --metric-resolution flag specifies the time window that is evaluated, typically 30 seconds.

For example, assume HPA is set to scale when CPU utilization exceeds 80%. If CPU utilization suddenly spikes to 90%, but this occurs for only 2 seconds out of a 30 second metric resolution window, and in the rest of the 30-second period utilization is 20%, the average utilization is:

(2 * 90% + 28 * 90%) / 30 = 27%

When HPA polls for the CPU utilization metric, it will observe a metric of 27%, which is not even close to the scaling threshold of 80%. This means HPA will not scale — even though in reality, the workload experienced high load.

Possible solutions

  • Increase metric resolution — you can set the --metric-resolution flag to a lower number. However, this might cause unwanted scaling events because HPA will become much more sensitive to changes in load.
  • Use burstable QoS on the pod — if you set the limits parameter significantly higher than the requests parameter (for example, 3-4 times higher), in the example above more resources will be allocated to the pod, if available. This can preclude the need to scale horizontally using HPA. This solution does not guarantee scaling, and also risks that the pod will be evicted from the node due to resource pressure.
  • Combine HPA with VPA — if you expect resources to be available on the node to provide more resources in case of brief spikes in load, you can use VPA in combination with HPA. Make sure to configure VPA on a separate metric from HPA, and one that immediately responds to increased loads.

Scaling Delay Due to Application Readiness

It can often happen that HPA correctly issues a scaling request, but for various reasons, it takes time for the new container to be up and running. These reasons can include:

  • Image downloads — some images are large and network conditions might result in a long download time.
  • Initialization procedures — some applications require a complex initialization or warmup, and while they are taking place, they cannot serve loads.
  • Readiness checks — a pod might have readiness checks such as initialDelaySeconds, meaning that Kubernetes will not send traffic to the pod until the delay is over, even if in reality the container is ready for work.

Possible solutions

  • Keep container images small.
  • Keep the initialization procedures short.
  • Identify Kubernetes readiness checks and ensure they are not overly strict.

Excessive Scaling

In some cases, HPA might scale an application so much that it could consume almost all the resources in the cluster. You could set up HPA in combination with Cluster Autoscaler to automatically add more nodes to the cluster. However, this might sometimes get out of hand.

Consider these scenarios:

  • A denial of service (DoS) attack in which an application is flooded with fake traffic.
  • An application is experiencing high loads but is not mission critical and the organization cannot invest in additional resources and scale up the cluster.
  • The application is using excessive resources due to a misconfiguration or design issue, which should be resolved instead of automatically scaling it on demand.

In these, and many similar scenarios, it is better not to scale the application beyond a certain limit. However, HPA does not know this and will continue to scale the application even when this does not make business sense.

Possible solutions

The best solution is to limit the number of replicas that can be created by HPA. You can define this in the spec:maxReplicas field of the HPA configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-testv2-memory
spec:
  minReplicas: 2
  maxReplicas: 10

In this configuration, maxReplicas is set to 10. Calculate the maximum expected load of your application and ensure you set a realistic maximal scale, with some buffer for surprise peaks in traffic.

Google search engine