Kubernetes HighAvailability Deployment with Pod Anti-Affinity

0
2100
Pod Anti-Affinity

We do aware main role of Kubernetes is providing High availability for our microservice applications. There is a greater number of options and best practices we should follow to keep our application high available. Today will see one more option how to keep the application high available in public cloud which contains multiple zones.

With a Kubernetes Deployment we can achieve high availability natively with Pod Replicas. But if those replicas are scheduled on the same node and there is a problem with the node then the system can experience downtime. Same, if those replicas are scheduled in the same availability zone (AZ) and the zone fails then the system will experience downtime.

As the Kubernetes scheduler uses a bin-packing algorithm to fit as many pods as possible into a cluster. The scheduler prefers a more evenly distributed general node load to app replicas precisely spread across nodes. Therefore, by default, multi-replica is not guaranteed multi-AZ. Considering that, to avoid the downtime and pod get allotted to single node/AZ, for production services, we can consider using pod anti-affinity to ensure replicas are distributed between AZs.

Anti-Affinity

In this article, lets understand more about affinity works and what are the options it provides and how to configure with deployments.

How Does Affinity Work?

Affinities are used to express Pod scheduling constraints that can match characteristics of candidate Nodes and the Pods that are already running on those Nodes. A Pod that has an “affinity” to a given Node is more likely to be scheduled to it; conversely, an “anti-affinity” makes it less probable it will be scheduled. The overall balance of these weights is used to determine the final placement of each Pod.

Affinity assessments can produce either hard or soft outcomes. A “hard” result means the Node must have the characteristics defined by the affinity expression. “Soft” affinities act as a preference, indicating to the scheduler that it should use a Node with the characteristics if one is available. A Node that does not meet the condition will still be selected if necessary.

Types of Affinity Condition

There are currently two different kinds of affinity that you can define:

  • Node Affinity – Used to constrain the Nodes that can receive a Pod by matching labels of those Nodes. Node Affinity can only be used to set positive affinities that attract Pods to the Node.
  • Inter-Pod Affinity – Used to constrain the Nodes that can receive a Pod by matching labels of the existing Pods already running on each of those Nodes. Inter-Pod Affinity can be either an attracting affinity or a repelling anti-affinity.

Setting Node Affinities

Node Affinity has two distinct sub-types:

  • requiredDuringSchedulingIgnoredDuringExecution – This is the “hard” affinity matcher that requires the Node meet the constraints you define.
  • preferredDuringSchedulingIgnoredDuringExecution – This is the “soft” matcher to express a preference that’s ignored when it can’t be fulfilled.

Here, The IgnoredDuringExecution part of these verbose names makes it explicit that affinity is only considered while scheduling Pods. Once a Pod has made it onto a Node, affinity is not re-evaluated. Changes to the Node will not cause a Pod eviction due to changed affinity values.

Example

In the simplest possible example, a Pod that includes a Node Affinity condition of label=value will only be scheduled to Nodes with a label=value label. A Pod with the same condition but defined as an Inter-Pod Affinity will be scheduled to a Node that already hosts a Pod with a label=value label.

Zones

This configuration makes a best effort to schedule replicas of a workload in different zones from each other.

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: <label-key>
              operator: In
              values:
              - <label-value>
          topologyKey: topology.kubernetes.io/zone

Where the label key-value pair is unique to the Deployment Pods.

Nodes

This configuration makes a best effort to schedule replicas of a workload in different nodes from each other.

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: <label-key>
              operator: In
              values:
              - <label-value>
          topologyKey: kubernetes.io/hostname

You could also use both configurations. This would be useful in scenarios where you have more pod replicas and nodes than zones.

Above example part of soft AZ-based anti-affinity, where the rule will not take strictly. If you want to make definition should match to schedule the pod, you may need to use hard AZ-based anti-affinity. Here is the example of it.

    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: foxapp
                operator: In
                values:
                - tools
                - learn
          - matchExpressions:
            - key: internal
              operator: Exists
            topologyKey: failure-domain.beta.kubernetes.io/zone

This manifest creates a hard affinity rule that schedules the Pod to a Node meeting the following criteria:

  • It has a foxapp label with either tools or learn as the value.
  • It has an internal label with any value.

You can attach additional conditions by repeating the matchExpressions clause. Supported operators for value comparisons are In, NotIn, Exists, DoesNotExist, Gt (greater than), and Lt (less than).

The matchExpression clauses grouped under a single nodeSelectorTerms clause are combined with a boolean AND. They all need to match for a Pod to gain affinity to a particular Node. You can use multiple nodeSelectorTerms clauses too; these will be combined as a logical OR operation. You can easily assemble complex scheduling criteria by utilizing both structures.

This make sures, the pods are distributed across the Availability Zones (AZs). Please be noted, this schedule one pod per zone, incase you have 3 region and you have provided 4 replicas, it schedules only 3 pods and 4th one will be on pending state. As there are only 3 regions. This option limited to number of AZs, you may need to pick the best region based on the AZ count sometimes. Due to such hard limitation, this one may not used much on production. But it is good know available options.

Use cases

  • While scheduling workload, when we need to schedule a certain set of pods together, PodAffinity makes sense. Example, a web server and a cache.
  • While scheduling workload, when we need to make sure that a certain set of pods are not scheduled together, PodAntiAffinity makes sense.
Google search engine