Tuesday, November 5, 2024
HomeKubernetesA Guide to Disaster Recovery in the Kubernetes Cluster

A Guide to Disaster Recovery in the Kubernetes Cluster

So far, we have seen various topics about Kubernetes, in this article let’s see another important topic about “A Guide to Disaster recovery in the Kubernetes cluster”. As the usage of the Kubernetes is increasing across everywhere, it is important to consider the industry standard processes part of your cluster implementation/configuration. Part of that backup is the one helps to recover your Kubernetes cluster from any major failure.

Why we need Backup and Recovery?

There are three reasons why we need a backup and recovery mechanism in place for our Kubernetes cluster.

  1. To recover cluster from Disasters: like someone accidentally deleted the namespace where your deployments reside.
  2. Replicate the environment: You want to replicate your production environment to staging environment before any major upgrade.
  3. Migration of Kubernetes Cluster: Let’s say, you want to migrate your Kubernetes cluster from one environment to another.

What to Backup?

We got to know why we need backup, but what to backup? Here is are things,

  1. Your Kubernetes control plane is stored into etcd storage and you need to backup the etcd state to get all the Kubernetes resources.
  2. If you have stateful containers (which you will have in real world), you need a backup of persistent volumes as well.

Best practices for Kubernetes disaster recovery

Kubernetes workloads should not be backed up using a traditional approach. To make sure that the backup and recovery are seamless, organizations should keep following things in mind.

  • Understand the backup requirement

It is important to understand what to take backup and how it is important. Like if you are running your kubernetes cluster on any cloud environment with GitOps, then backup is less bothered, as all your changes will be on GIT, you can focus on taking backup of volumes if you are using any. Like that understand the backup requirement and plan for it. In this article we assumed you are running the cluster on bare-metal and provided one of possible way to take backup. If you wish to adopt the GitOps way you can follow our ArgoCD series.

  • Have a restore plan

You should have details steps and plan how to restore the backup incase if anything happened, always test it minimum twice in different environments, so you will be more confident on real-time. Keep the steps with detailed explanation so it can be performed by anyone quickly.

  • Application-aware backups

Kubernetes’ portability is a double-edged sword. While it makes it easy to build new applications using existing services and helps ease migration to different environments. As many workloads running on the k8s platform are stateless, it’s important to have application-aware backups that provide context to the backup and different components involved in it. This can be done with the help of a Kubernetes backup solution. Organizations can automate the entire backup and recovery process to avoid any failures. These solutions also provide options to deploy the backups in various locations and help to make restoring to a brand-new environment a breeze.

  • Security is key

We need to protect our backups from any attackers. Organizations can make the mistake of slacking on the backup security. However, your application is as secure as your backup. To avoid unwarranted access to backups, organizations should employ identity access management (IAM) or role-based access control (RBAC). Only the members who are assigned to monitor or verify backups should be given access rights. Another important measure that can be taken to curb any attacks is data encryption. Organizations can invest in a disaster recovery solution that takes care of backup security for them.

Requirement

  1. Make sure both environments are using the same version of kubeadmkubeclt and Kubelet
  2. You can you https://foxutech.com/setup-a-multi-master-kubernetes-cluster-with-kubeadm/ to setup the cluster on your environment.

ETCD Backup

How to Take etcd backup:

There is a different mechanism to take etcd backup depending on how you set up your etcd cluster in Kubernetes environment. There are two ways to setup etcd cluster in Kubernetes environment:

  1. Internal etcd cluster: It means you’re running your etcd cluster in the form of containers/pods inside the Kubernetes cluster and it is the responsibility of Kubernetes to manage those pods.
  2. External etcd cluster: Etcd cluster you’re running outside of Kubernetes cluster mostly in the form of Linux services and providing its endpoints to Kubernetes cluster to write to.

Backup Strategy for Internal Etcd Cluster:

To take a backup from inside an etcd pod, we will be using Kubernetes CronJob functionality which will not require any etcdctl client to be installed on the host. Or you can use etcdctl also as following,

Note: The backup location should be external or somewhere secure location, which is again backup properly or high available environment like cloud volumes.

Command:

For this you need to have etcdctl should be installed on the node(s).

# etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/client.crt --key=/etc/kubernetes/pki/etcd/client.key snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db

If you are not aware of the etcd details, you can find the required information by using below command.

# kubectl get pods etcd-k8s-master -n kube-system -o=jsonpath='{.spec.containers[0].command}' | jq

Cronjob:

Following is the definition of Kubernete CronJob which will take etcd backup every minute:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
   name: backup
   namespace: kube-system
   spec:
     # activeDeadlineSeconds: 100
     schedule: "*/1 * * * *"
     jobTemplate:
       spec:
         template:
           spec:
             containers:
             - name: backup
            # Same image as in /etc/kubernetes/manifests/etcd.yaml
             image: k8s.gcr.io/etcd:3.2.24
             env:
             - name: ETCDCTL_API
               value: "3"
             command: ["/bin/sh"]
             args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/client.crt --key=/etc/kubernetes/pki/etcd/client.key snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db"]
             volumeMounts:
             - mountPath: /etc/kubernetes/pki/etcd
               name: etcd-certs
               readOnly: true
             - mountPath: /backup
               name: backup
           restartPolicy: OnFailure
           hostNetwork: true
           volumes:
           - name: etcd-certs
             hostPath:
               path: /etc/kubernetes/pki/etcd
               type: DirectoryOrCreate
           - name: backup
             hostPath:
               path: /data/backup
               type: DirectoryOrCreate

We can check the snapshot status.

# ETCDCTL_API=3 etcdctl --write-out=table snapshot status /backup/etcd-snapshot.db

Backup Strategy for External Etcd Cluster:

If you running etcd cluster on Linux hosts as a service, you should set up a Linux cron job to take backup of your cluster.

Run the following command to save etcd backup:

# ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save /path/for/backup/snapshot.db

Disaster Recovery

Now, let’s assume, you wish to migrate the cluster to another or replicate or somehow your cluster has been down, now we should recover from the etcd snapshot. You can setup the new cluster in new environment and start the etcd cluster, then do the kubeadm init on the master node with etcd endpoints. Don’t forgot to keep the backup certificates into /etc/kubernetes/pki folder before kubeadm init. It will pick up the same certificates. And access and move the backup to new environment.

Restore Strategy for Internal Etcd Cluster:

# docker run --rm \
-v '/data/backup:/backup' \
-v '/var/lib/etcd:/var/lib/etcd' \
--env ETCDCTL_API=3 \
'k8s.gcr.io/etcd:3.3.13' \
/bin/sh -c "etcdctl snapshot restore '/backup/etcd-snapshot-2022-10-23_10:44:02_UTC.db' ; mv /default.etcd/member/ /var/lib/etcd/"

# kubeadm init --ignore-preflight-errors=DirAvailable--var-lib-etcd

Restore Strategy for External Etcd Cluster

Now, stop the etcd service on all the nodes, and replace the “restored folder” with the restored folders on all nodes and then start the etcd service on all the nodes one by one. Now you could see all the nodes are ready, but sometimes, you may see there are only one master node is ready and other nodes were in not ready state. You need to join those two nodes again with the existing ca.crt file (you should have a backup of that).

Restore the etcd on 3 nodes using following commands:

# ETCDCTL_API=3 etcdctl snapshot restore snapshot-69.db \
--name 10.1.1.21 \
--initial-cluster 10.1.1.21=https://10.1.1.21:2380,10.1.1.22=https://10.1.1.22:2380,10.1.1.23=https://10.1.1.23:2380 \
--initial-cluster-token my-etcd-token \
--initial-advertise-peer-urls http://10.1.1.21:2380

# ETCDCTL_API=3 etcdctl snapshot restore snapshot-78.db \
--name 10.1.1.22 \
--initial-cluster 10.1.1.21=https://10.1.1.21:2380,10.1.1.22=https://10.1.1.22:2380,10.1.1.23=https://10.1.1.23:2380 \
--initial-cluster-token my-etcd-token \
--initial-advertise-peer-urls http://10.1.1.22:2380

# ETCDCTL_API=3 etcdctl snapshot restore snapshot-77.db \
--name 10.1.1.23 \
--initial-cluster 10.1.1.21=https://10.1.1.21:2380,10.1.1.22=https://10.1.1.22:2380,10.1.1.23=https://10.1.1.23:2380 \
--initial-cluster-token my-etcd-token \
--initial-advertise-peer-urls http://10.1.1.23:2380

Run the following command on master node:

# kubeadm token create --print-join-command

It will give you kubeadm join command, add one — ignore-preflight-errors and run that command on other two nodes for them to come into the ready state.

Even there are lot of famous backup solutions available now, like Velero, you can make use of those tools to backup the service. For example, you can refer https://foxutech.com/how-to-take-azure-kubernetes-backup-using-velero/.

RELATED ARTICLES
- Advertisment -

Most Popular

Recent Comments