This is the multi-page printable view of this section. Click here to print.
etcd backup and restore
1 - External etcd backup and restore
Note
External ETCD topology is supported for vSphere, CloudStack, Snow and Nutanix clusters, but not yet for Bare Metal clusters.This page contains steps for backing up a cluster by taking an ETCD snapshot, and restoring the cluster from a snapshot.
Use case
EKS-Anywhere clusters use ETCD as the backing store. Taking a snapshot of ETCD backs up the entire cluster data. This can later be used to restore a cluster back to an earlier state if required.
ETCD backups can be taken prior to cluster upgrade, so if the upgrade doesn’t go as planned, you can restore from the backup.
Important
Restoring to a previous cluster state is a destructive and destablizing action to take on a running cluster. It should be considered only when all other options have been exhausted.
If you are able to retrieve data using the Kubernetes API server, then etcd is available and you should not restore using an etcd backup.
2 - Bottlerocket
Note
External etcd topology is supported for vSphere, CloudStack, Snow and Nutanix clusters, but not yet for Bare Metal clusters.This guide requires some common shell tools such as:
grep
xargs
ssh
scp
cut
Make sure you have these installed on your admin machine before continuing.
Admin machine environment variables setup
On your admin machine, set the following environment variables that will later come in handy
export MANAGEMENT_CLUSTER_NAME="eksa-management" # Set this to the management cluster name
export CLUSTER_NAME="eksa-workload" # Set this to name of the cluster you want to backup (management or workload)
export SSH_KEY="path-to-private-ssh-key" # Set this to the cluster's private SSH key path
export SSH_USERNAME="ec2-user" # Set this to the SSH username
export SNAPSHOT_PATH="/tmp/snapshot.db" # Set this to the path where you want the etcd snapshot to be saved
export MANAGEMENT_KUBECONFIG=${MANAGEMENT_CLUSTER_NAME}/${MANAGEMENT_CLUSTER_NAME}-eks-a-cluster.kubeconfig
export CLUSTER_KUBECONFIG=${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig
export ETCD_ENDPOINTS=$(kubectl --kubeconfig=${MANAGEMENT_KUBECONFIG} -n eksa-system get machines --selector cluster.x-k8s.io/cluster-name=${CLUSTER_NAME},cluster.x-k8s.io/etcd-cluster=${CLUSTER_NAME}-etcd -ojsonpath='{.items[*].status.addresses[0].address}')
export CONTROL_PLANE_ENDPOINTS=($(kubectl --kubeconfig=${MANAGEMENT_KUBECONFIG} -n eksa-system get machines --selector cluster.x-k8s.io/control-plane-name=${CLUSTER_NAME} -ojsonpath='{.items[*].status.addresses[0].address}'))
Prepare etcd nodes for backup and restore
Install SCP on the etcd nodes:
echo -n ${ETCD_ENDPOINTS} | xargs -I {} -d" " ssh -o StrictHostKeyChecking=no -i ${SSH_KEY} ${SSH_USERNAME}@{} sudo yum -y install openssh-clients
Create etcd Backup
Make sure to setup the admin environment variables and prepare your ETCD nodes for backup before moving forward.
-
SSH into one of the etcd nodes
export ETCD_NODE=$(echo -n ${ETCD_ENDPOINTS} | cut -d " " -f1) ssh -i ${SSH_KEY} ${SSH_USERNAME}@${ETCD_NODE}
-
Drop into Bottlerocket’s root shell
sudo sheltie
-
Set these environment variables
# get the container ID corresponding to etcd pod export ETCD_CONTAINER_ID=$(ctr -n k8s.io c ls | grep -w "etcd-io" | head -1 | cut -d " " -f1) # get the etcd endpoint export ETCD_ENDPOINT=$(cat /etc/kubernetes/manifests/etcd | grep -wA1 ETCD_ADVERTISE_CLIENT_URLS | tail -1 | grep -oE '[^ ]+$')
-
Create the etcd snapshot
ctr -n k8s.io t exec -t --exec-id etcd ${ETCD_CONTAINER_ID} etcdctl \ --endpoints=${ETCD_ENDPOINT} \ --cacert=/var/lib/etcd/pki/ca.crt \ --cert=/var/lib/etcd/pki/server.crt \ --key=/var/lib/etcd/pki/server.key \ snapshot save /var/lib/etcd/data/etcd-backup.db
ctr -n k8s.io t exec -t --exec-id etcd ${ETCD_CONTAINER_ID} etcdctl \ --endpoints=${ETCD_ENDPOINT} \ --cacert=/var/lib/etcd/pki/ca.crt \ --cert=/var/lib/etcd/pki/server.crt \ --key=/var/lib/etcd/pki/server.key \ snapshot save /var/lib/etcd/data/etcd-backup.db
-
Move the snapshot to another directory and set proper permissions
mv /var/lib/etcd/data/etcd-backup.db /run/host-containerd/io.containerd.runtime.v2.task/default/admin/rootfs/home/ec2-user/snapshot.db chown 1000 /run/host-containerd/io.containerd.runtime.v2.task/default/admin/rootfs/home/ec2-user/snapshot.db
-
Exit out of etcd node. You will have to type
exit
twice to get back to the admin machineexit exit
-
Copy over the snapshot from the etcd node
scp -i ${SSH_KEY} ${SSH_USERNAME}@${ETCD_NODE}:/home/ec2-user/snapshot.db ${SNAPSHOT_PATH}
You should now have the etcd snapshot in your current working directory.
Restore etcd from Backup
Make sure to setup the admin environment variables and prepare your etcd nodes for restore before moving forward.
-
Pause cluster reconciliation
Before starting the process of restoring etcd, you have to pause some cluster reconciliation objects so EKS Anywhere doesn’t try to perform any operations on the cluster while you restore the etcd snapshot.
kubectl annotate clusters.anywhere.eks.amazonaws.com $CLUSTER_NAME anywhere.eks.amazonaws.com/paused=true --kubeconfig=$MANAGEMENT_KUBECONFIG kubectl patch clusters.cluster.x-k8s.io $CLUSTER_NAME --type merge -p '{"spec":{"paused": true}}' -n eksa-system --kubeconfig=$MANAGEMENT_KUBECONFIG
-
Stop control plane core components
You also need to stop the control plane core components so the Kubernetes API server doesn’t try to communicate with etcd while you perform etcd operations.
- You can use this command to get the control plane node IPs which you can use to SSH
echo -n ${CONTROL_PLANE_ENDPOINTS[@]} | xargs -I {} -d " " echo "{}"
- SSH into the node and stop the core components. You must do this for each control plane node.
# SSH into the control plane node using the IPs printed in previous command ssh -i ${SSH_KEY} ${SSH_USERNAME}@<Control Plane IP from previous command> # drop into bottlerocket root shell sudo sheltie # create a temporary directory and move the static manifests to it mkdir -p /tmp/temp-manifests mv /etc/kubernetes/manifests/* /tmp/temp-manifests
- Exit out of the Bottlerocket node
# exit from bottlerocket's root shell exit # exit from bottlerocket node exit
Repeat these steps for each control plane node.
-
Copy the backed-up etcd snapshot to all the etcd nodes
echo -n ${ETCD_ENDPOINTS} | xargs -I {} -d" " scp -o StrictHostKeyChecking=no -i ${SSH_KEY} ${SNAPSHOT_PATH} ${SSH_USERNAME}@{}:/home/ec2-user
-
Perform the etcd restore
For this step, you have to SSH into each etcd node and run the restore command.
- Get etcd nodes IPs for SSH’ing into the nodes
# This should print out all the etcd IPs echo -n ${ETCD_ENDPOINTS} | xargs -I {} -d " " echo "{}"
# SSH into the etcd node using the IPs printed in previous command ssh -i ${SSH_KEY} ${SSH_USERNAME}@<etcd IP from previous command> # drop into bottlerocket\'s root shell sudo sheltie # copy over the etcd snapshot to the appropriate location cp /run/host-containerd/io.containerd.runtime.v2.task/default/admin/rootfs/home/ec2-user/snapshot.db /var/lib/etcd/data/etcd-snapshot.db # setup the etcd environment export ETCD_NAME=$(cat /etc/kubernetes/manifests/etcd | grep -wA1 ETCD_NAME | tail -1 | grep -oE '[^ ]+$') export ETCD_INITIAL_ADVERTISE_PEER_URLS=$(cat /etc/kubernetes/manifests/etcd | grep -wA1 ETCD_INITIAL_ADVERTISE_PEER_URLS | tail -1 | grep -oE '[^ ]+$') export ETCD_INITIAL_CLUSTER=$(cat /etc/kubernetes/manifests/etcd | grep -wA1 ETCD_INITIAL_CLUSTER | tail -1 | grep -oE '[^ ]+$') export INITIAL_CLUSTER_TOKEN="etcd-cluster-1" # get the container ID corresponding to etcd pod export ETCD_CONTAINER_ID=$(ctr -n k8s.io c ls | grep -w "etcd-io" | head -1 | cut -d " " -f1)
# run the restore command ctr -n k8s.io t exec -t --exec-id etcd ${ETCD_CONTAINER_ID} etcdctl \ snapshot restore /var/lib/etcd/data/etcd-snapshot.db \ --name=${ETCD_NAME} \ --initial-cluster=${ETCD_INITIAL_CLUSTER} \ --initial-cluster-token=${INITIAL_CLUSTER_TOKEN} \ --initial-advertise-peer-urls=${ETCD_INITIAL_ADVERTISE_PEER_URLS} \ --cacert=/var/lib/etcd/pki/ca.crt \ --cert=/var/lib/etcd/pki/server.crt \ --key=/var/lib/etcd/pki/server.key
# run the restore command ctr -n k8s.io t exec -t --exec-id etcd ${ETCD_CONTAINER_ID} etcdctl \ snapshot restore /var/lib/etcd/data/etcd-snapshot.db \ --name=${ETCD_NAME} \ --initial-cluster=${ETCD_INITIAL_CLUSTER} \ --initial-cluster-token=${INITIAL_CLUSTER_TOKEN} \ --initial-advertise-peer-urls=${ETCD_INITIAL_ADVERTISE_PEER_URLS} \ --cacert=/var/lib/etcd/pki/ca.crt \ --cert=/var/lib/etcd/pki/server.crt \ --key=/var/lib/etcd/pki/server.key
# move the etcd data files out of the container to a temporary location mkdir -p /tmp/etcd-files $(ctr -n k8s.io snapshot mounts /tmp/etcd-files/ ${ETCD_CONTAINER_ID}) mv /tmp/etcd-files/${ETCD_NAME}.etcd /tmp/ # stop the etcd pod mkdir -p /tmp/temp-manifests mv /etc/kubernetes/manifests/* /tmp/temp-manifests # backup the previous etcd data files mv /var/lib/etcd/data/member /var/lib/etcd/data/member.backup # copy over the new etcd data files to the data directory mv /tmp/${ETCD_NAME}.etcd/member /var/lib/etcd/data/ # copy the WAL folder for the cluster member data before the restore to the data/member directory cp -r /var/lib/etcd/data/member.backup/wal /var/lib/etcd/data/member # re-start the etcd pod mv /tmp/temp-manifests/* /etc/kubernetes/manifests/
- Cleanup temporary files and folders
# clean up all the temporary files umount /tmp/etcd-files rm -rf /tmp/temp-manifests /tmp/${ETCD_NAME}.etcd /tmp/etcd-files/ /var/lib/etcd/data/etcd-snapshot.db
- Exit out of the Bottlerocket node
# exit from bottlerocket's root shell exit # exit from bottlerocket node exit
Repeat this step for each etcd node.
-
Restart control plane core components
- You can use this command to get the control plane node IPs which you can use to SSH
echo -n ${CONTROL_PLANE_ENDPOINTS[@]} | xargs -I {} -d " " echo "{}"
- SSH into the node and restart the core components. You must do this for each control plane node.
# SSH into the control plane node using the IPs printed in previous command ssh -i ${SSH_KEY} ${SSH_USERNAME}@<Control Plane IP from previous command> # drop into bottlerocket root shell sudo sheltie # move the static manifests back to the right directory mv /tmp/temp-manifests/* /etc/kubernetes/manifests/
- Exit out of the Bottlerocket node
# exit from bottlerocket's root shell exit # exit from bottlerocket node exit
Repeat these steps for each control plane node.
-
Unpause the cluster reconcilers
Once the etcd restore is complete, you can resume the cluster reconcilers.
kubectl annotate clusters.anywhere.eks.amazonaws.com $CLUSTER_NAME anywhere.eks.amazonaws.com/paused- --kubeconfig=$MANAGEMENT_KUBECONFIG kubectl patch clusters.cluster.x-k8s.io $CLUSTER_NAME --type merge -p '{"spec":{"paused": false}}' -n eksa-system --kubeconfig=$MANAGEMENT_KUBECONFIG
At this point you should have the etcd cluster restored to snapshot. To verify, you can run the following commands:
kubectl --kubeconfig=${CLUSTER_KUBECONFIG} get nodes
kubectl --kubeconfig=${CLUSTER_KUBECONFIG} get pods -A
You may also need to restart some deployments/daemonsets manually if they are stuck in an unhealthy state.
3 - Ubuntu and RHEL
Note
External etcd topology is supported for vSphere, CloudStack, Snow and Nutanix clusters, but not yet for Bare Metal clusters.This page contains steps for backing up a cluster by taking an etcd snapshot, and restoring the cluster from a snapshot. These steps are for an EKS Anywhere cluster provisioned using the external etcd topology (selected by default) with Ubuntu OS.
Use case
EKS-Anywhere clusters use etcd as the backing store. Taking a snapshot of etcd backs up the entire cluster data. This can later be used to restore a cluster back to an earlier state if required. Etcd backups can be taken prior to cluster upgrade, so if the upgrade doesn’t go as planned you can restore from the backup.
Backup
Etcd offers a built-in snapshot mechanism. You can take a snapshot using the etcdctl snapshot save
or etcdutl snapshot save
command by following the steps given below.
Note
The following commands use ec2-user as the username. For EKS Anywhere on vSphere, Bare Metal, and Snow, the default username is ec2-user. For EKS Anywhere on Apache CloudStack, the default username is capc. For EKS Anywhere on Nutanix, the default username is eksa. The default username cannot be changed.- Login to any one of the etcd VMs
ssh -i $PRIV_KEY ec2-user@$ETCD_VM_IP
- Run the etcdctl or etcdutl command to take a snapshot with the following steps
sudo su source /etc/etcd/etcdctl.env etcdctl snapshot save snapshot.db chown ec2-user snapshot.db
sudo su etcdutl snapshot save snapshot.db chown ec2-user snapshot.db
- Exit the VM. Copy the snapshot from the VM to your local/admin setup where you can save snapshots in a secure place. Before running scp, make sure you don’t already have a snapshot file saved by the same name locally.
scp -i $PRIV_KEY ec2-user@$ETCD_VM_IP:/home/ec2-user/snapshot.db .
NOTE: This snapshot file contains all information stored in the cluster, so make sure you save it securely (encrypt it).
Restore
Restoring etcd is a 2-part process. The first part is restoring etcd using the snapshot, creating a new data-dir for etcd. The second part is replacing the current etcd data-dir with the one generated after restore. During etcd data-dir replacement, we cannot have any kube-apiserver instances running in the cluster. So we will first stop all instances of kube-apiserver and other controlplane components using the following steps for every controlplane VM:
Pausing Etcdadm cluster and control plane machine health check reconcile
During restore, it is required to pause the Etcdadm controller reconcile and the control plane machine healths checks for the target cluster (whether it is management or workload cluster). To do that, you need to add a cluster.x-k8s.io/paused
annotation to the target cluster’s etcdadmclusters
and machinehealthchecks
resources. For example,
kubectl annotate clusters.anywhere.eks.amazonaws.com $CLUSTER_NAME anywhere.eks.amazonaws.com/paused=true --kubeconfig mgmt-cluster.kubeconfig
kubectl patch clusters.cluster.x-k8s.io $CLUSTER_NAME --type merge -p '{"spec":{"paused": true}}' -n eksa-system --kubeconfig mgmt-cluster.kubeconfig
Stopping the controlplane components
- Login to a controlplane VM
ssh -i $PRIV_KEY ec2-user@$CONTROLPLANE_VM_IP
- Stop controlplane components by moving the static pod manifests under a temp directory:
sudo su
mkdir temp-manifests
mv /etc/kubernetes/manifests/*.yaml temp-manifests
- Repeat these steps for all other controlplane VMs
After this you can restore etcd from a saved snapshot using the snapshot save
command following the steps given below.
Restoring from the snapshot
- The snapshot file should be made available in every etcd VM of the cluster. You can copy it to each etcd VM using this command:
scp -i $PRIV_KEY snapshot.db ec2-user@$ETCD_VM_IP:/home/ec2-user
- To run the etcdctl or etcdutl snapshot restore command, you need to provide the following configuration parameters:
- name: This is the name of the etcd member. The value of this parameter should match the value used while starting the member. This can be obtained by running:
sudo su
export ETCD_NAME=$(cat /etc/etcd/etcd.env | grep ETCD_NAME | awk -F'=' '{print $2}')
- initial-advertise-peer-urls: This is the advertise peer URL with which this etcd member was configured. It should be the exact value with which this etcd member was started. This can be obtained by running:
export ETCD_INITIAL_ADVERTISE_PEER_URLS=$(cat /etc/etcd/etcd.env | grep ETCD_INITIAL_ADVERTISE_PEER_URLS | awk -F'=' '{print $2}')
- initial-cluster: This should be a comma-separated mapping of etcd member name and its peer URL. For this, get the
ETCD_NAME
andETCD_INITIAL_ADVERTISE_PEER_URLS
values for each member and join them. And then use this exact value for all etcd VMs. For example, for a 3 member etcd cluster this is what the value would look like (The command below cannot be run directly without substituting the required variables and is meant to be an example)
export ETCD_INITIAL_CLUSTER=${ETCD_NAME_1}=${ETCD_INITIAL_ADVERTISE_PEER_URLS_1},${ETCD_NAME_2}=${ETCD_INITIAL_ADVERTISE_PEER_URLS_2},${ETCD_NAME_3}=${ETCD_INITIAL_ADVERTISE_PEER_URLS_3}
- initial-cluster-token: Set this to a unique value and use the same value for all etcd members of the cluster. It can be any value such as
etcd-cluster-1
as long as it hasn’t been used before.
- Gather the required env vars for the restore command
cat <<EOF >> restore.env
export ETCD_NAME=$(cat /etc/etcd/etcd.env | grep ETCD_NAME | awk -F'=' '{print $2}')
export ETCD_INITIAL_ADVERTISE_PEER_URLS=$(cat /etc/etcd/etcd.env | grep ETCD_INITIAL_ADVERTISE_PEER_URLS | awk -F'=' '{print $2}')
EOF
cat /etc/etcd/etcdctl.env >> restore.env
-
Make sure you form the correct
ETCD_INITIAL_CLUSTER
value using all etcd members, and set it as an env var in the restore.env file created in the above step. -
Once you have obtained all the right values, run the following commands to restore etcd replacing the required values:
sudo su source restore.env etcdctl snapshot restore snapshot.db \ --name=${ETCD_NAME} \ --initial-cluster=${ETCD_INITIAL_CLUSTER} \ --initial-cluster-token=etcd-cluster-1 \ --initial-advertise-peer-urls=${ETCD_INITIAL_ADVERTISE_PEER_URLS}
sudo su source restore.env etcdutl snapshot restore snapshot.db \ --name=${ETCD_NAME} \ --initial-cluster=${ETCD_INITIAL_CLUSTER} \ --initial-cluster-token=etcd-cluster-1 \ --initial-advertise-peer-urls=${ETCD_INITIAL_ADVERTISE_PEER_URLS}
-
This is going to create a new data-dir for the restored contents under a new directory
{ETCD_NAME}.etcd
. To start using this, restart etcd with the new data-dir with the following steps:
systemctl stop etcd.service
mv /var/lib/etcd/member /var/lib/etcd/member.bak
mv ${ETCD_NAME}.etcd/member /var/lib/etcd/
- Perform this directory swap on all etcd VMs, and then start etcd again on those VMs
systemctl start etcd.service
NOTE: Until the etcd process is started on all VMs, it might appear stuck on the VMs where it was started first, but this should be temporary.
Starting the controlplane components
- Login to a controlplane VM
ssh -i $PRIV_KEY ec2-user@$CONTROLPLANE_VM_IP
- Start the controlplane components by moving back the static pod manifests from under the temp directory to the /etc/kubernetes/manifests directory:
mv temp-manifests/*.yaml /etc/kubernetes/manifests
- Repeat these steps for all other controlplane VMs
- It may take a few minutes for the kube-apiserver and the other components to get restarted. After this you should be able to access all objects present in the cluster at the time the backup was taken.
Resuming Etcdadm cluster and control plane machine health checks reconcile
Resume Etcdadm cluster reconcile and control plane machine health checks for the target cluster by removing the cluster.x-k8s.io/paused
annotation in the target cluster’s resource. For example,
kubectl annotate clusters.anywhere.eks.amazonaws.com $CLUSTER_NAME anywhere.eks.amazonaws.com/paused- --kubeconfig mgmt-cluster.kubeconfig
kubectl patch clusters.cluster.x-k8s.io $CLUSTER_NAME --type merge -p '{"spec":{"paused": false}}' -n eksa-system --kubeconfig mgmt-cluster.kubeconfig