Kubernetes Backup and Disaster Recovery

Back up etcd (cluster state) and PVC data. Use Velero or similar for app-level backups. Test restore in a separate cluster. Document the order of restore (etcd, then PVs, then workloads).

What to back up

etcd: Contains cluster state (namespaces, deployments, services, secrets, etc.). Without it you cannot restore the cluster topology. Snapshot etcd regularly (many managed K8s do this; for self-managed, use etcdctl or Velero).
Persistent volumes (PV/PVC): Application data lives here. Back up according to the storage driver (snapshot, export to object storage, or use Velero volume backup). Ensure consistency (e.g. quiesce the app or use a consistent snapshot method).
Manifests and config: GitOps (Git as source of truth) or exported YAML. Velero can back up resources; having manifests in Git gives you a second way to recreate objects.

Velero and app-level backup

Velero: Backs up cluster resources and optionally PV data (via restic or CSI snapshot). Schedule backups; store in object storage (S3-compatible). Restore to the same or another cluster.
Order of restore: Typically restore etcd (or cluster state) first if rebuilding a cluster, then PVs, then workloads. With Velero, a full restore can do resources and volumes together; test the order in a drill.
Disaster recovery: Have a runbook: restore etcd (if applicable), restore PVs, restore workloads (Velero or kubectl apply). Test in a separate cluster periodically.

Summary

Back up etcd and PVC data; use Velero for app-level and volume backup. Test restore in a separate cluster; document the order (etcd, PVs, workloads). Run DR drills.