Disaster Recovery Planning for Hosted Systems

Define RTO (recovery time) and RPO (recovery point) per system. Backups and replication are the base; test restores regularly. For critical systems, consider multi-site or failover to another region.

RTO and RPO

RTO: Maximum acceptable downtime (how quickly you must be back).
RPO: Maximum acceptable data loss (how far back you can restore).
Set these per system or tier; critical DB may have tighter RTO/RPO than static assets.

Backups and replication

Backups: Scheduled, encrypted, stored off-server or in another region. Test restore at least quarterly.
Replication: DB and sometimes app state replicated to a secondary site for fast failover.
Snapshots: Quick point-in-time on same storage; complement with off-site backups for DR.

Failover and multi-site

Failover: Automated or manual switch to a standby when primary fails. Requires DNS or load balancer update.
Multi-site: Run active or passive in more than one region; adds cost and complexity but improves resilience.
Runbooks: Document steps for declare-failover, restore from backup, and verify. Run drills.

Summary

Define RTO/RPO; use backups and replication; test restores. For critical systems, plan failover or multi-site and keep runbooks updated.