Disaster recovery runbook
What we'll do if something serious goes wrong, and what you can do today to be ready.
Tiers of failure (and our response)
1. Process crash — auto-restart within seconds. 2. Container failure — same, no customer impact. 3. Host failure — manual restore to a standby VM from nightly backup. RTO ~2h. 4. Data centre failure — switch to standby DC. RTO ~4h. 5. Catastrophic loss — restore from off-region encrypted backups. RPO ≤ 24h.
What we do for you
Each tenant's site is backed up nightly into sites/<tenant>/private/backups/ (files + DB + settings). Encrypted off-host snapshots. A nightly backup-drill picks one tenant, restores into a sandbox, runs a sanity probe. Failures land in HQ + Telegram.
What you can do today
1. Enable warehouse so ClickHouse-side history covers your restore window. 2. Designate a recovery contact (phone + email) in your tenant. 3. Keep at least two ops users in HQ with TOTP enrolled. 4. Keep Knowledge Documents current — those drive customer self-service during a recovery window.
Communications during an incident
Updates on status.mojaedge.com every 30 min until resolved. SMS + email to your designated recovery contact. Telegram/Slack if subscribed.