Operations

Disaster recovery runbook

What we'll do if something serious goes wrong, and what you can do today to be ready.

Tiers of failure (and our response)

1. Process crash — auto-restart within seconds. 2. Container failure — same, no customer impact. 3. Host failure — manual restore to a standby VM from nightly backup. RTO ~2h. 4. Data centre failure — switch to standby DC. RTO ~4h. 5. Catastrophic loss — restore from off-region encrypted backups. RPO ≤ 24h.

What we do for you

Each tenant's site is backed up nightly into sites/<tenant>/private/backups/ (files + DB + settings). Encrypted off-host snapshots. A nightly backup-drill picks one tenant, restores into a sandbox, runs a sanity probe. Failures land in HQ + Telegram.

What you can do today

1. Enable warehouse so ClickHouse-side history covers your restore window. 2. Designate a recovery contact (phone + email) in your tenant. 3. Keep at least two ops users in HQ with TOTP enrolled. 4. Keep Knowledge Documents current — those drive customer self-service during a recovery window.

Communications during an incident

Updates on status.mojaedge.com every 30 min until resolved. SMS + email to your designated recovery contact. Telegram/Slack if subscribed.

Missing something? Tell us what you needed.