HomeGuided ModeCurriculumStudy PlanLabsQuizzesScenariosTaxonomyObjectsLifecyclesTracesEventsRunbooksAnti-PatternsGlossary
Operational Runbooks
Practice the operator path: find truth, classify risk, take the smallest correct action, and leave evidence.
Bad Deploy
Production error rate rises after promotion.
Steps
Freeze further promotions for the environment.
Read operation state and route owner epoch before logs.
Classify app-only, DB migration, runtime service, or route propagation issue.
If old artifact/database are compatible, rollback route owner by operation.
If DB is incompatible, choose forward fix or data restore path.
Do not
Do not rebuild from main and call it rollback.
Do not mutate route tables manually at edge.
Do not hide DB incompatibility behind a green rollback button.
Stuck Wake
Sleeping app receives safe traffic but never becomes hot.
Steps
Find wake operation and allocation attempt.
Check artifact pull, secrets, runtime service bindings, and readiness logs.
Coalesce duplicate wake attempts.
Return clear retry/503 after wake budget.
Surface repair action to app owner.
Do not
Do not buffer unsafe requests indefinitely.
Do not start unlimited allocations under stampede.
Do not mark app hot before readiness.
Route Event Replay Gap
Edge detects missing route event sequence after snapshot.
Steps
Fail closed for affected route keys or use last verified owner within policy.
Fetch fresh signed snapshot from control plane.
Verify checksum and sequence.
Resume JetStream from snapshot sequence plus one.
Record propagation incident evidence.
Do not
Do not apply later events over a gap.
Do not let allocation health choose production owner.
Do not clear gap without a new baseline.
Database Restore
Customer needs recovery from corruption or destructive migration.
Steps
Identify data class, RPO, RTO, and environment.
Stop or fence writers if restore would fork truth.
Choose PITR, backup restore, or forward repair.
Check app artifact compatibility with restored schema/data.
Record user-facing impact and audit.
Do not
Do not treat route rollback as data rollback.
Do not restore preview over production.
Do not ignore idempotency/event reprocessing after restore.
Host OS Upgrade
Runner host needs patching or replacement.
Steps
Cordon host.
Verify quorum and runner headroom.
Start replacement allocations where needed.
Drain active Vango sessions or enforce deadline.
Upgrade/rebuild host through bootstrap automation.
Run postchecks and uncordon.
Do not
Do not kill active sessions while claiming drain.
Do not drain below quorum.
Do not hand-edit host drift outside IaC without recording it.
Provider Outage
Database, object store, queue, or payment provider is degraded.
Steps
Classify provider and affected bindings.
Separate platform control-plane health from customer app dependency health.
Pause unsafe deploys/migrations if evidence cannot be collected.
Expose app-owner status with affected environments.
Fail over only if data/runtime topology is proven.
Do not
Do not hide provider failures as generic app crashes.
Do not fail over sessions to regions without data/runtime readiness.
Do not rotate secrets as a blind fix.