Study Lab
local appVango

Operational Runbooks

Practice the operator path: find truth, classify risk, take the smallest correct action, and leave evidence.

Bad Deploy

Production error rate rises after promotion.

Steps

Freeze further promotions for the environment.
Read operation state and route owner epoch before logs.
Classify app-only, DB migration, runtime service, or route propagation issue.
If old artifact/database are compatible, rollback route owner by operation.
If DB is incompatible, choose forward fix or data restore path.

Do not

Do not rebuild from main and call it rollback.
Do not mutate route tables manually at edge.
Do not hide DB incompatibility behind a green rollback button.

Stuck Wake

Sleeping app receives safe traffic but never becomes hot.

Steps

Find wake operation and allocation attempt.
Check artifact pull, secrets, runtime service bindings, and readiness logs.
Coalesce duplicate wake attempts.
Return clear retry/503 after wake budget.
Surface repair action to app owner.

Do not

Do not buffer unsafe requests indefinitely.
Do not start unlimited allocations under stampede.
Do not mark app hot before readiness.

Route Event Replay Gap

Edge detects missing route event sequence after snapshot.

Steps

Fail closed for affected route keys or use last verified owner within policy.
Fetch fresh signed snapshot from control plane.
Verify checksum and sequence.
Resume JetStream from snapshot sequence plus one.
Record propagation incident evidence.

Do not

Do not apply later events over a gap.
Do not let allocation health choose production owner.
Do not clear gap without a new baseline.

Database Restore

Customer needs recovery from corruption or destructive migration.

Steps

Identify data class, RPO, RTO, and environment.
Stop or fence writers if restore would fork truth.
Choose PITR, backup restore, or forward repair.
Check app artifact compatibility with restored schema/data.
Record user-facing impact and audit.

Do not

Do not treat route rollback as data rollback.
Do not restore preview over production.
Do not ignore idempotency/event reprocessing after restore.

Host OS Upgrade

Runner host needs patching or replacement.

Steps

Cordon host.
Verify quorum and runner headroom.
Start replacement allocations where needed.
Drain active Vango sessions or enforce deadline.
Upgrade/rebuild host through bootstrap automation.
Run postchecks and uncordon.

Do not

Do not kill active sessions while claiming drain.
Do not drain below quorum.
Do not hand-edit host drift outside IaC without recording it.

Provider Outage

Database, object store, queue, or payment provider is degraded.

Steps

Classify provider and affected bindings.
Separate platform control-plane health from customer app dependency health.
Pause unsafe deploys/migrations if evidence cannot be collected.
Expose app-owner status with affected environments.
Fail over only if data/runtime topology is proven.

Do not

Do not hide provider failures as generic app crashes.
Do not fail over sessions to regions without data/runtime readiness.
Do not rotate secrets as a blind fix.