Runbooks
1. Purpose
This directory contains operational runbooks for incident response and service recovery procedures.
Runbooks describe clear, step-by-step actions to diagnose and resolve production issues in DSP infrastructure.
These documents are intended for on-call engineers and system administrators.
2. Scope
Runbooks should cover:
-
Service outages
-
Region-specific bidding failures
-
Infrastructure component failures
-
Monitoring alerts requiring manual intervention
Runbooks must NOT duplicate architectural documentation.
System design details belong in docs/architecture/.
3. Structure Guidelines
Each runbook should follow a consistent structure:
-
Overview
-
Symptoms
-
Root Cause (if known)
-
Recovery Procedure
-
Post-Recovery Validation
-
Escalation
-
Notes (optional)
Clarity and precision are critical. Runbooks must be usable under incident pressure.
4. Naming Convention
File names should:
-
Be lowercase
-
Use hyphen-separated words
-
Clearly describe the incident type
Examples:
-
bidding-outage.adoc -
redis-cluster-failure.adoc -
nginx-503-spike.adoc
Avoid region-specific duplication unless procedures differ significantly.