Runbooks

1. Purpose

This directory contains operational runbooks for incident response and service recovery procedures.

Runbooks describe clear, step-by-step actions to diagnose and resolve production issues in DSP infrastructure.

These documents are intended for on-call engineers and system administrators.

2. Scope

Runbooks should cover:

  • Service outages

  • Region-specific bidding failures

  • Infrastructure component failures

  • Monitoring alerts requiring manual intervention

Runbooks must NOT duplicate architectural documentation. System design details belong in docs/architecture/.

3. Structure Guidelines

Each runbook should follow a consistent structure:

  1. Overview

  2. Symptoms

  3. Root Cause (if known)

  4. Recovery Procedure

  5. Post-Recovery Validation

  6. Escalation

  7. Notes (optional)

Clarity and precision are critical. Runbooks must be usable under incident pressure.

4. Naming Convention

File names should:

  • Be lowercase

  • Use hyphen-separated words

  • Clearly describe the incident type

Examples:

  • bidding-outage.adoc

  • redis-cluster-failure.adoc

  • nginx-503-spike.adoc

Avoid region-specific duplication unless procedures differ significantly.

5. Operational Principle

Runbooks must:

  • Contain tested commands only

  • Include verification steps

  • Specify expected outcomes

  • Avoid ambiguity

If procedures differ by region or environment, use parameterized placeholders (e.g. <REGION>).

6. Maintenance

Runbooks must be updated whenever:

  • Recovery procedures change

  • Infrastructure changes affect commands

  • Incident postmortems identify improvements