Operations
Webhook Incident Playbook: How to Triage Failures Without Panic
A repeatable workflow for the inevitable “why aren’t we receiving events?” moment.
Webhook incidents feel personal because they hit the invisible nervous system of your SaaS: billing, signups, CRM updates, fulfillment, access control, and internal automations.
The worst part is the uncertainty. Is it just one customer? One endpoint? A deploy? A provider outage? When you don’t have a process, you either overreact or underreact — both are expensive.
This playbook turns webhook failures into a calm checklist you can run in minutes.
Step 1: Confirm the blast radius
Before you fix anything, define the incident boundary. Ask:
- Is it one endpoint or many?
- Is it one provider (e.g., billing) or multiple systems?
- Is it constant failures or intermittent spikes?
- Did it start after a deploy, config change, or certificate renewal?
If you can group failures by endpoint, status code, and time window, you’ll usually see the shape of the problem immediately.
Step 2: Classify the failure type (use this quick map)
4xx errors
Usually “your app rejected it” (auth, signature, payload validation). Check recent code changes, signature secrets, and request parsing.
5xx errors
Usually “your app crashed” (exceptions, DB down, queue broken). Check logs, database connectivity, and worker health.
Timeouts
Usually “too slow” (heavy work inside request cycle). Move work to queue, return fast, and reduce external calls.
Network / DNS / TLS
Usually “can’t reach you” (expired cert, DNS, firewall). Verify domain, certificates, and upstream availability.
This classification prevents thrashing. You don’t debug a timeout the same way you debug a 401.
Step 3: Reproduce with one real payload
Pick a single failed delivery and try to replay it against your endpoint in a controlled way. You’re looking for answers to three questions:
- Does the request reach the app server?
- Does your app validate and accept it?
- Does the downstream work complete (DB writes, queue jobs, side effects)?
Reproduction keeps you honest. It’s easy to “fix” something that wasn’t the real cause.
Step 4: Stop the bleeding first
When incidents hit billing or access control, speed matters. The fastest safe actions are usually:
- Roll back the last deploy (if clearly correlated).
- Restore a known-good secret/config (if signatures suddenly fail).
- Scale workers or restart queue (if the backlog is exploding).
- Return quick 2xx and queue the rest (if timeouts are the problem).
You’re not trying to be clever here — you’re trying to restore a stable baseline.
Step 5: Backfill and reconcile (the part people skip)
After the endpoint is healthy, you still need to answer: what did we miss?
A good backfill strategy is based on source of truth:
- For billing: fetch recent invoices/subscriptions from the provider API and reconcile local state.
- For fulfillment: re-run job pipelines based on orders created in the incident window.
- For access control: compute entitlements from current plan state, then re-apply.
This is where idempotency pays off. Backfills become safe to run repeatedly.
Step 6: Prevent the same incident next week
Prevention is not “write more code.” Prevention is “make the next failure obvious and smaller.” Here are upgrades that compound quickly:
- Incident grouping by endpoint + error signature, so you see patterns instead of noise.
- Alert routing that’s specific (don’t send 200 emails for one outage).
- Operational checks for queue workers, database connectivity, and webhook route latency (via your infra/tooling stack).
- Dashboards that show recent failures, trends, and “time since last success.”
- Runbooks (like this one) linked directly from the incident view.
The goal isn’t “never fail.” The goal is “fail loudly, recover fast, and learn once.”
A gentle reminder
If you’re building a SaaS, webhook incidents are not a sign you’re doing something wrong — they’re a sign you’re shipping real systems that touch real networks.
With a simple playbook, your future self won’t dread these moments. You’ll just run the steps, fix the cause, backfill cleanly, and move on.