Debugging and Incident Response

Webhook Incident Playbook

Last updated: May 12, 2026 6:40 PM

When webhook integrations fail, the symptoms usually appear somewhere else first: missing payments, broken provisioning, delayed syncs, or incorrect account state. This playbook gives engineers a practical response workflow for production incidents.

Webhook incidents are confusing because the system that looks broken is often not the system where the failure actually occurred.

A payment provider may still be sending events. Your endpoint may still be reachable. The real problem could be a timeout, a queue worker crash, a signature mismatch, or silent processing failure deeper in the application.

The goal of an incident playbook is not just to investigate. It is to reduce time-to-diagnosis and help engineers restore correct event flow safely.

Step 1: Confirm whether the provider is still sending events

Start with the provider dashboard. Platforms like Stripe, Paddle, and GitHub all expose webhook delivery logs.

Check:

recent delivery attempts
HTTP response codes
retry attempts
response latency

If no recent events appear, the problem may be configuration related rather than runtime related.

Step 2: Determine whether this is a delivery problem or a processing problem

This distinction matters immediately.

Delivery problem: provider cannot successfully POST to your endpoint
Processing problem: endpoint returns success, but downstream jobs or business logic fail later

Engineers often waste time debugging application code when the endpoint never received the event in the first place, or the reverse.

Step 3: Inspect endpoint response behavior

Review what your webhook endpoint is returning.

HTTP 4xx responses usually indicate validation or signature verification failures
HTTP 5xx responses usually indicate application exceptions
timeouts usually indicate slow synchronous processing

If the request takes too long, providers will retry even when part of the business logic already ran.

For deeper timeout-specific debugging, see webhook timeout debugging .

Step 4: Check application and queue workers

Many webhook endpoints acknowledge the request quickly and delegate heavy work to background jobs.

That means the HTTP response alone does not prove the event was processed correctly.

Look for:

queue worker crashes
stuck jobs
database constraint failures
dead letter queue growth
unexpected retry loops inside the worker system

If you use queue-based processing, also see webhook processing architecture .

Step 5: Verify duplicate protection

Providers retry failed events automatically. If the first attempt partially succeeded, a retry can create duplicate side effects.

Typical symptoms:

duplicate subscription activation
repeated provisioning jobs
multiple receipt emails
duplicate rows in local billing tables

If you cannot answer “have we already processed this event ID?”, the incident may become more damaging during recovery.

See idempotent webhooks in Laravel .

Step 6: Decide whether replay is safe

During webhook incidents, teams often want to replay failed events immediately.

That is sometimes correct, but not always.

Replay is safest when:

the event was not processed at all
idempotency protections exist
current application state is understood

If the first attempt partially succeeded or later events already changed the resource state, replay may create new inconsistencies.

See replaying failed webhooks safely .

Quick incident response checklist

Check the provider delivery log
Separate delivery failures from downstream processing failures
Review endpoint response codes and response time
Inspect queue workers, jobs, and application logs
Confirm duplicate-event protection is working
Replay only when current state makes replay safe

The fastest incident response is usually the one that avoids guessing.