Debugging and Incident Response

Webhook Incident Playbook

When webhook integrations fail, the symptoms usually appear somewhere else first: missing payments, broken provisioning, delayed syncs, or incorrect account state. This playbook gives engineers a practical response workflow for production incidents.

Webhook incidents are confusing because the system that looks broken is often not the system where the failure actually occurred.

A payment provider may still be sending events. Your endpoint may still be reachable. The real problem could be a timeout, a queue worker crash, a signature mismatch, or silent processing failure deeper in the application.

The goal of an incident playbook is not just to investigate. It is to reduce time-to-diagnosis and help engineers restore correct event flow safely.

Step 1: Confirm whether the provider is still sending events

Start with the provider dashboard. Platforms like Stripe, Paddle, and GitHub all expose webhook delivery logs.

Check:

  • recent delivery attempts
  • HTTP response codes
  • retry attempts
  • response latency

If no recent events appear, the problem may be configuration related rather than runtime related.

Step 2: Determine whether this is a delivery problem or a processing problem

This distinction matters immediately.

  • Delivery problem: provider cannot successfully POST to your endpoint
  • Processing problem: endpoint returns success, but downstream jobs or business logic fail later

Engineers often waste time debugging application code when the endpoint never received the event in the first place, or the reverse.

Step 3: Inspect endpoint response behavior

Review what your webhook endpoint is returning.

  • HTTP 4xx responses usually indicate validation or signature verification failures
  • HTTP 5xx responses usually indicate application exceptions
  • timeouts usually indicate slow synchronous processing

If the request takes too long, providers will retry even when part of the business logic already ran.

For deeper timeout-specific debugging, see webhook timeout debugging .

Step 4: Check application and queue workers

Many webhook endpoints acknowledge the request quickly and delegate heavy work to background jobs.

That means the HTTP response alone does not prove the event was processed correctly.

Look for:

  • queue worker crashes
  • stuck jobs
  • database constraint failures
  • dead letter queue growth
  • unexpected retry loops inside the worker system

If you use queue-based processing, also see webhook processing architecture .

Step 5: Verify duplicate protection

Providers retry failed events automatically. If the first attempt partially succeeded, a retry can create duplicate side effects.

Typical symptoms:

  • duplicate subscription activation
  • repeated provisioning jobs
  • multiple receipt emails
  • duplicate rows in local billing tables

If you cannot answer “have we already processed this event ID?”, the incident may become more damaging during recovery.

See idempotent webhooks in Laravel .

Step 6: Decide whether replay is safe

During webhook incidents, teams often want to replay failed events immediately.

That is sometimes correct, but not always.

Replay is safest when:

  • the event was not processed at all
  • idempotency protections exist
  • current application state is understood

If the first attempt partially succeeded or later events already changed the resource state, replay may create new inconsistencies.

See replaying failed webhooks safely .

Quick incident response checklist

  1. Check the provider delivery log
  2. Separate delivery failures from downstream processing failures
  3. Review endpoint response codes and response time
  4. Inspect queue workers, jobs, and application logs
  5. Confirm duplicate-event protection is working
  6. Replay only when current state makes replay safe

The fastest incident response is usually the one that avoids guessing.

Related guides:

Start monitoring your webhook endpoints →