Debugging and Incident Response

Replaying Failed Webhooks Safely

Webhook retries help, but they do not solve every delivery failure. When an event fails permanently, teams often need a replay mechanism. Done carelessly, replay can be more dangerous than the original failure.

Most webhook providers retry failed deliveries automatically. That is useful for temporary outages, slow responses, and short-lived infrastructure problems. But retry windows eventually end.

After that point, the event may still matter. A payment succeeded, a subscription was renewed, or an entitlement changed, but your application never finished processing it correctly.

This is where replay becomes operationally important. The system needs a safe way to attempt processing again without creating duplicate side effects such as double provisioning, repeated emails, or inconsistent billing state.

Why replay is harder than it sounds

Replay is not just “run the same payload again.”

By the time an operator decides to replay an event, the surrounding system may already have changed. The customer record may have been updated manually. A later webhook may already have succeeded. A queue job may have partially completed. A downstream system may already contain some of the intended side effects.

That means replay must assume the event might be:

  • completely unprocessed
  • partially processed
  • already reconciled by another workflow
  • no longer safe to apply blindly

Safe replay starts with accepting that you are operating in an uncertain state.

Store webhook events before heavy processing

Replay is much easier when the original event has been stored durably.

A production-ready webhook pipeline should persist the incoming payload and delivery metadata before doing expensive business logic. At minimum, store:

  • provider name
  • event ID
  • event type
  • received timestamp
  • raw payload
  • processing status
  • last error message
  • replay count

This creates an auditable source of truth for operators and makes controlled replay possible later.

                    
Schema::create('webhook_events', function (Blueprint $table) {
    $table->id();
    $table->string('provider');
    $table->string('event_id');
    $table->string('event_type')->nullable();
    $table->longText('payload');
    $table->string('status')->default('received'); // received, processing, processed, failed
    $table->unsignedInteger('replay_count')->default(0);
    $table->text('last_error')->nullable();
    $table->timestamp('processed_at')->nullable();
    $table->timestamps();

    $table->unique(['provider', 'event_id']);
});
                    
                

This table is not just for debugging. It is the foundation of a safe replay workflow.

Separate event receipt from event handling

One of the best replay-friendly patterns is to treat the webhook endpoint as an ingestion layer, not the place where all business logic happens.

A safer request path looks like this:

                    
receive webhook
verify signature
store event
dispatch processing job
return HTTP 200
                    
                

This does three useful things:

  1. It reduces timeout risk.
  2. It gives you a durable event record even if processing fails later.
  3. It makes replay an application concern instead of a provider-dashboard-only workflow.

This pattern pairs naturally with queue-based processing. See our guide on webhook queue processing patterns .

Replays must be idempotent

Replay only becomes safe when the business logic behind the event is idempotent or state-aware.

Suppose a replayed event tries to:

  • activate a subscription that is already active
  • grant credits that were already granted
  • send a receipt email that was already sent
  • create a downstream record that already exists

If your code treats replay as a fresh event every time, you are turning an operational recovery tool into a production risk.

Two common protections are:

  • event-level idempotency using provider + event_id
  • state-based reconciliation before applying side effects

Our Laravel guide on idempotent webhook handling covers the first layer in more depth.

Use explicit replay states

A replayable event should move through explicit states, not vague booleans.

A simple state model might look like:

  • received — event stored but not yet processed
  • processing — a worker is actively handling it
  • processed — handling completed successfully
  • failed — handling failed and needs inspection
  • replaying — operator-triggered retry in progress

This matters because operators need to know whether they are replaying a clean failure or racing against a still-running worker.

                    
public function replay(WebhookEvent $event): void
{
    if (!in_array($event->status, ['failed', 'received'])) {
        throw new RuntimeException('This event is not in a replayable state.');
    }

    $event->update([
        'status' => 'replaying',
        'replay_count' => $event->replay_count + 1,
    ]);

    ProcessWebhookEvent::dispatch($event->id);
}
                    
                

Operators need context before replaying

A replay button alone is not enough. The person replaying the event should be able to answer a few questions first:

  1. What failed?
  2. Did the first attempt do any partial work?
  3. Did later events already repair the state?
  4. Is replay the right fix, or should the state be reconciled another way?

In other words, replay should be an informed action, not a reflex.

This is why storing processing logs and error details alongside the event record is so useful. Operators should not need to search three systems before deciding whether replay is safe.

Reconcile state when replay is risky

Sometimes replaying the original event is the wrong move.

Consider a subscription workflow where:

  • the original renewal webhook failed
  • a later subscription_updated webhook already arrived
  • the provider API now shows the current subscription state correctly

In cases like this, a reconciliation job may be safer than replaying the old event payload.

Replay is best when the event is still semantically relevant and the handling code is designed to tolerate repeated execution. Reconciliation is better when the canonical truth now lives in the provider’s current state.

A practical replay checklist

  1. Store every incoming event before heavy processing.
  2. Track status, error details, and replay count.
  3. Make event processing idempotent or state-aware.
  4. Replay only from explicit, safe states.
  5. Give operators enough context to decide whether replay is appropriate.
  6. Prefer reconciliation when the old event is no longer the safest source of truth.

The point of replay is not merely to “try again.” The point is to recover safely.

That difference matters a lot once billing, subscriptions, and customer access are involved.

Related guides:

Start monitoring your webhook endpoints →