Skip to content

Reliability Runbooks

Webhook ingress failures

Question: are webhook requests accepted and routed correctly?

Check:

event.name:webhook_handler_failed
event.name:webhook_non_success_response
span.op:http.server url.path:"/api/webhooks"

Queue callback failures

Question: are messages enqueued and processed successfully?

Check:

event.name:queue_callback_failed OR event.name:queue_message_failed
event.name:queue_ingress_dedup_hit
span.op:queue.process_message

Turn execution failures

Question: are assistant turns timing out or failing due to provider/tool issues?

Check:

event.name:agent_turn_timeout
event.name:agent_turn_failed OR event.name:agent_turn_provider_error
span.op:gen_ai.invoke_agent

Tool failure hotspots

Question: which tools fail most and why?

Check:

event.name:agent_tool_call_failed
event.name:agent_tool_call_invalid_input
span.op:gen_ai.execute_tool

Recovery order

Confirm release boundary where failures started.
Triage highest-error symptom first (webhook, queue, turn, tool).
Apply rollback/hotfix.
Re-run health + Slack-thread verification.

Next step

Use Verify & Troubleshoot for first-response checks, then return to Observability to confirm recovery.