Reliability Runbooks
Webhook ingress failures
Section titled “Webhook ingress failures”Question: are webhook requests accepted and routed correctly?
Check:
event.name:webhook_handler_failedevent.name:webhook_non_success_responsespan.op:http.server url.path:"/api/webhooks"
Queue callback failures
Section titled “Queue callback failures”Question: are messages enqueued and processed successfully?
Check:
event.name:queue_callback_failed OR event.name:queue_message_failedevent.name:queue_ingress_dedup_hitspan.op:queue.process_message
Turn execution failures
Section titled “Turn execution failures”Question: are assistant turns timing out or failing due to provider/tool issues?
Check:
event.name:agent_turn_timeoutevent.name:agent_turn_failed OR event.name:agent_turn_provider_errorspan.op:gen_ai.invoke_agent
Tool failure hotspots
Section titled “Tool failure hotspots”Question: which tools fail most and why?
Check:
event.name:agent_tool_call_failedevent.name:agent_tool_call_invalid_inputspan.op:gen_ai.execute_tool
Recovery order
Section titled “Recovery order”- Confirm release boundary where failures started.
- Triage highest-error symptom first (webhook, queue, turn, tool).
- Apply rollback/hotfix.
- Re-run health + Slack-thread verification.
Next step
Section titled “Next step”Use Verify & Troubleshoot for first-response checks, then return to Observability to confirm recovery.