Skip to content

Reliability Runbooks

Question: are webhook requests accepted and routed correctly?

Check:

  • event.name:webhook_handler_failed
  • event.name:webhook_non_success_response
  • span.op:http.server url.path:"/api/webhooks"

Question: are messages enqueued and processed successfully?

Check:

  • event.name:queue_callback_failed OR event.name:queue_message_failed
  • event.name:queue_ingress_dedup_hit
  • span.op:queue.process_message

Question: are assistant turns timing out or failing due to provider/tool issues?

Check:

  • event.name:agent_turn_timeout
  • event.name:agent_turn_failed OR event.name:agent_turn_provider_error
  • span.op:gen_ai.invoke_agent

Question: which tools fail most and why?

Check:

  • event.name:agent_tool_call_failed
  • event.name:agent_tool_call_invalid_input
  • span.op:gen_ai.execute_tool
  1. Confirm release boundary where failures started.
  2. Triage highest-error symptom first (webhook, queue, turn, tool).
  3. Apply rollback/hotfix.
  4. Re-run health + Slack-thread verification.

Use Verify & Troubleshoot for first-response checks, then return to Observability to confirm recovery.