Methodology: Audio Arena Static Benchmarks

Audio Arena is a suite of 6 static multi-turn voice agent benchmarks designed to stress-test speech-to-speech and text-to-speech models on realistic customer-facing scenarios. Each benchmark places the model in a different domain with a unique system prompt, tool set, and knowledge base, then runs a fixed sequence of user audio turns that escalate in complexity — from straightforward requests to adversarial traps, chained corrections, and cross-entity state tracking.

Audio Arena started from an earlier 30-turn evaluation created by Kwindla Kramer at Daily (blog post), built around an AI conference scenario. When frontier models began scoring above 90% on nearly every category (medium benchmark results), we discarded the majority of the original turns and rebuilt the evaluation as a suite of harder, domain-diverse benchmarks.

All user audio is pre-recorded with TTS (OpenAI tts-1, alloy voice) so every model receives identical input. Benchmarks are static: the same turns, golden answers, and scoring expectations are used for every run, making results directly comparable across models.


The 6 Benchmarks

Appointment Bench (25 turns) — A dental office scheduling assistant for Bayshore Family Dental. The caller books and modifies appointments across two patients with near-identical names (Daniel/Danielle Nolan) and two doctors (Perry/Barry). Tests confusable-name disambiguation, phone number swap and revert, slot-taken error recovery, false memory traps, and cross-entity state tracking.

Assistant Bench (31 turns) — A personal assistant (Atlas) that handles flights, hotels, calendar, reminders, and email. The caller issues dual requests in single turns, switches topics mid-conversation, and circles back to earlier topics. Tests multi-intent segmentation, mid-sentence self-correction, retroactive email correction, cross-reference arithmetic (total cost calculations), correction-chain recall, vague pronoun disambiguation, and audio traps on name spelling, airport codes, dates, and times.

Conversation Bench (75 turns) — A conference assistant for the AI Engineer World's Fair. The original benchmark, rebuilt from scratch with substantially harder turns. Tests adversarial traps, multi-step tool use with long-range memory, cascading error recovery, cancellation flow, ambiguous entities, implicit correction, and distractor injection.

Event Bench (29 turns) — An event planning assistant for Evergreen Events in Austin. The caller books and modifies venue, catering, and guest count details across a single evolving event. Tests mid-sentence self-corrections, vague pronoun resolution, wrong-math correction, multi-request reversals, ambiguous add-on disambiguation, hypothetical reasoning, retroactive date changes, phone number swaps, false memory traps, and cross-entity state tracking.

Grocery Bench (30 turns) — A grocery ordering assistant for Harvest & Hearth Market. The caller builds, modifies, and confirms a multi-item order. Tests multi-item single turns, relative-math quantity changes, conditional additions and removals by price threshold, chained corrections, homophone collisions (flower/flour), fifteen/fifty audio confusion, partial name references, swap operations, retroactive quantity changes, and full order reconciliation.

Product Bench (31 turns) — A laptop sales assistant for TechMart Electronics. The caller compares, selects, and purchases laptops. Tests multi-intent turns, retroactive correction via reported speech, conditional arithmetic chains (discount stacking policy edges), cross-reference counting, 3-step order modification chains, confusable model numbers (X1490/X1940), false memory traps, and out-of-scope deflection.


Scoring Rubric

All benchmarks share the same scoring rubric and judge. Each turn is evaluated across up to 5 dimensions:

  • Category-aware dimensions — Core dimensions (tool use, instruction following, KB grounding) are scored on every turn. state_tracking and ambiguity_handling are scored only on turns tagged with the relevant categories, so models are not penalized on dimensions that are out of scope for a given turn.
  • Two-phase evaluation — An initial turn-by-turn pass is followed by a realignment pass that detects early or late function calls and cascading effects. If a required call was made a turn early, later turns are not penalized for a "missing" call; if a call was made a turn late, the turn where it was actually made gets credit.
  • Penalty absorption — When a missed tool call has a more specific root cause, the penalty lands on that dimension instead of tool_use_correct. Over-clarification penalties go to ambiguity_handling; forgotten conversational state goes to state_tracking. If the specific dimension is not in scope, the penalty falls back to tool_use_correct. This avoids double-penalizing while ensuring every failure is counted exactly once.
  • Strict separation of instruction_following and tool_use_correct — Failing to call a tool when expected is scored only under tool_use_correct. instruction_following is failed only when the assistant's words and actions contradict each other in a non-tool sense.
  • Turn-taking and leniency — For speech-to-speech runs, a pre-computed turn_taking dimension reflects audio timing (overlaps, interruptions, missing response). When turn_taking fails, the judge is more lenient on instruction_following to account for possible transcription or audio issues.

Evaluation Pipeline

All speech-to-speech model runs are orchestrated with Pipecat, an open-source framework for building voice and multimodal AI pipelines. Pipecat manages audio transport, VAD, and model-specific service adapters so each model is exercised through a consistent pipeline: pre-recorded user audio turns are fed in, tool calls are intercepted and resolved, and the model's audio + text output is captured for scoring.

Amazon Nova Sonic

Amazon Nova Sonic is Amazon's native speech-to-speech foundation model, accessed via AWS Bedrock. Nova Sonic connections have an ~8-minute server-side timeout, so our pipeline proactively rotates sessions every couple minutes. On each rotation we replay the complete system prompt and full conversation history.

GLM Realtime

GLM Realtime (Air & Flash) from Zhipu AI has a character limit on context, similar to Nova Sonic's session timeout constraint. Our pipeline uses the same session-rotation strategy: proactively rotating sessions and replaying the system prompt and conversation history to stay within the context window.

GLM Realtime Air is currently excluded from the leaderboard. While the text model responds correctly (response.done events arrive, typically after ~60s latency), the audio generation backend appears to be down on Zhipu's side — the server returns tgi 请求失败,错误信息: Cannot connect to host tob-glm-4o-audiocall-32b-lb:8080 hundreds of times per run, producing zero audio output. We will re-enable GLM Realtime Air once the issue is resolved upstream.


Links & Resources


Changelog

March 31, 2026 — New models: Gemini 3.1 Flash & GLM Realtime

  • Added Gemini 3.1 Flash (gemini-3.1-flash-live-preview)
  • Added GLM Realtime Air and GLM Realtime Flash from Zhipu AI
  • GLM uses session rotation with context replay to work within its character limit, similar to Nova Sonic

March 27, 2026 — Turns updated & runs rejudged

  • Updated turns across all benchmarks
  • All runs rejudged with updated scoring

March 2026 — Multi-benchmark support

  • Added 5 benchmarks: Appointment, Assistant, Event, Grocery, and Product
  • Updated turn tightening on Conversation Bench
  • Leaderboard now supports switching between benchmarks via tab selector
  • Published real audio files for all benchmarks on HuggingFace

January 2026 — Hard benchmark launch

  • Expanded from 30 to 75 turns with substantially harder questions, named Conversation Bench
  • Added ambiguity handling and state tracking as scoring dimensions
  • Introduced two-phase evaluation with realignment pass
  • Added penalty absorption to avoid double-penalizing failures

November 2025 — Initial release

  • 30-turn medium benchmark (AI Engineer World's Fair scenario) created by Kwindla Kramer at Daily
  • 3 scoring dimensions: tool use, instruction following, KB grounding
  • Claude judge with per-turn reasoning