Audio Arena started from an earlier 30-turn multi-turn evaluation benchmark created by Kwindla Kramer at Daily (blog post), built around the AI Engineer World's Fair conference scenario. During development, we found that the original 30 questions were not sufficiently challenging — when we reran the benchmark, most frontier models scored above 90% on nearly every category (medium benchmark results). We discarded the majority of the original turns and rebuilt the benchmark from scratch as a 75-turn conference assistant benchmark.
Of the original 30 turns, only a small number of basic QA and tool-use turns were retained, and even those were revised. The remaining turns are entirely new, with new TTS-generated audio (OpenAI tts-1, alloy voice), new golden answers, and new scoring expectations. The benchmark is 2.5× larger and substantially harder across every category. The new and redesigned turns specifically test:
- Adversarial traps — Authority appeals, plausible hallucinations, subtle prompt injection, near-miss entities, and false recall (replacing the more obvious attacks in the original)
- Multi-step tool use and long-range memory — Conditional logic, parallel chains, implicit requirements, rollbacks, and correct use of information from many turns earlier
- Error recovery — Cascading failures, partial success states, and ambiguous error messages
- Cancellation flow and state tracking — User changes of mind and correct handling of cancelled actions across turns
- Ambiguity handling — Ambiguous entities (e.g., two people with the same name), compound ambiguity, and dependent or contradictory constraints
- Implicit correction — Nested misconceptions, partial truths, and false attributions that the model must correct without over-correcting
- Distractor injection — Buried questions, emotional manipulation, and technical tangents that require focusing on the actual user intent
We also adapted the scoring rubric to account for the following:
- Category-aware dimensions — Core dimensions (tool use, instruction following, KB grounding) are scored on every turn. The dimensions state_tracking and ambiguity_handling are scored only on turns tagged with the relevant categories (e.g., long-range memory, cancellation flow, implicit correction, ambiguous entity), so we do not penalize models on dimensions that are out of scope for a given turn.
- Two-phase evaluation — An initial turn-by-turn pass is followed by a realignment pass that detects early or late function calls and cascading effects. If a required call was made a turn early, later turns that "expected" that call are not penalized for a "missing" call; if a call was made a turn late, the turn where it was actually made gets credit.
- Penalty absorption — When a missed tool call has a more specific root cause, the penalty lands on that dimension instead of
tool_use_correct. If a model over-clarified (asked for confirmation when it wasn't needed) andambiguity_handlingis in scope, the penalty goes to ambiguity. If a model forgot earlier conversational state (e.g., re-asked the user's name) andstate_trackingis in scope, the penalty goes to state tracking. If the specific dimension is not in scope for that turn, the penalty falls back totool_use_correct. This avoids double-penalizing while ensuring every failure is counted exactly once. - Strict separation of instruction_following and tool_use_correct — Failing to call a tool when expected, or asking for unnecessary confirmation instead of calling, is scored only under
tool_use_correct(or absorbed by a more specific dimension as above).instruction_followingis failed only when the assistant's words and actions contradict each other in a non-tool sense (e.g., saying "I'll wait for your confirmation" and then calling the function in the same turn). - Turn-taking and leniency — For speech-to-speech runs, a pre-computed turn_taking dimension reflects audio timing (overlaps, interruptions, missing response). When turn_taking fails, the judge is more lenient on instruction_following to account for possible transcription or audio issues.
The benchmark is static: the same 75 user inputs (and corresponding audio) are used for every run, with golden expectations and category tags defined in benchmarks/_shared/turns.py, so results are comparable across models and runs.
Evaluation Pipeline
All speech-to-speech model runs are orchestrated with Pipecat, an open-source framework for building voice and multimodal AI pipelines. Pipecat manages audio transport, VAD, and model-specific service adapters so each model is exercised through a consistent pipeline: the same 75 pre-recorded user audio turns are fed in, tool calls are intercepted and resolved, and the model's audio + text output is captured for scoring.
Amazon Nova Sonic
Amazon Nova Sonic is Amazon's native speech-to-speech foundation model, accessed via AWS Bedrock. Nova Sonic sessions have an ~8-minute limit, so our pipeline performs session rotation mid-conversation. On each rotation we provide the complete system prompt, all tool calls and their results, and the entire conversation history. If Nova Sonic rejects the context due to size limits, we fall back progressively: first to a tool-priority mode that preserves tool-call state and fills the remaining budget with recent messages, then to a zero-context mode (system prompt only) as a last resort. It is evaluated with the same scoring rubric and judge as every other model on the leaderboard.
Links & Resources
- Audio Arena GitHub — evaluation code, model runners, and scoring pipeline
- HuggingFace Dataset — benchmark data, turn definitions, and audio files
- Original 30-turn benchmark by Kwindla Kramer — the foundation this hard benchmark builds on
- Benchmarking LLMs for Voice Agent Use Cases — Daily blog post describing the original evaluation
- Daily — the real-time voice and video API used to power the evaluation pipeline