$τ$-bench - A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Summary

Annotations

Annotation

« We propose τ -bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. »()

Annotation

« We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. »()

Annotation

« Our evaluation scheme compares the database state at the end of each episode with the ground truth expected state. »(2)

Annotation

« We also introduce the metric of pass^k, which measures the consistency and robustness of the agent across k i.i.d. trials. »(2)

Annotation

« For instance, even state-of-the-art LMs like gpt-4o achieve low task success rates (pass^1) using function calling (∼61% on τ -retail and ∼35% on τ -airline). With increasing k, the chance of consistently solving a task drops rapidly, to as low as ∼25% for pass^8 on τ -retail for the same model »(2)

Annotation

« For example, if the preferred payment method is not specified, the user might answer differently and cause the final database to be different across trials »(5)

Annotation

« we run each τ -retail task with > 40 gpt-4-turbo trials and check all tasks with zero or low success rates) »(5)

Annotation

« so the cost is mainly due to long system prompt (domain policy + function definitions). »(7)


Related Notes