$τ$-bench - A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Summary

This paper has provided
- a benchmark dataset of two domains (retail, airline) consisting of domain policy, api definition, user, agent, tool conversation with positive (reward 1) and negative trajectories (reward 0)
  - They have done that using LLM to simulate an user
  - At first they created some user instructions that will be given to LLM to simulate a real human
  - The API definitions are generated using both manual and LLM generation
  - The data of the database were first generated manually and then synthetically given the manual ones as example
- A new metric (pass^k) — this metric checks if all the pass are consistent and gives the right result compared to pass@k which checks if one of the results is correct in k trials

Annotations

Annotation

« We propose τ -bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. »()

Annotation

« We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. »()

Annotation

« Our evaluation scheme compares the database state at the end of each episode with the ground truth expected state. »(2)

Annotation

« We also introduce the metric of pass^k, which measures the consistency and robustness of the agent across k i.i.d. trials. »(2)

Annotation

« For instance, even state-of-the-art LMs like gpt-4o achieve low task success rates (pass^1) using function calling (∼61% on τ -retail and ∼35% on τ -airline). With increasing k, the chance of consistently solving a task drops rapidly, to as low as ∼25% for pass^8 on τ -retail for the same model »(2)

Annotation

« For example, if the preferred payment method is not specified, the user might answer differently and cause the final database to be different across trials »(5)

Annotation

« we run each τ -retail task with > 40 gpt-4-turbo trials and check all tasks with zero or low success rates) »(5)

Annotation

« so the cost is mainly due to long system prompt (domain policy + function definitions). »(7)

Metadata

Date : 06-17-2024

Authors : Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan

Paper Link : http://arxiv.org/abs/2406.12045

Zotero Link: Full Text PDF

Tags : #Computer-Science---Artificial-Intelligence, #Computer-Science---Computation-and-Language

Citation :

Summary

Annotations

Related Notes