$τ$-bench - A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Summary
- This paper has provided
- a benchmark dataset of two domains (retail, airline) consisting of domain policy, api definition, user, agent, tool conversation with positive (reward 1) and negative trajectories (reward 0)
- They have done that using LLM to simulate an user
- At first they created some user instructions that will be given to LLM to simulate a real human
- The API definitions are generated using both manual and LLM generation
- The data of the database were first generated manually and then synthetically given the manual ones as example
- A new metric (pass^k) — this metric checks if all the pass are consistent and gives the right result compared to pass@k which checks if one of the results is correct in k trials
- a benchmark dataset of two domains (retail, airline) consisting of domain policy, api definition, user, agent, tool conversation with positive (reward 1) and negative trajectories (reward 0)
Annotations
« Our evaluation scheme compares the database state at the end of each episode with the ground truth expected state. »(2)
« We also introduce the metric of pass^k, which measures the consistency and robustness of the agent across k i.i.d. trials. »(2)
« For instance, even state-of-the-art LMs like gpt-4o achieve low task success rates (pass^1) using function calling (∼61% on τ -retail and ∼35% on τ -airline). With increasing k, the chance of consistently solving a task drops rapidly, to as low as ∼25% for pass^8 on τ -retail for the same model »(2)
« For example, if the preferred payment method is not specified, the user might answer differently and cause the final database to be different across trials »(5)
« we run each τ -retail task with > 40 gpt-4-turbo trials and check all tasks with zero or low success rates) »(5)
« so the cost is mainly due to long system prompt (domain policy + function definitions). »(7)
Date : 06-17-2024
Authors : Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan
Paper Link : http://arxiv.org/abs/2406.12045
Zotero Link: Full Text PDF
Tags : #Computer-Science---Artificial-Intelligence, #Computer-Science---Computation-and-Language
Citation :