OpenPI-C
Summary
What did authors try to accomplish?
In this paper, the authors tried to fix the original OpenPI
What are the key elements of the approach?
What can I use for myself?
What are the other references I can read next?
3+ Most Important Things
1+ Deficiencies
3+ New Ideas
Annotations
« However, we identify issues with the dataset quality and evaluation metric »(1)
« procedure level, step level and state change level respectively »(1)
« For the evaluation metric, we propose a cluster-based metric to fix the original metric’s preference for repetition. »(1)
« temporal dependency and entity awareness. »(1)
« OpenPI dataset (Tandon et al., 2020) is, to our knowledge, the first and only dataset for this task »(1)
« non-procedural documents »(1)
« out-of-order steps »(1)
« ambiguous state changes »(1)
« ∼32% of the state changes cannot be reliably inferred from the input, »(1)
« allows matching multiple predicted state changes to a single gold state change, inadvertently inflating the score when the model produces repetitive outputs »(1)
« propose a cluster-based metric that automatically merges repetitive stage changes and enforces 1-to-1 assignment between clusters. »(1)
« entity memory »(1)
« This requires the model to jointly identify involved entities and predict their state changes »(1)
« For input, we find that ∼15% input texts are not procedure texts because the steps do show any temporal continuity »(2)
« In valid procedure text inputs, ∼7.4% steps are invalid steps in the context of the procedure texts »(2)
« For output, ∼32% state changes cannot be reliably inferred from the input »(2)
« (1) filtering out non-procedure input texts, »(2)
« 2) filtering out invalid steps, »(2)
« (3) filtering out unreliable state changes »(2)
« We cluster the predicted set and the gold-standard set respectively based on Sentence-BERT (Reimers and Gurevych, 2019a) embedding similarity. »(3)
« which enforces one-to-one mapping »(3)
« Eventually, we use the assignment to calculate precision, recall and F1 scores »(3)
« We did a binary classification on each output state change to classify whether it contains hallucinations or not »(5)
« the model trained on OpenPI produced 749 hallucinated state changes while the model trained on OpenPI-C produced 393 (47.53% less). »(5)
« For future work, we consider using external resources such as ConceptNet (Amigó et al., 2009) to assist entity prediction »(5)
« For output clustering, we use stsb-distilroberta-base-v2 model provided by sentence-transformers package »(6)
« We set the threshold th as 0.7 »(6)
« Stage 1: »(7)
« Stage 2: »(7)
« Stage 3: »(7)
« We first remove steps with no state changes, and then remove procedure texts with < 3 steps. »(7)
« In decoding, we use beam search with beam size of 4 »(7)
« best decoding strategy is found by manual tuning on the original OpenPI dataset »(7)
Date : 06-01-2023
Authors : Xueqing Wu, Sha Li, Heng Ji
Paper Link : https://arxiv.org/abs/2306.00887v2
Zotero Link: OpenPI-C.pdf
Tags : ##now, ##toread
Citation : @misc{Wu_Li_Ji_2023, title={OpenPI-C: A Better Benchmark and Stronger Baseline for Open-Vocabulary State Tracking}, url={https://arxiv.org/abs/2306.00887v2}, abstractNote={Open-vocabulary state tracking is a more practical version of state tracking that aims to track state changes of entities throughout a process without restricting the state space and entity space. OpenPI is to date the only dataset annotated for open-vocabulary state tracking. However, we identify issues with the dataset quality and evaluation metric. For the dataset, we categorize 3 types of problems on the procedure level, step level and state change level respectively, and build a clean dataset OpenPI-C using multiple rounds of human judgment. For the evaluation metric, we propose a cluster-based metric to fix the original metric’s preference for repetition. Model-wise, we enhance the seq2seq generation baseline by reinstating two key properties for state tracking: temporal dependency and entity awareness. The state of the world after an action is inherently dependent on the previous state. We model this dependency through a dynamic memory bank and allow the model to attend to the memory slots during decoding. On the other hand, the state of the world is naturally a union of the states of involved entities. Since the entities are unknown in the open-vocabulary setting, we propose a two-stage model that refines the state change prediction conditioned on entities predicted from the first stage. Empirical results show the effectiveness of our proposed model especially on the cluster-based metric. The code and data are released at https://github.com/shirley-wu/openpi-c}, journal={arXiv.org}, author={Wu, Xueqing and Li, Sha and Ji, Heng}, year={2023}, month=jun, language={en} }