OpenPI-C

#paper

Summary

What did authors try to accomplish?

In this paper, the authors tried to fix the original OpenPI

What are the key elements of the approach?

What can I use for myself?

What are the other references I can read next?

3+ Most Important Things

1+ Deficiencies

3+ New Ideas

Annotations

Annotation

« However, we identify issues with the dataset quality and evaluation metric »(1)

Annotation

« procedure level, step level and state change level respectively »(1)

Annotation

« For the evaluation metric, we propose a cluster-based metric to fix the original metric’s preference for repetition. »(1)

Annotation

« temporal dependency and entity awareness. »(1)

Annotation

« OpenPI dataset (Tandon et al., 2020) is, to our knowledge, the first and only dataset for this task »(1)

Annotation

« non-procedural documents »(1)

Annotation

« out-of-order steps »(1)

Annotation

« ambiguous state changes »(1)

Annotation

« ∼32% of the state changes cannot be reliably inferred from the input, »(1)

Annotation

« allows matching multiple predicted state changes to a single gold state change, inadvertently inflating the score when the model produces repetitive outputs »(1)

Annotation

« propose a cluster-based metric that automatically merges repetitive stage changes and enforces 1-to-1 assignment between clusters. »(1)

Annotation

« entity memory »(1)

Annotation

« This requires the model to jointly identify involved entities and predict their state changes »(1)

Annotation

« For input, we find that ∼15% input texts are not procedure texts because the steps do show any temporal continuity »(2)

Annotation

« In valid procedure text inputs, ∼7.4% steps are invalid steps in the context of the procedure texts »(2)

Annotation

« For output, ∼32% state changes cannot be reliably inferred from the input »(2)

Annotation

« (1) filtering out non-procedure input texts, »(2)

Annotation

« 2) filtering out invalid steps, »(2)

Annotation

« (3) filtering out unreliable state changes »(2)

Annotation

« We cluster the predicted set and the gold-standard set respectively based on Sentence-BERT (Reimers and Gurevych, 2019a) embedding similarity. »(3)

Annotation

« which enforces one-to-one mapping »(3)

Annotation

« Eventually, we use the assignment to calculate precision, recall and F1 scores »(3)

Annotation

« We did a binary classification on each output state change to classify whether it contains hallucinations or not »(5)

Annotation

« the model trained on OpenPI produced 749 hallucinated state changes while the model trained on OpenPI-C produced 393 (47.53% less). »(5)

Annotation

« For future work, we consider using external resources such as ConceptNet (Amigó et al., 2009) to assist entity prediction »(5)

Annotation

« For output clustering, we use stsb-distilroberta-base-v2 model provided by sentence-transformers package »(6)

Annotation

« We set the threshold th as 0.7 »(6)

Annotation

« Stage 1: »(7)

Annotation

« Stage 2: »(7)

Annotation

« Stage 3: »(7)

Annotation

« We first remove steps with no state changes, and then remove procedure texts with < 3 steps. »(7)

Annotation

« In decoding, we use beam search with beam size of 4 »(7)

Annotation

« best decoding strategy is found by manual tuning on the original OpenPI dataset »(7)

Metadata

Date : 06-01-2023

Authors : Xueqing Wu, Sha Li, Heng Ji

Paper Link : https://arxiv.org/abs/2306.00887v2

Zotero Link: OpenPI-C.pdf

Tags : ##now, ##toread

Citation : @misc{Wu_Li_Ji_2023, title={OpenPI-C: A Better Benchmark and Stronger Baseline for Open-Vocabulary State Tracking}, url={https://arxiv.org/abs/2306.00887v2}, abstractNote={Open-vocabulary state tracking is a more practical version of state tracking that aims to track state changes of entities throughout a process without restricting the state space and entity space. OpenPI is to date the only dataset annotated for open-vocabulary state tracking. However, we identify issues with the dataset quality and evaluation metric. For the dataset, we categorize 3 types of problems on the procedure level, step level and state change level respectively, and build a clean dataset OpenPI-C using multiple rounds of human judgment. For the evaluation metric, we propose a cluster-based metric to fix the original metric’s preference for repetition. Model-wise, we enhance the seq2seq generation baseline by reinstating two key properties for state tracking: temporal dependency and entity awareness. The state of the world after an action is inherently dependent on the previous state. We model this dependency through a dynamic memory bank and allow the model to attend to the memory slots during decoding. On the other hand, the state of the world is naturally a union of the states of involved entities. Since the entities are unknown in the open-vocabulary setting, we propose a two-stage model that refines the state change prediction conditioned on entities predicted from the first stage. Empirical results show the effectiveness of our proposed model especially on the cluster-based metric. The code and data are released at https://github.com/shirley-wu/openpi-c}, journal={arXiv.org}, author={Wu, Xueqing and Li, Sha and Ji, Heng}, year={2023}, month=jun, language={en} }

Summary

What did authors try to accomplish?

What are the key elements of the approach?

What can I use for myself?

What are the other references I can read next?

3+ Most Important Things

1+ Deficiencies

3+ New Ideas

Annotations

Related Notes