OpenPI-C

Summary

What did authors try to accomplish?

In this paper, the authors tried to fix the original OpenPI

What are the key elements of the approach?

What can I use for myself?

3+ Most Important Things

1+ Deficiencies

3+ New Ideas

Annotations

Annotation

« However, we identify issues with the dataset quality and evaluation metric »(1)

Annotation

« procedure level, step level and state change level respectively »(1)

Annotation

« For the evaluation metric, we propose a cluster-based metric to fix the original metric’s preference for repetition. »(1)

Annotation

« temporal dependency and entity awareness. »(1)

Annotation

« OpenPI dataset (Tandon et al., 2020) is, to our knowledge, the first and only dataset for this task »(1)

Annotation

« non-procedural documents »(1)

Annotation

« out-of-order steps »(1)

Annotation

« ambiguous state changes »(1)

Annotation

« ∼32% of the state changes cannot be reliably inferred from the input, »(1)

Annotation

« allows matching multiple predicted state changes to a single gold state change, inadvertently inflating the score when the model produces repetitive outputs »(1)

Annotation

« propose a cluster-based metric that automatically merges repetitive stage changes and enforces 1-to-1 assignment between clusters. »(1)

Annotation

« entity memory »(1)

Annotation

« This requires the model to jointly identify involved entities and predict their state changes »(1)

Annotation

« For input, we find that ∼15% input texts are not procedure texts because the steps do show any temporal continuity »(2)

Annotation

« In valid procedure text inputs, ∼7.4% steps are invalid steps in the context of the procedure texts »(2)

Annotation

« For output, ∼32% state changes cannot be reliably inferred from the input »(2)

Annotation

« (1) filtering out non-procedure input texts, »(2)

Annotation

« 2) filtering out invalid steps, »(2)

Annotation

« (3) filtering out unreliable state changes »(2)

Annotation

« We cluster the predicted set and the gold-standard set respectively based on Sentence-BERT (Reimers and Gurevych, 2019a) embedding similarity. »(3)

Annotation

« which enforces one-to-one mapping »(3)

Annotation

« Eventually, we use the assignment to calculate precision, recall and F1 scores »(3)

Annotation

« We did a binary classification on each output state change to classify whether it contains hallucinations or not »(5)

Annotation

« the model trained on OpenPI produced 749 hallucinated state changes while the model trained on OpenPI-C produced 393 (47.53% less). »(5)

Annotation

« For future work, we consider using external resources such as ConceptNet (Amigó et al., 2009) to assist entity prediction »(5)

Annotation

« For output clustering, we use stsb-distilroberta-base-v2 model provided by sentence-transformers package »(6)

Annotation

« We set the threshold th as 0.7 »(6)

Annotation

« Stage 1: »(7)

Annotation

« Stage 2: »(7)

Annotation

« Stage 3: »(7)

Annotation

« We first remove steps with no state changes, and then remove procedure texts with < 3 steps. »(7)

Annotation

« In decoding, we use beam search with beam size of 4 »(7)

Annotation

« best decoding strategy is found by manual tuning on the original OpenPI dataset »(7)


Related Notes