What is More Likely to Happen Next

Summary

What did authors try to accomplish?

In this paper, the authors proposed a new task and a new dataset for future event prediction task. The authors have used a different paradigm to collect data from MTurk. It has helped them to generate easy, medium and hard data, without any negation or usual bias from the MTurk annotators.

What are the key elements of the approach?

The main key element of this paper is the data collection method rather than the future event prediction method. Their data collection procedure includes:

Standard Data Collection: In this stage, authors have collected 50% of the data just giving the video, premise summary to the annotators and ask the annotators to write (1) most like future event (2) less likely future event. This data is annotated as the easy data.

Adversarial Data Collection: In this stage, authors have trained a classifier based on the previous stage data to classify most likely and less likely events. Now when the annotator, write the most likely and less likely, both of them goes to the classifier and if the probability difference between them is too much, then that data is discard. The hypothesis is that if the probability of both of them is closer, then it is hard to differentiate for a model from most likely to less likely. This data is considered as the hard data.

Adversarial Matching:
Now as the authors had enough data, authors have generated synthetical generation but with adversarial matching. For that, for each most likely event, the authors try to find a negative from other negative such that they are similar to the premise but not too similar to the positive. The former one makes it harder for a model to differentiate and the latter makes sure that it is not the positive event somehow.

What can I use for myself?

The data collection idea is a very interesting and clever one which we can use if we need to generate synthetical or human dataset. Other than that, my main motivation was to find a dataset to train a model for future event prediction which might help me in my current work. This dataset in paper actually solves that issue. The only issue is when I looked through their dataset, their was no premise, there was a video, positive and negative event but as described on paper, there was no premise summary.

What are the other references I can read next?

Still trying to find more like this type of work.

3+ Most Important Things

Data collection method
VLEP dataset

1+ Deficiencies

I didn't find any deficiencies but currently this dataset should be easily solvable by the LLMs

Annotations

Annotation

« but also a significant amount of commonsense knowledge »()

Annotation

« we collect a new dataset, named Video-and-Language Event Prediction (VLEP), with 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips »()

Annotation

« Given a video with aligned dialogue, and two possible future events, the AI system is required to understand both visual and language semantics from this »()

Annotation

« video, and commonsense world knowledge, and »(2)

Annotation

« then make a sound and practical judgment about »(2)

Annotation

« the future »(2)

Annotation

« The VLEP dataset contains 28,726 examples from 10,234 short video clips. »(2)

Annotation

« Each example (see Figure 1) consists of a Premise Event (a short video clip with dialogue), a Premise Summary (a text summary of the premise event), and two potential natural language Future Events (along with Rationales) written by people »(2)

Annotation

« To mitigate this, we combine two recent effective approaches, adversarial human-and-model-in-theloop data collection (Nie et al., 2020) and adversarial matching (Zellers et al., 2019a), to build a larger, more-challenging, and less-biased dataset. »(2)

Annotation

« Specifically, 50% of the examples in VLEP are directly annotated by humans over two rounds: round one of standard data collection, i.e., crowd-workers perform the annotations with no model feedback, and round two of adversarial data collection, i.e., crowd-workers perform the annotations with the goal of fooling our basic models trained on round one data (thus avoiding obvious biases) »(2)

Annotation

« Overall, our dataset is collected via 3 methods (standardhuman, adversarial-human, adversarial-matching) »(2)

Annotation

« Overall, our dataset is collected via 3 methods (standardhuman, adversarial-human, adversarial-matching) »(2)

Annotation

« We also inject commonsense reasoning knowledge into our model from the ATOMIC dataset »(2)

Annotation

« video understanding »(2)

Annotation

« di- alogue understanding »(2)

Annotation

« commonsense knowl- edge »(2)

Annotation

« (1) »(2)

Annotation

« (2) »(2)

Annotation

« (3) »(2)

Annotation

« 50% are collected directly from human annotators over two rounds: (1) round one: standard data collection; (2) round two: adversarial data collection. »(3)

Annotation

« (1) round one: standard data collection »(3)

Annotation

« (2) round two: adversarial data collection. »(3)

Annotation

« We use TV show clips from TVQA (Lei et al., 2018) »(3)

Annotation

« We segment the videos into 60second clips. For each video, we drop the first and the last clip, as most of them are high-level introductions or subscription pleas. »(4)

Annotation

« that are likely to happen right after the ‘premise’ video »(4)

Annotation

« For example, we found workers tend to use negation when writing the less-likely event. »(4)

Annotation

« Specifically, each submitted result is sent to the model for evaluation and writers are prompted3 to rewrite their negative event if our model predicts a much higher probability for the more-likely event (pm) than the lesslikely event (pl), i.e., pm − pl > ∆, where ∆ is a hyperparameter that controls how difficult we want the collected examples to be and is set to 0.1 empirically. »(5)

Annotation

« Given a premise event and its positive event, the goal of adversarial matching is to find a negative from other premise events’ positives, such that the matched negative is very relevant to the premise event (so that they are still hard for machines) and at the same time, not overly similar to the true positive (in case they incidentally become a positive event to the premise). »(6)

Annotation

« Since our model has a pre- »(8)

Annotation

« trained component (RoBERTa), we use a two- phase training strategy »(8)

Annotation

« Specifically, we first freeze RoBERTa’s weights up to the second last layer and then pre-train the rest of model for 3 epochs with initial learning rate of 1e-4, learning rate warmup over the first 10% of the steps and linear decay the learning rate to 0 »(8)

Metadata

Date : 10-15-2020

Authors : Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

Paper Link : http://arxiv.org/abs/2010.07999

Zotero Link: arXiv Fulltext PDF

Tags : ##p1, ##toread, #Computer-Science---Computation-and-Language, #Computer-Science---Artificial-Intelligence, #Computer-Science---Computer-Vision-and-Pattern-Recognition

Citation : @article{Lei_Yu_Berg_Bansal_2020, title={What is More Likely to Happen Next? Video-and-Language Future Event Prediction}, url={http://arxiv.org/abs/2010.07999}, DOI={10.48550/arXiv.2010.07999}, abstractNote={Given a video with aligned dialogue, people can often infer what is more likely to happen next. Making such predictions requires not only a deep understanding of the rich dynamics underlying the video and dialogue, but also a significant amount of commonsense knowledge. In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions. To support research in this direction, we collect a new dataset, named Video-and-Language Event Prediction (VLEP), with 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips. In order to promote the collection of non-trivial challenging examples, we employ an adversarial human-and-model-in-the-loop data collection procedure. We also present a strong baseline incorporating information from video, dialogue, and commonsense knowledge. Experiments show that each type of information is useful for this challenging task, and that compared to the high human performance on VLEP, our model provides a good starting point but leaves large room for future work. Our dataset and code are available at: https://github.com/jayleicn/VideoLanguageFuturePred}, note={arXiv:2010.07999 [cs]}, number={arXiv:2010.07999}, publisher={arXiv}, author={Lei, Jie and Yu, Licheng and Berg, Tamara L. and Bansal, Mohit}, year={2020}, month=oct }

Summary

What did authors try to accomplish?

What are the key elements of the approach?

What can I use for myself?

What are the other references I can read next?

3+ Most Important Things

1+ Deficiencies

Annotations

Related Notes