How To 100M Learning Text Video

#paper #nlp #dataset

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Date : 2019-07-31

Authors : Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

Citation : Miech, Antoine, et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. arXiv:1906.03327, arXiv, 31 July 2019. arXiv.org, https://doi.org/10.48550/arXiv.1906.03327.

Paper Link : http://arxiv.org/abs/1906.03327

Zotero Link: HowTo100M.pdf

Annotation

« we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated in- structional web videos depicting humans performing and describing over 23k different visual tasks »(1)

Annotation

« we demonstrate that a text-video embedding trained on this data leads to state-of- the-art results for text-to-video retrieval and action local- ization on instructional video datasets such as YouCook2 or CrossTask »(1)

Annotation

« we show that this embedding trans- fers well to other domains »(1)

Annotation

« MSR-VTT [58] »(1)

Annotation

« DiDeMo [15] »(1)

Annotation

« EPIC- »(1)

Annotation

« KITCHENS [7] »(2)

Annotation

« 136 million video clips sourced from 1.22 million narrated instructional videos depicting humans performing more than 23,000 different tasks »(2)

Annotation

« YouCook2 [67] »(2)

Annotation

« CrossTask [68] »(2)

Annotation

« Sener et al. [47] use WikiHow, an encyclopedia of how to articles, to col- lect 17 popular physical tasks, and obtain videos by query- ing these tasks on YouTube. In a similar vein, COIN [50] and CrossTask [68] datasets are collected by first search- ing for tasks on WikiHow and then videos for each task on YouTube. »(2)

Annotation

« We are primarily inter- ested in “visual tasks” that involve some interaction with the physical world »(3)

Annotation

« However, unlike previous datasets, HowTo100M does not have clean annotated cap- tions »(4)

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Related Notes