How To 100M Learning Text Video
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Date : 2019-07-31
Authors : Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic
Citation : Miech, Antoine, et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. arXiv:1906.03327, arXiv, 31 July 2019. arXiv.org, https://doi.org/10.48550/arXiv.1906.03327.
Paper Link : http://arxiv.org/abs/1906.03327
Zotero Link: HowTo100M.pdf
« we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated in- structional web videos depicting humans performing and describing over 23k different visual tasks »(1)
« we demonstrate that a text-video embedding trained on this data leads to state-of- the-art results for text-to-video retrieval and action local- ization on instructional video datasets such as YouCook2 or CrossTask »(1)
« we show that this embedding trans- fers well to other domains »(1)
« MSR-VTT [58] »(1)
« DiDeMo [15] »(1)
« EPIC- »(1)
« KITCHENS [7] »(2)
« 136 million video clips sourced from 1.22 million narrated instructional videos depicting humans performing more than 23,000 different tasks »(2)
« YouCook2 [67] »(2)
« CrossTask [68] »(2)
« Sener et al. [47] use WikiHow, an encyclopedia of how to articles, to col- lect 17 popular physical tasks, and obtain videos by query- ing these tasks on YouTube. In a similar vein, COIN [50] and CrossTask [68] datasets are collected by first search- ing for tasks on WikiHow and then videos for each task on YouTube. »(2)
« We are primarily inter- ested in “visual tasks” that involve some interaction with the physical world »(3)
« However, unlike previous datasets, HowTo100M does not have clean annotated cap- tions »(4)