COIN

Paper Name: COIN - A Large scale Dataset for Comprehensive Instructional Video Analysis

Summary

In this paper, the authors have proposed a instructional video dataset comprising of,

  1. Video
  2. Step text with step localization (start and end time)
  3. Task name

The dataset consists of 11,827 videos of 180 tasks in 12 domain. The authors have proposed that the dataset is different that others because (a) Diversity: Compared to the other datasets which focuses on 1 / 5 tasks, this dataset focuses on 12 tasks. (b) Scale: Compared to the datasets of that time, this dataset is bigger in scale.

Instructional video analysis can be divided into three categories:

  1. Unsupervised Learning
  2. Weakly Supervised Learning
  3. Full Supervised Learning

The authors have divided the dataset into three levels of abstraction:

  1. Domain: This is the first level where the authors have chosen 12 domains from the Instructional Websites
  2. Task: This is linked to the domain. The authors have selected 180 most viewed tasks.
  3. Step: This the third level which is the step-wise description of the task. In total, the authors have procured 778 defined steps, which are divided into different tasks.

For the annotation, contrary to the usual methods, the authors have propose two mode to make the annotation work more efficient

  1. Video Mode: This is the usual model, where the annotator watches the video and annotate with the step with it's start and end time.
  2. Frame Mode: This is the novel one proposed by the author where the annotator will select the frames connected to the step. By default, the authors selected 2 frames per second. However, due to the time gap between the frames, some quick steps can be overlooked and missed.

The annotation goes through 3 steps:

  1. First annotation is done on the frame model
  2. The second worker refine the results of the first annotator.
  3. The final worker switch to video mode and check for any missing steps and refine it.

The COIN dataset is split into 9030 and 2797 video samples for training and testing respectively. The authors have also looked for task consistency, which made sure one step defined for one domain doesn't appear on another domain, which is the usual normal case.

3+ Most Important Things

  1. A instructional video dataset with step localization which provides specific video location for that step.

1+ Deficiencies

  1. The steps are pre-defined and not free form generation which lacks the human alike generation capability

3+ New Ideas


Related Notes