MultiVENT

Paper Name: MultiVENT: Multilingual Videos of Events with Aligned Natural Text

Summary

In this paper, authors have presented a multilingual, multimodal, event-centric dataset. The dataset consists of,

  1. Video
    1. Youtube
    2. Twitter
    3. Youtube Shorts
  2. Natural Language video description
  3. Video Grounded text document
    1. Wikipedia
    2. News Article

Text is from 5 different languages: Arabic, Chinese, English, Korean and Russian. Though there are multiple languages, English is available for each video. Dataset is of 2,396 videos covering 260 current events grounded in 468 documents.

There were four primary event categories: Disaster, Political Events, Social Events, Technology Events. For each category, and each target language, there were thirteen current events. Authors used google trends to find out the trending events for each of the target language. For each target language authors have found out most spoken country: total 5 countries. Authors have not included the same event in multiple languages by design.

For video, authors have collected 10 event for each event and then manually removed duplicates. Video description is curated from the video description if available, if not then the title. Authors have shown that half of the entities are visually salient and other half can be extracted from text.

Authors have provided a multimodal baseline where they have used a video encoder (CLIP) and text encoder (XLM-Roberta) to encode both video an text respectively. For videos, authors have takes 12 frames per video and mean pooled the frame representation to get the video representation. Then text and video embedding are used to create a similarity matrix to find out the most similar videos given text or vice versa.

3+ Most Important Things

  1. Authors have provided a novel multilingual dataset which have used non-professional and amateur videos for information retrieval.
  2. All the videos have english grounded document and description available which will help the research not focusing on Multilingual.
  3. Authors have shown that video and text plays similar roles for understanding an event.

1+ Deficiency

  1. Authors could have shown some more tasks on this dataset other information retrieval such as video topic modeling, video question answering.

New Ideas

  1. This dataset can be used as the base for multimodal counterfactual dataset
    1. Because,
      1. The instances are based on events which doesn't have a specific end goal like an instructional video but has a proper sequence of events that would be easier to alternate than a scenario, i.e., a olympic sport vs a wedding party, where the former one is easier to alternate (break the leg, comes last).
      2. Having a large document is helpful to find out multiple entities and agents where the agents and entities can be connected to achieve a goal, hence we can expand the dataset
    2. Issues
      1. This dataset doesn't have a specific step by step text, it has a document which can be large; hence can be hard for alternate document annotation