Reno Talk @UMBC on Scale-2024

Full handwritten notes: https://web.goodnotes.com/s/TL37bZae7UNxYWtBRCxmLR#page-1 (for some reason the note is coming blank 😞)

Motivation:

  1. Most videos don’t have the description or metadata
  2. Youtube has:
    1. title
    2. description
    3. likes / dislikes
    4. hashtags

Methods:

Video

  1. video β†’ caption β†’ embedding
  2. video β†’ Key frame β†’ embedding
  3. video β†’ Key frame β†’ caption -> embedding

Audio

  1. video β†’ ASR β†’ embedding
  2. video β†’ ASR β†’ translation β†’ embedding

OCR

  1. video β†’ key frame β†’ OCR β†’ text β†’ embedding
  2. video β†’ key frame β†’ OCR β†’ text β†’ summary β†’ embedding

Description

  1. Description β†’ embedding
  2. Description β†’ summary β†’ embedding

Team Members:

Data:

Multivent:

InternVid:

Scale / Multivent 2.0:

Train

Sub-Train / Train-2k

Test

Query Creation:

Multivent

InternVID:

Evaluation:

Pre-Scale Baseline

  1. Video β†’ PySceneDetect β†’ 10 frames β†’ Clip
  2. Video β†’ PySceneDetect β†’ 10 frames β†’ OCR β†’ PaddleOCR β†’ Clip
  3. ASR β†’ Whisper β†’ Clip
  4. Desc β†’ Clip

After-Scale:

  1. Video β†’ 16 uniform frames β†’ OCR β†’ ICDAR β†’ original + google translate β†’ original + LLAMAΒ  summarized google translate β†’ ColBERT
  2. Video β†’ 16 uniform frames β†’ Siglip β†’ maxSim
  3. Desc β†’ original + google translate β†’ ColBERT
  4. ASR β†’ whisper β†’ original + google translate β†’ original + LLAMA summarized google translate β†’Β  ColBERT
  5. Weighted fusion based on if the video is professional, edited or raw (using trained classifier)
  6. query β†’ query decomposition
  7. Rerank with MonoT5

Without Re-Ranking

With Reranking

Takeaways:

  1. Late Fusion better
  2. Summarize OCR & STT text representation
  3. SigLIP best as a vision encoder rather than CLIP
  4. Combination of Translation & Original help
  5. No single modality is sufficient
  6. The more text we can get is better
  7. LLM summarization works better than raw text

Papers to Read:

  1. https://aclanthology.org/2022.lrec-1.739/
  2. https://arxiv.org/abs/2104.00650
  3. https://arxiv.org/abs/2106.11097
  4. https://arxiv.org/abs/2212.03191
  5. https://arxiv.org/abs/2305.18500