Reno Talk @UMBC on Scale-2024
Full handwritten notes: https://web.goodnotes.com/s/TL37bZae7UNxYWtBRCxmLR#page-1 (for some reason the note is coming blank π)
Motivation:
- Most videos donβt have the description or metadata
- Youtube has:
- title
- description
- likes / dislikes
- hashtags
Methods:
Video
- video β caption β embedding
- video β Key frame β embedding
- video β Key frame β caption -> embedding
Audio
- video β ASR β embedding
- video β ASR β translation β embedding
OCR
- video β key frame β OCR β text β embedding
- video β key frame β OCR β text β summary β embedding
Description
- Description β embedding
- Description β summary β embedding
Team Members:
- Video, Audio, OCR, Text, Information Retrieval
Data:
Multivent:
- ~2400 data
- Professional β Edited β Raw (Easy to Hard to retrieve)
InternVid:
- ~10M data
- Large array of categories
- Multilingual
Scale / Multivent 2.0:
Train
- 108k InternVid video as distractor
- 1400 train queries
- No multivent video here
Sub-Train / Train-2k
- 2000 videos, including all relevant videos
- Used for quick prototyping
Test
- 110k from Intervid as needle & distractor + 2400 multivent data
- 2546 Test queries β 2400 multivent + ~146 InternVid
Query Creation:
Multivent
- MultiVENT has description or title or tweet as query, which is really long
- MultiVENT has a doc. associated with each video
- 255 base event queries
- 884 specific event queries
InternVID:
- InternVid has title
- From title or desc. base query
- specific event query from transcribed speech and OCR
Evaluation:
- Primary metric: NDCG@10
- Score 0.3 - 0.4 - decent
- Score 06 - 0.7 - good
Pre-Scale Baseline
- Video β PySceneDetect β 10 frames β Clip
- Video β PySceneDetect β 10 frames β OCR β PaddleOCR β Clip
- ASR β Whisper β Clip
- Desc β Clip
After-Scale:
- Video β 16 uniform frames β OCR β ICDAR β original + google translate β original + LLAMAΒ summarized google translate β ColBERT
- Video β 16 uniform frames β Siglip β maxSim
- Desc β original + google translate β ColBERT
- ASR β whisper β original + google translate β original + LLAMA summarized google translate βΒ ColBERT
- Weighted fusion based on if the video is professional, edited or raw (using trained classifier)
- query β query decomposition
- Rerank with MonoT5
Takeaways:
- Late Fusion better
- Summarize OCR & STT text representation
- SigLIP best as a vision encoder rather than CLIP
- Combination of Translation & Original help
- No single modality is sufficient
- The more text we can get is better
- LLM summarization works better than raw text