Reno Talk @UMBC on Scale-2024

Full handwritten notes: https://web.goodnotes.com/s/TL37bZae7UNxYWtBRCxmLR#page-1 (for some reason the note is coming blank 😞)

Motivation:

Most videos don’t have the description or metadata
Youtube has:
1. title
2. description
3. likes / dislikes
4. hashtags

Methods:

Video

video → caption → embedding
video → Key frame → embedding
video → Key frame → caption -> embedding

Audio

video → ASR → embedding
video → ASR → translation → embedding

OCR

video → key frame → OCR → text → embedding
video → key frame → OCR → text → summary → embedding

Description

Description → embedding
Description → summary → embedding

Team Members:

Video, Audio, OCR, Text, Information Retrieval

Data:

Multivent:

~2400 data
Professional → Edited → Raw (Easy to Hard to retrieve)

InternVid:

~10M data
Large array of categories
Multilingual

Scale / Multivent 2.0:

Train

108k InternVid video as distractor
1400 train queries
No multivent video here

Sub-Train / Train-2k

2000 videos, including all relevant videos
- Used for quick prototyping

Test

110k from Intervid as needle & distractor + 2400 multivent data
2546 Test queries → 2400 multivent + ~146 InternVid

Query Creation:

Multivent

MultiVENT has description or title or tweet as query, which is really long
MultiVENT has a doc. associated with each video
- 255 base event queries
- 884 specific event queries

InternVID:

InternVid has title
- From title or desc. base query
- specific event query from transcribed speech and OCR

Evaluation:

Primary metric: NDCG@10
- Score 0.3 - 0.4 - decent
- Score 06 - 0.7 - good

Pre-Scale Baseline

Video → PySceneDetect → 10 frames → Clip
Video → PySceneDetect → 10 frames → OCR → PaddleOCR → Clip
ASR → Whisper → Clip
Desc → Clip

After-Scale:

Video → 16 uniform frames → OCR → ICDAR → original + google translate → original + LLAMA summarized google translate → ColBERT
Video → 16 uniform frames → Siglip → maxSim
Desc → original + google translate → ColBERT
ASR → whisper → original + google translate → original + LLAMA summarized google translate → ColBERT
Weighted fusion based on if the video is professional, edited or raw (using trained classifier)
query → query decomposition
Rerank with MonoT5

Without Re-Ranking

With Reranking

Takeaways:

Late Fusion better
Summarize OCR & STT text representation
SigLIP best as a vision encoder rather than CLIP
Combination of Translation & Original help
No single modality is sufficient
The more text we can get is better
LLM summarization works better than raw text

Motivation:

Methods:

Video

Audio

OCR

Description

Team Members:

Data:

Multivent:

InternVid:

Scale / Multivent 2.0:

Train

Sub-Train / Train-2k

Test

Query Creation:

Multivent

InternVID:

Evaluation:

Pre-Scale Baseline

After-Scale:

Takeaways:

Papers to Read: