Molmo and PixMo

Summary

What did authors try to accomplish?

In this paper, authors have open sourced a family of Multimodal Open Source Language Models (Molmo) which is on par or better than most of the open-source and closed-source model. It was developed by AI2. The idea of this paper is very simple, even the architecture they have used is pretty simple and conventional one. But they have used a very high quality and diverse dataset (mostly collected by them) and also used currently available open-source dataset. Most of the current VLM models are trained by the data synthetically generated from the GPT4V or other closed source models, which is actually the distillation of the large models. So there is a lack of foundational VLM models and how should they be trained.

The authors have worked on that niche. The authors have found out that the main reason behind their success is not the model, but the data. They have created a very high quality dataset named PixMo (Pixels for Molmo). They have used PixMo-Cap, the dense caption dataset to pre-train the model and then fine-tune the pretrained model on the PixMo-* datasets and different academic datasets. They have not used any RLHF training.

Model Architecture:

Model Architecture

What are the key elements of the approach?

  1. Their data collection strategy, where unlike everyone they have not used image to text approach for dense captioning or any other data collection. They asked the annotators to give a 60-90 seconds speech about the image. This simple "trick" has disabled the annotators to use copy-paste or llm generation.
  2. They have used a unique dataset (collected by them), where they train the model to point out the object in the image as per the prompt.
  3. Authors have released three different type of models
    1. MolMoE-1B based on OLMoE-1B-7B MoE - most Efficient
    2. Molmo-7B-O based on the OLMo-7b
    3. Molmo-7b-D based on Qwen2-7b
    4. Molmo-70B based on the Qwen2 72B - most Performance

What can I use for myself?

I found out in practice the model's output is really good, which has motivated me to use it in different projects.

Also, when released the dataset will be a seminal work done on the open-source vision language community.

Annotations

Annotation

« effectively distilling these closed models into open ones »()

Annotation

« The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released »()

Annotation

« The resulting VLMs, therefore, are effectively distillations of proprietary VLMs »()

Annotation

« The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and most critically, the quality of our new datasets, collectively named PixMo (Pixels for Molmo), all of which will be released. »(2)

Annotation

« Empirically, we found that with this modality switching “trick” annotators provide far more detailed descriptions in less time, and for each description, »(2)

Annotation

« After training our models to generate dense captions we fine-tune them on a broad range of use cases with supervised training data. »(2)

Annotation

« This novel pointing data enables our models to answer some questions more naturally by pointing to the pixels that support the answer, improves counting accuracy »(2)

Annotation

« Our best-in-class Molmo-72B model, based on Qwen2 72B, achieves the highest academic benchmark score and ranks second by human preference, just behind GPT-4o. »(2)

Annotation

« (1) a pre-processor that converts the input image into a set of multiscale, multi-crop images »(2)

Annotation

« (2) a ViT image encoder that independently maps each of these images into a set of vision tokens, »(2)

Annotation

« (3) a connector that projects the vision tokens to the language model’s input dimension with an MLP and then pools the vision tokens to reduce their count, »(2)

Annotation

« (4) a decoder-only Transformer LLM »(2)

Annotation

« For the vision encoder, all of our released models use OpenAI’s ViT-L/14 336px CLIP model [27], which provides consistently good results »(3)

Annotation

« For the LLM, we offer a variety of choices at different scales and degrees of openness: fully open OLMo-7B-1024 (a pre-released October, 2024 backbone, which will be released at a later date), fully open OLMoE-1B-7B (our most efficient model), open-weight Qwen2 7B, and open-weight Qwen2 72B (our best-performing model). »(3)

Annotation

« 1) multimodal pre-training for caption generation using PixMo-Cap, our newly collected caption data »(3)

Annotation

« (2) supervised fine-tuning using a mixture of academic datasets and our newly collected supervised PixMo-⋆ family of datasets »(3)

Annotation

« We do not use RLHF. »(3)

Annotation

« train all model parameters on the task of caption generation »(3)

Annotation

« We also created a fourth image description by asking the language-only LLM to summarize the three original transcripts into a single description. »(3)

Annotation

« In total, we trained on 712k distinct images with ∼1.3M captions (including the augmentation). »(3)

Annotation

« We used our stage 1 model to generate a dense caption for the image and passed that caption »(3)

Annotation

« OCR output for the image (from a non-VLM, off-the-shelf OCR model), »(3)

Annotation

« he question to a language-only LLM. »(3)

Annotation

« This dataset has 162k question-answer pairs and 73k images. »(3)

Annotation

« (1) enables the model to point to anything described by text »(3)

Annotation

« (2) enables the model to count by pointing »(3)

Annotation

« (3) enables the model to use pointing as a natural form of visual explanation when answering questions »(3)

Annotation

« We collected 2.3M question-point pairs from 428k images. »(3)

Annotation

« We collected 79k question-answer pairs from 29k images. »(3)

Annotation

« Our most efficient model, MolmoE-1B, based on the OLMoE-1B-7B mixture-of-experts LLM, nearly matches the performance of GPT-4V on both academic benchmarks and Elo. »(6)


Related Notes