Molmo and PixMo

Summary

What did authors try to accomplish?

In this paper, authors have open sourced a family of Multimodal Open Source Language Models (Molmo) which is on par or better than most of the open-source and closed-source model. It was developed by AI2. The idea of this paper is very simple, even the architecture they have used is pretty simple and conventional one. But they have used a very high quality and diverse dataset (mostly collected by them) and also used currently available open-source dataset. Most of the current VLM models are trained by the data synthetically generated from the GPT4V or other closed source models, which is actually the distillation of the large models. So there is a lack of foundational VLM models and how should they be trained.

The authors have worked on that niche. The authors have found out that the main reason behind their success is not the model, but the data. They have created a very high quality dataset named PixMo (Pixels for Molmo). They have used PixMo-Cap, the dense caption dataset to pre-train the model and then fine-tune the pretrained model on the PixMo-* datasets and different academic datasets. They have not used any RLHF training.

Model Architecture:

Model Architecture

What are the key elements of the approach?

Their data collection strategy, where unlike everyone they have not used image to text approach for dense captioning or any other data collection. They asked the annotators to give a 60-90 seconds speech about the image. This simple "trick" has disabled the annotators to use copy-paste or llm generation.
They have used a unique dataset (collected by them), where they train the model to point out the object in the image as per the prompt.
Authors have released three different type of models
1. MolMoE-1B based on OLMoE-1B-7B MoE - most Efficient
2. Molmo-7B-O based on the OLMo-7b
3. Molmo-7b-D based on Qwen2-7b
4. Molmo-70B based on the Qwen2 72B - most Performance

What can I use for myself?

I found out in practice the model's output is really good, which has motivated me to use it in different projects.

Also, when released the dataset will be a seminal work done on the open-source vision language community.

Annotations

Annotation

« effectively distilling these closed models into open ones »()

Annotation

« The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released »()

Annotation

« The resulting VLMs, therefore, are effectively distillations of proprietary VLMs »()

Annotation

« The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and most critically, the quality of our new datasets, collectively named PixMo (Pixels for Molmo), all of which will be released. »(2)

Annotation

« Empirically, we found that with this modality switching “trick” annotators provide far more detailed descriptions in less time, and for each description, »(2)

Annotation

« After training our models to generate dense captions we fine-tune them on a broad range of use cases with supervised training data. »(2)

Annotation

« This novel pointing data enables our models to answer some questions more naturally by pointing to the pixels that support the answer, improves counting accuracy »(2)

Annotation

« Our best-in-class Molmo-72B model, based on Qwen2 72B, achieves the highest academic benchmark score and ranks second by human preference, just behind GPT-4o. »(2)

Annotation

« (1) a pre-processor that converts the input image into a set of multiscale, multi-crop images »(2)

Annotation

« (2) a ViT image encoder that independently maps each of these images into a set of vision tokens, »(2)

Annotation

« (3) a connector that projects the vision tokens to the language model’s input dimension with an MLP and then pools the vision tokens to reduce their count, »(2)

Annotation

« (4) a decoder-only Transformer LLM »(2)

Annotation

« For the vision encoder, all of our released models use OpenAI’s ViT-L/14 336px CLIP model [27], which provides consistently good results »(3)

Annotation

« For the LLM, we offer a variety of choices at different scales and degrees of openness: fully open OLMo-7B-1024 (a pre-released October, 2024 backbone, which will be released at a later date), fully open OLMoE-1B-7B (our most efficient model), open-weight Qwen2 7B, and open-weight Qwen2 72B (our best-performing model). »(3)

Annotation

« 1) multimodal pre-training for caption generation using PixMo-Cap, our newly collected caption data »(3)

Annotation

« (2) supervised fine-tuning using a mixture of academic datasets and our newly collected supervised PixMo-⋆ family of datasets »(3)

Annotation

« We do not use RLHF. »(3)

Annotation

« train all model parameters on the task of caption generation »(3)

Annotation

« We also created a fourth image description by asking the language-only LLM to summarize the three original transcripts into a single description. »(3)

Annotation

« In total, we trained on 712k distinct images with ∼1.3M captions (including the augmentation). »(3)

Annotation

« We used our stage 1 model to generate a dense caption for the image and passed that caption »(3)

Annotation

« OCR output for the image (from a non-VLM, off-the-shelf OCR model), »(3)

Annotation

« he question to a language-only LLM. »(3)

Annotation

« This dataset has 162k question-answer pairs and 73k images. »(3)

Annotation

« (1) enables the model to point to anything described by text »(3)

Annotation

« (2) enables the model to count by pointing »(3)

Annotation

« (3) enables the model to use pointing as a natural form of visual explanation when answering questions »(3)

Annotation

« We collected 2.3M question-point pairs from 428k images. »(3)

Annotation

« We collected 79k question-answer pairs from 29k images. »(3)

Annotation

« Our most efficient model, MolmoE-1B, based on the OLMoE-1B-7B mixture-of-experts LLM, nearly matches the performance of GPT-4V on both academic benchmarks and Elo. »(6)

Metadata

Date : 09-25-2024

Authors : Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi

Paper Link : http://arxiv.org/abs/2409.17146

Zotero Link: Full Text PDF

Citation : @article{Deitke_Clark_Lee_Tripathi_Yang_Park_Salehi_Muennighoff_Lo_Soldaini_et al._2024, title={Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models}, url={http://arxiv.org/abs/2409.17146}, DOI={10.48550/arXiv.2409.17146}, abstractNote={Today’s most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at https://molmo.allenai.org.}, note={arXiv:2409.17146 [cs]}, number={arXiv:2409.17146}, publisher={arXiv}, author={Deitke, Matt and Clark, Christopher and Lee, Sangho and Tripathi, Rohun and Yang, Yue and Park, Jae Sung and Salehi, Mohammadreza and Muennighoff, Niklas and Lo, Kyle and Soldaini, Luca and Lu, Jiasen and Anderson, Taira and Bransom, Erin and Ehsani, Kiana and Ngo, Huong and Chen, YenSung and Patel, Ajay and Yatskar, Mark and Callison-Burch, Chris and Head, Andrew and Hendrix, Rose and Bastani, Favyen and VanderBilt, Eli and Lambert, Nathan and Chou, Yvonne and Chheda, Arnavi and Sparks, Jenna and Skjonsberg, Sam and Schmitz, Michael and Sarnat, Aaron and Bischoff, Byron and Walsh, Pete and Newell, Chris and Wolters, Piper and Gupta, Tanmay and Zeng, Kuo-Hao and Borchardt, Jon and Groeneveld, Dirk and Dumas, Jen and Nam, Crystal and Lebrecht, Sophie and Wittlif, Caitlin and Schoenick, Carissa and Michel, Oscar and Krishna, Ranjay and Weihs, Luca and Smith, Noah A. and Hajishirzi, Hannaneh and Girshick, Ross and Farhadi, Ali and Kembhavi, Aniruddha}, year={2024}, month=sep }

Summary

What did authors try to accomplish?

Model Architecture:

What are the key elements of the approach?

What can I use for myself?

Annotations

Related Notes