What did authors try to accomplish?

In this paper, authors have open sourced a family of Multimodal Open Source Language Models (Molmo) which is on par or better than most of the open-source and closed-source model. It was developed by AI2. The idea of this paper is very simple, even the architecture they have used is pretty simple and conventional one. But they have used a very high quality and diverse dataset (mostly collected by them) and also used currently available open-source dataset. Most of the current VLM models are trained by the data synthetically generated from the GPT4V or other closed source models, which is actually the distillation of the large models. So there is a lack of foundational VLM models and how should they be trained.

The authors have worked on that niche. The authors have found out that the main reason behind their success is not the model, but the data. They have created a very high quality dataset named PixMo (Pixels for Molmo). They have used PixMo-Cap, the dense caption dataset to pre-train the model and then fine-tune the pretrained model on the PixMo-* datasets and different academic datasets. They have not used any RLHF training.

Model Architecture:

Model Architecture

What are the key elements of the approach?

  1. Their data collection strategy, where unlike everyone they have not used image to text approach for dense captioning or any other data collection. They asked the annotators to give a 60-90 seconds speech about the image. This simple "trick" has disabled the annotators to use copy-paste or llm generation.
  2. They have used a unique dataset (collected by them), where they train the model to point out the object in the image as per the prompt.
  3. Authors have released three different type of models
    1. MolMoE-1B based on OLMoE-1B-7B MoE - most Efficient
    2. Molmo-7B-O based on the OLMo-7b
    3. Molmo-7b-D based on Qwen2-7b
    4. Molmo-70B based on the Qwen2 72B - most Performance

What can I use for myself?

I found out in practice the model's output is really good, which has motivated me to use it in different projects.

Also, when released the dataset will be a seminal work done on the open-source vision language community.



