Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.
M3 enables LMMs to learned a nested representation of visual tokens in a coarse-to-fine manner. Our approach shows several benefits:
Our training is very simple. We take the average of the language generation loss over diverse visual token scales. In our paper, we use average pooling to get the multi-granularities visual tokens.
Above figure shows the performance comparsion between LLaVA-1.5-7B-M3 and various baselines in MMBench, where we can find
Table below shows the performance comparsion between LLaVA-Next-7B-M3 and various baselines in MMBench, where we can find
Table below shows the performance under LLaVA-1.5-7B-M3 and various baselines in MMBench, where we can find
We extract the response from LLaVA-NeXT-M3 in the TextVQA benchmark, and show the samples where using visual tokens across different scales can answer the question correctly and incorrectly. Shown in the figure above, the OCR performance aligns with the complexity of the images, which indicates that M3 can be utilized as a metric towards sample level complexity.
@article{cai2024matryoshka,
title={Matryoshka Multimodal Models},
author={Cai, Mu and Yang, Jianwei and Gao, Jianfeng and Lee, Yong Jae},
journal={arXiv preprint arXiv:2405.17430},
year={2024}
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.
Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
Related Links: [LLaVA] [Insutrction Tuning with GPT-4]