Evaluating Multimodal Models: Are Our Benchmarks Enough?

In recent years, vision-language models (VLMs) have exploded in popularity. From CLIPLLaVa and more recently PaLI-Gemma, the field has moved towards models capable of processing both images and text. These systems promise powerful capabilities across domains — from education and accessibility to robotics and document processing. But as this line of research advances, there is one question that we need to keep in mind:

How do we evaluate these models?

Why Evaluation Matters

Different VLMs have different strengths. Some are trained primarily on natural images and captions, while others are fine-tuned on documents, charts, or even diagrams. Architectural choices — like whether a model uses a transformer-based image encoder or relies on region features — can have a huge impact on performance for specific tasks.

This means there is no one-size-fits-all solution. Depending on the application — say, answering questions about a scientific diagram vs. interpreting a business chart — the best model can vary significantly. That’s why rigorous, thoughtful evaluation is essential.

The Current State of Evaluation

So, how are VLMs evaluated today? The short answer: mostly through benchmark datasets. These datasets pose questions or tasks for the model to solve, typically structured as question-answer pairs. Some of the most widely used include:

These benchmarks offer a solid foundation for comparison — but they’re often limited to multiple choice questions, which deviates from the real use-cases.

The Limitations We Need to Address

While benchmark datasets are useful, current evaluation methods lean heavily on multiple-choice question answering and image captioning. This leads to several core challenges:

1. Artificial Evaluation Settings

Multiple-choice questions offer clear metrics (accuracy), but they’re not representative of real-world usage. When you ask a model, “What does this chart say about sales in Q4?” you don’t provide four possible answers — you expect an open-ended, context-aware response.

Moreover, multiple-choice setups can mask weaknesses. Models may exploit statistical patterns or biases in the dataset without truly understanding the content — problems that have long plagued datasets like SNLI and MMLU.

2. Weakness in Multimodal Reasoning

Many VLMs can recognize entities in individual modalities — text in a document, or objects in an image. But connecting information across modalities remains a major challenge. This is particularly apparent in tasks like visual multimodal entity linking. In our European Project LUMINOUS, we are actively working on manners to evaluate and improve this aspect. For example, we have created a dataset (MATE) where we probe VLMs ability to perform simple linking tasks which require understanding both visual and textual modalities.

Image from Alonso et al. (2025) Vision-Language Models Struggle to Align Entities across Modalities

In the chart below, you can see that even SoTA VLMs struggle to link between modalities as the number of objects increases, despite human performance being level across the number of objects. Strikingly, this suggests that VLMs are currently not able to correctly combine information they have available in separate modalities.

Image from Alonso et al. (2025) Vision-Language Models Struggle to Align Entities across Modalities

3. Ambiguity in Natural Input

Real users often ask vague or underspecified questions. Yet our benchmarks rarely test models in these scenarios. Can the model handle ambiguous queries? Can it ask clarifying questions or make reasonable assumptions based on context? These nuances are largely missing from current evaluations.

Where Do We Go From Here?

To truly assess multimodal models, we need to move beyond rigid benchmarks and into more realistic evaluation scenarios. Some promising directions include:

Final Thoughts

Benchmarking VLMs is a necessary first step — but it’s only the beginning. As these models move into more domains and user-facing applications, evaluation must evolve to keep pace with reality. We need tools that test not just accuracy, but adaptability, reasoning, and robustness across modalities. Most importantly, we need to be sure that we understand what we are evaluating.