{"id":786,"date":"2025-07-01T16:25:49","date_gmt":"2025-07-01T14:25:49","guid":{"rendered":"https:\/\/luminous-horizon.eu\/?page_id=786"},"modified":"2025-07-02T14:58:12","modified_gmt":"2025-07-02T12:58:12","slug":"evaluating-multimodal-models-are-our-benchmarks-enough","status":"publish","type":"page","link":"https:\/\/luminous-horizon.eu\/index.php\/blogs\/evaluating-multimodal-models-are-our-benchmarks-enough\/","title":{"rendered":"Evaluating Multimodal Models: Are Our Benchmarks Enough?"},"content":{"rendered":"\n<p class=\"has-x-large-font-size\">Evaluating Multimodal Models: Are Our Benchmarks Enough?<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p id=\"c65c\">In recent years,&nbsp;<strong>vision-language models (VLMs)<\/strong>&nbsp;have exploded in popularity. From&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/2103.00020\" rel=\"noreferrer noopener\" target=\"_blank\">CLIP<\/a>,&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/2304.08485\" rel=\"noreferrer noopener\" target=\"_blank\">LLaVa<\/a>&nbsp;and more recently&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/2407.07726\" rel=\"noreferrer noopener\" target=\"_blank\">PaLI-Gemma<\/a>, the field has moved towards models capable of processing both images and text. These systems promise powerful capabilities across domains \u2014 from education and accessibility to robotics and document processing. But as this line of research advances, there is one question that we need to keep in mind:<\/p>\n<\/blockquote>\n\n\n\n<p id=\"b62c\"><strong>How do we evaluate these models?<\/strong><\/p>\n\n\n\n<h1 class=\"wp-block-heading has-large-font-size\" id=\"aaed\">Why Evaluation Matters<\/h1>\n\n\n\n<p id=\"9d75\">Different VLMs have different strengths. Some are trained primarily on natural images and captions, while others are fine-tuned on documents, charts, or even diagrams. Architectural choices \u2014 like whether a model uses a transformer-based image encoder or relies on region features \u2014 can have a huge impact on performance for specific tasks.<\/p>\n\n\n\n<p id=\"2e11\">This means there is no one-size-fits-all solution. Depending on the application \u2014 say, answering questions about a scientific diagram vs. interpreting a business chart \u2014 the best model can vary significantly. That\u2019s why rigorous, thoughtful&nbsp;<strong>evaluation<\/strong>&nbsp;is essential.<\/p>\n\n\n\n<h1 class=\"wp-block-heading has-large-font-size\" id=\"757b\">The Current State of Evaluation<\/h1>\n\n\n\n<p id=\"df04\">So, how are VLMs evaluated today? The short answer: mostly through&nbsp;<strong>benchmark datasets<\/strong>. These datasets pose questions or tasks for the model to solve, typically structured as question-answer pairs. Some of the most widely used include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>VQA (Visual Question Answering)<\/strong><br><a href=\"https:\/\/arxiv.org\/abs\/1505.00468\" target=\"_blank\" rel=\"noreferrer noopener\">Link<\/a>&nbsp;\u2014 760K questions, generally short, one-word answers.<\/li>\n\n\n\n<li><strong>DocVQA<\/strong><br><a href=\"https:\/\/arxiv.org\/abs\/2007.00398\" target=\"_blank\" rel=\"noreferrer noopener\">Link<\/a>&nbsp;\u2014 Questions about text in documents (50K examples).<\/li>\n\n\n\n<li><strong>ChartQA<\/strong><br><a href=\"https:\/\/arxiv.org\/abs\/2203.10244\" target=\"_blank\" rel=\"noreferrer noopener\">Link<\/a>&nbsp;\u2014 Understanding and reasoning over data visualizations.<\/li>\n\n\n\n<li><strong>AI2D<\/strong><br><a href=\"https:\/\/arxiv.org\/abs\/1603.07396\" target=\"_blank\" rel=\"noreferrer noopener\">Link<\/a>&nbsp;\u2014 15K multiple-choice questions on diagrams.<\/li>\n\n\n\n<li><strong>TextVQA<\/strong><br><a href=\"https:\/\/arxiv.org\/abs\/1904.08920\" target=\"_blank\" rel=\"noreferrer noopener\">Link<\/a>&nbsp;\u2014 45K questions focused on reading text in images.<\/li>\n\n\n\n<li><strong>MMMU<\/strong><br><a href=\"https:\/\/mmmu-benchmark.github.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Website<\/a>&nbsp;\u2014 A multi-domain, expert-level dataset with 11.5K questions.<\/li>\n\n\n\n<li><strong>MathVista<\/strong><br><a href=\"https:\/\/arxiv.org\/pdf\/2310.02255v3\" target=\"_blank\" rel=\"noreferrer noopener\">Link<\/a>&nbsp;\u2014 Visual math problems with diagrammatic reasoning.<\/li>\n\n\n\n<li><strong>MM-Bench<\/strong><br><a href=\"https:\/\/arxiv.org\/pdf\/2307.06281v5\" target=\"_blank\" rel=\"noreferrer noopener\">Link<\/a>&nbsp;\u2014 3K vision-language questions with multiple-choice answers.<\/li>\n<\/ul>\n\n\n\n<p id=\"b61e\">These benchmarks offer a solid foundation for comparison \u2014 but they\u2019re often limited to multiple choice questions, which deviates from the real use-cases.<\/p>\n\n\n\n<h1 class=\"wp-block-heading has-large-font-size\" id=\"ecc3\">The Limitations We Need to Address<\/h1>\n\n\n\n<p id=\"d1af\">While benchmark datasets are useful, current evaluation methods lean heavily on&nbsp;<strong>multiple-choice question answering<\/strong>&nbsp;and&nbsp;<strong>image captioning<\/strong>. This leads to several core challenges:<\/p>\n\n\n\n<h1 class=\"wp-block-heading has-medium-font-size\" id=\"f4b4\"><strong>1. Artificial Evaluation Settings<\/strong><\/h1>\n\n\n\n<p id=\"05c4\">Multiple-choice questions offer clear metrics (accuracy), but they\u2019re not representative of real-world usage. When you ask a model, \u201cWhat does this chart say about sales in Q4?\u201d you don\u2019t provide four possible answers \u2014 you expect an open-ended, context-aware response.<\/p>\n\n\n\n<p id=\"6c8c\">Moreover, multiple-choice setups can&nbsp;<strong>mask weaknesses<\/strong>. Models may exploit statistical patterns or biases in the dataset without truly understanding the content \u2014 problems that have long plagued datasets like&nbsp;<a href=\"https:\/\/aclanthology.org\/N18-2017\/\" rel=\"noreferrer noopener\" target=\"_blank\">SNLI<\/a>&nbsp;and&nbsp;<a href=\"https:\/\/aclanthology.org\/2025.naacl-long.262\/\" rel=\"noreferrer noopener\" target=\"_blank\">MMLU<\/a>.<\/p>\n\n\n\n<h1 class=\"wp-block-heading has-medium-font-size\" id=\"a05f\"><strong>2. Weakness in Multimodal Reasoning<\/strong><\/h1>\n\n\n\n<p id=\"093f\">Many VLMs can recognize entities in individual modalities \u2014 text in a document, or objects in an image. But&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/2503.03854\" rel=\"noreferrer noopener\" target=\"_blank\"><strong>connecting information across modalities remains a major challenge<\/strong><\/a>. This is particularly apparent in tasks like visual multimodal entity linking. In our&nbsp;<a href=\"https:\/\/luminous-horizon.eu\/\" rel=\"noreferrer noopener\" target=\"_blank\">European Project LUMINOUS<\/a>, we are actively working on manners to evaluate and improve this aspect. For example, we have created a dataset (MATE) where we probe VLMs ability to perform simple linking tasks which require understanding both visual and textual modalities.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"720\" height=\"542\" src=\"http:\/\/luminous-horizon.eu\/wp-content\/uploads\/2025\/07\/Page01.webp\" alt=\"\" class=\"wp-image-787\" srcset=\"https:\/\/luminous-horizon.eu\/wp-content\/uploads\/2025\/07\/Page01.webp 720w, https:\/\/luminous-horizon.eu\/wp-content\/uploads\/2025\/07\/Page01-300x226.webp 300w\" sizes=\"auto, (max-width: 720px) 100vw, 720px\" \/><figcaption class=\"wp-element-caption\">Image from Alonso et al. (2025) Vision-Language Models Struggle to Align Entities across Modalities<\/figcaption><\/figure>\n\n\n\n<p>In the chart below, you can see that even SoTA VLMs struggle to link between modalities as the number of objects increases, despite human performance being level across the number of objects. Strikingly, this suggests that VLMs are currently not able to correctly combine information they have available in separate modalities.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"720\" height=\"540\" src=\"http:\/\/luminous-horizon.eu\/wp-content\/uploads\/2025\/07\/Page01_02.webp\" alt=\"\" class=\"wp-image-788\" srcset=\"https:\/\/luminous-horizon.eu\/wp-content\/uploads\/2025\/07\/Page01_02.webp 720w, https:\/\/luminous-horizon.eu\/wp-content\/uploads\/2025\/07\/Page01_02-300x225.webp 300w\" sizes=\"auto, (max-width: 720px) 100vw, 720px\" \/><figcaption class=\"wp-element-caption\">Image from Alonso et al. (2025) Vision-Language Models Struggle to Align Entities across Modalities<\/figcaption><\/figure>\n\n\n\n<h1 class=\"wp-block-heading has-medium-font-size\" id=\"5c28\"><strong>3. Ambiguity in Natural Input<\/strong><\/h1>\n\n\n\n<p id=\"7009\">Real users often ask vague or underspecified questions. Yet our benchmarks rarely test models in these scenarios. Can the model handle ambiguous queries? Can it ask clarifying questions or make reasonable assumptions based on context? These nuances are largely missing from current evaluations.<\/p>\n\n\n\n<h1 class=\"wp-block-heading has-large-font-size\" id=\"6b35\">Where Do We Go From Here?<\/h1>\n\n\n\n<p id=\"8807\">To truly assess multimodal models, we need to move beyond rigid benchmarks and into&nbsp;<strong>more realistic evaluation scenarios<\/strong>. Some promising directions include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Open-ended evaluation tasks<\/strong>, where models must generate full responses or summaries based on visual and textual inputs.<\/li>\n\n\n\n<li><strong>Interactive evaluation<\/strong>, where models respond to follow-up or clarifying questions.<\/li>\n\n\n\n<li><strong>Task-based evaluations<\/strong>, where models complete a goal (e.g., extract structured data from a receipt) instead of simply answering a question.<\/li>\n\n\n\n<li><strong>Bias and robustness checks<\/strong>, ensuring models perform consistently across diverse content and aren\u2019t exploiting spurious patterns.<\/li>\n<\/ul>\n\n\n\n<h1 class=\"wp-block-heading has-large-font-size\" id=\"ed2b\">Final Thoughts<\/h1>\n\n\n\n<p id=\"a9b4\">Benchmarking VLMs is a necessary first step \u2014 but it\u2019s only the beginning. As these models move into more domains and user-facing applications,&nbsp;<strong>evaluation must evolve to keep pace with reality<\/strong>. We need tools that test not just accuracy, but adaptability, reasoning, and robustness across modalities. Most importantly, we need to be sure that we understand what we are evaluating.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Evaluating Multimodal Models: Are Our Benchmarks Enough? In recent years,&nbsp;vision-language models (VLMs)&nbsp;have exploded in popularity. From&nbsp;CLIP,&nbsp;LLaVa&nbsp;and more recently&nbsp;PaLI-Gemma, the field has moved towards models capable of processing both images and text. These systems promise powerful capabilities across domains \u2014 from education and accessibility to robotics and document processing. But as this line of research advances, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":696,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-786","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/luminous-horizon.eu\/index.php\/wp-json\/wp\/v2\/pages\/786","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/luminous-horizon.eu\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/luminous-horizon.eu\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/luminous-horizon.eu\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/luminous-horizon.eu\/index.php\/wp-json\/wp\/v2\/comments?post=786"}],"version-history":[{"count":4,"href":"https:\/\/luminous-horizon.eu\/index.php\/wp-json\/wp\/v2\/pages\/786\/revisions"}],"predecessor-version":[{"id":870,"href":"https:\/\/luminous-horizon.eu\/index.php\/wp-json\/wp\/v2\/pages\/786\/revisions\/870"}],"up":[{"embeddable":true,"href":"https:\/\/luminous-horizon.eu\/index.php\/wp-json\/wp\/v2\/pages\/696"}],"wp:attachment":[{"href":"https:\/\/luminous-horizon.eu\/index.php\/wp-json\/wp\/v2\/media?parent=786"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}