MME-RealWorld

Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang¹^,⁵, Huanyu Zhang¹^,⁵, Haochen Tian¹^,⁵, Chaoyou Fu², Shuangqing Zhang², Junfei Wu¹^,⁵, Feng Li³, Kun Wang⁴^,⁵, Qingsong Wen⁶, Zhang Zhang¹^,⁵, Liang Wang¹^,⁵, Rong Jin⁷, Tieniu Tan¹^,²^,⁵
CASIA MAIS-NLPR

¹CASIA, ²NJU, ³HKUST, ⁴NTU, ⁵UCAS, ⁶Squirrel AI Learning, ⁷Meta AI

arXiv Code

🤗

Dataset

🏆

Leaderboard

Introduction

   Existing Multimodal Large Language Model benchmarks present several common barriers that make it difficult to measure the significant challenges that models face in the real world, including: 1) small data scale leads to a large performance variance; 2) reliance on model-based annotations results in restricted data quality; 3) insufficient task difficulty, especially caused by the limited image resolution.
   We present MME-RealWorld, a benchmark meticulously designed to address real-world applications with practical relevance. Featuring 13,366 high-resolution images averaging 2,000 × 1,500 pixels, MME-RealWorld poses substantial recognition challenges. Our dataset encompasses 29,429 annotations across 43 tasks, all expertly curated by a team of 25 crowdsource workers and 7 MLLM experts. The main advantages of MME-RealWorld compared to existing MLLM benchmarks as follows:
    1. Scale: with the efforts of a total of 32 volunteers, we have manually annotated 29,429 QA pairs focused on real-world scenarios, making this the largest fully human-annotated benchmark known to date.
   2. Data Quality. 1) Resolution: Many image details, such as a scoreboard in a sports event, carry critical information. These details can only be properly interpreted with high- resolution images, which are essential for providing meaningful assistance to humans. To the best of our knowledge, MME-RealWorld features the highest average image resolution among existing competitors. 2) Annotation: All annotations are manually completed, with a professional team cross-checking the results to ensure data quality.
   3. Task Difficulty and Real-World Utility. We can see that even the most advanced models have not surpassed 60% accuracy. Additionally, many real-world tasks are significantly more difficult than those in traditional benchmarks. For example, in video monitoring, a model needs to count the presence of 133 vehicles, or in remote sensing, it must identify and count small objects on a map with an average resolution exceeding 5000×5000.
  4. MME-RealWorld-CN. Existing Chinese benchmark is usually translated from its English version. This has two limitations: 1) Question-image mismatch. The image may relate to an English scenario, which is not intuitively connected to a Chinese question. 2) Translation mismatch. The machine translation is not always precise and perfect enough. We collect additional images that focus on Chinese scenarios, asking Chinese volunteers for annotation. This results in 5,917 QA pairs.

Leaderboard

Models are ranked according to their average performance on perception and reasoning tasks, from highest to lowest. “OCR”, “RS”, “DT”, “MO” and “AD” each indicate a specific task domain: Optical Character Recognition in the Wild, Remote Sensing, Diagram and Table, Monitoring, and Autonomous Driving, respectively. “Avg” and “Avg-C” indicate the weighted average accuracy and the unweighted average accuracy across subtasks in each domain.

By default, this leaderboard is sorted by results with Overall. To view other sorted results, please click on the corresponding cell.

Method	LLM	Date	Overall	Perception							Reasoning
Task Split			Avg	OCR	RS	DT	MO	AD	Avg	Avg-C	OCR	DT	MO	AD	Avg	Avg-C
QA pairs			23599	5740	3738	5433	2196	3660	20767	20767	500	500	498	1334	2832	2832
Awaker2.5-VL Metabrain AGI	9.8B	2024-11-17	60.8	83.34	49.49	72.72	39.48	45.38	63.14	58.08	63.20	49.40	45.38	33.78	43.74	47.94
InternVL2.5 Shanghai AI Lab	8B	2024-12-08	58.1	79.49	48.58	71.07	40.21	37.35	60.15	55.34	62.00	51.80	37.35	34.23	42.82	46.34
LLaVA-OneVision Bytedance & NTU S-Lab	7B	2024-09-29	57.4	78.69	53.53	60.70	40.26	45.77	59.59	55.81	61.80	40.00	40.76	34.08	41.17	44.16
Qwen2.5-VL Alibaba	7B	2025-02-13	57.3	84.60	39.22	77.97	38.52	28.83	59.99	53.83	62.80	59.80	20.88	26.04	37.54	42.61
Qwen2-VL Alibaba	7B	2024-09-03	56.5	81.38	44.81	70.18	37.30	34.62	58.96	53.66	63.40	48.60	33.13	31.47	40.39	44.15
Xiaosuan-2.0-VL OpenBayes	-	2024-09-30	55.7	80.75	44.66	68.01	37.07	31.94	57.64	52.48	63.40	49.40	35.74	31.62	41.06	45.04
InternVL2 Shanghai AI Lab	7B	2024-08-26	53.5	73.92	39.35	62.80	53.19	35.46	55.82	52.94	57.40	39.00	43.57	29.84	38.74	42.45
Claude 3.5 Sonnet Anthropic	-	2024-08-26	51.6	72.47	25.74	67.44	32.19	40.77	52.90	47.72	61.90	61.20	41.79	31.92	44.12	49.20
InternLM-XComposer2.5 Shanghai AI Lab	7B	2024-08-26	50.0	69.25	36.12	63.92	39.48	33.63	52.47	48.48	53.40	41.00	17.67	29.99	33.90	35.52
InternVL-Chat-V1.5 Shanghai AI Lab	20B	2024-08-26	49.4	71.51	33.55	55.83	51.16	31.42	51.56	48.69	56.80	35.40	37.35	28.94	36.48	39.62
Jiutian-FALCON HITSZ	8B	2025-01-05	49.2	66.39	49.55	47.23	40.39	36.53	50.33	48.02	55.40	39.20	44.38	34.45	40.71	43.36
VITA Tencent Youtu Lab	8*7B	2024-09-12	47.5	70.60	39.40	42.60	37.50	38.20	48.40	45.66	62.20	31.80	43.20	35.40	40.90	43.15
Mini-Gemini-34B-HD CUHK	34B	2024-08-26	45.9	69.55	40.40	44.36	39.61	32.70	48.05	45.32	59.20	39.20	20.48	22.84	31.73	35.43
MiniCPM-V 2.5 OpenBMB	8B	2024-08-26	45.6	66.79	27.69	52.81	38.70	34.15	47.37	44.03	44.00	31.80	36.95	31.03	34.50	35.95
GPT-4o OpenAI	-	2024-08-26	45.2	77.69	28.92	46.68	33.93	22.43	46.43	41.93	61.40	44.80	36.51	26.41	37.61	42.28
CogVLM2-llama3-Chat THU & Zhipu AI	8B	2024-08-26	44.6	69.97	28.76	47.51	33.74	30.22	45.85	42.04	54.00	32.80	41.16	31.18	37.25	39.62
Cambrain-1-34B NYU	34B	2024-08-26	44.1	66.45	38.63	40.44	45.98	33.61	46.68	45.02	55.00	36.00	19.48	16.07	27.06	31.64
Cambrain-1-8B NYU	8B	2024-08-26	42.7	58.68	40.05	32.73	47.68	38.52	43.82	43.53	53.20	27.40	42.37	30.73	36.16	38.43
SliME-8B CASIA	8B	2024-08-26	39.6	53.45	42.27	29.34	40.62	33.66	40.29	39.87	53.20	29.40	36.14	31.55	35.80	37.57
Gemini-1.5-Pro Google	-	2024-08-26	38.2	67.62	13.99	39.90	31.11	26.64	39.63	35.85	52.70	33.20	28.33	19.20	29.19	33.36
GPT-4o-mini OpenAI	-	2024-08-26	36.4	62.51	6.69	44.23	26.50	24.18	37.12	32.82	47.00	39.08	25.81	26.76	32.48	24.85
Monkey HUST	7B	2024-08-26	35.3	54.63	24.99	23.51	28.01	29.67	36.30	33.96	27.20	20.80	27.31	33.04	28.84	27.09
mPLUG-DocOwl 1.5 Alibaba	7B	2024-08-26	32.7	51.15	23.71	29.34	24.97	28.28	33.71	31.49	42.60	19.80	20.48	26.04	26.88	27.23
DeepSeek-VL DeepSeek-AI	7B	2024-08-26	32.4	49.55	25.49	23.38	26.97	33.39	33.14	31.76	45.20	23.80	16.67	27.31	27.98	28.25
SliME-13B CASIA	13B	2024-08-26	31.7	50.58	25.82	20.93	24.73	27.16	31.50	29.84	41.00	39.00	33.13	30.80	34.46	35.98
YI-VL-34B 01.AI	34B	2024-08-26	31.0	44.95	31.62	15.99	34.85	28.31	30.97	31.14	42.40	26.00	31.33	31.55	32.45	32.82
Mini-Gemini-7B-HD CUHK	7B	2024-08-26	30.3	42.02	31.30	22.31	34.15	24.81	31.07	30.92	35.40	24.60	25.90	23.29	26.12	27.30
LLaVA-NeXT-LLama3-8B Bytedance & NTU S-Lab	8B	2024-08-26	30.2	47.94	25.42	26.63	19.46	18.66	30.14	27.62	55.20	23.40	21.08	30.73	32.06	32.60
LLaVA-NeXT-Qwen-72B Bytedance & NTU S-Lab	72B	2024-08-26	28.7	37.07	29.13	27.68	29.37	17.98	29.01	28.25	17.20	34.20	27.31	29.69	27.86	27.10
LLaVA1.5-13B UW-Madison	13B	2024-08-26	28.0	44.10	23.27	20.17	20.45	26.12	28.42	26.82	30.20	20.80	27.51	24.78	25.51	25.82
ShareGPT4V-13B USTC & Shanghai AI Lab	13B	2024-08-26	27.8	44.55	23.06	20.17	19.26	26.12	28.38	26.63	26.00	20.80	27.31	24.55	24.63	24.67
MiniGPT-v2 KAUST & Meta AI	7B	2024-08-26	26.4	39.02	23.33	20.41	19.26	25.96	26.94	25.60	30.00	20.40	16.87	23.66	23.01	22.73
ShareGPT4V-7B USTC & Shanghai AI Lab	7B	2024-08-26	26.3	39.39	22.10	20.08	19.13	26.04	26.73	22.35	24.15	20.60	26.10	24.18	23.88	23.76
LLaVA1.5-7B UW-Madison	7B	2024-08-26	26.1	38.69	22.12	20.08	19.13	16.04	26.54	25.21	26.00	20.60	25.90	24.18	24.17	24.17
Qwen-VL-Chat Alibaba	7B	2024-08-26	21.1	32.37	15.14	15.59	22.13	15.08	20.75	20.06	28.60	13.60	16.47	24.63	21.95	20.83
TextMonkey HUST	7B	2024-08-26	17.8	37.30	11.69	5.93	16.14	14.26	18.18	17.06	30.40	2.20	4.42	20.01	15.96	14.26

Data Examples

All data are freshly collected and human-annotated, with superior resolution, task complexity, and real-world utility.

Diagram of MME-RealWorld: Our benchmark contains five real-world image domains,covering 43 perception and reasoning subtasks. Each QA pair offers five options. We highlight andmagnify the parts of the image relevant to the question in a red box for better visibility.

Benchmark Statistics

Task Categories: Our benchmark spans 5 key domains and 43 sub-class tasks highly related to real-world scenarios, including 10,000 high-resolution images and 29,429 annotations.

Benchmark Comparison

Comparison of MME-RealWorld with other benchmarks: highlights its status as the largest fully human-annotated dataset, featuring the highest average resolution and the most challenging tasks, emphasizing its unprecedented scale and focus on real-world applications.

Frequency of Outputting Answer "E"

Frequency of outputting answer "E" for different models : across various domains. The notation in parentheses indicates the task type: P for perception and R for reasoning tasks.

Confusion Matrix

Confusion Matrix Across Models Highlighting Error Distribution Patterns : The matrix reveals distinct response behaviors among different MLLMs. Larger models tend to select safer options, while smaller models exhibit a bias toward the first option.

MME-RealWorld-CN

(1) Experimental results on the MME-RealWorld-CN perception tasks.

(2) Experimental results on the MME-RealWorld-CN reasoning tasks.

Experimental Results on All Task Splits

(1) Experimental results on the OCR tasks.

(2) Experimental results on the Diagram and Table tasks.

(3) Experimental results on the Remote Sensing tasks.

(4.1) Experimental results on the Autonomous Driving perception tasks.

(4.1) Experimental results on the Autonomous Driving reasoning tasks.

(5) Experimental results on the Monitoring tasks.


      @article{zhang2024mme,
        title={MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?},
        author={Zhang, Yi-Fan and Zhang, Huanyu and Tian, Haochen and Fu, Chaoyou and Zhang, Shuangqing and Wu, Junfei and Li, Feng and Wang, Kun and Wen, Qingsong and Zhang, Zhang and others},
        journal={arXiv preprint arXiv:2408.13257},
        year={2024}
      }

MME-RealWorld

Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Introduction

Leaderboard

Benchmark

Data Examples

Benchmark Statistics

Benchmark Comparison

Experiment Results

Frequency of Outputting Answer "E"

Confusion Matrix

MME-RealWorld-CN

Experimental Results on All Task Splits

Citation