Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

Yongyuan Liang△*, Wei Chow▲*, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao,
Jiuhai Chen, Jiatao Gu, Yue Wang, Furong Huang,
University of Maryland, College Park, University of Pennsylvania, University of Southern California,
University of Michigan The Hong Kong University of Science and Technology
*Equal contribution. Equal advising.
Benchmark Icon
Reciprocal Reasoning: We introduce the first benchmark targeting reciprocal cross-modal reasoning where one modality guides, verifies, or refines outputs in another.
Evaluation Icon
Dual Settings: ROVER evaluates verbally-augmented visual generation and visually-augmented verbal reasoning across diverse domains and reasoning types.
Generation Icon
Comprehensive Evaluation: Multi-dimensional evaluation protocol assessing reasoning process, output alignment, and cross-modal consistency.
Analysis Icon
Key Insights: Cross-modal reasoning strongly correlates with visual generation performance, while current models show limited visually-augmented reasoning capabilities.
ROVER Teaser
Figure 1:The ROVER benchmark. ROVER evaluates UMMs through reciprocal cross-modal reasoning: ROVER-IG (left) requires generating images with language-augmented reasoning, while ROVER-TG (right) requires generating text answers with visually-augmented reasoning.

Unified multimodal models (UMMs) have shown remarkable advances in understanding and generating text and images. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning. Existing benchmarks rarely require the use of one modality to guide, verify, or refine outputs in the other, failing to capture a central aspiration of unified multimodal models: seamless reasoning across modalities. We address this gap with ROVER, a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, containing 1285 tasks grounded in 2,048 images spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use structured verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes.

ROVER Benchmark Statistics

ROVER Data Viewer

ROVER-IG

Natural Science 1 Natural Science 2 Natural Science 3 Natural Science 4 Natural Science 5 Natural Science 6 Natural Science 7 Natural Science 8 Natural Science 9 Natural Science 10 Natural Science 11 Natural Science 12
Original Data Chart Placeholder

Prompt:

Generate a detailed scientific diagram showing the molecular structure of water, including accurate bond angles and electron cloud representations. Use clear labels and a clean, academic style suitable for a chemistry textbook.
Nano Banana
Generated Image Placeholder
GPT-5
Generated Image Placeholder
BAGEL-Think
Generated Image Placeholder
Qwen-Image
Generated Image Placeholder

ROVER-TG

World Model 1 World Model 2 World Model 3 World Model 4 World Model 5
Original Data Chart Placeholder

Prompt:

Generate a detailed scientific diagram showing the molecular structure of water, including accurate bond angles and electron cloud representations. Use clear labels and a clean, academic style suitable for a chemistry textbook.

Answer:

Expected answer will be displayed here.
Nano Banana
Generated Image Placeholder
Generated text response will appear here...
GPT-5
Generated Image Placeholder
Generated text response will appear here...

ROVER Benchmark

Benchmark Overview: ROVER introduces the first benchmark specifically designed to evaluate reciprocal cross-modal reasoning in unified multimodal models. Unlike existing benchmarks that evaluate modalities in isolation, ROVER requires models to use information from one modality to inform and improve outputs in another.

benchmark category
Figure 2: Verbally-Augmented Reasoning for Visual Generation. The benchmark spans 4 domains (natural science, culture and art, common sense, and logic), instantiated across 7 reasoning subtasks.
benchmark category
Figure 3: Visually-Augmented Reasoning for Verbal Generation. The benchmark spans 3 scenarios and 6 subtasks: physical world modeling, logical assistance, and visual perception enhancement.

Verbally-Augmented Reasoning for Visual Generation: This setting evaluates whether models can use structured verbal prompts and reasoning chains to guide faithful image synthesis. It spans 4 domains (natural science, culture and art, common sense, and logic) instantiated across 7 reasoning types: temporal, spatial, causal, synthetic, quantitative, abstract, and mathematical. Each task provides a textual prompt with an initial image and a chain of constraints that a correct output image must satisfy, requiring genuine visual understanding and complex reasoning chains.

Visually-Augmented Reasoning for Verbal Generation: This setting evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes. Unlike text-only Chain-of-Thought, we examine scenarios where models generate intermediate visual representations to facilitate reasoning. The benchmark focuses on 3 scenarios: physical world modeling (functioning as world simulators), logical assistance (generating visual aids for abstract problems), and visual perception enhancement (creating supportive images for challenging perception tasks).

Evaluation Protocol

Multi-Dimensional Assessment: We adopt a multi-dimensional protocol that combines an automated VLM judge - GPT-4.1 with expert validation on stratified samples.

Verbally-Augmented Generation Metrics: We assess model performance across 5 rubric dimensions: (1) Reasoning Process (RP) evaluates the quality of verbal reasoning through logical structure and domain knowledge application; (2) Reasoning Visual (RV) measures how well generated visuals match target descriptions; (3) Reasoning Alignment (Align.) quantifies consistency between verbal reasoning and visual outcomes; (4) Visual Consistency (VC) ensures non-target elements remain unchanged; (5) Image Quality (IQ) assesses technical excellence and visual coherence.

Visually-Augmented Generation Metrics: We evaluate across 3 dimensions: (1) Interleaved Reasoning Quality (IR) evaluates plausibility and relevance of intermediate visual representations; (2) Final Answer Accuracy (Acc.) measures whether the model's final reasoning outcome matches ground truth; (3) Reasoning-Answer Alignment (Align.) quantifies how effectively generated images contribute to reaching correct conclusions.

ROVER Leaderboard

The scores in the leaderboard are calculated using the weighted formula: 20% × VC + 20% × IQ + 60% × RV in ROVER-IG.

Model Organizor Params Date Nature Science Culture / Art Common Sense Logic Average
Nano Banana Proprietary - 2025-08-22 80.9 79.5 81.1 63.3 76.2
Gemini 2.0 Flash Proprietary - 2025-02-05 72.0 72.9 73.2 55.0 68.3
GPT-5 Proprietary - 2025-08-07 74.9 74.5 72.8 59.8 70.5
BAGEL-Think Open Source 14B 2025-05-23 61.1 65.7 65.0 37.5 57.3
BAGEL Open Source 14B 2025-05-23 46.2 53.9 51.6 44.1 49.0
Step1X-Edit v1.2 Open Source - 2025-09-08 59.2 58.8 57.0 37.7 53.2
UniCoT Open Source 14B 2025-07-29 49.1 69.4 63.6 28.2 52.6
BLIP3o-NEXT Open Source 8B 2025-08-04 50.2 52.0 56.0 43.2 50.3
UniPic2-Metaquery-9B Open Source 9B 2025-04-28 50.2 48.1 49.3 43.1 47.6
Emu2-Gen Open Source 8B 2025-02-13 52.2 47.5 49.5 43.7 48.2
OmniGen2 Open Source 27B 2025-06-16 48.2 49.5 51.3 42.6 47.9
Qwen-Image-Edit Edit 20B 2025-08-04 59.8 70.5 65.6 48.6 61.1
FLUX.1 Kontext Edit 12B 2025-08-06 51.5 57.6 54.8 37.9 50.5
UltraEdit(SD3) Edit 2B 2024-08-31 40.1 51.4 39.1 42.4 43.3
VAREdit-8B Edit 8B 2025-08-21 48.7 55.2 47.0 31.2 45.5
Step1X-Edit v1.1 Edit - 2025-06-09 55.2 59.6 51.8 39.1 51.4

Findings and Insights

We conducted comprehensive evaluation of 17 state-of-the-art unified multimodal models across both settings in ROVER . Our experiments reveal critical insights about the current state and limitations of cross-modal reasoning capabilities in modern UMMs.

Key Finding 1

Cross-modal reasoning capabilities and alignment strongly correlate with visual generation effectiveness.

Key Finding 2

Unified models capable of interleaved image-text generation demonstrate superior reasoning-dependent visual generation performance.

Key Finding 3

Current models remain severely limited in visually-augmented reasoning, showing relative strength in perception and physical modeling but weakness in logical tasks.


Cross-Modal Reasoning Matters for UMMs: To validate that UMMs perform cross-modal reasoning internally and that this mechanism cannot be replicated through external models, we conducted comparative analysis between unified models and cascade approaches. Results demonstrate that reasoning across modalities cannot fully transfer across different model architectures—unified models must transcend modality boundaries to produce emergent cross-modal insights.

Cascade Analysis
Figure 6: Cascade reasoning evaluation comparing cascade approaches (FLUX+GPT with GPT-4o prompt refinement) against unified multimodal models.

Coherence Between Reasoning Subtasks: Analysis reveals uneven performance across reasoning dimensions, with models excelling in temporal, spatial, and causal reasoning while struggling with abstract and mathematical tasks. This pattern indicates that current UMMs better handle concrete, observable phenomena than symbolic reasoning. Strong interdependence among physical reasoning types suggests shared mechanisms for processing spatiotemporal relationships, while abstract reasoning develops as a distinct capability.

Reasoning Analysis
Figure 7: Analysis of reasoning capabilities showing performance patterns across different reasoning subtasks and their correlations.

Evaluation Protocol Reliability: We conducted user studies with 4 human experts to validate our VLM-as-judge evaluation protocol. Results demonstrate strong alignment between GPT-4.1 and human expert judgments across all evaluation dimensions. Visual-quality-related metrics show particularly strong human-VLM agreement, while reasoning-related metrics exhibit larger but acceptable discrepancies due to inherent complexities in multimodal reasoning assessment.

Evaluation Reliability
Figure 8: Evaluation reliability of GPT-4.1 across five assessment dimensions, showing Pearson correlation coefficients and Mean Absolute Error compared to human experts.

Conclusion

We introduce ROVER , the first benchmark for reciprocal cross-modal reasoning, which systematically evaluates 17 unified multimodal models across 23 diverse task types in both verbal reasoning for visual generation and interleaved multimodal reasoning scenarios. Our evaluation exposes substantial performance gaps in current models and establishes that interleaved generation capabilities are strongly correlated with cross-modal reasoning effectiveness. These findings expose critical limitations in existing unified models and provide insights for advancing cross-modal reasoning capabilities in future omnimodal models. ROVER represents a critical step toward enabling true omnimodal generation through reciprocal cross-modal reasoning.