Composition-Grounded Instruction Synthesis for Visual Reasoning

Anonymous ICLR submission
COGS teaser
Figure 1: COmposition-Grounded Instruction Synthesis (COGS) Starting from a small set of reasoning-intensive seed questions, COGS decomposes them into primitive perception and reasoning factors, which are then recombined with new image sources to synthesize question–answer pairs. This process expands both the quantity and diversity of reasoning types beyond the original seeds.

Abstract

Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets.

We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions.

Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.

COGS

A data-efficient framework compositionally generates reasoning data to equip MLLMs with complex visual reasoning in Chart and WebGUI understanding tasks.

COGS operates in 3 stages:

1) Seed Data Decomposition: given a small seed set of complex questions, a multimodal LLM decomposes each question into interpretable perception and reasoning factors. The aggregated factor set captures the seed domain’s compositional structure.

2) Question Generation via Factor Recomposition: given a new image and sampled factors from decomposed set, a multimodal LLM generates grounded subquestions, composes them into a complex question, and outputs both intermediate answer of subquestions and overall answers, enabling annotation-free data expansion;

3) Reinforcement Learning-Based Fine-tuning: we adopt GRPO to fine-tune a pretrained MLLM with the generated question–answer data. The structure of data generated COGS enables richer reward modeling beyond final-answer correctness.

COGS framework
Figure 2: The overall pipeline of COGS.

Results

We evaluate COGS across multiple artificial image domains and report results separately for each setting.


Chart Understanding

Chart Question Answering (CQA) requires interpreting visual representations in charts and reasoning over their spatial relation and underlying data. The recently released ChartQAPro benchmark consists of 1,948 human-curated question–answer pairs targeting complex reasoning over diverse chart types.

we randomly select 33% of the released test set as validation data and treat them as seed questions for data synthesis. The remaining 67% is held out as a fully unseen test set for all experiments. we use the training set of ChartQA as the image source.

Model Factoid MCQ Convers. FactChk. Hypoth. Overall
Proprietary Models
GPT-5-nano45.9563.6449.4063.5849.8250.74
GPT-4o-mini43.6366.4345.4859.8845.2048.32
Gemini 2.5 Flash-Lite40.4219.9648.7737.4316.6638.72
Claude Haiku 3.543.4465.0339.8461.7938.7746.74
Opensource Models (7B+)
Qwen2.5-VL-7B (base)42.0762.5944.8860.7850.7247.36
InternVL3.5-GPT-OSS43.0258.7442.8658.0254.4846.86
PHi-4-14B23.1834.2740.9346.9136.3131.61
Chart Specialist Models
ChartLLaMA8.1123.0818.3745.0629.5517.19
ChartMoE19.0335.6632.9745.6827.0827.28
Data Synthesis Approaches: over Qwen2.5-VL-7B
ChartQA-Train38.7760.1449.7261.1153.1246.64
Chart-R142.1746.8550.5361.1155.5547.32
In-Context Q Example46.3362.9446.9161.1161.7250.58
COGS (Ours) 46.88 65.73 51.16 61.85 58.25 52.02
Table 1: Accuracy (%) on ChartQAPro grouped by question type. COGS performs the best.

Webpage GUI Understanding

To demonstrate the generality of COGS, we also evaluate it on the webpage question answering domain, which requires visual, semantic, and structural reasoning over graphical user interfaces (GUIs). We adopt VisualWebBench, a benchmark consisting of diverse real-world webpages paired with reasoning-intensive, human-curated questions.

We use questions from VisualWebBench as seeds and screenshots from MultiUI as the image source.

Model WebQA
Proprietary Models
GPT-5-nano89.47
GPT-4o-mini81.34
Gemini 2.5 Flash-Lite81.85
Claude Haiku 3.580.86
Opensource Models (~7B)
Qwen2.5-VL-7B (base model)85.65
InternVL3.5-GPT-OSS74.64
Phi-4-14B74.16
Specialist Models
UiX-Qwen268.90
Inference-time decomposition
Decompositional CoT86.12
Data Synthesis Approaches
MultiUI-WQA86.60
COGS (Ours) 88.04
Table 2: Accuracy (%) on VisualWebBench (WebQA). COGS performs the best among non-proprietary models.

Generalization over Mixture of Datasets

We extend COGS to a multi-dataset setting by incorporating the MultiModal Chart Benchmark (MMC-Bench).

We compare two strategies for synthesizing data across domains:

1. Data-level mixture: decompose and recompose A and B independently, then combine the synthesized data, i.e., Recompose(Decompose(A)) + Recompose(Decompose(B)).

2. Factor-level mixture: decompose A and B separately, merge all extracted factors into a joint pool, and recompose using this combined pool, i.e., Recompose(Decompose(A) ∪ Decompose(B)).

In addition, we include two "specialist models" trained only with augmented data from a single domain (e.g., trained on augmented A and evaluated on A). These serve as "upper-bound references" for in-domain data augmentation. All methods use Qwen2.5-VL-7B as the base model and are trained with GRPO and ProcessRM-max.

The results show that factor-level mixture is a better strategy for data mixing.

Model ChartQAPro MMC
Qwen2.5VL 47.36 85.65
+ ChartQAPro 52.02 85.69
+ MMC 49.93 88.10
+ Data-level Mix 50.72 86.99
+ Factor-level Mix 52.33 87.55
Table 3: Multi-data co-training results.

Reward Model

We ablate 3 reward models.

  • StandardRM: r(y) = rfinal(y), which only evaluates final-answer correctness. This is the default option when subquestion supervision is not available.
  • ProcessRM-sum: r(y) = rfinal(y) + λ ⋅ rsub(y), which combines correctness of the final answer with the average subquestion accuracy, encouraging faithful reasoning at the factor level.
  • ProcessRM-max: r(y) = max(rfinal(y), λ ⋅ rsub(y)), which prioritizes the final answer but still provides reward shaping when the intermediate reasoning is largely correct.

Theoretical proof* and empirical results show that ProcessRM-max is the most effective reward model in the setting of COGS.

*Please find more details in our paper.

Reward Model Overall Accuracy
StandardRM 50.96
ProcessRM-sum 50.35
ProcessRM-max 52.02
Table 4: Ablation study on reward models shows that ProcessRM-max maximally boosts the model performance. COGS performs the best among non-proprietary models.

Example COGS data

Chart


viz_1
viz_2
viz_2
viz_2
viz_5
viz_6
viz_7
viz_8

Webpage GUI


viz_9
viz_10
viz_11