Benchmark

Blueprint-Bench: Testing spatial intelligence in AI models

How do AI models understand space? We test this by asking them to convert apartment photographs into accurate 2D floor plans. While photos are familiar training data, spatial reconstruction requires genuine intelligence.

Leaderboard

	Model	Type	Similarity Score (mean)
1	Human*	Human	0.547
2	GPT-5	LLM	0.431
3	Gemini 2.5 Pro	LLM	0.421
4	GPT-5-mini	LLM	0.400
5	Grok-4	LLM	0.393
6	Codex CLI (GPT-5)	Agent	0.388
7	Gemini 2.5 Flash	LLM	0.362
8	Claude Code (Opus 4)	Agent	0.355
9	GPT Image	Image Model	0.300
10	Claude Opus 4	LLM	0.286
11	Random baseline	Random	0.279
12	Claude Sonnet 4	LLM	0.272
13	NanoBanana (Gemini 2.5 Flash Image)	Image Model	0.168
14	GPT-4o	LLM	0.129

*Human baseline tested on subset of 12 apartments only

The eval

Blueprint-Bench tests spatial reasoning through converting apartment photographs into accurate 2D floor plans. Models examine ~20 interior photos and generate a floor plan showing room layouts, connections, and relative sizes.

Converting apartment photographs (left) into a 2D floor plan (right). Red dots indicate rooms, green lines show doorways.

Success requires identifying rooms, inferring spatial relationships, understanding scale, and generating structured output following strict formatting rules.

Model Performance Comparison - Human Baseline

Mean similarity scores across apartments with human baselines. Humans outperform all AI systems.

Our results reveal a striking failure of current AI capabilities: most models perform at or below our random baseline (0.279), with even the best models, Gemini 2.5 Pro and GPT-5, achieve scores below the human baseline at 0.547. This significant gap suggests that while photographs are well within the training distribution of modern multimodal models, the task of spatial reconstruction—inferring room layouts, understanding connectivity, and maintaining consistent scale—requires genuine spatial intelligence that current systems lack.

Mean similarity scores across all apartments. Most perform at or near random baseline.

Many models, especially image generation systems, failed to follow the strict formatting rules required for robust scoring. This reveals challenges in instruction adherence, not just spatial reasoning.

Examples of poor instruction following: GPT-4o fails to label rooms with dots (left), NanoBanana includes forbidden details like furniture (right).

Humans worked iteratively, viewing images multiple times and refining drawings. To test if this explained AI’s poor performance, we evaluated agent systems with similar capabilities.

Claude Code agent attempting iterative refinement, making multiple attempts but still producing errors despite claiming success.

Agent-based approaches with iterative refinement capabilities showed no meaningful improvement over single-pass generation, confirming that the core limitation is spatial understanding rather than methodology.

Blueprint-Bench provides the first numerical framework for comparing spatial intelligence across LLMs, image models, and agents. As AI systems advance, spatial reasoning remains fundamental for real-world applications.

While spatial intelligence isn’t inherently dangerous, it’s a prerequisite for many advanced capabilities. Blueprint-Bench helps monitor progress toward AI systems that can truly understand physical spaces.

Blueprint-Bench: Testing spatial intelligence in AI models

Leaderboard

The eval

Are you a researcher and want to test a model on Blueprint-Bench?