Prompt to Polygon: Testing 10 AI Models on Editable Vector Logo Generation. The Results Were Weird.

This report is unapologetically from my own perspective. It reflects how I evaluate AI logo generation inside Logo Lattice, what I care about in production, and the failures I personally find most costly.

That probably makes it fairly pointless as a universal benchmark, but I still think it is interesting: once you force models to produce structured, editable vector geometry instead of pretty pixels, their strengths and weaknesses become very obvious.

Most AI logo tools generate a flat image and stop there. Logo Lattice does something harder. It converts a text prompt into editable, grid-snapped vector geometry that survives a real production pipeline.

{
  "format": "logolattice",
  "version": 1,
  "snapshot": {
    "document": {
      "shapes": [
        {
          "id": "fd17ccba-913c-4396-917a-fe99f1d30591",
          "points": [
            { "x": 390, "y": 329.0896534380867 },
            { "x": 360, "y": 311.7691453623979 },
            { "x": 390, "y": 294.44863728670913 },
            { "x": 420, "y": 311.7691453623979 }
          ],
          "style": {
            "stroke": "#111111",
            "fill": "#6fbaff",
            "strokeWidth": 1
          },
          "layerNumber": 1,
          "groupId": "e17a92e4-fd04-4962-867f-d1a600c74108"
        }
      ],
      "groups": 2
    },
    "gridType": "isometric",
    "isometricStyle": "triangular",
    "gridSize": 60,
    "style": {
      "stroke": "#111111",
      "fill": "#488ccd",
      "strokeWidth": 3
    }
  }
}

That means the model isn't just "being creative". It has to:

Return valid structured JSON
Follow a lattice coordinate system
Generate usable geometry
Survive normalisation
Export clean SVGs

So, I built a benchmark and tested every generative model enabled in our development build. The results were far less predictable than expected.

The benchmark

I ran 10 models, 4 logo prompts, 3 times each, for a total of 120 generations.

Models

OpenRouter id	Label
`openai/gpt-5.2`	GPT-5.2
`openai/gpt-5.5`	GPT-5.5
`openai/gpt-5.4`	GPT-5.4
`anthropic/claude-sonnet-4.6`	Claude Sonnet 4.6
`anthropic/claude-opus-4.8`	Claude Opus 4.8
`anthropic/claude-opus-4.7`	Claude Opus 4.7
`anthropic/claude-opus-4.6`	Claude Opus 4.6
`google/gemini-2.5-pro`	Gemini 2.5 Pro
`openai/o4-mini`	o4-mini
`deepseek/deepseek-v4-pro`	DeepSeek V4 Pro

Prompts

Case id	Prompt (summary)	Expectations
`teal-mountain`	Isometric teal mountain mark	≥3 shapes, teal, isometric grid
`minimal-circle`	Orange circle on square grid	1–6 shapes, orange, square grid
`complex-facets`	Purple/gold faceted gem	≥5 shapes, purple + gold
`arc-swoosh`	Navy open arc on isometric grid	≥1 shape, navy, isometric grid

Every run went through the same production pipeline:

No tuning per provider was implemented
Each try uses the same system prompt, same post-processing and same scoring rules

This was not a "which image looks coolest" benchmark. We measured reliability.

Metrics included:

Structured output validity
Parse rate success
Grid adherence
Shape complexity
Colour fidelity
SVG export success
Latency
Token usage
Cost

The biggest surprise

Most models succeeded... but the failures mattered.

Overall success rate: 117/120 successful generations (97.5%)

Failure cases

Model	Case	Run	Error	Latency
`google/gemini-2.5-pro`	`complex-facets`	1, 2	Empty response	~2–3 min
`anthropic/claude-opus-4.6`	?	?	Invalid JSON	…

In theory, that sounds excellent. But the failures exposed something important. When users spend credits on generation, even a 2–3% failure rate becomes very noticeable - especially when failures take 2–3 minutes before timing out.

{
  "model": "google/gemini-2.5-pro",
  "caseId": "complex-facets",
  "run": 2,
  "ok": false,
  "latencyMs": 165505.65229100036,
  "error": "OpenRouter returned no content."
},
{
  "model": "google/gemini-2.5-pro",
  "caseId": "complex-facets",
  "run": 3,
  "ok": false,
  "latencyMs": 199037.96783299977,
  "error": "OpenRouter returned no content."
}

The second surprise

"Good scores" does not always mean "good logos".

The benchmark exposed a weird gap between machine-valid output and human-perceived quality. A model could score extremely high while still generating decagons instead of circles, awkward nested arcs, overbuilt geometry and strange compositions that technically passed validation.

The easiest prompt in the benchmark, creating a simple orange circle, ended up exposing some of the biggest weaknesses.

Prompt: "Simple orange circle on square grid, flat geometric logo mark"

Several top-tier models generated polygon approximations instead of circles.

Example benchmark results

Model	Score	shapeCount	gridSnapError	Visual
GPT-5.5	100	12	0.4	decagon "circle"
Opus 4.8	100	6	0.1	decagon "circle"
DeepSeek	91	1	0.0	single horizontal line

OpenAI GPT-5.5 minimal circle prompt result

Anthropic Opus 4.8 minimal circle prompt result

One DeepSeek run returned a single horizontal line.

DeepSeek V4 Pro minimal circle prompt result

Technically, it was valid. Visually, it was absurd.

JSON output (truncated - GPT-5.5, minimal circle, run 1)

{
  "format": "logolattice",
  "version": 1,
  "exportedAt": "2026-05-30T18:24:06.438Z",
  "snapshot": {
    "document": {
      "shapes": [
        {
          "id": "5b6a5c6b-5d9b-4b68-9e7d-4b2c0b2f5f3a",
          "points": [
            { "x": 390, "y": 240 },
            { "x": 450, "y": 240 },
            { "x": 510, "y": 300 }
          ],
          "style": {
            "stroke": "#f97316",
            "fill": "#f97316",
            "strokeWidth": 1
          },
          "layerNumber": 1,
          "closed": true,
          "vertexCornerRadii": [36, 36, 36, 36, 36, 36, 36, 36]
        }
      ]
    },
    "gridType": "square",
    "gridSize": 60,
    "style": {
      "stroke": "#f97316",
      "fill": "#f97316",
      "strokeWidth": 1
    }
  }
}

🎯

Simple briefs became the harshest quality test in the entire suite. Complex prompts gave models room to design; simple prompts forced precision.

Complex prompts were actually easier

The strongest outputs happened on higher complexity prompts, such as isometric mountains, faceted gems and layered geometric marks. Those prompts gave models more room to "design". Simple prompts forced precision.

That became a recurring pattern across all 120 generations. In other words: simple briefs exposed geometry problems. Complex briefs exposed creativity differences.

Simple circle benchmark score distribution by model

Shape count distribution for passed runs

Benchmark scores capture visual correctness, but not how efficiently models achieve it. To examine structural behaviour, I analysed the number of shapes used in successful outputs for the simple circle benchmark.

Ideally, a circle should be represented by a single shape. The chart below shows that while many models achieved this minimal representation, others produced more complex constructions to reach a visually similar result - revealing differences in structural efficiency beyond overall score.

Simple circle shape count distribution for passed runs

Interesting result from the data:

26 runs passed with 1 shape (the "correct" minimal circle)
1 run passed with 6 shapes (the decagon / polygon approximation)
A few odd middle cases (2, 3, 4 shapes)

That actually tells a pretty compelling story for the benchmark: some models "pass" while taking a much more complex route to achieve the same visual result. This highlights structural efficiency, not just visual correctness.

Grid adherence and vertex drift

To measure geometric precision beyond simple parse success, I included gridSnapError - a metric that captures the average distance between raw vertices and their normalised grid-aligned positions prior to snapping. This measures pre-normalisation vertex drift, rather than simply whether an output could be parsed.

export function gridSnapError(
  rawSnapshot: EditorSnapshot,
  normalizedSnapshot: EditorSnapshot,
): number {
  // ... average distance from each raw vertex to snapped grid position
  return count === 0 ? Number.POSITIVE_INFINITY : total / count;
}

As a result, benchmark scores frequently cluster between 91–100, reflecting that many models achieve strong grid adherence even when overall composition or semantic correctness is weaker. In practice, this helps separate structural precision from higher-level design quality.

Model highlights

Claude Sonnet 4.6 was the most interesting model

Not because it had the highest raw score. Because it consistently hit the best balance of quality, reliability, speed and cost. Typical generations were often 5–25 seconds, whilst still producing some of the best complex marks in the benchmark.

Its outputs felt the most like a designer made it, rather than an AI assembled some shapes. Oddly though, it struggled on very simple geometry. Its "circle" attempts were sometimes worse than smaller models.

Metric	Value
Model	`anthropic/claude-sonnet-4.6`
Success rate	47/48 (97%)
Median latency	13.2s
P95 latency	23.1s
Total cost	$1.92
Avg completion tokens	1,281
Best case	`teal-mountain` (score 98)
Weakest case	`arc-swoosh` (score 87)

Anthropic Claude Sonnet 4.6 full benchmark outputs

GPT-5.5 was strong but expensive

GPT-5.5 produced consistently high-quality outputs. But it also costs roughly 4× more than its smaller sibling, o4-mini, for the same benchmark suite - and visually, the difference was almost smaller than expected. In several cases, cheaper models produced equally usable marks.

Metric	Value
Model	`openai/gpt-5.5`
Success rate	12/12 (100%)
Median latency	28.9s
P95 latency	74.5s
Total cost	$0.78
Avg completion tokens	2,078
Best case	`complex-facets` (score 98)
Weakest case	`arc-swoosh` (score 87, inconsistent curve geometry)

DeepSeek was the chaos model

DeepSeek V4 Pro was easily the most unpredictable system we tested. Some runs were genuinely excellent, others looked completely broken. Latency was similarly inconsistent: some generations finished in under 10 seconds, whereas others took nearly 4 minutes.

It behaved like a lottery ticket: occasionally amazing, occasionally bizarre. One DeepSeek generation exceeded 13,000 completion tokens for a single logo attempt.

Metric	Value
Model	`deepseek/deepseek-v4-pro`
Case	`complex-facets`
Run	1
Score	95
Success	✅
Completion tokens	13,873
Prompt tokens	1,959
Total tokens	15,832
Latency	239.1s
Cost	$0.0453
Shape count	8
Grid snap error	2.64
SVG size	2,399 bytes

The benchmark repeatedly showed that output complexity, not prompt size, drove costs. Prompt tokens were relatively stable across all models. Completion tokens exploded on harder prompts.

Metric	Value
Model	`deepseek/deepseek-v4-pro`
Success rate	12/12 (100%)
Median latency	93.1s
P95 latency	207.4s
Total cost	$0.19
Avg completion tokens	4,462
Best case	`teal-mountain` (score 96)
Weakest case	`arc-swoosh` (score 81)

Gemini's results were frustrating

Because when Gemini worked, it often looked good. Its successful outputs were clean, minimal and visually sharp. But the empty responses and thin geometry made the model difficult to trust for production defaults. Reliability matters more than the occasional brilliance when generations cost money.

Metric	Value
Model	`google/gemini-2.5-pro`
Success rate	10/12 (83%)
Median latency	40.1s
P95 latency	180.6s
Total cost	$0.47
Avg completion tokens	4,496
Best case	`teal-mountain` (score 93)
Weakest case	`complex-facets` (score 79)

Google Gemini 2.5 Pro full benchmark outputs

The benchmark also exposed a scoring problem

Each successful run gets a 0–100 composite score from machine-checkable checks: non-empty output after normalisation, minimum shape count, grid snap error, colour-hint match, SVG export, a small complexity bonus, and light penalties for wrong grid type or too many shapes.

Composite score (0–100)

Component	Points	Rule
Non-empty logo	+40	At least one shape after normalisation
Minimum complexity	+15	`shapeCount >= minShapes` (default `minShapes` is 1)
Grid adherence	0–15	Pre-normalisation vertex snap error (see below)
Colour fidelity	+10	`colorHintMatch === true`
SVG export	+10	`svgBytes > 0`
Complexity bonus	0–10	`min(10, shapeCount)`
Too many shapes	−5	Only when `maxShapes` is set and exceeded
Wrong grid type	−5	Expected square or isometric and snapshot doesn't match

export function scoreDescribeSnapshot(...) {
  let score = 0;
  score += shapeCount > 0 ? 40 : 0;
  // minShapes +15, grid snap 0–15, color +10, SVG +10, complexity bonus...
}

Grid adherence

gridSnapError	Points
< 2 px (average)	+15
≥ 2 px	`+max(0, 15 − error)` (capped at 15)

gridSnapError = average distance from each raw vertex to its snapped grid position (before normalisation).

Most successful runs clustered between a score of 91 and 100. So raw score alone wasn't very useful. The real differences appeared in:

Geometry quality
Consistency
Shape decisions
Prompt interpretation
Visual polish
Reliability under repetition

In other words, the benchmark became more interesting when we looked at the SVGs, instead of the numbers.

Biggest overall findings

Structured vector generation is much harder than image generation - Generating pretty pixels is easy compared to generating valid geometry, coherent structure, editable vectors, and machine-readable layouts. Many models that are excellent at image tasks struggled with geometric discipline.
Simplicity is deceptive - A "minimal circle" sounds trivial. But simple prompts leave nowhere to hide. You instantly notice polygon artefacts, asymmetry, awkward spacing and excessive geometry. Simple briefs became the harshest quality test in the entire suite.
Reliability matters more than peak quality - The benchmark changed how we think about defaults. A model that occasionally produced incredible work, but sometimes fails entirely, creates a worse product experience than a model that is consistently good - especially in a credit-based generation flow.
High scores do not equal aesthetic quality - Automated metrics are useful but they do not fully capture balance, composition, silhouette quality, logo "feel" and visual confidence. Human review still matters enormously.

Master model comparison

Model	Success Rate	Median Score	Median Latency	P95 Latency	Total Cost	Avg Completion Tokens	Best Case	Weakest Case
GPT-5.5	12/12 (100%)	92	28.9s	74.5s	$0.78	2,078	complex-facets (98)	arc-swoosh (87)
Claude Sonnet 4.6	47/48 (97%)	95	13.2s	23.1s	$1.92	1,281	teal-mountain (98)	arc-swoosh (87)
DeepSeek V4 Pro	12/12 (100%)	92	93.1s	207.4s	$0.19	4,462	teal-mountain (96)	arc-swoosh (81)
Gemini 2.5 Pro	10/12 (83%)	88	40.1s	180.6s	$0.47	4,496	teal-mountain (93)	complex-facets (79)

The models, in one sentence each

Model	Summary
Claude Sonnet 4.6	Best overall balance
Claude Opus 4.6	Strong ideas, weak fundamentals
Claude Opus 4.7	Most consistently "brand-ready"
Claude Opus 4.8	Creative but inconsistent
o4-mini	Cheap, clean, surprisingly capable
GPT-5.2	Reliable and symmetrical
GPT-5.4	Excellent on complex geometry
GPT-5.5	Strong, but expensive
Gemini 2.5 Pro	Sharp when it works
DeepSeek V4 Pro	Pure variance

The most important takeaway

The benchmark made one thing very clear. "AI generated SVG" is not a solved problem.

Getting a model to generate aesthetically strong, structurally valid, editable, normalised, export-safe vector geometry is still surprisingly difficult - especially when the outputs need to survive a real product pipeline instead of just looking good in a screenshot.

And honestly? That's what made this benchmark interesting. Not which model got the highest score, but how differently each model failed.

What's next

We're expanding the benchmark with:

Reference-image prompts
More stroke-based tests
Multi-turn refinement
Larger sample sizes
Human ranking evaluation

Because after looking through all 120 outputs, one thing became obvious: the future of AI design tools is probably less about image generation, and more about controllable structure.

Appendices

Full per-model output galleries and detailed breakdowns are available in the Describe-Benchmark repository:

Anthropic results
OpenAI results
Gemini and DeepSeek results

Raw results and reproducibility

I am currently working on a way to open source the benchmark. At the moment, too much of the benchmark relies on code I do not want to open source.

In the meantime, you can view the full results and raw outputs on GitHub → logolattice/Describe-Benchmark

Originally posted on X on 2 June 2026.