Prompt to Polygon: Testing 10 AI Models on Editable Vector Logo Generation. The Results Were Weird.

15 min read
Prompt to Polygon - AI logo generation benchmark

This report is unapologetically from my own perspective. It reflects how I evaluate AI logo generation inside Logo Lattice, what I care about in production, and the failures I personally find most costly.

That probably makes it fairly pointless as a universal benchmark, but I still think it is interesting: once you force models to produce structured, editable vector geometry instead of pretty pixels, their strengths and weaknesses become very obvious.

Most AI logo tools generate a flat image and stop there. Logo Lattice does something harder. It converts a text prompt into editable, grid-snapped vector geometry that survives a real production pipeline.

{
  "format": "logolattice",
  "version": 1,
  "snapshot": {
    "document": {
      "shapes": [
        {
          "id": "fd17ccba-913c-4396-917a-fe99f1d30591",
          "points": [
            { "x": 390, "y": 329.0896534380867 },
            { "x": 360, "y": 311.7691453623979 },
            { "x": 390, "y": 294.44863728670913 },
            { "x": 420, "y": 311.7691453623979 }
          ],
          "style": {
            "stroke": "#111111",
            "fill": "#6fbaff",
            "strokeWidth": 1
          },
          "layerNumber": 1,
          "groupId": "e17a92e4-fd04-4962-867f-d1a600c74108"
        }
      ],
      "groups": 2
    },
    "gridType": "isometric",
    "isometricStyle": "triangular",
    "gridSize": 60,
    "style": {
      "stroke": "#111111",
      "fill": "#488ccd",
      "strokeWidth": 3
    }
  }
}

That means the model isn't just "being creative". It has to:

  • Return valid structured JSON
  • Follow a lattice coordinate system
  • Generate usable geometry
  • Survive normalisation
  • Export clean SVGs

So, I built a benchmark and tested every generative model enabled in our development build. The results were far less predictable than expected.

The benchmark

I ran 10 models, 4 logo prompts, 3 times each, for a total of 120 generations.

Models

OpenRouter idLabel
openai/gpt-5.2GPT-5.2
openai/gpt-5.5GPT-5.5
openai/gpt-5.4GPT-5.4
anthropic/claude-sonnet-4.6Claude Sonnet 4.6
anthropic/claude-opus-4.8Claude Opus 4.8
anthropic/claude-opus-4.7Claude Opus 4.7
anthropic/claude-opus-4.6Claude Opus 4.6
google/gemini-2.5-proGemini 2.5 Pro
openai/o4-minio4-mini
deepseek/deepseek-v4-proDeepSeek V4 Pro

Prompts

Case idPrompt (summary)Expectations
teal-mountainIsometric teal mountain mark≥3 shapes, teal, isometric grid
minimal-circleOrange circle on square grid1–6 shapes, orange, square grid
complex-facetsPurple/gold faceted gem≥5 shapes, purple + gold
arc-swooshNavy open arc on isometric grid≥1 shape, navy, isometric grid

Every run went through the same production pipeline:

Logo Lattice Describe flow from user prompt to output
  • No tuning per provider was implemented
  • Each try uses the same system prompt, same post-processing and same scoring rules

This was not a "which image looks coolest" benchmark. We measured reliability.

Metrics included:

  • Structured output validity
  • Parse rate success
  • Grid adherence
  • Shape complexity
  • Colour fidelity
  • SVG export success
  • Latency
  • Token usage
  • Cost

The biggest surprise

Most models succeeded... but the failures mattered.

Overall success rate: 117/120 successful generations (97.5%)

Failure cases

ModelCaseRunErrorLatency
google/gemini-2.5-procomplex-facets1, 2Empty response~2–3 min
anthropic/claude-opus-4.6??Invalid JSON

In theory, that sounds excellent. But the failures exposed something important. When users spend credits on generation, even a 2–3% failure rate becomes very noticeable - especially when failures take 2–3 minutes before timing out.

{
  "model": "google/gemini-2.5-pro",
  "caseId": "complex-facets",
  "run": 2,
  "ok": false,
  "latencyMs": 165505.65229100036,
  "error": "OpenRouter returned no content."
},
{
  "model": "google/gemini-2.5-pro",
  "caseId": "complex-facets",
  "run": 3,
  "ok": false,
  "latencyMs": 199037.96783299977,
  "error": "OpenRouter returned no content."
}

The second surprise

"Good scores" does not always mean "good logos".

The benchmark exposed a weird gap between machine-valid output and human-perceived quality. A model could score extremely high while still generating decagons instead of circles, awkward nested arcs, overbuilt geometry and strange compositions that technically passed validation.

The easiest prompt in the benchmark, creating a simple orange circle, ended up exposing some of the biggest weaknesses.

Prompt: "Simple orange circle on square grid, flat geometric logo mark"

Several top-tier models generated polygon approximations instead of circles.

Example benchmark results

ModelScoreshapeCountgridSnapErrorVisual
GPT-5.5100120.4decagon "circle"
Opus 4.810060.1decagon "circle"
DeepSeek9110.0single horizontal line
OpenAI GPT-5.5 minimal circle prompt result Anthropic Opus 4.8 minimal circle prompt result

One DeepSeek run returned a single horizontal line.

DeepSeek V4 Pro minimal circle prompt result

Technically, it was valid. Visually, it was absurd.

JSON output (truncated - GPT-5.5, minimal circle, run 1)

{
  "format": "logolattice",
  "version": 1,
  "exportedAt": "2026-05-30T18:24:06.438Z",
  "snapshot": {
    "document": {
      "shapes": [
        {
          "id": "5b6a5c6b-5d9b-4b68-9e7d-4b2c0b2f5f3a",
          "points": [
            { "x": 390, "y": 240 },
            { "x": 450, "y": 240 },
            { "x": 510, "y": 300 }
          ],
          "style": {
            "stroke": "#f97316",
            "fill": "#f97316",
            "strokeWidth": 1
          },
          "layerNumber": 1,
          "closed": true,
          "vertexCornerRadii": [36, 36, 36, 36, 36, 36, 36, 36]
        }
      ]
    },
    "gridType": "square",
    "gridSize": 60,
    "style": {
      "stroke": "#f97316",
      "fill": "#f97316",
      "strokeWidth": 1
    }
  }
}
🎯

Simple briefs became the harshest quality test in the entire suite. Complex prompts gave models room to design; simple prompts forced precision.

Complex prompts were actually easier

The strongest outputs happened on higher complexity prompts, such as isometric mountains, faceted gems and layered geometric marks. Those prompts gave models more room to "design". Simple prompts forced precision.

That became a recurring pattern across all 120 generations. In other words: simple briefs exposed geometry problems. Complex briefs exposed creativity differences.

Simple circle benchmark score distribution by model

Shape count distribution for passed runs

Benchmark scores capture visual correctness, but not how efficiently models achieve it. To examine structural behaviour, I analysed the number of shapes used in successful outputs for the simple circle benchmark.

Ideally, a circle should be represented by a single shape. The chart below shows that while many models achieved this minimal representation, others produced more complex constructions to reach a visually similar result - revealing differences in structural efficiency beyond overall score.

Simple circle shape count distribution for passed runs

Interesting result from the data:

  • 26 runs passed with 1 shape (the "correct" minimal circle)
  • 1 run passed with 6 shapes (the decagon / polygon approximation)
  • A few odd middle cases (2, 3, 4 shapes)

That actually tells a pretty compelling story for the benchmark: some models "pass" while taking a much more complex route to achieve the same visual result. This highlights structural efficiency, not just visual correctness.

Grid adherence and vertex drift

To measure geometric precision beyond simple parse success, I included gridSnapError - a metric that captures the average distance between raw vertices and their normalised grid-aligned positions prior to snapping. This measures pre-normalisation vertex drift, rather than simply whether an output could be parsed.

export function gridSnapError(
  rawSnapshot: EditorSnapshot,
  normalizedSnapshot: EditorSnapshot,
): number {
  // ... average distance from each raw vertex to snapped grid position
  return count === 0 ? Number.POSITIVE_INFINITY : total / count;
}

As a result, benchmark scores frequently cluster between 91–100, reflecting that many models achieve strong grid adherence even when overall composition or semantic correctness is weaker. In practice, this helps separate structural precision from higher-level design quality.

Model highlights

Claude Sonnet 4.6 was the most interesting model

Not because it had the highest raw score. Because it consistently hit the best balance of quality, reliability, speed and cost. Typical generations were often 5–25 seconds, whilst still producing some of the best complex marks in the benchmark.

Its outputs felt the most like a designer made it, rather than an AI assembled some shapes. Oddly though, it struggled on very simple geometry. Its "circle" attempts were sometimes worse than smaller models.

MetricValue
Modelanthropic/claude-sonnet-4.6
Success rate47/48 (97%)
Median latency13.2s
P95 latency23.1s
Total cost$1.92
Avg completion tokens1,281
Best caseteal-mountain (score 98)
Weakest casearc-swoosh (score 87)
Anthropic Claude Sonnet 4.6 full benchmark outputs

GPT-5.5 was strong but expensive

GPT-5.5 produced consistently high-quality outputs. But it also costs roughly 4× more than its smaller sibling, o4-mini, for the same benchmark suite - and visually, the difference was almost smaller than expected. In several cases, cheaper models produced equally usable marks.

MetricValue
Modelopenai/gpt-5.5
Success rate12/12 (100%)
Median latency28.9s
P95 latency74.5s
Total cost$0.78
Avg completion tokens2,078
Best casecomplex-facets (score 98)
Weakest casearc-swoosh (score 87, inconsistent curve geometry)
OpenAI GPT-5.5 full benchmark outputs

DeepSeek was the chaos model

DeepSeek V4 Pro was easily the most unpredictable system we tested. Some runs were genuinely excellent, others looked completely broken. Latency was similarly inconsistent: some generations finished in under 10 seconds, whereas others took nearly 4 minutes.

It behaved like a lottery ticket: occasionally amazing, occasionally bizarre. One DeepSeek generation exceeded 13,000 completion tokens for a single logo attempt.

MetricValue
Modeldeepseek/deepseek-v4-pro
Casecomplex-facets
Run1
Score95
Success
Completion tokens13,873
Prompt tokens1,959
Total tokens15,832
Latency239.1s
Cost$0.0453
Shape count8
Grid snap error2.64
SVG size2,399 bytes

The benchmark repeatedly showed that output complexity, not prompt size, drove costs. Prompt tokens were relatively stable across all models. Completion tokens exploded on harder prompts.

MetricValue
Modeldeepseek/deepseek-v4-pro
Success rate12/12 (100%)
Median latency93.1s
P95 latency207.4s
Total cost$0.19
Avg completion tokens4,462
Best caseteal-mountain (score 96)
Weakest casearc-swoosh (score 81)
DeepSeek V4 Pro full benchmark outputs

Gemini's results were frustrating

Because when Gemini worked, it often looked good. Its successful outputs were clean, minimal and visually sharp. But the empty responses and thin geometry made the model difficult to trust for production defaults. Reliability matters more than the occasional brilliance when generations cost money.

MetricValue
Modelgoogle/gemini-2.5-pro
Success rate10/12 (83%)
Median latency40.1s
P95 latency180.6s
Total cost$0.47
Avg completion tokens4,496
Best caseteal-mountain (score 93)
Weakest casecomplex-facets (score 79)
Google Gemini 2.5 Pro full benchmark outputs

The benchmark also exposed a scoring problem

Each successful run gets a 0–100 composite score from machine-checkable checks: non-empty output after normalisation, minimum shape count, grid snap error, colour-hint match, SVG export, a small complexity bonus, and light penalties for wrong grid type or too many shapes.

Composite score (0–100)

ComponentPointsRule
Non-empty logo+40At least one shape after normalisation
Minimum complexity+15shapeCount >= minShapes (default minShapes is 1)
Grid adherence0–15Pre-normalisation vertex snap error (see below)
Colour fidelity+10colorHintMatch === true
SVG export+10svgBytes > 0
Complexity bonus0–10min(10, shapeCount)
Too many shapes−5Only when maxShapes is set and exceeded
Wrong grid type−5Expected square or isometric and snapshot doesn't match
export function scoreDescribeSnapshot(...) {
  let score = 0;
  score += shapeCount > 0 ? 40 : 0;
  // minShapes +15, grid snap 0–15, color +10, SVG +10, complexity bonus...
}

Grid adherence

gridSnapErrorPoints
< 2 px (average)+15
≥ 2 px+max(0, 15 − error) (capped at 15)

gridSnapError = average distance from each raw vertex to its snapped grid position (before normalisation).

Most successful runs clustered between a score of 91 and 100. So raw score alone wasn't very useful. The real differences appeared in:

  • Geometry quality
  • Consistency
  • Shape decisions
  • Prompt interpretation
  • Visual polish
  • Reliability under repetition

In other words, the benchmark became more interesting when we looked at the SVGs, instead of the numbers.

Biggest overall findings

  1. Structured vector generation is much harder than image generation - Generating pretty pixels is easy compared to generating valid geometry, coherent structure, editable vectors, and machine-readable layouts. Many models that are excellent at image tasks struggled with geometric discipline.

  2. Simplicity is deceptive - A "minimal circle" sounds trivial. But simple prompts leave nowhere to hide. You instantly notice polygon artefacts, asymmetry, awkward spacing and excessive geometry. Simple briefs became the harshest quality test in the entire suite.

  3. Reliability matters more than peak quality - The benchmark changed how we think about defaults. A model that occasionally produced incredible work, but sometimes fails entirely, creates a worse product experience than a model that is consistently good - especially in a credit-based generation flow.

  4. High scores do not equal aesthetic quality - Automated metrics are useful but they do not fully capture balance, composition, silhouette quality, logo "feel" and visual confidence. Human review still matters enormously.

Master model comparison

ModelSuccess RateMedian ScoreMedian LatencyP95 LatencyTotal CostAvg Completion TokensBest CaseWeakest Case
GPT-5.512/12 (100%)9228.9s74.5s$0.782,078complex-facets (98)arc-swoosh (87)
Claude Sonnet 4.647/48 (97%)9513.2s23.1s$1.921,281teal-mountain (98)arc-swoosh (87)
DeepSeek V4 Pro12/12 (100%)9293.1s207.4s$0.194,462teal-mountain (96)arc-swoosh (81)
Gemini 2.5 Pro10/12 (83%)8840.1s180.6s$0.474,496teal-mountain (93)complex-facets (79)

The models, in one sentence each

ModelSummary
Claude Sonnet 4.6Best overall balance
Claude Opus 4.6Strong ideas, weak fundamentals
Claude Opus 4.7Most consistently "brand-ready"
Claude Opus 4.8Creative but inconsistent
o4-miniCheap, clean, surprisingly capable
GPT-5.2Reliable and symmetrical
GPT-5.4Excellent on complex geometry
GPT-5.5Strong, but expensive
Gemini 2.5 ProSharp when it works
DeepSeek V4 ProPure variance

The most important takeaway

The benchmark made one thing very clear. "AI generated SVG" is not a solved problem.

Getting a model to generate aesthetically strong, structurally valid, editable, normalised, export-safe vector geometry is still surprisingly difficult - especially when the outputs need to survive a real product pipeline instead of just looking good in a screenshot.

And honestly? That's what made this benchmark interesting. Not which model got the highest score, but how differently each model failed.

What's next

We're expanding the benchmark with:

  • Reference-image prompts
  • More stroke-based tests
  • Multi-turn refinement
  • Larger sample sizes
  • Human ranking evaluation

Because after looking through all 120 outputs, one thing became obvious: the future of AI design tools is probably less about image generation, and more about controllable structure.

Appendices

Full per-model output galleries and detailed breakdowns are available in the Describe-Benchmark repository:

  • Anthropic results
  • OpenAI results
  • Gemini and DeepSeek results

Raw results and reproducibility

I am currently working on a way to open source the benchmark. At the moment, too much of the benchmark relies on code I do not want to open source.

In the meantime, you can view the full results and raw outputs on GitHub → logolattice/Describe-Benchmark


Originally posted on X on 2 June 2026.