Exploring whether large language models can generate quality pixel art through structured drawing commands
LLMs output structured drawing commands. A server-side rasterizer turns those commands into indexed-color pixel grids.
The naive approach — asking an LLM to output raw pixel arrays — produces incoherent results. Models struggle with spatial reasoning at the individual pixel level. But shape primitives are a different story: rectangles, circles, and polygons are the kind of structured, compositional output that language models handle well.
Asset Forge gives the model a fixed palette and a canvas size, then asks it to compose a sprite from geometric primitives. The server rasterizes the commands into a pixel grid, producing genuine pixel art with hard edges and an indexed color palette.
rectcircleellipselinepolygonfillA calibrated LLM judge (Opus evaluating Sonnet's output) scoring across six quality dimensions, validated against human grading.
Automated evaluation of generative art is hard. Subjective quality, prompt fidelity, and pixel-art-specific conventions all matter. We built a judge that scores each sprite on a 1–5 Likert scale across six dimensions, then calibrated its rubrics against blind human scoring over five rounds.
Each dimension has explicit anchor descriptions at every score level, so both the LLM judge and human graders share a common rubric. Component Separation, for example, distinguishes between multi-part subjects (where parts should be visually distinct) and simple subjects (where cohesion is the goal).
We ran five rounds of human grading, tightening rubrics and changing the judge's inputs after each round. The key breakthrough was giving the judge vision — both the rendered PNG and an ASCII grid representation — which dramatically improved Spatial Coverage and Prompt Adherence scoring.
Initial rubrics, code-based scoring for Detail Density and Spatial Coverage. Overall MAD 0.98. Component Separation wildly off at 1.40.
Rewrote anchor descriptions with subject-relative guidance. Overall MAD 0.86. Spatial Coverage worsened to 1.20 under code-based formula.
Refined Spatial Coverage rubric to account for subject shape. Overall MAD 0.96. Prompt Adherence worst at 1.60 without vision.
20 prompts across 5 difficulty levels. Multi-shot judging. Overall MAD 0.87. Converging but coverage still code-based.
Gave the judge the rendered PNG and ASCII grid. Spatial Coverage moved to LLM-judged. Overall MAD 0.61. Spatial Coverage went from worst dimension to best.
| Round | Comp. Sep. | Color | Detail | Coverage | Pixel Disc. | Prompt Adh. | Overall MAD |
|---|---|---|---|---|---|---|---|
| 1 — Baseline | 1.40 | 1.00 | 1.00 | 0.20 | — | 1.30 | 0.98 |
| 2 — Rubric Tightening | 0.70 | 0.80 | 0.80 | 1.20 | — | 0.80 | 0.86 |
| 3 — Subject-Aware | 1.00 | 0.60 | 0.90 | 0.70 | — | 1.60 | 0.96 |
| 4 — Expanded Set | 1.05 | 0.58 | 0.74 | 0.89 | — | 1.11 | 0.87 |
| 5 — Vision | 0.50 | 0.56 | 0.39 | 0.61 | — | 1.00 | 0.61 |
MAD = Mean Absolute Deviation between LLM judge scores and blind human scores (lower is better). Color coding: green ≤ 0.65, gold ≤ 1.10, red > 1.10.
The same prompts generated fresh each round. The sprites differ because LLM output varies, but the generation prompt stayed the same. What changed was the judge.
MAD = Mean Absolute Difference (lower is better). Agreement = % of scores within 1 point of human.
| Dimension | Round 1 Baseline |
Round 2 Rubric rewrites |
Round 3 Model separation |
Round 4 Vision + preChecks |
Round 5 LLM spatial coverage, 30 prompts |
|---|---|---|---|---|---|
| Comp Sep | 1.40 | 0.70 | 1.00 | 0.50 | 0.93 |
| Color | 1.00 | 0.80 | 0.60 | 0.56 | 0.71 |
| Detail | 1.00 | 0.80 | 0.90 | 0.39 | 1.04 |
| Spatial | 0.20 | 1.20 | 0.70 | 0.61 | 0.61 |
| Pixel Art | - | - | - | - | 0.89 |
| Prompt | 1.30 | 0.80 | 1.60 | 1.00 | 1.32 |
Tiles, simple objects, and the calibration methodology itself were clear successes.
Key insight: Giving the LLM judge both the rendered PNG and an ASCII text grid of the sprite dramatically improved its ability to assess spatial qualities. The model needs to see the output, not just analyze the drawing commands.
The latest round with 30 prompts, expanded categories, and LLM-judged spatial coverage.
Generation prompt iteration made things worse. Complex subjects hit a quality ceiling.
| Version | Comp. Sep. | Color | Detail | Coverage | Pixel Disc. | Prompt Adh. | Overall |
|---|---|---|---|---|---|---|---|
| Current (baseline) | 4.28 | 4.50 | 4.06 | 4.33 | — | 3.61 | 4.16 |
| v2 (detail emphasis) | 4.05 | 4.05 | 3.58 | 4.16 | 4.05 | 3.95 | 3.97 |
| v3 (category guidance) | 3.60 | 4.00 | 3.35 | 4.10 | 3.75 | 3.35 | 3.69 |
Human-graded mean scores (1–5 Likert scale). Each version evaluated on the same prompt set. v2 added detail emphasis and stronger instructions; v3 added per-category guidance and lighter perspective rules.
The same prompts rendered by each generation prompt version. Human scores (1–5) shown below each sprite.
Lesson learned: When the baseline model is already producing good output, adding more constraints in the system prompt consistently makes things worse. The model knows more about composing pixel art than our prescriptive rules can capture.
Same judge, different generation prompts. The baseline (current) outperformed both v2 and v3. More prescriptive instructions made sprites worse.
| Dimension | current | v2 | v3 |
|---|---|---|---|
| Comp Sep | 4.11 | 4.05 | 3.60 |
| Color | 4.32 | 4.05 | 4.00 |
| Detail | 4.05 | 3.58 | 3.35 |
| Spatial | 4.42 | 4.16 | 4.10 |
| Pixel Art | 4.05 | 4.05 | 3.75 |
| Prompt | 3.84 | 3.95 | 3.35 |
| OVERALL | 4.13 | 3.97 | 3.69 |
The eval infrastructure is solid. The generation approach may need to change.
The drawing-commands approach works well for tiles and simple objects but hits a quality ceiling for complex subjects. One alternative under consideration: generate a high-resolution image with an image-generation model, then use custom tooling to convert it down to pixel art with proper palette constraints and hard edges.
Regardless of the generation approach, the evaluation harness and calibrated judge remain valuable. The rubrics, prompt sets, human grading UI, and calibration methodology are general-purpose infrastructure for assessing pixel art quality.
Prompt sets, automated judging, multi-shot generation, and statistical analysis — reusable regardless of generation method.
Six-dimension rubrics validated against human scoring. MAD 0.61 overall with vision-augmented judging.
Web-based blind grading interface for sprite quality assessment. Supports re-grading and score comparison.
Drawing-command generation with server-side rasterization. Works well for simple subjects; may evolve for complex ones.