Asset Forge

Exploring whether large language models can generate quality pixel art through structured drawing commands

Generator Claude Sonnet Judge Claude Opus Grid 32–64px indexed color Stack Node.js + Express

01 The Approach

LLMs output structured drawing commands. A server-side rasterizer turns those commands into indexed-color pixel grids.

The naive approach — asking an LLM to output raw pixel arrays — produces incoherent results. Models struggle with spatial reasoning at the individual pixel level. But shape primitives are a different story: rectangles, circles, and polygons are the kind of structured, compositional output that language models handle well.

Asset Forge gives the model a fixed palette and a canvas size, then asks it to compose a sprite from geometric primitives. The server rasterizes the commands into a pixel grid, producing genuine pixel art with hard edges and an indexed color palette.

Drawing Primitives

rect
circle
ellipse
line
polygon
fill
front-facing wizard with tall pointed purple hat, flowing blue robes, and holding a glowing staff
front-facing wizard with tall pointed purple hat, flowing blue robes, and holding a glowing staff
Human 5.0 LLM 4.3 48x48
top-down grass ground tile with subtle texture variation and scattered tiny flowers
top-down grass ground tile with subtle texture variation and scattered tiny flowers
Human 5.0 LLM 4.5 32x32
top-down sandy desert tile with small scattered rocks and wind-swept ripples
top-down sandy desert tile with small scattered rocks and wind-swept ripples
Human 5.0 LLM 4.2 32x32
front-view red health potion in a round glass bottle with a cork stopper
front-view red health potion in a round glass bottle with a cork stopper
Human 5.0 LLM 4.0 32x32
front-view golden crown with three gemstones set along the top band
front-view golden crown with three gemstones set along the top band
Human 5.0 LLM 4.3 48x48
top-down large grey boulder with cracks, moss patches, and shading for depth
top-down large grey boulder with cracks, moss patches, and shading for depth
Human 5.0 LLM 4.2 32x32
front-facing peasant farmer holding a pitchfork with straw hat and brown overalls
front-facing peasant farmer holding a pitchfork with straw hat and brown overalls
Human 4.8 LLM 4.2 32x32
top-down lava tile with glowing orange cracks on dark rock surface
top-down lava tile with glowing orange cracks on dark rock surface
Human 4.8 LLM 4.5 48x48

02 The Evaluation Harness

A calibrated LLM judge (Opus evaluating Sonnet's output) scoring across six quality dimensions, validated against human grading.

Automated evaluation of generative art is hard. Subjective quality, prompt fidelity, and pixel-art-specific conventions all matter. We built a judge that scores each sprite on a 1–5 Likert scale across six dimensions, then calibrated its rubrics against blind human scoring over five rounds.

Quality Dimensions

Component Separation Color Usage Detail Density Spatial Coverage Pixel Art Discipline Prompt Adherence

Each dimension has explicit anchor descriptions at every score level, so both the LLM judge and human graders share a common rubric. Component Separation, for example, distinguishes between multi-part subjects (where parts should be visually distinct) and simple subjects (where cohesion is the goal).

Calibration Journey

We ran five rounds of human grading, tightening rubrics and changing the judge's inputs after each round. The key breakthrough was giving the judge vision — both the rendered PNG and an ASCII grid representation — which dramatically improved Spatial Coverage and Prompt Adherence scoring.

Round 1 — Baseline

Initial rubrics, code-based scoring for Detail Density and Spatial Coverage. Overall MAD 0.98. Component Separation wildly off at 1.40.

Round 2 — Rubric Tightening

Rewrote anchor descriptions with subject-relative guidance. Overall MAD 0.86. Spatial Coverage worsened to 1.20 under code-based formula.

Round 3 — Subject-Aware Coverage

Refined Spatial Coverage rubric to account for subject shape. Overall MAD 0.96. Prompt Adherence worst at 1.60 without vision.

Round 4 — Expanded Prompt Set

20 prompts across 5 difficulty levels. Multi-shot judging. Overall MAD 0.87. Converging but coverage still code-based.

Round 5 — LLM Vision

Gave the judge the rendered PNG and ASCII grid. Spatial Coverage moved to LLM-judged. Overall MAD 0.61. Spatial Coverage went from worst dimension to best.

Calibration Data by Round

Round Comp. Sep. Color Detail Coverage Pixel Disc. Prompt Adh. Overall MAD
1 — Baseline 1.40 1.00 1.00 0.20 1.30 0.98
2 — Rubric Tightening 0.70 0.80 0.80 1.20 0.80 0.86
3 — Subject-Aware 1.00 0.60 0.90 0.70 1.60 0.96
4 — Expanded Set 1.05 0.58 0.74 0.89 1.11 0.87
5 — Vision 0.50 0.56 0.39 0.61 1.00 0.61

MAD = Mean Absolute Deviation between LLM judge scores and blind human scores (lower is better). Color coding: green ≤ 0.65, gold ≤ 1.10, red > 1.10.

Sprites Used for Judge Calibration

The same prompts generated fresh each round. The sprites differ because LLM output varies, but the generation prompt stayed the same. What changed was the judge.

top-down military tank with visible tracks, hull, and turret
top-down military tank with v…
Round 1
H:3.0 L:4.8
top-down military tank with v…
Round 2
H:3.0 L:4.4
top-down military tank with v…
Round 3
H:2.4 L:4.8
top-down military tank with v…
Round 4
H:4.8 L:4.6
side-view red sports car
side-view red sports car
Round 1
H:3.6 L:4.2
side-view red sports car
Round 2
H:4.0 L:3.8
side-view red sports car
Round 3
H:3.6 L:4.2
pixel art knight with sword and shield
pixel art knight with sword a…
Round 1
H:3.2 L:4.2
pixel art knight with sword a…
Round 2
H:3.6 L:3.8
pixel art knight with sword a…
Round 3
H:3.6 L:4.2
wizard with pointed hat and staff
wizard with pointed hat and s…
Round 1
H:3.2 L:3.8
wizard with pointed hat and s…
Round 2
H:3.8 L:3.6
wizard with pointed hat and s…
Round 3
H:3.8 L:4.0
grass ground tile with subtle texture variation
grass ground tile with subtle…
Round 1
H:3.8 L:3.6
grass ground tile with subtle…
Round 2
H:2.8 L:4.0
grass ground tile with subtle…
Round 3
H:3.8 L:3.8
water tile with wave pattern
water tile with wave pattern
Round 1
H:2.6 L:4.4
water tile with wave pattern
Round 2
H:2.4 L:4.4
water tile with wave pattern
Round 3
H:3.4 L:4.4

Calibration Metrics Across Rounds

MAD = Mean Absolute Difference (lower is better). Agreement = % of scores within 1 point of human.

Dimension Round 1
Baseline
Round 2
Rubric rewrites
Round 3
Model separation
Round 4
Vision + preChecks
Round 5
LLM spatial coverage, 30 prompts
Comp Sep 1.40 0.70 1.00 0.50 0.93
Color 1.00 0.80 0.60 0.56 0.71
Detail 1.00 0.80 0.90 0.39 1.04
Spatial 0.20 1.20 0.70 0.61 0.61
Pixel Art - - - - 0.89
Prompt 1.30 0.80 1.60 1.00 1.32

03 What Worked

Tiles, simple objects, and the calibration methodology itself were clear successes.

Key insight: Giving the LLM judge both the rendered PNG and an ASCII text grid of the sprite dramatically improved its ability to assess spatial qualities. The model needs to see the output, not just analyze the drawing commands.

front-facing wizard with tall pointed purple hat, flowing blue robes, and holding a glowing staff
front-facing wizard with tall pointed purple hat, flowing blue robes, and holding a glowing staff
Human 5.0 LLM 4.3 48x48
top-down grass ground tile with subtle texture variation and scattered tiny flowers
top-down grass ground tile with subtle texture variation and scattered tiny flowers
Human 5.0 LLM 4.5 32x32
top-down sandy desert tile with small scattered rocks and wind-swept ripples
top-down sandy desert tile with small scattered rocks and wind-swept ripples
Human 5.0 LLM 4.2 32x32
front-view red health potion in a round glass bottle with a cork stopper
front-view red health potion in a round glass bottle with a cork stopper
Human 5.0 LLM 4.0 32x32
front-view golden crown with three gemstones set along the top band
front-view golden crown with three gemstones set along the top band
Human 5.0 LLM 4.3 48x48
top-down large grey boulder with cracks, moss patches, and shading for depth
top-down large grey boulder with cracks, moss patches, and shading for depth
Human 5.0 LLM 4.2 32x32
front-facing peasant farmer holding a pitchfork with straw hat and brown overalls
front-facing peasant farmer holding a pitchfork with straw hat and brown overalls
Human 4.8 LLM 4.2 32x32
top-down lava tile with glowing orange cracks on dark rock surface
top-down lava tile with glowing orange cracks on dark rock surface
Human 4.8 LLM 4.5 48x48

Round 5: Best and Worst Sprites

The latest round with 30 prompts, expanded categories, and LLM-judged spatial coverage.

Top Scoring

Lowest Scoring

04 What Didn't Work

Generation prompt iteration made things worse. Complex subjects hit a quality ceiling.

Generation Prompt Version Comparison

Version Comp. Sep. Color Detail Coverage Pixel Disc. Prompt Adh. Overall
Current (baseline) 4.28 4.50 4.06 4.33 3.61 4.16
v2 (detail emphasis) 4.05 4.05 3.58 4.16 4.05 3.95 3.97
v3 (category guidance) 3.60 4.00 3.35 4.10 3.75 3.35 3.69

Human-graded mean scores (1–5 Likert scale). Each version evaluated on the same prompt set. v2 added detail emphasis and stronger instructions; v3 added per-category guidance and lighter perspective rules.

Side-by-Side Comparison

The same prompts rendered by each generation prompt version. Human scores (1–5) shown below each sprite.

front-view red health potion in a round glass bottle with a cork stopper
Current
Current
5.0
v2
v2
4.8
v3
v3
3.2
top-down sandy desert tile with small scattered rocks and wind-swept ripples
Current
Current
5.0
v2
v2
4.7
v3
v3
3.3
top-down military tank with visible tracks, hull, and turret with barrel poin...
Current
Current
3.8
v2
v2
4.2
v3
v3
2.3
flat iron key with ornate loop handle, viewed from above, with rust patina
Current
Current
4.2
v2
v2
3.5
v3
v3
3.2
top-down grass ground tile with subtle texture variation and scattered tiny f...
Current
Current
5.0
v2
v2
4.3
v3
v3
4.2

Lesson learned: When the baseline model is already producing good output, adding more constraints in the system prompt consistently makes things worse. The model knows more about composing pixel art than our prescriptive rules can capture.

Generator Prompt Iterations

Same judge, different generation prompts. The baseline (current) outperformed both v2 and v3. More prescriptive instructions made sprites worse.

three-quarter view treasure chest with gold trim, lid closed
three-quarter view treasure c…
current
H:5.0
three-quarter view treasure c…
v2
H:4.5
three-quarter view treasure c…
v3
H:3.0
top-down military tank with visible tracks, hull, and turret
top-down military tank with v…
current
H:4.0
top-down military tank with v…
v2
H:4.2
top-down military tank with v…
v3
H:2.3
front-view red health potion in a round glass bottle with a cork
front-view red health potion …
current
H:5.0
front-view red health potion …
v2
H:4.8
front-view red health potion …
v3
H:3.2
top-down sandy desert tile with small scattered rocks
top-down sandy desert tile wi…
current
H:4.3
top-down sandy desert tile wi…
v2
H:4.7
top-down sandy desert tile wi…
v3
H:3.3
flat iron key with ornate loop handle, viewed from above
flat iron key with ornate loo…
current
H:4.5
flat iron key with ornate loo…
v2
H:3.5
flat iron key with ornate loo…
v3
H:3.2
front-facing skeleton warrior holding a bone club
front-facing skeleton warrior…
current
H:4.2
front-facing skeleton warrior…
v2
H:3.2
front-facing skeleton warrior…
v3
H:3.0
front-view brick wall segment with alternating brick pattern
front-view brick wall segment…
current
H:3.8
front-view brick wall segment…
v2
H:5.0
front-view brick wall segment…
v3
H:3.8
top-down small red dragon with wings spread, viewed from above
top-down small red dragon wit…
current
H:2.8
top-down small red dragon wit…
v2
H:3.8
top-down small red dragon wit…
v3
H:3.3

Human Scores by Generator Version

Dimension current v2 v3
Comp Sep 4.11 4.05 3.60
Color 4.32 4.05 4.00
Detail 4.05 3.58 3.35
Spatial 4.42 4.16 4.10
Pixel Art 4.05 4.05 3.75
Prompt 3.84 3.95 3.35
OVERALL 4.13 3.97 3.69
top-down blue pickup truck facing up with cab, open truck bed, and four visible wheels
top-down blue pickup truck facing up with cab, open truck bed, and four visible wheels
Human 2.3 LLM 4.3 48x48
three-quarter view wooden barrel with two metal bands and visible wood stave grain
three-quarter view wooden barrel with two metal bands and visible wood stave grain
Human 2.7 LLM 4.2 32x32
side-view red sports car facing right with visible wheels, low profile, and tinted windows
side-view red sports car facing right with visible wheels, low profile, and tinted windows
Human 3.0 LLM 4.2 64x32
front-facing pixel art knight holding a sword in one hand and a shield in the other, wearing silver armour
front-facing pixel art knight holding a sword in one hand and a shield in the other, wearing silver armour
Human 3.0 LLM 4.3 48x48
front-facing dwarf blacksmith with thick beard, leather apron, and hammer raised overhead
front-facing dwarf blacksmith with thick beard, leather apron, and hammer raised overhead
Human 3.0 LLM 3.8 48x48
top-down small green go-kart facing right with driver helmet visible
top-down small green go-kart facing right with driver helmet visible
Human 3.2 LLM 4.2 32x32

05 Where It's Going

The eval infrastructure is solid. The generation approach may need to change.

The drawing-commands approach works well for tiles and simple objects but hits a quality ceiling for complex subjects. One alternative under consideration: generate a high-resolution image with an image-generation model, then use custom tooling to convert it down to pixel art with proper palette constraints and hard edges.

Regardless of the generation approach, the evaluation harness and calibrated judge remain valuable. The rubrics, prompt sets, human grading UI, and calibration methodology are general-purpose infrastructure for assessing pixel art quality.

Eval Harness

Prompt sets, automated judging, multi-shot generation, and statistical analysis — reusable regardless of generation method.

Calibrated Judge

Six-dimension rubrics validated against human scoring. MAD 0.61 overall with vision-augmented judging.

Human Grading UI

Web-based blind grading interface for sprite quality assessment. Supports re-grading and score comparison.

Generation Pipeline

Drawing-command generation with server-side rasterization. Works well for simple subjects; may evolve for complex ones.