Asset Forge — LLM Pixel Art Generation

01 The Approach

LLMs output structured drawing commands. A server-side rasterizer turns those commands into indexed-color pixel grids.

The naive approach — asking an LLM to output raw pixel arrays — produces incoherent results. Models struggle with spatial reasoning at the individual pixel level. But shape primitives are a different story: rectangles, circles, and polygons are the kind of structured, compositional output that language models handle well.

Asset Forge gives the model a fixed palette and a canvas size, then asks it to compose a sprite from geometric primitives. The server rasterizes the commands into a pixel grid, producing genuine pixel art with hard edges and an indexed color palette.

Drawing Primitives

rect

circle

ellipse

line

polygon

fill

front-facing wizard with tall pointed purple hat, flowing blue robes, and holding a glowing staff

Human 5.0 LLM 4.3 48x48

top-down grass ground tile with subtle texture variation and scattered tiny flowers

Human 5.0 LLM 4.5 32x32

top-down sandy desert tile with small scattered rocks and wind-swept ripples

Human 5.0 LLM 4.2 32x32

front-view red health potion in a round glass bottle with a cork stopper

Human 5.0 LLM 4.0 32x32

front-view golden crown with three gemstones set along the top band

Human 5.0 LLM 4.3 48x48

top-down large grey boulder with cracks, moss patches, and shading for depth

Human 5.0 LLM 4.2 32x32

front-facing peasant farmer holding a pitchfork with straw hat and brown overalls

Human 4.8 LLM 4.2 32x32

top-down lava tile with glowing orange cracks on dark rock surface

Human 4.8 LLM 4.5 48x48

02 The Evaluation Harness

A calibrated LLM judge (Opus evaluating Sonnet's output) scoring across six quality dimensions, validated against human grading.

Automated evaluation of generative art is hard. Subjective quality, prompt fidelity, and pixel-art-specific conventions all matter. We built a judge that scores each sprite on a 1–5 Likert scale across six dimensions, then calibrated its rubrics against blind human scoring over five rounds.

Quality Dimensions

Component Separation Color Usage Detail Density Spatial Coverage Pixel Art Discipline Prompt Adherence

Each dimension has explicit anchor descriptions at every score level, so both the LLM judge and human graders share a common rubric. Component Separation, for example, distinguishes between multi-part subjects (where parts should be visually distinct) and simple subjects (where cohesion is the goal).

Calibration Journey

We ran five rounds of human grading, tightening rubrics and changing the judge's inputs after each round. The key breakthrough was giving the judge vision — both the rendered PNG and an ASCII grid representation — which dramatically improved Spatial Coverage and Prompt Adherence scoring.

Round 1 — Baseline

Initial rubrics, code-based scoring for Detail Density and Spatial Coverage. Overall MAD 0.98. Component Separation wildly off at 1.40.

Round 2 — Rubric Tightening

Rewrote anchor descriptions with subject-relative guidance. Overall MAD 0.86. Spatial Coverage worsened to 1.20 under code-based formula.

Round 3 — Subject-Aware Coverage

Refined Spatial Coverage rubric to account for subject shape. Overall MAD 0.96. Prompt Adherence worst at 1.60 without vision.

Round 4 — Expanded Prompt Set

20 prompts across 5 difficulty levels. Multi-shot judging. Overall MAD 0.87. Converging but coverage still code-based.

Round 5 — LLM Vision

Gave the judge the rendered PNG and ASCII grid. Spatial Coverage moved to LLM-judged. Overall MAD 0.61. Spatial Coverage went from worst dimension to best.

Calibration Data by Round

Round	Comp. Sep.	Color	Detail	Coverage	Pixel Disc.	Prompt Adh.	Overall MAD
1 — Baseline	1.40	1.00	1.00	0.20	—	1.30	0.98
2 — Rubric Tightening	0.70	0.80	0.80	1.20	—	0.80	0.86
3 — Subject-Aware	1.00	0.60	0.90	0.70	—	1.60	0.96
4 — Expanded Set	1.05	0.58	0.74	0.89	—	1.11	0.87
5 — Vision	0.50	0.56	0.39	0.61	—	1.00	0.61

MAD = Mean Absolute Deviation between LLM judge scores and blind human scores (lower is better). Color coding: green ≤ 0.65, gold ≤ 1.10, red > 1.10.

Sprites Used for Judge Calibration

The same prompts generated fresh each round. The sprites differ because LLM output varies, but the generation prompt stayed the same. What changed was the judge.

top-down military tank with visible tracks, hull, and turret

Round 1

H:3.0 L:4.8

Round 2

H:3.0 L:4.4

Round 3

H:2.4 L:4.8

Round 4

H:4.8 L:4.6

side-view red sports car

Round 1

H:3.6 L:4.2

Round 2

H:4.0 L:3.8

Round 3

H:3.6 L:4.2

pixel art knight with sword and shield

Round 1

H:3.2 L:4.2

Round 2

H:3.6 L:3.8

Round 3

H:3.6 L:4.2

wizard with pointed hat and staff

Round 1

H:3.2 L:3.8

Round 2

H:3.8 L:3.6

Round 3

H:3.8 L:4.0

grass ground tile with subtle texture variation

Round 1

H:3.8 L:3.6

Round 2

H:2.8 L:4.0

Round 3

H:3.8 L:3.8

water tile with wave pattern

Round 1

H:2.6 L:4.4

Round 2

H:2.4 L:4.4

Round 3

H:3.4 L:4.4

Calibration Metrics Across Rounds

MAD = Mean Absolute Difference (lower is better). Agreement = % of scores within 1 point of human.

Dimension	Round 1 Baseline	Round 2 Rubric rewrites	Round 3 Model separation	Round 4 Vision + preChecks	Round 5 LLM spatial coverage, 30 prompts
Comp Sep	1.40	0.70	1.00	0.50	0.93
Color	1.00	0.80	0.60	0.56	0.71
Detail	1.00	0.80	0.90	0.39	1.04
Spatial	0.20	1.20	0.70	0.61	0.61
Pixel Art	-	-	-	-	0.89
Prompt	1.30	0.80	1.60	1.00	1.32

03 What Worked

Tiles, simple objects, and the calibration methodology itself were clear successes.

Tiles and simple objects score well — human scores of 4.0–4.8 out of 5 for subjects like boulders, crates, floor tiles, and simple vehicles. The drawing-command approach handles these naturally.
The calibration methodology converged — five rounds of blind human scoring, rubric rewriting, model separation, and adding vision brought overall MAD from 0.98 down to 0.61.
Spatial Coverage transformed — the worst-calibrated dimension (MAD 0.95 under code-based formula) became the best (MAD 0.61) once moved to LLM-judged with vision input.
The eval harness is reusable infrastructure — prompt sets, multi-shot generation, automated judging with rubric anchors, human grading UI, and calibration statistics are all general-purpose tools that work regardless of the generation approach.

Key insight: Giving the LLM judge both the rendered PNG and an ASCII text grid of the sprite dramatically improved its ability to assess spatial qualities. The model needs to see the output, not just analyze the drawing commands.

front-facing wizard with tall pointed purple hat, flowing blue robes, and holding a glowing staff

Human 5.0 LLM 4.3 48x48

top-down grass ground tile with subtle texture variation and scattered tiny flowers

Human 5.0 LLM 4.5 32x32

top-down sandy desert tile with small scattered rocks and wind-swept ripples

Human 5.0 LLM 4.2 32x32

front-view red health potion in a round glass bottle with a cork stopper

Human 5.0 LLM 4.0 32x32

front-view golden crown with three gemstones set along the top band

Human 5.0 LLM 4.3 48x48

top-down large grey boulder with cracks, moss patches, and shading for depth

Human 5.0 LLM 4.2 32x32

front-facing peasant farmer holding a pitchfork with straw hat and brown overalls

Human 4.8 LLM 4.2 32x32

top-down lava tile with glowing orange cracks on dark rock surface

Human 4.8 LLM 4.5 48x48

Round 5: Best and Worst Sprites

The latest round with 30 prompts, expanded categories, and LLM-judged spatial coverage.

Top Scoring

front-facing wizard with tall pointed purple hat, flowing b…

H:5.0 L:4.3

top-down grass ground tile with subtle texture variation an…

H:5.0 L:4.5

top-down sandy desert tile with small scattered rocks and w…

H:5.0 L:4.2

front-view red health potion in a round glass bottle with a…

H:5.0 L:4.0

front-view golden crown with three gemstones set along the …

H:5.0 L:4.3

top-down large grey boulder with cracks, moss patches, and …

H:5.0 L:4.2

front-facing peasant farmer holding a pitchfork with straw …

H:4.8 L:4.2

top-down lava tile with glowing orange cracks on dark rock …

H:4.8 L:4.5

Lowest Scoring

top-down blue pickup truck facing up with cab, open truck b…

H:2.3 L:4.3

three-quarter view wooden barrel with two metal bands and v…

H:2.7 L:4.2

front-facing dwarf blacksmith with thick beard, leather apr…

H:3.0 L:3.8

front-facing pixel art knight holding a sword in one hand a…

H:3.0 L:4.3

side-view red sports car facing right with visible wheels, …

H:3.0 L:4.2

top-down small green go-kart facing right with driver helme…

H:3.2 L:4.2

04 What Didn't Work

Generation prompt iteration made things worse. Complex subjects hit a quality ceiling.

Prompt iteration backfired — v2 of the generation prompt scored −0.16 vs. baseline, v3 scored −0.45. More prescriptive instructions constrain the model and reduce quality.
Complex subjects produce recognizable blobs — dragons, knights, and detailed vehicles have too many components to resolve clearly at 32–64px with geometric primitives.
Prompt Adherence remains hardest to calibrate — MAD 1.32 on the latest full round. The judge tends to be more generous than humans about whether a sprite "looks like" the requested subject.
The drawing-commands approach may have a ceiling — geometric primitives produce clean, structured output but lack the organic detail needed for complex subjects.

Generation Prompt Version Comparison

Version	Comp. Sep.	Color	Detail	Coverage	Pixel Disc.	Prompt Adh.	Overall
Current (baseline)	4.28	4.50	4.06	4.33	—	3.61	4.16
v2 (detail emphasis)	4.05	4.05	3.58	4.16	4.05	3.95	3.97
v3 (category guidance)	3.60	4.00	3.35	4.10	3.75	3.35	3.69

Human-graded mean scores (1–5 Likert scale). Each version evaluated on the same prompt set. v2 added detail emphasis and stronger instructions; v3 added per-category guidance and lighter perspective rules.

Side-by-Side Comparison

The same prompts rendered by each generation prompt version. Human scores (1–5) shown below each sprite.

front-view red health potion in a round glass bottle with a cork stopper

Current

5.0

4.8

3.2

top-down sandy desert tile with small scattered rocks and wind-swept ripples

Current

5.0

4.7

3.3

top-down military tank with visible tracks, hull, and turret with barrel poin...

Current

3.8

4.2

2.3

flat iron key with ornate loop handle, viewed from above, with rust patina

Current

4.2

3.5

3.2

top-down grass ground tile with subtle texture variation and scattered tiny f...

Current

5.0

4.3

4.2

Lesson learned: When the baseline model is already producing good output, adding more constraints in the system prompt consistently makes things worse. The model knows more about composing pixel art than our prescriptive rules can capture.

Generator Prompt Iterations

Same judge, different generation prompts. The baseline (current) outperformed both v2 and v3. More prescriptive instructions made sprites worse.

three-quarter view treasure chest with gold trim, lid closed

current

H:5.0

H:4.5

H:3.0

top-down military tank with visible tracks, hull, and turret

current

H:4.0

H:4.2

H:2.3

front-view red health potion in a round glass bottle with a cork

current

H:5.0

H:4.8

H:3.2

top-down sandy desert tile with small scattered rocks

current

H:4.3

H:4.7

H:3.3

flat iron key with ornate loop handle, viewed from above

current

H:4.5

H:3.5

H:3.2

front-facing skeleton warrior holding a bone club

current

H:4.2

H:3.2

H:3.0

front-view brick wall segment with alternating brick pattern

current

H:3.8

H:5.0

H:3.8

top-down small red dragon with wings spread, viewed from above

current

H:2.8

H:3.8

H:3.3

Human Scores by Generator Version

Dimension	current	v2	v3
Comp Sep	4.11	4.05	3.60
Color	4.32	4.05	4.00
Detail	4.05	3.58	3.35
Spatial	4.42	4.16	4.10
Pixel Art	4.05	4.05	3.75
Prompt	3.84	3.95	3.35
OVERALL	4.13	3.97	3.69

top-down blue pickup truck facing up with cab, open truck bed, and four visible wheels

Human 2.3 LLM 4.3 48x48

three-quarter view wooden barrel with two metal bands and visible wood stave grain

Human 2.7 LLM 4.2 32x32

side-view red sports car facing right with visible wheels, low profile, and tinted windows

Human 3.0 LLM 4.2 64x32

front-facing pixel art knight holding a sword in one hand and a shield in the other, wearing silver armour

Human 3.0 LLM 4.3 48x48

front-facing dwarf blacksmith with thick beard, leather apron, and hammer raised overhead

Human 3.0 LLM 3.8 48x48

top-down small green go-kart facing right with driver helmet visible

Human 3.2 LLM 4.2 32x32

05 Where It's Going

The eval infrastructure is solid. The generation approach may need to change.

The drawing-commands approach works well for tiles and simple objects but hits a quality ceiling for complex subjects. One alternative under consideration: generate a high-resolution image with an image-generation model, then use custom tooling to convert it down to pixel art with proper palette constraints and hard edges.

Regardless of the generation approach, the evaluation harness and calibrated judge remain valuable. The rubrics, prompt sets, human grading UI, and calibration methodology are general-purpose infrastructure for assessing pixel art quality.

⚙

Eval Harness

Prompt sets, automated judging, multi-shot generation, and statistical analysis — reusable regardless of generation method.

★

Calibrated Judge

Six-dimension rubrics validated against human scoring. MAD 0.61 overall with vision-augmented judging.

✎

Human Grading UI

Web-based blind grading interface for sprite quality assessment. Supports re-grading and score comparison.

⚖

Generation Pipeline

Drawing-command generation with server-side rasterization. Works well for simple subjects; may evolve for complex ones.