Rethinking How We Evaluate AI Creativity

The Problem

When someone says an AI story is "creative," what do they actually mean? After running a controlled study with 115 readers evaluating AI-generated stories, the answer surprised us: creativity and enjoyment operate on fundamentally different axes.

Most LLM evaluation treats "creative output" as a single score. But our research at Kurohashi Lab found that creativity judgments rely primarily on novelty, while enjoyment depends on emotional resonance. Optimizing for one can actively hurt the other.

The Framework

We proposed a four-dimensional evaluation framework:

Novelty — Is this surprising or unexpected?
Value — Does this contribute something meaningful?
Adherence — Does it follow the constraints of the prompt?
Resonance — Does the reader feel something?

The key finding: these dimensions are evaluated hierarchically, not cumulatively. Readers don't average across all four — they gate on novelty first, then assess the rest.

Spike Prompting

To generate stories with controlled creativity levels, we developed "Spike Prompting" — a technique that deliberately amplifies specific creativity dimensions while holding others constant. This let us isolate which dimensions drive which judgments.

What This Means for LLM Development

If you're building systems that generate creative text, optimizing for novelty alone (as most RLHF pipelines do) is a trap. You'll get higher "creativity scores" but lower user satisfaction. The reward model needs to capture resonance separately — which is exactly what we're working on next.

The full paper is on arXiv (2601.03698).