AI Creativity Evaluation Framework

Human-grounded framework for evaluating creativity in AI-generated stories. Decomposes creative quality into 4 dimensions and 11 sub-components, validated through crowdsourced evaluation with 115 participants.

Problem

Current approaches to evaluating AI-generated text treat quality as a single dimension. But creativity and user satisfaction are driven by fundamentally different features — optimizing for novelty alone can actually reduce user enjoyment.

Framework

We decompose creative quality into four orthogonal dimensions:

Novelty (Vocabulary Freshness, Plot Uniqueness, Surprise) — captures originality
Resonance (Emotional Impact, Empathy, Thought-Provocation) — captures affective engagement
Value (Engagement, Stylistic Quality, Logical Coherence) — captures craftsmanship
Adherence (Topic Fidelity, Tone Fidelity) — captures constraint satisfaction

Methodology

Spike Prompting: Controlled generation strategy using 4 narrative tones × 3 topics = 12 stories via Gemini 3.0 Pro
Human Evaluation: Crowdsourced study (N=115) separating immediate affective judgments from reflective analytical ratings
Validation: Cronbach's α > 0.69 for all constructs, manipulation checks confirm discriminant validity

Key Results

Creativity and enjoyment rely on largely non-overlapping features (regression weight correlation r = -0.057)
Participants who increased creativity ratings after reflection reported significantly lower enjoyment (p = 0.010)
Novelty drives creativity (β = 0.436) but contributes minimally to enjoyment; Value drives enjoyment (β = 0.748) but not creativity

Impact

Published at NLP2026 and as arXiv preprint (2601.03698)
Framework adopted as evaluation basis for ongoing multimodal creativity research
Training reward models to align LLM-as-judge with human creativity definitions