nick.dev
← projects
nlpevaluationllmcreativitypython

AI Creativity Evaluation Framework

Human-grounded framework for evaluating creativity in AI-generated stories. Decomposes creative quality into 4 dimensions and 11 sub-components, validated through crowdsourced evaluation with 115 participants.

Problem

Current approaches to evaluating AI-generated text treat quality as a single dimension. But creativity and user satisfaction are driven by fundamentally different features — optimizing for novelty alone can actually reduce user enjoyment.

Framework

We decompose creative quality into four orthogonal dimensions:

  • Novelty (Vocabulary Freshness, Plot Uniqueness, Surprise) — captures originality
  • Resonance (Emotional Impact, Empathy, Thought-Provocation) — captures affective engagement
  • Value (Engagement, Stylistic Quality, Logical Coherence) — captures craftsmanship
  • Adherence (Topic Fidelity, Tone Fidelity) — captures constraint satisfaction

Methodology

  • Spike Prompting: Controlled generation strategy using 4 narrative tones × 3 topics = 12 stories via Gemini 3.0 Pro
  • Human Evaluation: Crowdsourced study (N=115) separating immediate affective judgments from reflective analytical ratings
  • Validation: Cronbach's α > 0.69 for all constructs, manipulation checks confirm discriminant validity

Key Results

  • Creativity and enjoyment rely on largely non-overlapping features (regression weight correlation r = -0.057)
  • Participants who increased creativity ratings after reflection reported significantly lower enjoyment (p = 0.010)
  • Novelty drives creativity (β = 0.436) but contributes minimally to enjoyment; Value drives enjoyment (β = 0.748) but not creativity

Impact

  • Published at NLP2026 and as arXiv preprint (2601.03698)
  • Framework adopted as evaluation basis for ongoing multimodal creativity research
  • Training reward models to align LLM-as-judge with human creativity definitions