← projects
nlpevaluationllmcreativitypython
AI Creativity Evaluation Framework
Human-grounded framework for evaluating creativity in AI-generated stories. Decomposes creative quality into 4 dimensions and 11 sub-components, validated through crowdsourced evaluation with 115 participants.
Problem
Current approaches to evaluating AI-generated text treat quality as a single dimension. But creativity and user satisfaction are driven by fundamentally different features — optimizing for novelty alone can actually reduce user enjoyment.
Framework
We decompose creative quality into four orthogonal dimensions:
- Novelty (Vocabulary Freshness, Plot Uniqueness, Surprise) — captures originality
- Resonance (Emotional Impact, Empathy, Thought-Provocation) — captures affective engagement
- Value (Engagement, Stylistic Quality, Logical Coherence) — captures craftsmanship
- Adherence (Topic Fidelity, Tone Fidelity) — captures constraint satisfaction
Methodology
- Spike Prompting: Controlled generation strategy using 4 narrative tones × 3 topics = 12 stories via Gemini 3.0 Pro
- Human Evaluation: Crowdsourced study (N=115) separating immediate affective judgments from reflective analytical ratings
- Validation: Cronbach's α > 0.69 for all constructs, manipulation checks confirm discriminant validity
Key Results
- Creativity and enjoyment rely on largely non-overlapping features (regression weight correlation r = -0.057)
- Participants who increased creativity ratings after reflection reported significantly lower enjoyment (p = 0.010)
- Novelty drives creativity (β = 0.436) but contributes minimally to enjoyment; Value drives enjoyment (β = 0.748) but not creativity
Impact
- Published at NLP2026 and as arXiv preprint (2601.03698)
- Framework adopted as evaluation basis for ongoing multimodal creativity research
- Training reward models to align LLM-as-judge with human creativity definitions