Skip to content

StoryScope: Investigating Idiosyncrasies in AI Fiction

A fascinating new paper on arXiv (2604.03136) introduces STORYSCOPE, a pipeline that extracts discourse-level narrative features — plot structure, character agency, temporal structure — from fiction to distinguish AI-written from human-written stories.

Unlike surface-level signals such as word choice or the overused word "delve," these structural features are robust to editing and paraphrasing, making them far harder to circumvent.

The Dataset

  • 10,272 human-written stories sourced from Books3
  • 5 LLMs generated mirrored stories: Claude, DeepSeek, Gemini, GPT, and Kimi
  • 61,608 total stories in the corpus

Three-Stage Pipeline

  1. Structured Narrative Representations — Stories are analyzed across 10 dimensions defined by the NarraBench framework.
  2. Cross-Source LLM Comparison — The narrative profiles of each model are compared against each other and against humans.
  3. Feature Discovery304 interpretable features are distilled into a compact fingerprint of writing style.

Results

  • 93.2% macro-F1 for Human vs. AI detection using only narrative features (this captures 97% of the performance achieved when including style features too).
  • 68.4% macro-F1 for 6-way authorship attribution (identifying which model wrote a given piece).
  • Robust to LAMP editing — still achieves 93.9% F1 after text is edited, meaning surface-level paraphrasing does not evade detection.

Key AI vs. Human Differences

Dimension AI Human
Theme explanation 77% of stories over-explain themes 52%
Olfactory/sensory imagery 81% over-describe body/senses 38%
Plot structure Favors tidy single-track plots More nonlinearity
Reader address Rare Common
Intertextual references Rare Common

Perhaps most interestingly, the paper discovered distinct model fingerprints — meaning each LLM leaves a unique narrative signature that can be identified even when the topic and genre are the same.


Source: arXiv 2604.03136 — StoryScope