StoryScope: Investigating Idiosyncrasies in AI Fiction¶
A fascinating new paper on arXiv (2604.03136) introduces STORYSCOPE, a pipeline that extracts discourse-level narrative features — plot structure, character agency, temporal structure — from fiction to distinguish AI-written from human-written stories.
Unlike surface-level signals such as word choice or the overused word "delve," these structural features are robust to editing and paraphrasing, making them far harder to circumvent.
The Dataset¶
- 10,272 human-written stories sourced from Books3
- 5 LLMs generated mirrored stories: Claude, DeepSeek, Gemini, GPT, and Kimi
- 61,608 total stories in the corpus
Three-Stage Pipeline¶
- Structured Narrative Representations — Stories are analyzed across 10 dimensions defined by the NarraBench framework.
- Cross-Source LLM Comparison — The narrative profiles of each model are compared against each other and against humans.
- Feature Discovery — 304 interpretable features are distilled into a compact fingerprint of writing style.
Results¶
- 93.2% macro-F1 for Human vs. AI detection using only narrative features (this captures 97% of the performance achieved when including style features too).
- 68.4% macro-F1 for 6-way authorship attribution (identifying which model wrote a given piece).
- Robust to LAMP editing — still achieves 93.9% F1 after text is edited, meaning surface-level paraphrasing does not evade detection.
Key AI vs. Human Differences¶
| Dimension | AI | Human |
|---|---|---|
| Theme explanation | 77% of stories over-explain themes | 52% |
| Olfactory/sensory imagery | 81% over-describe body/senses | 38% |
| Plot structure | Favors tidy single-track plots | More nonlinearity |
| Reader address | Rare | Common |
| Intertextual references | Rare | Common |
Perhaps most interestingly, the paper discovered distinct model fingerprints — meaning each LLM leaves a unique narrative signature that can be identified even when the topic and genre are the same.
Source: arXiv 2604.03136 — StoryScope