We collect datasets that have been used for style-related tasks before. These datasets were also compiled with resources from Huang et al., 2025, Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges. (Since they provide many machine text detection resources and that task was not our focus, we did not repeat them here.) Authorship Attribution datasets can usually also be used for authorship verification. Note that many more datasets could be used for style-related tasks.
Note that several commonly used datasets in the 2010s and before might not be accessible anymore, or not recommended to use because of ToS and copyright conflicts. Proceed with caution. We try and highlight safe corpora (✓), but give no guarantees.
We welcome additions via Pull Rrequests!
| Dataset / Collection | Domain | Link | Paper / Reference | Common Task | Availability / License | Size / Volume |
|---|---|---|---|---|---|---|
| Enron Email Corpus | E-mails | dataset | Klimt & Yang 2004 | Authorship Attribution | ✓ | ~500k emails |
| Blog Authorship Corpus | Blogs | dataset | Schler et al. 2006 | Author Profiling (age, gender) | ✓ | ~680k blog posts |
| Pushshift Reddit Dataset | dataset, not available anymore | Baumgartner et al. 2020 | Authortship Attribution | ✗ (Reddit ToS change) | - | |
| Million Reddit User Dataset (MUD) | Request form | Khan et al., 2021 | Authorship Attribution | ✓ Request-only (research use) | ~1 million users | |
| Amazon Reviews | Reviews | dataset, also includes links to previous versions | 2023: Hou et al., 2024, 2019: Ni et al., classic: McAuley 2015 | Authorship Attribution | ✓ | ~570 million reviews |
| Amazon Multilingual Reviews | Reviews | dataset | Keung et al., 2020 | Authorship Attribution | ✗ Removed | ~200k reviews in 6 languages |
| Yelp Reviews | Reviews | dataset | - | Authorship Attribution | ✓ (educational use) | ~7 million reviews |
| IMDB Reviews | Reviews | dataset | Maas et al. 2011 | Authorship Attribution | ✓ | 50,000 movie reviews |
| IMDb1M | Reviews (Movies) | Seroussi et al., 2014 | Authorship Attribution | ? (available upon request) | ~270k reviews | |
| IMDb62 | Reviews (Movies) | Seroussi et al., 2014 | Authorship Attribution | ? (available upon request) | ~80k reviews by 62 authors | |
| AO3 / Fanfiction.net | Fanfiction | reddit posts link1 link2 | - | Authorship Attribution | ✗ previous conflicts with ToS | |
| Wikipedia Million Author Corpus | Wikipedia | dataset | Israeli et al., 2025 | Authorship Attribution | ✓ CC BY-SA | ~60 million text chunks |
| BookCorpus | Books | dataset | Zhu et al. 2015 | Authorship Attribution | ✗ (copyright issues) | Originally ~11k books |
| Gutenberg / PG-19 | Books published before 1919 | dataset | Rae et al. 2019 | Authorship attribution | ✓ Apache 2.0 | ~29k books |
| PAN datasets | PAN | dataset | varies by dataset, see pan website | Authorship Verification, Authorship Attribution, Style Change Detection | ✓ (some require permission) | |
| Aston 100 Idiolects Corpus | Mixed (emails, essays, texts, memos) | corpus info | - | Authorship Verification, Attribution, Style across discourse types | ? Permission required (Aston Institute for Forensic Linguistics) | ~112 individuals (ages 18–22) across written and spoken modalities; PAN 22 Author Identification use a subset |
| Forensic Linguistic Databank | Collection of datasets for forensic linguistics with varying access categories | link | Aston Institute for Forensic Linguistics | ✓ | ||
| Valla | standardized benchmark (Amazon, Blogs, and others) | github | Tyo et al. | Authorship Attribution, Authorship Verification, Benchmark | ? (collection of different sources, looks largely fine though) | not providing the datasets themselves |
| Reuters | Reuters News stories | RCV1, RC2, TRC2: dataset info | Lewis et al., 2004 | Authorship Attribution, Topic-controlled Style Analysis | ? (restricted via request) | |
| Reuters-21578 | Reuters News stories | dataset, huggingface | Topic-controlled Style Analysis | ✓ (research purposes only) | ~21k news articles | |
| CMCC | Forensic (letters, notes) | corpus info | Goldstein et al., 2008 | Forensic Authorship Attribution, Threat Letter Analysis | ? (e-mail authors) | ~3,500+ communications; authorship labels available but access controlled |
| Guardian Corpus | News / Journalism | dataset | Stamatatos, 2013 | Authorship Attribution | ? (copyright is probably with the Guardian) | ~150,000 articles |
| STEL | Mixed (Wikipedia, Reddit, GYAFC) | dataset | Wegmann and Nguyen, 2021 | Style Benchmark | ✓ (larger selection available on request) | ~2k sentences |
| SynthSTEL | GPT-4 generated sentences | huggingface | Patel et al., 2025 | Style Transfer, Style Classification | ✓ | ~4k |
| mSynthSTEL | GPT-4 generated multilingual sentences | hugggingface | Qiu et al., 2025 | Style Transfer, Style Classification | ✓ | ~36k |
| xSLUE | Mixed | dataset | Kang and Hovy, 2021 | Style Classification Benchmark | ✓ (some datasets with restricted access) | |
| GYAFC | Yahoo Answers | dataset | Rao and Tetreault, 2018 | Style Transfer, Style Classification | ✓ (available on request) | ~100k sentences |
| PASTEL | Image Caption Stories | dataset | Kang et al., 2019 | Author Profiling (Gender, Age, Country, Politics, Education, Ethnicity), Style Transfer | ✓ | ~42k sentences |
| Corpus of diverse styles (COD) | Mixed (Joyce, poetry, tweets, conversational speech, biblical text) | huggingface | Krishna et al., 2020 | Style Transfer, Style Classification | ✓ (might be partially questionable, especially tweets) | ~15 million sentences |
| Wikipedia Text Simplification | Wikipedia Texts simplified by annotators | github | Xu et al., 2016, Xu et al., 2015 | Style Transfer, Style Classification | ✓ | |
| Fisher Speech Transcripts | Transcribed telephone conversations | dataset pt. 1, pt. 2 | Aggazzotti et al., 2024, 2025a, 2025b | Automatic Speech Recognition, Speaker Attribution | ✓ (available with LDC license) | 11,699 conversations (1,960 hours) |
| HANSEN | Mixed (conversations, speeches, interviews, QA, talk-shows) | huggingface | Tripto et al., 2023 | Author/Speaker Attribution, Machine Text Detection | ✓ (some require scraping, accepting terms) | 514k human, 23k LLM |
| NIST SRE | Speech conversations | resource list | NIST SRE | Speaker Recognition | ✓ (most require LDC license) | varies |
| GEDE | Essays | github | Gehring & Paaßen, 2025 | Machine Text Detection | ✓ CC BY-NC-SA | 916 human, 12,703 LLM |