StyleSurvey

We collect datasets that have been used for style-related tasks before. These datasets were also compiled with resources from Huang et al., 2025, Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges. (Since they provide many machine text detection resources and that task was not our focus, we did not repeat them here.) Authorship Attribution datasets can usually also be used for authorship verification. Note that many more datasets could be used for style-related tasks.

Note that several commonly used datasets in the 2010s and before might not be accessible anymore, or not recommended to use because of ToS and copyright conflicts. Proceed with caution. We try and highlight safe corpora (✓), but give no guarantees.

We welcome additions via Pull Rrequests!

Dataset / Collection Domain Link Paper / Reference Common Task Availability / License Size / Volume
Enron Email Corpus E-mails dataset Klimt & Yang 2004 Authorship Attribution ~500k emails
Blog Authorship Corpus Blogs dataset Schler et al. 2006 Author Profiling (age, gender) ~680k blog posts
Pushshift Reddit Dataset Reddit dataset, not available anymore Baumgartner et al. 2020 Authortship Attribution ✗ (Reddit ToS change) -
Million Reddit User Dataset (MUD) Reddit Request form Khan et al., 2021 Authorship Attribution ✓ Request-only (research use) ~1 million users
Amazon Reviews Reviews dataset, also includes links to previous versions 2023: Hou et al., 2024, 2019: Ni et al., classic: McAuley 2015 Authorship Attribution ~570 million reviews
Amazon Multilingual Reviews Reviews dataset Keung et al., 2020 Authorship Attribution ✗ Removed ~200k reviews in 6 languages
Yelp Reviews Reviews dataset - Authorship Attribution ✓ (educational use) ~7 million reviews
IMDB Reviews Reviews dataset Maas et al. 2011 Authorship Attribution 50,000 movie reviews
IMDb1M Reviews (Movies)   Seroussi et al., 2014 Authorship Attribution ? (available upon request) ~270k reviews
IMDb62 Reviews (Movies)   Seroussi et al., 2014 Authorship Attribution ? (available upon request) ~80k reviews by 62 authors
AO3 / Fanfiction.net Fanfiction reddit posts link1 link2 - Authorship Attribution previous conflicts with ToS  
Wikipedia Million Author Corpus Wikipedia dataset Israeli et al., 2025 Authorship Attribution ✓ CC BY-SA ~60 million text chunks
BookCorpus Books dataset Zhu et al. 2015 Authorship Attribution ✗ (copyright issues) Originally ~11k books
Gutenberg / PG-19 Books published before 1919 dataset Rae et al. 2019 Authorship attribution ✓ Apache 2.0 ~29k books
PAN datasets PAN dataset varies by dataset, see pan website Authorship Verification, Authorship Attribution, Style Change Detection ✓ (some require permission)  
Aston 100 Idiolects Corpus Mixed (emails, essays, texts, memos) corpus info - Authorship Verification, Attribution, Style across discourse types ? Permission required (Aston Institute for Forensic Linguistics) ~112 individuals (ages 18–22) across written and spoken modalities; PAN 22 Author Identification use a subset
Forensic Linguistic Databank Collection of datasets for forensic linguistics with varying access categories link Aston Institute for Forensic Linguistics    
Valla standardized benchmark (Amazon, Blogs, and others) github Tyo et al. Authorship Attribution, Authorship Verification, Benchmark ? (collection of different sources, looks largely fine though) not providing the datasets themselves
Reuters Reuters News stories RCV1, RC2, TRC2: dataset info Lewis et al., 2004 Authorship Attribution, Topic-controlled Style Analysis ? (restricted via request)  
Reuters-21578 Reuters News stories dataset, huggingface   Topic-controlled Style Analysis ✓ (research purposes only) ~21k news articles
CMCC Forensic (letters, notes) corpus info Goldstein et al., 2008 Forensic Authorship Attribution, Threat Letter Analysis ? (e-mail authors) ~3,500+ communications; authorship labels available but access controlled
Guardian Corpus News / Journalism dataset Stamatatos, 2013 Authorship Attribution ? (copyright is probably with the Guardian) ~150,000 articles
STEL Mixed (Wikipedia, Reddit, GYAFC) dataset Wegmann and Nguyen, 2021 Style Benchmark ✓ (larger selection available on request) ~2k sentences
SynthSTEL GPT-4 generated sentences huggingface Patel et al., 2025 Style Transfer, Style Classification ~4k
mSynthSTEL GPT-4 generated multilingual sentences hugggingface Qiu et al., 2025 Style Transfer, Style Classification ~36k
xSLUE Mixed dataset Kang and Hovy, 2021 Style Classification Benchmark ✓ (some datasets with restricted access)  
GYAFC Yahoo Answers dataset Rao and Tetreault, 2018 Style Transfer, Style Classification ✓ (available on request) ~100k sentences
PASTEL Image Caption Stories dataset Kang et al., 2019 Author Profiling (Gender, Age, Country, Politics, Education, Ethnicity), Style Transfer ~42k sentences
Corpus of diverse styles (COD) Mixed (Joyce, poetry, tweets, conversational speech, biblical text) huggingface Krishna et al., 2020 Style Transfer, Style Classification ✓ (might be partially questionable, especially tweets) ~15 million sentences
Wikipedia Text Simplification Wikipedia Texts simplified by annotators github Xu et al., 2016, Xu et al., 2015 Style Transfer, Style Classification  
Fisher Speech Transcripts Transcribed telephone conversations dataset pt. 1, pt. 2 Aggazzotti et al., 2024, 2025a, 2025b Automatic Speech Recognition, Speaker Attribution ✓ (available with LDC license) 11,699 conversations (1,960 hours)
HANSEN Mixed (conversations, speeches, interviews, QA, talk-shows) huggingface Tripto et al., 2023 Author/Speaker Attribution, Machine Text Detection ✓ (some require scraping, accepting terms) 514k human, 23k LLM
NIST SRE Speech conversations resource list NIST SRE Speaker Recognition ✓ (most require LDC license) varies
GEDE Essays github Gehring & Paaßen, 2025 Machine Text Detection ✓ CC BY-NC-SA 916 human, 12,703 LLM