StyleSurvey

We collect datasets that have been used for style-related tasks before. These datasets were also compiled with resources from Huang et al., 2025, Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges. (Since they provide many machine text detection resources and that task was not our focus, we did not repeat them here.) Authorship Attribution datasets can usually also be used for authorship verification. Note that many more datasets could be used for style-related tasks.

Note that several commonly used datasets in the 2010s and before might not be accessible anymore, or not recommended to use because of ToS and copyright conflicts. Proceed with caution. We try and highlight safe corpora (✓), but give no guarantees.

We welcome additions via Pull Rrequests!

Dataset / Collection	Domain	Link	Paper / Reference	Common Task	Availability / License	Size / Volume
Enron Email Corpus	E-mails	dataset	Klimt & Yang 2004	Authorship Attribution	✓	~500k emails
Blog Authorship Corpus	Blogs	dataset	Schler et al. 2006	Author Profiling (age, gender)	✓	~680k blog posts
Pushshift Reddit Dataset	Reddit	dataset, not available anymore	Baumgartner et al. 2020	Authortship Attribution	✗ (Reddit ToS change)	-
Million Reddit User Dataset (MUD)	Reddit	Request form	Khan et al., 2021	Authorship Attribution	✓ Request-only (research use)	~1 million users
Amazon Reviews	Reviews	dataset, also includes links to previous versions	2023: Hou et al., 2024, 2019: Ni et al., classic: McAuley 2015	Authorship Attribution	✓	~570 million reviews
Amazon Multilingual Reviews	Reviews	dataset	Keung et al., 2020	Authorship Attribution	✗ Removed	~200k reviews in 6 languages
Yelp Reviews	Reviews	dataset	-	Authorship Attribution	✓ (educational use)	~7 million reviews
IMDB Reviews	Reviews	dataset	Maas et al. 2011	Authorship Attribution	✓	50,000 movie reviews
IMDb1M	Reviews (Movies)		Seroussi et al., 2014	Authorship Attribution	? (available upon request)	~270k reviews
IMDb62	Reviews (Movies)		Seroussi et al., 2014	Authorship Attribution	? (available upon request)	~80k reviews by 62 authors
AO3 / Fanfiction.net	Fanfiction	reddit posts link1 link2	-	Authorship Attribution	✗ previous conflicts with ToS
Wikipedia Million Author Corpus	Wikipedia, multilingual	dataset	Israeli et al., 2025	Authorship Attribution	✓ CC BY-SA	~60 million text chunks
BookCorpus	Books	dataset	Zhu et al. 2015	Authorship Attribution	✗ (copyright issues)	Originally ~11k books
Gutenberg / PG-19	Books published before 1919	dataset	Rae et al. 2019	Authorship attribution	✓ Apache 2.0	~29k books
PAN datasets	PAN	dataset	varies by dataset, see pan website	Authorship Verification, Authorship Attribution, Style Change Detection	✓ (some require permission)
Aston 100 Idiolects Corpus	Mixed (emails, essays, texts, memos)	corpus info	-	Authorship Verification, Attribution, Style across discourse types	? Permission required (Aston Institute for Forensic Linguistics)	~112 individuals (ages 18–22) across written and spoken modalities; PAN 22 Author Identification use a subset
Forensic Linguistic Databank	Collection of datasets for forensic linguistics with varying access categories	link	Aston Institute for Forensic Linguistics		✓
Valla	standardized benchmark (Amazon, Blogs, and others)	github	Tyo et al.	Authorship Attribution, Authorship Verification, Benchmark	✗ (collection of different sources that need to be downloaded manually, several are not accessible anymore)	not providing the datasets themselves
Reuters	Reuters News stories	RCV1, RC2, TRC2: dataset info	Lewis et al., 2004	Authorship Attribution, Topic-controlled Style Analysis	? (restricted via request)
Reuters-21578	Reuters News stories	dataset, huggingface		Topic-controlled Style Analysis	✓ (research purposes only)	~21k news articles
CMCC	Forensic (letters, notes)	corpus info	Goldstein et al., 2008	Forensic Authorship Attribution, Threat Letter Analysis	? (e-mail authors)	~3,500+ communications; authorship labels available but access controlled
Guardian Corpus	News / Journalism	dataset	Stamatatos, 2013	Authorship Attribution	? (copyright is probably with the Guardian)	~150,000 articles
STEL	Mixed (Wikipedia, Reddit, GYAFC)	dataset	Wegmann and Nguyen, 2021	Style Benchmark	✓ (larger selection available on request)	~2k sentences
SynthSTEL	GPT-4 generated sentences	huggingface	Patel et al., 2025	Style Transfer, Style Classification	✓	~4k
mSynthSTEL	GPT-4 generated multilingual sentences	hugggingface	Qiu et al., 2025	Style Transfer, Style Classification	✓	~36k
xSLUE	Mixed	dataset	Kang and Hovy, 2021	Style Classification Benchmark	✓ (some datasets with restricted access)
GYAFC	Yahoo Answers	dataset	Rao and Tetreault, 2018	Style Transfer, Style Classification	✓ (available on request)	~100k sentences
PASTEL	Image Caption Stories	dataset	Kang et al., 2019	Author Profiling (Gender, Age, Country, Politics, Education, Ethnicity), Style Transfer	✓	~42k sentences
Corpus of diverse styles (COD)	Mixed (Joyce, poetry, tweets, conversational speech, biblical text)	huggingface	Krishna et al., 2020	Style Transfer, Style Classification	✓ (might be partially questionable, especially tweets)	~15 million sentences
Wikipedia Text Simplification	Wikipedia Texts simplified by annotators	github	Xu et al., 2016, Xu et al., 2015	Style Transfer, Style Classification	✓
StylePTB	Penn Treebank	github	Lyu et al., 2021	Style Transfer (parallel, pre-defined features, compositional)	✓ (transformations accessible with Creative Commons Attribution 4.0 on GitHub; underlying PTB might require a license)	21 fine-grained transfers + compositions over PTB sentences
Fisher Speech Transcripts	Transcribed telephone conversations	dataset pt. 1, pt. 2	Aggazzotti et al., 2024, 2025a, 2025b	Automatic Speech Recognition, Speaker Attribution	✓ (available with LDC license)	11,699 conversations (1,960 hours)
HANSEN	Mixed (conversations, speeches, interviews, QA, talk-shows)	huggingface	Tripto et al., 2023	Author/Speaker Attribution, Machine Text Detection	✓ (some require scraping, accepting terms)	514k human, 23k LLM
NIST SRE	Speech conversations	resource list	NIST SRE	Speaker Recognition	✓ (most require LDC license)	varies
GEDE	Essays	github	Gehring & Paaßen, 2025	Machine Text Detection	✓ CC BY-NC-SA	916 human, 12,703 LLM
RedDust	Reddit comments	corpus info	Tigunova et al., 2020	Author Profiling (profession, hobby, family status, age, gender)	? (comment IDs and classifications available, comments are not, questionable as with all Reddit data)	~300k comments
Reddit L2 corpus	Reddit comments	corpus info	Goldin et al., 2018, Rabinovich et al., 2018	Author Profiling (native language)	? (questionable as with all Reddit data)	5-14GB
StyleEmbedding data	Reddit comments	huggingface	Wegmann et al., 2022	Authorship Verification	? (openly accessible, license questionable as with all Reddit data)	~300k rows
Hiatus data	mix of genres	IARPA	Agarwal et al., 2025	Authorship Attribution	? (claimed open access of all splits etc., but the linked website only talks about development datasets gated through e-mailing)
Hiatus data	mix of genres	HuggingFace, GitHub	Man et al., 2026	Authorship Attribution	✓
CrossNews	cross-genre setup (news articles and Tweets) with author information that crosses genre	raw_data.zip on GitHub	Ma et al., 2025	Authorship Attribution, Authorship Verification	✓
MADAR Parallel Corpus	Arabic dialects (25 cities), travel domain	dataset page	Bouamor et al. 2018	Dialect Identification, Authorship Attribution	✓ (request via Google form, research license)	~12k parallel sentences across up to 25 dialects
DSL-ML 2024	VarDial 2024, Spanish, English, Portuguese, French, different domains	github	Chifu et al. 2024	Dialect Identification	✓
IDIOLEX evaluation sets	Arabic and Spanish dialects (idiolectal/dialectal variation)	github	Kantharuban et al. 2026	Dialect Identification, Authorship Attribution, Idiolectal Representation	✓ (MIT license)	—
LambdaG content-masked datasets	Mixed genres (Enron emails, Wikipedia, Perverted Justice conversations, Apricity forum, TripAdvisor reviews, blogs); content-masked with POSnoise	github (Zenodo linked from repo)	Nini et al., 2026	Authorship Verification	✓	6 content-masked datasets across genres

This site is open source. Improve this page.