StyleSurvey

Papers and works from the Style Survey

We welcome contributions via GitHub issues or pull requests to collect relevant style papers that are not mentioned in our survey. Note that this list includes all references from our survey which includes works that might not be directly related to style or style representations.

This page is intended to list the references from the Style Survey paper as a sortable, searchable table.

To filter the table, start typing in the search box below; click any column header to sort.

Title Authors Link Year Venue
Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspaceOne of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94% when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes. Ahmed Abbasi and Hsinchun Chen link 2008 ACM Transactions on Information Systems (TOIS), 26(2):1–29
Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction TasksThere is a lot of research interest in encoding variable length sentences into fixed length vectors, in a way that preserves the sentence meanings. Two common methods include representations based on averaging word vectors, and representations based on the hidden states of recurrent neural networks such as LSTMs. The sentence vectors are used as features for subsequent machine learning tasks or for pre-training in the context of deep learning. However, not much is known about the properties that are encoded in these sentence representations and about the language information they capture. We propose a framework that facilitates better understanding of the encoded representations. We define prediction tasks around isolated aspects of sentence structure (namely sentence length, word content, and word order), and score representations by the ability to train a classifier to solve each prediction task when using the representation as input. We demonstrate the potential contribution of the approach by analyzing different sentence representation mechanisms. The analysis sheds light on the relative strengths of different sentence embedding methods with respect to these low level prediction tasks, and on the effect of the encoded vector’s dimensionality on the resulting representations. Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg link 2017 ICLR
Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?Abstract Authorship verification is the task of determining if two distinct writing samples share the same author and is typically concerned with the attribution of written text. In this paper, we explore the attribution of transcribed speech, which poses novel challenges. The main challenge is that many stylistic features, such as punctuation and capitalization, are not informative in this setting. On the other hand, transcribed speech exhibits other patterns, such as filler words and backchannels (e.g., um, uh-huh), which may be characteristic of different speakers. We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts. To limit spurious associations of speakers with topic, we employ both conversation prompts and speakers participating in the same conversation to construct verification trials of varying difficulties. We establish the state of the art on this new benchmark by comparing a suite of neural and non-neural baselines, finding that although written text attribution models achieve surprisingly good performance in certain settings, they perform markedly worse as conversational topic is increasingly controlled. We present analyses of the impact of transcription style on performance as well as the ability of fine-tuning on speech transcripts to improve performance.1 Cristina Aggazzotti, Nicholas Andrews, and Elizabeth Allyn Smith link 2024 Transactions of the Association for Computational Linguistics, 12:875–891
Content Anonymization for Privacy in Long-form AudioVoice anonymization techniques have been found to successfully obscure a speaker's acoustic identity in short, isolated utterances in benchmarks such as the VoicePrivacy Challenge. In practice, however, utterances seldom occur in isolation: long-form audio is commonplace in domains such as interviews, phone calls, and meetings. In these cases, many utterances from the same speaker are available, which pose a significantly greater privacy risk: given multiple utterances from the same speaker, an attacker could exploit an individual's vocabulary, syntax, and turns of phrase to re-identify them, even when their voice is completely disguised. To address this risk, we propose a new approach that performs a contextual rewriting of the transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. We present results in a long-form telephone conversation setting demonstrating the effectiveness of a content-based attack on voice-anonymized speech. Then we show how the proposed content-based anonymization methods can mitigate this risk while preserving speech utility. Overall, we find that paraphrasing is an effective defense against content-based attacks and recommend that stakeholders adopt this step to ensure anonymity in long-form audio. Cristina Aggazzotti, Ashi Garg, Zexin Cai, and Nicholas Andrews link 2025 arXiv preprint ArXiv:2510.12780
The impact of automatic speech transcription on speaker attributionSpeaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity. Cristina Aggazzotti, Matthew Wiesner, Elizabeth Allyn Smith, and Nicholas Andrews link 2025 Transactions of the Association for Computational Linguistics, in press
Neurobiber: Fast and Interpretable Stylistic Feature ExtractionLinguistic style is pivotal for understanding how texts convey meaning and fulfill communicative purposes, yet extracting detailed stylistic features at scale remains challenging. We present Neurobiber, a transformer-based system for fast, interpretable style profiling built on Biber's Multidimensional Analysis (MDA). Neurobiber predicts 96 Biber-style features from our open-source BiberPlus library (a Python toolkit that computes stylistic features and provides integrated analytics, e.g., PCA and factor analysis). Despite being up to 56 times faster than existing open source systems, Neurobiber replicates classic MDA insights on the CORE corpus and achieves competitive performance on the PAN 2020 authorship verification task without extensive retraining. Its efficient and interpretable representations readily integrate into downstream NLP pipelines, facilitating large-scale stylometric research, forensic analysis, and real-time text monitoring. All components are made publicly available. Kenan Alkiek, Anna Wegmann, Jian Zhu, and David Jurgens link 2025 arXiv preprint ArXiv:2502.18590
SmolLM2: When Smol Goes Big — Data-Centric Training of a Fully Open Small Language ModelLarge language models, while groundbreaking, are computationally expensive and difficult to deploy in resource-constrained settings. To address this challenge, small language models have emerged, but their performance critically depends on the quality and composition of the pretraining datasets—yet many recent models, such as Qwen2.5-1.5B and Llama3.2-1B, remain opaque about their training data, limiting reproducibility and scientific understanding. In this paper, we document and publicly release SmolLM2, a fully transparent state-of-the-art ``small'' (1.7 billion parameter) language model (LM), along with its training datasets and code. To attain strong performance, we overtrain SmolLM2 on 11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally curate and release new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations and a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous one. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B, Llama3.2-1B, and Falcon3-1.6B. By releasing our model, datasets, and code, we aim to facilitate future research on LM development as well as applications of small LMs. Team SmolLM, Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, et al. link 2025 COLM
Masks and mimicry: Strategic obfuscation and impersonation attacks on authorship verificationThe increasing use of Artificial Intelligence(AI) technologies, such as Large LanguageModels (LLMs) has led to nontrivial improvementsin various tasks, including accurate authorshipidentification of documents. However,while LLMs improve such defense techniques,they also simultaneously provide a vehicle formalicious actors to launch new attack vectors.To combat this security risk, we evaluate theadversarial robustness of authorship models(specifically an authorship verification model)to potent LLM-based attacks. These attacksinclude untargeted methods - authorship obfuscationand targeted methods - authorshipimpersonation. For both attacks, the objectiveis to mask or mimic the writing style of an authorwhile preserving the original texts’ semantics,respectively. Thus, we perturb an accurateauthorship verification model, and achievemaximum attack success rates of 92% and 78%for both obfuscation and impersonation attacks,respectively. Kenneth Alperin, Rohan Leekha, Adaku Uchendu, Trang Nguyen, Srilakshmi Medarametla, Carlos Levya Capote, Seth Aycock, and Charlie Dagli link 2025 Proceedings of the 5th International Conference on NLP for Digital Humanities, pages 102–116
Latent Space Interpretation for Stylistic Analysis and Explainable Authorship AttributionRecent state-of-the-art authorship attribution methods learn authorship representations of text in a latent, uninterpretable space, which hinders their usability in real-world applications. We propose a novel approach for interpreting learned embeddings by identifying representative points in the latent space and leveraging large language models to generate informative natural language descriptions of the writing style associated with each point. We evaluate the alignment between our interpretable and latent spaces and demonstrate superior prediction agreement over baseline methods. Additionally, we conduct a human evaluation to assess the quality of these style descriptions and validate their utility in explaining the latent space. Finally, we show that human performance on the challenging authorship attribution task improves by +20% on average when aided with explanations from our method. Milad Alshomary, Narutatsu Ri, Marianna Apidianaki, Ajay Patel, Smaranda Muresan, and Kathleen McKeown link 2025 COLING, pages 1124–1135
Layered Insights: Generalizable Analysis of Human Authorial Style by Leveraging All Transformer LayersWe propose a new approach for the authorship attribution task that leverages the various linguistic representations learned at different layers of pre-trained transformer-based models. We evaluate our approach on two popular authorship attribution models and three evaluation datasets, in in-domain and out-of-domain scenarios. We found that utilizing various transformer layers improves the robustness of authorship attribution models when tested on out-of-domain data, resulting in a much stronger performance. Our analysis gives further insights into how our model’s different layers get specialized in representing certain linguistic aspects that we believe benefit the model when tested out of the domain. Milad Alshomary, Nikhil Reddy Varimalla, Vishal Anand, Smaranda Muresan, and Kathleen McKeown link 2025 EMNLP, pages 10290–10303
The topic confusion task: A novel evaluation scenario for authorship attributionPLACEHOLDER Malik Altakrori, Jackie Chi Kit Cheung, and Benjamin CM Fung link 2021 Findings of EMNLP 2021, pages 4242–4256
Learning invariant representations of social media usersThe evolution of social media users’ behavior over time complicates user-level comparison tasks such as verification, classification, clustering, and ranking. As a result, naive approaches may fail to generalize to new users or even to future observations of previously known users. In this paper, we propose a novel procedure to learn a mapping from short episodes of user activity on social media to a vector space in which the distance between points captures the similarity of the corresponding users’ invariant features. We fit the model by optimizing a surrogate metric learning objective over a large corpus of unlabeled social media content. Once learned, the mapping may be applied to users not seen at training time and enables efficient comparisons of users in the resulting vector space. We present a comprehensive evaluation to validate the benefits of the proposed approach using data from Reddit, Twitter, and Wikipedia. Nicholas Andrews and Marcus Bishop link 2019 EMNLP-IJCNLP, pages 1684–1695
(Dis)improved?! How Simplified Language Affects Large Language Model Performance across LanguagesSimplified language enhances the accessibility and human understanding of texts. However, whether it also benefits large language models (LLMs) remains underexplored. This paper extensively studies whether LLM performance improves on simplified data compared to its original counterpart. Our experiments span six datasets and nine automatic simplification systems across three languages. We show that English models, including GPT-4o-mini, show a weak generalization and exhibit a significant performance drop on simplified data. This introduces an intriguing paradox: simplified data is helpful for humans but not for LLMs. At the same time, the performance in non-English languages sometimes improves, depending on the task and quality of the simplifier. Our findings offer a comprehensive view of the impact of simplified language on LLM performance and uncover severe implications for people depending on simple language. Miriam Anschütz, Anastasiya Damaratskaya, Chaeeun Joy Lee, Arthur Schmalz, Edoardo Mosca, and Georg Groh link 2025 GEM² Workshop, pages 847–861
A light in the dark web: Linking dark web aliases to real internet identitiesPLACEHOLDER Ehsan Arabnezhad, Massimo La Morgia, Alessandro Mei, Eugenio Nerio Nemmi, and Julinda Stefa link 2020 ICDCS, pages 311–321
Computational forensic authorship analysis: Promises and pitfallsPLACEHOLDER Shlomo Argamon link 2018 Language and Law/Linguagem e Direito, 5(2):7–37
Overview of the International Authorship Identification Competition at PAN-2011PLACEHOLDER Shlomo Argamon and Patrick Juola link 2011 CLEF 2011
Efficient Large Scale Language Modeling with Mixtures of ExpertsMixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot fine-tuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using ~4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use. Meta AI, Mikel Artetxe, Shruti Bhosale, Naman Goyal, et al. link 2022 EMNLP, pages 11699–11732
The Routledge Handbook of Sociolinguistics Around the World, 2nd editionPLACEHOLDER Martin J. Ball, Rajend Mesthrie, and Chiara Meluzzi link 2023 Routledge
The Language That Drives Engagement: A Systematic Large-scale Analysis of Headline ExperimentsWe use a large-scale data set of thousands of field experiments conducted on Upworthy.com , an online media platform, to investigate the cognitive, motivational, affective, and grammatical factors implementable in messages that increase engagement with online content. Akshina Banerjee and Oleg Urminsky link 2025 Marketing Science, 44(3):566–592
Keep it Private: Unsupervised privatization of online textAuthorship obfuscation techniques hold the promise of helping people protect their privacy in online communications by automatically rewriting text to hide the identity of the original author. However, obfuscation has been evaluated in narrow settings in the NLP literature and has primarily been addressed with superficial edit operations that can lead to unnatural outputs. In this work, we introduce an automatic text privatization framework that fine-tunes a large language model via reinforcement learning to produce rewrites that balance soundness, sense, and privacy. We evaluate it extensively on a large-scale test set of English Reddit posts by 68k authors composed of short-medium length texts. We study how the performance changes among evaluative conditions including authorial profile length and authorship detection strategy. Our method maintains high text quality according to both automated metrics and human evaluation, and successfully evades several automated authorship attacks. Calvin Bao and Marine Carpuat link 2024 NAACL, pages 8678–8693
Measuring what Matters: Construct Validity in Large Language Model BenchmarksEvaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as `safety' and `robustness' requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks. Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, et al. link 2025 NeurIPS
Probing classifiers: Promises, shortcomings, and advancesAbstract Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. The basic idea is simple—a classifier is trained to predict some linguistic property from a model’s representations—and has been used to examine a wide variety of models and properties. However, recent studies have demonstrated various methodological limitations of this approach. This squib critically reviews the probing classifiers framework, highlighting their promises, shortcomings, and advances. Yonatan Belinkov link 2022 Computational Linguistics, 48(1):207–219
What do neural machine translation models learn about morphology?Neural machine translation (MT) models obtain state-of-the-art performance while maintaining a simple, end-to-end architecture. However, little is known about what these models learn about source and target languages during the training process. In this work, we analyze the representations learned by neural MT models at various levels of granularity and empirically evaluate the quality of the representations for learning morphology through extrinsic part-of-speech and morphological tagging tasks. We conduct a thorough investigation along several parameters: word-based vs. character-based representations, depth of the encoding layer, the identity of the target language, and encoder vs. decoder representations. Our data-driven, quantitative evaluation sheds light on important aspects in the neural MT system and its ability to capture word structure. Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass link 2017 ACL, pages 861–872
Language style as audience designPLACEHOLDER Allan Bell link 1984 Language in Society, 13(2):145–204
Overview of PAN 2025: Generative AI Detection, Multilingual Text Detoxification, Multi-author Writing Style Analysis, and Generative Plagiarism DetectionPLACEHOLDER Janek Bevendorff, Daryna Dementieva, Maik Fröbe, Bela Gipp, André Greiner-Petter, Jussi Karlgren, Maximilian Mayerl, Preslav Nakov, Alexander Panchenko, Martin Potthast, Artem Shelmanov, Efstathios Stamatatos, Benno Stein, Yuxia Wang, Matti Wiegmann, and Eva Zangerle link 2025 Advances in Information Retrieval, pages 434–441
The two paradigms of LLM detection: Authorship attribution vs. authorship verificationThe detection of texts generated by LLMs has quickly become an important research problem. Many supervised and zero-shot detectors have already been proposed, yet their effectiveness and precision remain disputed. Current research therefore focuses on making detectors robust against domain shifts and on building corresponding benchmarks. In this paper, we show that the actual limitations hindering progress in LLM detection lie elsewhere: LLM detection is often implicitly modeled as an authorship attribution task, while its true nature is that of authorship verification. We systematically analyze the current research with respect to this misunderstanding, conduct an in-depth comparative analysis of the benchmarks, and validate our claim using state-of-the-art LLM detectors. Our contributions open the realm of authorship analysis technology for understanding and tackling the problem of LLM detection. Janek Bevendorff, Matti Wiegmann, Emmelie Richter, Martin Potthast, and Benno Stein link 2025 Findings of ACL 2025, pages 3762–3787
Variation across Speech and WritingSimilarities and differences between speech and writing have been the subject of innumerable studies, but until now there has been no attempt to provide a unified linguistic analysis of the whole range of spoken and written registers in English. In this widely acclaimed empirical study, Douglas Biber uses computational techniques to analyse the linguistic characteristics of twenty three spoken and written genres, enabling identification of the basic, underlying dimensions of variation in English. In Variation Across Speech and Writing, six dimensions of variation are identified through a factor analysis, on the basis of linguistic co-occurence patterns. The resulting model of variation provides for the description of the distinctive linguistic characteristics of any spoken or written text andd emonstrates the ways in which the polarization of speech and writing has been misleading, and thus enables reconciliation of the contradictory conclusions reached in previous research. Douglas Biber link 1988 Cambridge University Press
Register, Genre, and Style, 2nd editionPLACEHOLDER Douglas Biber and Susan Conrad link 2019 Cambridge University Press
Natural Language Processing with PythonPLACEHOLDER Steven Bird, Ewan Klein, and Edward Loper link 2019 O'Reilly Media
Centering the speech communityHow can NLP/AI practitioners engage with oral societies and develop locally appropriate language technologies? We report on our experience of working together over five years in a remote community in the far north of Australia, and how we prototyped simple language technologies to support our collaboration. We navigated different understandings of language, the functional differentiation of oral vs institutional languages, and the distinct technology opportunities for each. Our collaboration unsettled the first author’s western framing of language as data for exploitation by machines, and we devised a design pattern that seems better aligned with local interests and aspirations. We call for new collaborations on the design of locally appropriate technologies for languages with primary orality. Steven Bird and Dean Yibarbuk link 2024 EACL, pages 826–839
ETS corpus of non-native written English LDC2014T06PLACEHOLDER Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow link 2014 Linguistic Data Consortium
The language of intergroup distinctivenessPLACEHOLDER Richard Y. Bourhis and Howard Giles link 1977 Language, Ethnicity and Intergroup Relations, pages 119–135
Rethinking the Authorship Verification Experimental SetupsOne of the main drivers of the recent advances in authorship verification is the PAN large-scale authorship dataset. Despite generating significant progress in the field, inconsistent performance differences between the closed and open test sets have been reported. To this end, we improve the experimental setup by proposing five new public splits over the PAN dataset, specifically designed to isolate and identify biases related to the text topic and to the author’s writing style. We evaluate several BERT-like baselines on these splits, showing that such models are competitive with authorship verification state-of-the-art methods. Furthermore, using explainable AI, we find that these baselines are biased towards named entities. We show that models trained without the named entities obtain better results and generalize better when tested on DarkReddit, our new dataset for authorship verification. Florin Brad, Andrei Manolache, Elena Burceanu, Antonio Barbalau, Radu Tudor Ionescu, and Marius Popescu link 2022 EMNLP, pages 5634–5643
Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymityThe use of stylometry, authorship recognition through purely linguistic means, has contributed to literary, historical, and criminal investigation breakthroughs. Existing stylometry research assumes that authors have not attempted to disguise their linguistic writing style. We challenge this basic assumption of existing stylometry methodologies and present a new area of research: adversarial stylometry. Adversaries have a devastating effect on the robustness of existing classification methods. Our work presents a framework for creating adversarial passages including obfuscation , where a subject attempts to hide her identity, and imitation , where a subject attempts to frame another subject by imitating his writing style, and translation where original passages are obfuscated with machine translation services. This research demonstrates that manual circumvention methods work very well while automated translation methods are not effective. The obfuscation method reduces the techniques' effectiveness to the level of random guessing and the imitation attempts succeed up to 67% of the time depending on the stylometry technique used. These results are more significant given the fact that experimental subjects were unfamiliar with stylometry, were not professional writers, and spent little time on the attacks. This article also contributes to the field by using human subjects to empirically validate the claim of high accuracy for four current techniques (without adversaries). We have also compiled and released two corpora of adversarial stylometry texts to promote research in this field with a total of 57 unique authors. We argue that this field is important to a multidisciplinary approach to privacy, security, and anonymity. Michael Brennan, Sadia Afroz, and Rachel Greenstadt link 2012 ACM TISSEC, 15:1–22
'Delta': a measure of stylistic difference and a guide to likely authorshipPLACEHOLDER John Burrows link 2002 Literary and Linguistic Computing, 17(3):267–287
How the communication style of chatbots influences consumers' satisfaction, trust, and engagement in the context of service failureAbstract This study examines consumers’ reactions to the communication styles of chatbots during failed service experiences. The current study explores whether the communication style adopted by a chatbot impacts consumer satisfaction and behavior intention and how expectancy violations can moderate these relationships in the service context. A pre-test examined the validity of the stimuli of chatbots that were either task-oriented or social-oriented after consumers encountered service failure. For more information, the experiment was designed to manipulate the AI-based chatbot agent’s process and style of communication and measure the role of expectancy violations. The main experiment results showed that interactions with social-oriented communication style chatbots enhance the level of consumers’ interaction satisfaction and intention of behavior. Respondents experienced a higher perception of warmth when interacting with social-oriented communication style chatbots than task-oriented. Moreover, expectancy violation moderates the mediation of warmth on the relationship between the chatbot’s communication style/type and interaction satisfaction, trust, and intention of patronage. Setting chatbots’ communication styles to be social-oriented can help reduce negative emotions among consumers caused by service failure; specifically, the perception of warmth created by the social-oriented communication style can alleviate negative evaluations of service agents and companies, such as dissatisfaction and loss of interest. Therefore, in managerial practice, the firm should choose the social-oriented communication style chatbot agent to recover the customer relationship after a service failure. Na Cai, Shuhong Gao, and Jinzhe Yan link 2024 Humanities and Social Sciences Communications, 11(1):687
Accent, (ING), and the social logic of listener perceptionsThis article reports on the relationship between the English variable (ING) and two divergent accents (Southern and gay) as they are conceptualized and given social meaning in listeners' perceptions of spontaneous speech. The study used an expanded form of the Matched Guise Technique, using recordings collected through sociolinguistic interviews with 8 speakers from North Carolina and California. Excerpts were digitally manipulated to create 32 matched pairs differing only in tokens of (ING), which were used to collect responses in group interviews (N = 55) and a Web-based experiment (N = 124). The alveolar variant -in increased the perceived strength of Southern accents and dampened an accent heard as gay and urban. The influence of (ING) on these accents is linked to shared social meanings of the alveolar form -in and Southern accents on the one hand (lack of education, the country, and the term “redneck”) and the velar variant -ing and the gay accent on the other (lowered masculinity, the city, and the term “metrosexual”). These two accents are contrasted with a third variety, heard as nonaccented and aregional. These effects demonstrate the status of the three linguistic objects, the two accents and (ING), as social objects as well. Kathryn Campbell-Kibler link 2007 American Speech, 82(1):32–64
The nature of sociolinguistic perceptionAbstract This study investigates how linguistic variation carries social meaning, examining the impact of the English variable (ING) on perceptions of eight speakers from the U.S. West Coast and South. Thirty-two excerpts of spontaneous speech were digitally manipulated to vary only in tokens of (ING) and used to collect listener perceptions in group interviews ( N = 55) and an experiment ( N = 124). Interview data and experimental results show that (ING) impacts social perception variably, inhabiting an indexical field of related meanings (Eckert, Penelope. [2008]. Variation and the indexical field. Journal of Sociolinguistics 12(4):453–476). One of these meanings, intelligence/education, is explored in detail to understand how a given meaning is realized or not in a specific context. Speakers were heard as less educated/intelligent when they used -in , but this effect is driven by reactions to speakers heard as aregional and not as working-class. Some implications on our future understanding of the processing of socially laden variation are discussed. Kathryn Campbell-Kibler link 2009 Language Variation and Change, 21(1):135–156
The sociolinguistic variant as a carrier of social meaningAbstract Traditionally used as a “heuristic device” (Labov, 1978), the sociolinguistic variable has taken on a new role as a primitive of speaker/hearer mental models in third-wave variation work (Eckert, 2005, 2008). Results from a sociolinguistic perception study suggest that at least in some cases, variants of the same variable function independently as loci of indexically linked social meaning. Listener responses were collected to three matched guises of the English variable (ING): -in , -ing , and a neutral guise with no audible (ING) tokens. The results counter the study hypothesis that listener expectation, triggered by speaker regional accent, would shape (ING)'s impact. Instead, the two variants showed distinct social associations: the -ing guises were rated as more intelligent/educated, more articulate, and less likely to be a student than either the -in or neutral guises, which did not differ significantly. In contrast, -in guises made speakers sound less formal and less likely to be gay than the -ing and neutral guises, which did not differ. These results suggest that third-wave work needs to more closely examine the role of the variable in theorizing the relationship between linguistic and social structures. Kathryn Campbell-Kibler link 2011 Language Variation and Change, 22(3):423–441
The elements of stylePLACEHOLDER Kathryn Campbell-Kibler, Penelope Eckert, Norma Mendoza-Denton, and Emma Moore link 2006 NWAV Poster Session
Expertise style transfer: A new task towards better communication between experts and laymenThe curse of knowledge can impede communication between experts and laymen. We propose a new task of expertise style transfer and contribute a manually annotated dataset with the goal of alleviating such cognitive biases. Solving this task not only simplifies the professional language, but also improves the accuracy and expertise level of laymen descriptions using simple words. This is a challenging task, unaddressed in previous work, as it requires the models to have expert intelligence in order to modify text with a deep understanding of domain knowledge and structures. We establish the benchmark performance of five state-of-the-art models for style transfer and text simplification. The results demonstrate a significant gap between machine and human performance. We also discuss the challenges of automatic evaluation, to provide insights into future research directions. The dataset is publicly available at <a href=https://srhthu.github.io/expertise-style-transfer/ class=acl-markup-url>https://srhthu.github.io/expertise-style-transfer/</a>. Yixin Cao, Ruihao Shui, Liangming Pan, Min-Yen Kan, Zhiyuan Liu, and Tat-Seng Chua link 2020 ACL, pages 1061–1071
On the diversity of synthetic data and its impact on training large language modelsThe rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. While previous literature has focused predominantly on the quality and quantity of real data, our work enables the measurement of diversity in synthetic data and explores its impact on LLM performance. We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages by introducing a new diversity metric, \textit{LLM cluster-agent}, designed to evaluate the diversity of synthetic datasets. Through a series of controlled experiments with models of 350M and 1.4B parameters, we demonstrate that the proposed cluster-based LLM scoring of diversity correlates positively with both pre-training and supervised fine-tuning performance. Our findings also reveal that synthetic data diversity in pre-training affects supervised fine-tuning more significantly than pre-training itself, even for smaller models. We hope this study advances our understanding of the optimal use of synthetic data in LLM training and opens new avenues for efficient data generation processes. Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, and Marah I. Abdin link 2024 arXiv preprint ArXiv:2410.15226
HumT DumT: Measuring and controlling human-like language in LLMsShould LLMs generate language that makes them seem human? Human-like language might improve user experience, but might also lead to deception, overreliance, and stereotyping. Assessing these potential impacts requires a systematic way to measure human-like tone in LLM outputs. We introduce HumT and SocioT, metrics for human-like tone and other dimensions of social perceptions in text data based on relative probabilities from an LLM. By measuring HumT across preference and usage datasets, we find that users prefer less human-like outputs from LLMs in many contexts. HumT also offers insights into the perceptions and impacts of anthropomorphism: human-like LLM outputs are highly correlated with warmth, social closeness, femininity, and low status, which are closely linked to the aforementioned harms. We introduce DumT, a method using HumT to systematically control and reduce the degree of human-like tone while preserving model performance. DumT offers a practical approach for mitigating risks associated with anthropomorphic language generation. Myra Cheng, Sunny Yu, and Dan Jurafsky link 2025 ACL, pages 25983–26008
CLUB: a contrastive log-ratio upper bound of mutual informationPLACEHOLDER Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin link 2020 ICML
Improving disentangled text representation learning with information-theoretic guidanceLearning disentangled representations of natural language is essential for many NLP tasks, e.g., conditional text generation, style transfer, personalized dialogue systems, etc. Similar problems have been studied extensively for other forms of data, such as images and videos. However, the discrete nature of natural language makes the disentangling of textual representations more challenging (e.g., the manipulation over the data space cannot be easily achieved). Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text, without any supervision on semantics. A new mutual information upper bound is derived and leveraged to measure dependence between style and content. By minimizing this upper bound, the proposed method induces style and content embeddings into two independent low-dimensional spaces. Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation in terms of content and style preservation. Pengyu Cheng, Martin Renqiang Min, Dinghan Shen, Christopher Malon, Yizhe Zhang, Yitong Li, and Lawrence Carin link 2020 ACL, pages 7530–7541
Evaluating synthetic data generation from user generated textAbstract User-generated content provides a rich resource to study social and behavioral phenomena. Although its application potential is currently limited by the paucity of expert labels and the privacy risks inherent in personal data, synthetic data can help mitigate this bottleneck. In this work, we introduce an evaluation framework to facilitate research on synthetic language data generation for user-generated text. We define a set of aspects for assessing data quality, namely, style preservation, meaning preservation, and divergence, as a proxy for privacy. We introduce metrics corresponding to each aspect. Moreover, through a set of generation strategies and representative tasks and baselines across domains, we demonstrate the relation between the quality aspects of synthetic user generated content, generation strategies, metrics, and downstream performance. To our knowledge, our work is the first unified evaluation framework for user-generated text in relation to the specified aspects, offering both intrinsic and extrinsic evaluation. We envisage it will facilitate developments towards shareable, high-quality synthetic language data. Jenny Chim, Julia Ive, and Maria Liakata link 2025 Computational Linguistics, 51(1):191–233
When Variants Lack Semantic Equivalence: Adverbial Subclause Word OrderPLACEHOLDER Tanya Karoli Christensen and Torben Juel Jensen link 2022 Cambridge University Press, pages 171–206
Conventionality and contrast: Pragmatic principles with lexical consequencesPLACEHOLDER Eve V. Clark link 1992 Frames, Fields, and Contrasts, pages 171–188
Dimensions of abusive language on TwitterIn this paper, we use a new categorical form of multidimensional register analysis to identify the main dimensions of functional linguistic variation in a corpus of abusive language, consisting of racist and sexist Tweets. By analysing the use of a wide variety of parts-of-speech and grammatical constructions, as well as various features related to Twitter and computer-mediated communication, we discover three dimensions of linguistic variation in this corpus, which we interpret as being related to the degree of interactive, antagonistic and attitudinal language exhibited by individual Tweets. We then demonstrate that there is a significant functional difference between racist and sexist Tweets, with sexists Tweets tending to be more interactive and attitudinal than racist Tweets. Isobelle Clarke and Jack Grieve link 2017 First Workshop on Abusive Language Online, pages 1–11
Detecting collaborations in text comparing the authors' rhetorical language choices in the federalist papersPLACEHOLDER Jeff Collins, David Kaufer, Pantelis Vlachos, Brian Butler, and Suguru Ishizaki link 2004 Computers and the Humanities, 38:15–36
Author identification, idiolect, and linguistic uniquenessPLACEHOLDER Malcolm Coulthard link 2004 Applied Linguistics, 25(4):431–447
Style: Language Variation and IdentityStyle refers to ways of speaking - how speakers use the resource of language variation to make meaning in social encounters. This 2007 book develops a coherent theoretical approach to style in sociolinguistics, illustrated with copious examples. It explains how speakers project different social identities and create different social relationships through their style choices, and how speech-style and social context inter-relate. Style therefore refers to the wide range of strategic actions and performances that speakers engage in, to construct themselves and their social lives. Coupland draws on and integrates a wide variety of contemporary sociolinguistic research as well as his own extensive research in this field. The emphasis is on how social meanings are made locally, in specific relationships, genres, groups and cultures, and on studying language variation as part of the analysis of spoken discourse. Nikolas Coupland link 2007 Cambridge University Press
Txtng: The gr8 db8PLACEHOLDER David Crystal link 2008 Oxford University Press
A Dictionary of Linguistics and Phonetics, 6th editionPLACEHOLDER David Crystal link 2011 Blackwell Publishing
Investigating English StylePLACEHOLDER David Crystal and Derek Davy link 1969 Routledge
Learning stylometric representations for authorship analysisPLACEHOLDER Steven H. H. Ding, Benjamin C. M. Fung, Farkhund Iqbal, and William K. Cheung link 2019 IEEE Transactions on Cybernetics, 49(1):107–121
Speaker recognition based on idiolectal differences between speakersPLACEHOLDER George R. Doddington link 2001 Eurospeech 2001, pages 2521–2524
Automatically constructing a corpus of sentential paraphrasesPLACEHOLDER William B. Dolan and Chris Brockett link 2005 IWP2005
Triplet loss in siamese network for object trackingPLACEHOLDER Xingping Dong and Jianbing Shen link 2018 ECCV 2018, pages 472–488
Refocusing on relevance: Personalization in NLGMany NLG tasks such as summarization, dialogue response, or open domain question answering, focus primarily on a source text in order to generate a target response. This standard approach falls short, however, when a user’s intent or context of work is not easily recoverable based solely on that source text– a scenario that we argue is more of the rule than the exception. In this work, we argue that NLG systems in general should place a much higher level of emphasis on making use of additional context, and suggest that relevance (as used in Information Retrieval) be thought of as a crucial tool for designing user-oriented text-generating tasks. We further discuss possible harms and hazards around such personalization, and argue that value-sensitive design represents a crucial path forward through these challenges. Shiran Dudy, Steven Bedrick, and Bonnie Webber link 2021 EMNLP, pages 5190–5202
HotFlip: White-Box Adversarial Examples for Text ClassificationWe propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. Due to efficiency of our method, we can perform adversarial training which makes the model more robust to attacks at test time. With the use of a few semantics-preserving constraints, we demonstrate that HotFlip can be adapted to attack a word-level classifier as well. Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou link 2018 ACL, pages 31–36
Jocks and Burnouts: Social Categories and Identity in the High SchoolPLACEHOLDER Penelope Eckert - 1989 Teachers College Press
Variation and the indexical fieldThis paper argues for a focus on the social meaning of variation, based in a study of stylistic practice. It is common in the study of variation to interpret variables as reflections of speakers' membership in social categories. Others have argued more recently that variables are associated not with the categories themselves, but with stances and characteristics that constitute those categories. The paper reviews some variation studies that show that variables do not have static meanings, but rather general meanings that become more specific in the context of styles. Building on Michael Silverstein's notion of indexical order, I argue that the meanings of variables are not precise or fixed but rather constitute a field of potential meanings – an indexical field , or constellation of ideologically related meanings, any one of which can be activated in the situated use of the variable. The field is fluid, and each new activation has the potential to change the field by building on ideological connections. Thus variation constitutes an indexical system that embeds ideology in language and that is in turn part and parcel of the construction of ideology. Penelope Eckert link 2008 Journal of Sociolinguistics, 12(4):453–476
Three waves of variation study: The emergence of meaning in the study of sociolinguistic variationThe treatment of social meaning in sociolinguistic variation has come in three waves of analytic practice. The first wave of variation studies established broad correlations between linguistic variables and the macrosociological categories of socioeconomic class, gender, ethnicity, and age. The second wave employed ethnographic methods to explore the local categories and configurations that inhabit, or constitute, these broader categories. In both waves, variation was seen as marking social categories. This article sets out a theoretical foundation for the third wave, arguing that (a) variation constitutes a robust social semiotic system, potentially expressing the full range of social concerns in a given community; (b) the meanings of variables are underspecified, gaining more specific meanings in the context of styles, and (c) variation does not simply reflect, but also constructs, social meaning and hence is a force in social change. Penelope Eckert link 2012 Annual Review of Anthropology, 41(1):87–100
Stylometry with R: A Package for Computational Text AnalysisPLACEHOLDER Maciej Eder, Jan Rybicki, and Mike Kestemont link 2016 The R Journal, 8(1):107–121
Analyzing the Persuasive Effect of Style in News Editorial ArgumentationNews editorials argue about political issues in order to challenge or reinforce the stance of readers with different ideologies. Previous research has investigated such persuasive effects for argumentative content. In contrast, this paper studies how important the style of news editorials is to achieve persuasion. To this end, we first compare content- and style-oriented classifiers on editorials from the liberal NYTimes with ideology-specific effect annotations. We find that conservative readers are resistant to NYTimes style, but on liberals, style even has more impact than content. Focusing on liberals, we then cluster the leads, bodies, and endings of editorials, in order to learn about writing style patterns of effective argumentation. Roxanne El Baff, Henning Wachsmuth, Khalid Al Khatib, and Benno Stein link 2020 ACL, pages 3154–3160
Adversarial removal of demographic attributes from text dataRecent advances in Representation Learning and Adversarial Training seem to succeed in removing unwanted features from the learned representation. We show that demographic information of authors is encoded in—and can be recovered from—the intermediate representations learned by text-based neural classifiers. The implication is that decisions of classifiers trained on textual data are not agnostic to—and likely condition on—demographic attributes. When attempting to remove such demographic information using adversarial training, we find that while the adversarial component achieves chance-level development-set accuracy during training, a post-hoc classifier, trained on the encoded sentences from the first part, still manages to reach substantially higher classification accuracies on the same data. This behavior is consistent across several tasks, demographic properties and datasets. We explore several techniques to improve the effectiveness of the adversarial component. Our main conclusion is a cautionary one: do not rely on the adversarial training to achieve invariant representation to sensitive features. Yanai Elazar and Yoav Goldberg link 2018 EMNLP, pages 11–21
Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated textAbstract The proliferation of artificial intelligence (AI)-generated content, particularly from models like ChatGPT, presents potential challenges to academic integrity and raises concerns about plagiarism. This study investigates the capabilities of various AI content detection tools in discerning human and AI-authored content. Fifteen paragraphs each from ChatGPT Models 3.5 and 4 on the topic of cooling towers in the engineering process and five human-witten control responses were generated for evaluation. AI content detection tools developed by OpenAI, Writer, Copyleaks, GPTZero, and CrossPlag were used to evaluate these paragraphs. Findings reveal that the AI detection tools were more accurate in identifying content generated by GPT 3.5 than GPT 4. However, when applied to human-written control responses, the tools exhibited inconsistencies, producing false positives and uncertain classifications. This study underscores the need for further development and refinement of AI content detection tools as AI-generated content becomes more sophisticated and harder to distinguish from human-written text. Ahmed M. Elkhatat, Khaled Elsaid, and Saeed Almeer link 2023 International Journal for Educational Integrity, 19(1):1–16
MMTEB: Massive Multilingual Text Embedding BenchmarkText embeddings are typically evaluated on a narrow set of tasks, limited in terms of languages, domains, and task types. To circumvent this limitation and to provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) -- a large-scale community-driven initiative expanding MTEB to over 500 quality-controlled evaluation tasks across 1,000+ languages. MMTEB includes a wide range of challenging novel tasks such as instruction following, long-document retrieval, and code retrieval, and represents the largest multilingual collection of evaluation tasks for embedding models to date. We use this collection to construct multiple highly multilingual benchmarks. We evaluate a representative set of models on these benchmarks. Our findings indicate that, while LLM-based models can achieve state-of-the-art performance on a subset of languages, the best-performing publicly available model across languages is the notably smaller, multilingual-e5-large-instruct. Massive benchmarks often impose high computational demands, limiting accessibility, particularly for low-resource communities. To address this, we downsample tasks based on inter-task correlation (i.e., selecting only a diverse set of tasks) while preserving relative rankings. We further optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks at a significantly lower computational cost. For instance, we introduce a new zero-shot English benchmark that maintains a similar ordering at a fraction of the cost. Team MMTEB, Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, et al. link 2024 ICLR
Variety, style-shifting, and ideologyPLACEHOLDER Susan M. Ervin-Tripp link 2001 Style and Sociolinguistic Variation, pages 44–56
OLMo 3PLACEHOLDER Team OLMo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, et al. link 2025 Technical Report
Leveraging Measurement Theory for Natural Language Processing ResearchThis dissertation explores the intersection of natural language processing (NLP) and measurement theory. NLP is a field aimed at enabling machines to process and generate human languages, such as English, German, and Mandarin. These languages are complex, diverse, and full of irregularities, making them challenging for machines to handle compared to structured artificial languages, like programming languages. NLP research ranges from simple tasks, like word frequency analysis, to complex ones involving the understanding and generation of human language. Measurement theory, traditionally applied in social sciences, addresses how we measure various properties scientifically. Key concepts include construct validity, which examines whether a measure accurately represents what it intends to measure, and reliability, which focuses on the consistency of a measure across different conditions. This dissertation argues that many challenges in NLP relate to measurement issues and suggests that principles from measurement theory can help address these challenges, particularly by providing tools to evaluate and improve the quality of NLP models. The structure of the dissertation is as follows: Chapter 2 offers background on NLP and measurement theory, covering essential text representation techniques in NLP, the history of measurement theory, and recent discussions on its unification across fields. Chapters 3-5 apply measurement theory to evaluate NLP models: Chapter 3 explores the reliability of gender bias measures in NLP by using classical reliability estimators from social sciences. Chapter 4 adapts a construct validity testing framework to assess the quality of text representations for social science constructs. Chapter 5 introduces a psychometric-based benchmarking approach to evaluate large language models, demonstrated through a case study on eighth-grade math proficiency. Chapters 6-7 focus on using measurement theory to improve NLP model performance: Chapter 6 presents a framework for designing user models based on measurement principles, achieving better-quality user representations than current methods. Chapter 7 examines how integrating human values into model training can enhance models’ ability to recognize values in human arguments. Chapters 8 and 9 reflect on current NLP research challenges and propose future directions: Chapter 8 identifies challenges in text-based personality computing, offering potential solutions and avenues for research. Chapter 9 concludes with a summary of the dissertation’s findings and suggests future work at the intersection of measurement theory and NLP. This work underscores the potential of measurement theory to enhance NLP research by offering frameworks for evaluating and designing more reliable and valid models. By integrating these approaches, the dissertation aims to bridge NLP and measurement theory, advancing NLP's capability to address complex measurement challenges. Qixiang Fang link 2024 Dissertation, Utrecht University
Linguistic bias in ChatGPT: Language models reinforce dialect discriminationWe present a large-scale study of linguistic bias exhibited by ChatGPT covering ten dialects of English (Standard American English, Standard British English, and eight widely spoken non-”standard” varieties from around the world). We prompted GPT-3.5 Turbo and GPT-4 with text by native speakers of each variety and analyzed the responses via detailed linguistic feature annotation and native speaker evaluation. We find that the models default to “standard” varieties of English; based on evaluation by native speakers, we also find that model responses to non-”standard” varieties consistently exhibit a range of issues: stereotyping (19% worse than for “standard” varieties), demeaning content (25% worse), lack of comprehension (9% worse), and condescending responses (15% worse). Moreover, if these models are asked to imitate the writing style of prompts in non-”standard” varieties, they produce text that exhibits lower comprehension of the input and is especially prone to stereotyping. GPT-4 improves on GPT-3.5 in terms of comprehension, warmth, and friendliness, but also exhibits a marked increase in stereotyping (+18%). The results indicate that GPT-3.5 Turbo and GPT-4 can perpetuate linguistic discrimination toward speakers of non-”standard” varieties. Eve Fleisig, Genevieve Smith, Madeline Bossi, Ishita Rustagi, Xavier Yin, and Dan Klein link 2024 EMNLP, pages 13541–13564
Survey of the state of the art in natural language generation: Core tasks, applications and evaluationThis paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past two decades, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of NLP, with an emphasis on different evaluation methods and the relationships between them. Albert Gatt and Emiel Krahmer link 2018 JAIR, 61:65–170
GLTR: Statistical detection and visualization of generated textThe rapid improvement of language models has raised the specter of abuse of text generation systems. This progress motivates the development of simple methods for detecting generated text that can be used by non-experts. In this work, we introduce GLTR, a tool to support humans in detecting whether a text was generated by a model. GLTR applies a suite of baseline statistical methods that can detect generation artifacts across multiple sampling schemes. In a human-subjects study, we show that the annotation scheme provided by GLTR improves the human detection-rate of fake text from 54% to 72% without any prior training. GLTR is open-source and publicly deployed, and has already been widely used to detect generated outputs. Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush link 2019 ACL System Demonstrations, pages 111–116
Accommodation theory: Communication, context, and consequencePLACEHOLDER Howard Giles, Nikolas Coupland, and Justine Coupland link 1991 Contexts of accommodation, 1:1–68
Speech Style and Social EvaluationPLACEHOLDER Howard Giles and Peter F. Powesland - 1975 Academic Press
Assessing BERT's syntactic abilitiesI assess the extent to which the recently introduced BERT model captures English syntactic phenomena, using (1) naturally-occurring subject-verb agreement stimuli; (2) "coloreless green ideas" subject-verb agreement stimuli, in which content words in natural sentences are randomly replaced with words sharing the same part-of-speech and inflection; and (3) manually crafted stimuli for subject-verb agreement and reflexive anaphora phenomena. The BERT model performs remarkably well on all cases. Yoav Goldberg link 2019 arXiv preprint ArXiv:1901.05287
Coh-Metrix: Analysis of text on cohesion and languagePLACEHOLDER Arthur C. Graesser, Danielle S. McNamara, Max M. Louwerse, and Zhiqiang Cai link 2004 Behavior Research Methods, Instruments, & Computers, 36(2):193–202
The Idea of Progress in Forensic Authorship AnalysisThis Element examines progress in research and practice in forensic authorship analysis. It describes the existing research base and examines what makes an authorship analysis more or less reliable. Further to this, the author describes the recent history of forensic science and the scientific revolution brought about by the invention of DNA evidence. They chart the rise of three major changes in forensic science – the recognition of contextual bias in analysts, the need for validation studies and shift in logic of providing identification evidence. This Element addresses the idea of progress in forensic authorship analysis in terms of these three issues with regard to new knowledge about the nature of authorship and methods in stylistics and stylometry. The author proposes that the focus needs to shift to validation of protocols for approaching case questions, rather than on validation of systems or general approaches. This title is also available as Open Access on Cambridge Core. Tim Grant link 2022 Cambridge University Press
Quantitative authorship attribution: An evaluation of techniquesPLACEHOLDER Jack Grieve link 2007 Literary and Linguistic Computing, 22(3):251–270
Register variation explains stylometric authorship analysisAbstract For centuries, investigations of disputed authorship have shown that people have unique styles of writing. Given sufficient data, it is generally possible to distinguish between the writings of a small group of authors, for example, through the multivariate analysis of the relative frequencies of common function words. There is, however, no accepted explanation for why this type of stylometric analysis is successful. Authorship analysts often argue that authors write in subtly different dialects, but the analysis of individual words is not licensed by standard theories of sociolinguistic variation. Alternatively, stylometric analysis is consistent with standard theories of register variation. In this paper, I argue that stylometric methods work because authors write in subtly different registers. To support this claim, I present the results of parallel stylometric and multidimensional register analyses of a corpus of newspaper articles written by two columnists. I demonstrate that both analyses not only distinguish between these authors but identify the same underlying patterns of linguistic variation. I therefore propose that register variation, as opposed to dialect variation, provides a basis for explaining these differences and for explaining stylometric analyses of authorship more generally. Jack Grieve link 2023 Corpus Linguistics and Linguistic Theory, 19(1):47–77
The sociolinguistic foundations of language modelingIn this article, we introduce a sociolinguistic perspective on language modeling. We claim that language models in general are inherently modeling varieties of language , and we consider how this insight can inform the development and deployment of language models. We begin by presenting a technical definition of the concept of a variety of language as developed in sociolinguistics. We then discuss how this perspective could help us better understand five basic challenges in language modeling: social bias, domain adaptation, alignment, language change , and scale . We argue that to maximize the performance and societal value of language models it is important to carefully compile training corpora that accurately represent the specific varieties of language being modeled, drawing on theories, methods, and descriptions from the field of sociolinguistics. Jack Grieve, Sara Bartl, Matteo Fuoli, Jason Grafmiller, Weihang Huang, Alejandro Jawerbaum, Akira Murakami, Marcus Perlman, Dana Roemling, and Bodo Winter link 2025 Frontiers in Artificial Intelligence, 7:1472411
Variation among blogs: A multi-dimensional analysisPLACEHOLDER Jack Grieve, Douglas Biber, Eric Friginal, and Tatiana Nekrasova link 2011 Genres on the Web, pages 303–322
Benchmarking Linguistic Diversity of Large Language ModelsThe development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether machine-generated language matches the human level of diversity, in terms of vocabulary choice, syntactic construction, and expression of meaning, raising questions about whether the fundamentals of language generation have been fully addressed. This paper emphasizes the importance of examining the preservation of human linguistic richness by language models, given the concerning surge in online content produced or aided by LLMs. We propose a comprehensive framework for evaluating LLMs from various linguistic diversity perspectives including lexical, syntactic, and semantic dimensions. Using this framework, we benchmark several state-of-the-art LLMs across all diversity dimensions, and conduct an in-depth case study for syntactic diversity. Finally, we analyze how different development and deployment choices impact the linguistic diversity of LLM outputs. Yanzhu Guo, Guokan Shang, and Chloé Clavel link 2025 arXiv preprint ArXiv:2412.10271
The curious decline of linguistic diversity: Training language models on synthetic textThis study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models. Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and Chloé Clavel link 2024 Findings of NAACL 2024, pages 3589–3604
Annotation artifacts in natural language inference dataLarge-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. We show that, in a significant portion of such data, this protocol leaves clues that make it possible to identify the label by looking only at the hypothesis, without observing the premise. Specifically, we show that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI (Bowman et. al, 2015) and 53% of MultiNLI (Williams et. al, 2017). Our analysis reveals that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Our findings suggest that the success of natural language inference models to date has been overestimated, and that the task remains a hard open problem. Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith link 2018 NAACL, pages 107–112
Towards style alignment in cross-cultural translationSuccessful communication depends on the speaker’s intended style (i.e., what the speaker is trying to convey) aligning with the listener’s interpreted style (i.e., what the listener perceives). However, cultural differences often lead to misalignment between the two; for example, politeness is often lost in translation. We characterize the ways that LLMs fail to translate style – biasing translations towards neutrality and performing worse in non-Western languages. We mitigate these failures with RASTA (Retrieval-Augmented STylistic Alignment), a method that leverages learned stylistic concepts to encourage LLM translation to appropriately convey cultural communication norms and align style. Shreya Havaldar, Adam Stein, Eric Wong, and Lyle Ungar link 2025 ACL, pages 32213–32230
Representation learning of writing styleIn this paper, we introduce a new method of representation learning that aims to embed documents in a stylometric space. Previous studies in the field of authorship analysis focused on feature engineering techniques in order to represent document styles and to enhance model performance in specific tasks. Instead, we directly embed documents in a stylometric space by relying on a reference set of authors and the intra-author consistency property which is one of two components in our definition of writing style. The main intuition of this paper is that we can define a general stylometric space from a set of reference authors such that, in this space, the coordinates of different documents will be close when the documents are by the same author, and spread away when they are by different authors, even for documents by authors who are not in the set of reference authors. The method we propose allows for the clustering of documents based on stylistic clues reflecting the authorship of documents. For the empirical validation of the method, we train a deep neural network model to predict authors of a large reference dataset consisting of news and blog articles. Albeit the learning process is supervised, it does not require a dedicated labeling of the data but it relies only on the metadata of the articles which are available in huge amounts. We evaluate the model on multiple datasets, on both the authorship clustering and the authorship attribution tasks. Julien Hay, Bich-Lien Doan, Fabrice Popineau, and Ouassim Ait Elhara link 2020 W-NUT 2020, pages 232–243
Measuring Mathematical Problem Solving With the MATH DatasetMany intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt link 2021 NeurIPS
Looking for the inner music: Probing LLMs' understanding of literary styleAbstract Language models have the ability to identify the characteristics of much shorter literary passages than was thought feasible with traditional stylometry. We evaluate authorship and genre detection for a new corpus of literary novels. We find that a range of LLMs are able to distinguish authorship and genre, but that different models do so in different ways. Some models rely more on memorization, while others make greater use of author or genre characteristics learned during fine-tuning. We additionally use three methods – direct syntactic ablation of input text and two means of studying internal model values – to probe one high-performing LLM for features that characterize styles. We find that authorial style is easier to characterize than genre-level style and is more impacted by minor syntactic decisions and contextual word usage. However, some traits like pronoun usage and word order prove significant for defining both kinds of literary style. Rebecca M. M. Hicke and David Mimno link 2025 Computational Humanities Research, 1:e3
AI generates covertly racist decisions about people based on their dialectAbstract Hundreds of millions of people now interact with language models, with uses ranging from help with writing 1,2 to informing hiring decisions 3 . However, these language models are known to perpetuate systematic racial prejudices, making their judgements biased in problematic ways about groups such as African Americans 4–7 . Although previous research has focused on overt racism in language models, social scientists have argued that racism with a more subtle character has developed over time, particularly in the United States after the civil rights movement 8,9 . It is unknown whether this covert racism manifests in language models. Here, we demonstrate that language models embody covert racism in the form of dialect prejudice, exhibiting raciolinguistic stereotypes about speakers of African American English (AAE) that are more negative than any human stereotypes about African Americans ever experimentally recorded. By contrast, the language models’ overt stereotypes about African Americans are more positive. Dialect prejudice has the potential for harmful consequences: language models are more likely to suggest that speakers of AAE be assigned less-prestigious jobs, be convicted of crimes and be sentenced to death. Finally, we show that current practices of alleviating racial bias in language models, such as human preference alignment, exacerbate the discrepancy between covert and overt stereotypes, by superficially obscuring the racism that language models maintain on a deeper level. Our findings have far-reaching implications for the fair and safe use of language technology. Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King link 2024 Nature, 633:147–154
Intonation and referee design phenomena in the narrative speech of Black/biracial menThis study examines how men with one Black parent and one white parent variably construct their racial identities through both linguistic practice and explicit testimonials, with a specific focus on how this construction is realized in narratives about law enforcement. The data consist of interviews with five young men, aged 18-32, in Washington, D.C., and the analysis compares use of intonational phenomena associated with African American Language (AAL) in response to questions about aspects of their racial identities. Declarative intonational phrases from responses to questions were MAE-ToBi annotated and analyzed for use of intonational features subject to racialized stylistic variation, including use of L+H* versus H*, focus marking, and peak delay interval length. Results of multiple regression models indicate speakers avoid intonational features associated with AAL in police narratives, especially L+H* pitch accents with broad focus marking and longer peak delay intervals. These findings illuminate an important aspect of the relationship between linguistic performance and identity: both racial and linguistic identities are subject to topic and audience/referee-conditioned variation and individuals can use specific intonational variables to align themselves within specific audience and topic-influenced constraints. In the context of police narratives, avoidance of salient features of AAL intonation can serve as linguistic respectability politics; these speakers have motivation to employ linguistic behavior that distances them from the most societally and physically precarious implications of their identities. Nicole Holliday link 2021 Journal of English Linguistics, 49(3):283–304
The analysis of literary style – a reviewPLACEHOLDER David I. Holmes link 1985 Journal of the Royal Statistical Society: Series A, 148(4):328–341
ParaGuide: Guided diffusion paraphrasers for plug-and-play textual style transferTextual style transfer is the task of transforming stylistic properties of text while preserving meaning. Target "styles" can be defined in numerous ways, ranging from single attributes (e.g. formality) to authorship (e.g. Shakespeare). Previous unsupervised style-transfer approaches generally rely on significant amounts of labeled data for only a fixed set of styles or require large language models. In contrast, we introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles at inference time. Our parameter-efficient approach, ParaGuide, leverages paraphrase-conditioned diffusion models alongside gradient-based guidance from both off-the-shelf classifiers and strong existing style embedders to transform the style of text while preserving semantic information. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer. Zachary Horvitz, Ajay Patel, Chris Callison-Burch, Zhou Yu, and Kathleen McKeown link 2024 AAAI, pages 18216–18224
TinyStyler: Efficient few-shot text style transfer with authorship embeddingsThe goal of text style transfer is to transform the style of texts while preserving their original meaning, often with only a few examples of the target style. Existing style transfer methods generally rely on the few-shot capabilities of large language models or on complex controllable text generation approaches that are inefficient and underperform on fluency metrics. We introduce TinyStyler, a lightweight but effective approach, which leverages a small language model (800M params) and pre-trained authorship embeddings to perform efficient, few-shot text style transfer. We evaluate on the challenging task of authorship style transfer and find TinyStyler outperforms strong approaches such as GPT-4. We also evaluate TinyStyler’s ability to perform text attribute style transfer (formal <span class=tex-math>↔ Zachary Horvitz, Ajay Patel, Kanishk Singh, Chris Callison-Burch, Kathleen McKeown, and Zhou Yu link 2024 Findings of EMNLP 2024, pages 13376–13390
N-gram feature selection for authorship identificationPLACEHOLDER John Houvardas and Efstathios Stamatatos link 2006 AIMSA'06, pages 77–86
Demographic factors improve classification performancePLACEHOLDER Dirk Hovy link 2015 ACL-IJCNLP, pages 752–762
"You Sound Just Like Your Father" Commercial Machine Translation Systems Include Stylistic BiasesThe main goal of machine translation has been to convey the correct content. Stylistic considerations have been at best secondary. We show that as a consequence, the output of three commercial machine translation systems (Bing, DeepL, Google) make demographically diverse samples from five languages “sound” older and more male than the original. Our findings suggest that translation models reflect demographic bias in the training data. This opens up interesting new research avenues in machine translation to take stylistic considerations into account. Dirk Hovy, Federico Bianchi, and Tommaso Fornaciari link 2020 ACL, pages 1686–1690
The social impact of natural language processingPLACEHOLDER Dirk Hovy and Shannon L. Spruit link 2016 ACL, pages 591–598
Tagging Performance Correlates with Author AgePLACEHOLDER Dirk Hovy and Anders Søgaard link 2015 ACL-IJCNLP, pages 483–488
The importance of modeling social factors of language: Theory and practiceNatural language processing (NLP) applications are now more powerful and ubiquitous than ever before. With rapidly developing (neural) models and ever-more available data, current NLP models have access to more information than any human speaker during their life. Still, it would be hard to argue that NLP models have reached human-level capacity. In this position paper, we argue that the reason for the current limitations is a focus on information content while ignoring language’s social factors. We show that current NLP systems systematically break down when faced with interpreting the social factors of language. This limits applications to a subset of information-related tasks and prevents NLP from reaching human-level performance. At the same time, systems that incorporate even a minimum of social factors already show remarkable improvements. We formalize a taxonomy of seven social factors based on linguistic theory and exemplify current failures and emerging successes for each of them. We suggest that the NLP community address social factors to get closer to the goal of human-like language understanding. Dirk Hovy and Diyi Yang link 2021 NAACL, pages 588–602
Authorship Attribution in the Era of LLMs: Problems, Methodologies, and ChallengesAccurate attribution of authorship is crucial for maintaining the integrity of digital content, improving forensic investigations, and mitigating the risks of misinformation and plagiarism. Addressing the imperative need for proper authorship attribution is essential to uphold the credibility and accountability of authentic authorship. The rapid advancements of Large Language Models (LLMs) have blurred the lines between human and machine authorship, posing significant challenges for traditional methods. We present a comprehensive literature review that examines the latest research on authorship attribution in the era of LLMs. This survey systematically explores the landscape of this field by categorizing four representative problems: (1) Human-written Text Attribution; (2) LLM-generated Text Detection; (3) LLM-generated Text Attribution; and (4) Human-LLM Co-authored Text Attribution. We also discuss the challenges related to ensuring the generalization and explainability of authorship attribution methods. Generalization requires the ability to generalize across various domains, while explainability emphasizes providing transparent and understandable insights into the decisions made by these models. By evaluating the strengths and limitations of existing methods and benchmarks, we identify key open problems and future research directions in this field. This literature review serves a roadmap for researchers and practitioners interested in understanding the state of the art in this rapidly evolving field. Additional resources and a curated list of papers are available and regularly updated at https://llm-authorship.github.io/. Baixiang Huang, Canyu Chen, and Kai Shu link 2025 ACM SIGKDD Explorations Newsletter, 26(2):21–43
Sparse autoencoders find highly interpretable features in language modelsOne of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability. Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey link 2024 ICLR
"Style" as distinctiveness: the culture and ideology of linguistic differentiationPLACEHOLDER Judith T. Irvine link 2001 Style and Sociolinguistic Variation, pages 21–43
The Million Authors Corpus: A Cross-Lingual and Cross-Domain Wikipedia Dataset for Authorship VerificationAuthorship verification (AV) is a crucial task for applications like identity verification, plagiarism detection, and AI-generated text identification. However, datasets for training and evaluating AV models are primarily in English and primarily in a single domain. This precludes analysis of AV techniques for generalizability and can cause seemingly valid AV solutions to, in fact, rely on topic-based features rather than actual authorship features. To address this limitation, we introduce the Million Authors Corpus (), a novel dataset encompassing contributions from dozens of languages on Wikipedia. It includes only long and contiguous textual chunks taken from Wikipedia edits and links those texts to their authors. includes 60.08M textual chunks, contributed by 1.29M Wikipedia authors. It enables broad-scale cross-lingual and cross-domain AV evaluation to ensure accurate analysis of model capabilities that are not overly optimistic. We provide baseline evaluations using state-of-the-art AV models as well as information retrieval models that are not AV-specific in order to demonstrate ‘s unique cross-lingual and cross-domain ablation capabilities. Abraham Israeli, Shuai Liu, Jonathan May, and David Jurgens link 2025 Findings of ACL 2025, pages 25997–26017
Style versus Content: A distinction without a (learnable) difference?Textual style transfer involves modifying the style of a text while preserving its content. This assumes that it is possible to separate style from content. This paper investigates whether this separation is possible. We use sentiment transfer as our case study for style transfer analysis. Our experimental methodology frames style transfer as a multi-objective problem, balancing style shift with content preservation and fluency. Due to the lack of parallel data for style transfer we employ a variety of adversarial encoder-decoder networks in our experiments. Also, we use of a probing methodology to analyse how these models encode style-related features in their latent spaces. The results of our experiments which are further confirmed by a human evaluation reveal the inherent trade-off between the multiple style transfer objectives which indicates that style cannot be usefully separated from content within these style-transfer systems. Somayeh Jafaritazehjani, Gwénolé Lecorvé, Damien Lolive, and John Kelleher link 2020 COLING, pages 2169–2180
Evaluating Style-Personalized Text Generation: Challenges and DirectionsWith the surge of large language models (LLMs) and their ability to produce customized output, style-personalized text generation--"write like me"--has become a rapidly growing area of interest. However, style personalization is highly specific, relative to every user, and depends strongly on the pragmatic context, which makes it uniquely challenging. Although prior research has introduced benchmarks and metrics for this area, they tend to be non-standardized and have known limitations (e.g., poor correlation with human subjects). LLMs have been found to not capture author-specific style well, it follows that the metrics themselves must be scrutinized carefully. In this work we critically examine the effectiveness of the most common metrics used in the field, such as BLEU, embeddings, and LLMs-as-judges. We evaluate these metrics using our proposed style discrimination benchmark, which spans eight diverse writing tasks across three evaluation settings: domain discrimination, authorship attribution, and LLM-generated personalized vs non-personalized discrimination. We find strong evidence that employing ensembles of diverse evaluation metrics consistently outperforms single-evaluator methods, and conclude by providing guidance on how to reliably assess style-personalized text generation. Anubhav Jangra, Bahareh Sarrafzadeh, Adrian de Wynter, Silviu Cucerzan, and Sujay Kumar Jauhar link 2025 arXiv preprint ArXiv:2508.06374
Shakespearizing Modern Language Using Copy-Enriched Sequence to Sequence ModelsVariations in writing styles are commonly used to adapt the content to a specific context, audience, or purpose. However, applying stylistic variations is still by and large a manual process, and there have been little efforts towards automating it. In this paper we explore automated methods to transform text from modern English to Shakespearean English using an end to end trainable neural model with pointers to enable copy action. To tackle limited amount of parallel data, we pre-train embeddings of words by leveraging external dictionaries mapping Shakespearean words to modern English words as well as additional text. Our methods are able to get a BLEU score of 31+, an improvement of ≈ 6 points above the strongest baseline. We publicly release our code to foster further research in this area. Harsh Jhamtani, Varun Gangal, Eduard Hovy, and Eric Nyberg link 2017 Workshop on Stylistic Variation, pages 10–19
Mistral 7BWe introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed link 2023 arXiv preprint ArXiv:2310.06825
Deep learning for text style transfer: A surveyAbstract Text style transfer is an important task in natural language generation, which aims to control certain attributes in the generated text, such as politeness, emotion, humor, and many others. It has a long history in the field of natural language processing, and recently has re-gained significant attention thanks to the promising performance brought by deep neural models. In this article, we present a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neural text style transfer work in 2017. We discuss the task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies in the presence of parallel and non-parallel data. We also provide discussions on a variety of important topics regarding the future development of this task.1 Di Jin, Zhijing Jin, Zhiting Hu, Olga Vechtomova, and Rada Mihalcea link 2022 Computational Linguistics, 48(1):155–205
Disentangled representation learning for non-parallel text style transferThis paper tackles the problem of disentangling the latent representations of style and content in language models. We propose a simple yet effective approach, which incorporates auxiliary multi-task and adversarial objectives, for style prediction and bag-of-words prediction, respectively. We show, both qualitatively and quantitatively, that the style and content are indeed disentangled in the latent space. This disentangled latent representation learning can be applied to style transfer on non-parallel corpora. We achieve high performance in terms of transfer accuracy, content preservation, and language fluency, in comparison to various previous approaches. Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Vechtomova link 2019 ACL, pages 424–434
Authorship attributionAuthorship attribution, the science of inferring characteristics of the author from the characteristics of documents written by that author, is a problem with a long history and a wide range of application. Recent work in “non-traditional” authorship attribution demonstrates the practicality of automatically analyzing documents based on authorial style, but the state of the art is confusing. Analyses are difficult to apply, little is known about type or rate of errors, and few “best practices” are available. In part because of this confusion, the field has perhaps had less uptake and general acceptance than is its due. This review surveys the history and present state of the discipline, presenting some comparative results when available. It shows, first, that the discipline is quite successful, even in difficult cases involving small documents in unfamiliar and less studied languages; it further analyzes the types of analysis and features used and tries to determine characteristics of well-performing systems, finally formulating these in a set of recommendations for best practices. Patrick Juola link 2006 Foundations and Trends in Information Retrieval, 1(3):233–334
JGAAP 4.0–A revised authorship attribution toolPLACEHOLDER Patrick Juola, John Noecker Jr., Mike Ryan, and Sandy Speer link 2009 Digital Humanities
(male, bachelor) and (female, Ph.D) have different connotations: Parallelly annotated stylistic language dataset with multiple personasStylistic variation in text needs to be studied with different aspects including the writer’s personal traits, interpersonal relations, rhetoric, and more. Despite recent attempts on computational modeling of the variation, the lack of parallel corpora of style language makes it difficult to systematically control the stylistic change as well as evaluate such models. We release PASTEL, the parallel and annotated stylistic language dataset, that contains ~41K parallel sentences (8.3K parallel stories) annotated across different personas. Each persona has different styles in conjunction: gender, age, country, political view, education, ethnic, and time-of-writing. The dataset is collected from human annotators with solid control of input denotation: not only preserving original meaning between text, but promoting stylistic diversity to annotators. We test the dataset on two interesting applications of style language, where PASTEL helps design appropriate experiment and evaluation. First, in predicting a target style (e.g., male or female in gender) given a text, multiple styles of PASTEL make other external style variables controlled (or fixed), which is a more accurate experimental design. Second, a simple supervised model with our parallel text outperforms the unsupervised models using nonparallel text in style transfer. Our dataset is publicly available. Dongyeop Kang, Varun Gangal, and Eduard Hovy link 2019 EMNLP-IJCNLP, pages 1696–1706
Style is NOT a single variable: Case Studies for Cross-Stylistic Language UnderstandingEvery natural text is written in some style. Style is formed by a complex combination of different stylistic factors, including formality markers, emotions, metaphors, etc. One cannot form a complete understanding of a text without considering these factors. The factors combine and co-vary in complex ways to form styles. Studying the nature of the covarying combinations sheds light on stylistic language in general, sometimes called cross-style language understanding. This paper provides the benchmark corpus (XSLUE) that combines existing datasets and collects a new one for sentence-level cross-style language understanding and evaluation. The benchmark contains text in 15 different styles under the proposed four theoretical groupings: figurative, personal, affective, and interpersonal groups. For valid evaluation, we collect an additional diagnostic set by annotating all 15 styles on the same text. Using XSLUE, we propose three interesting cross-style applications in classification, correlation, and generation. First, our proposed cross-style classifier trained with multiple styles together helps improve overall classification performance against individually-trained style classifiers. Second, our study shows that some styles are highly dependent on each other in human-written text. Finally, we find that combinations of some contradictive styles likely generate stylistically less appropriate text. We believe our benchmark and case studies help explore interesting future directions for cross-style research. The preprocessed datasets and code are publicly available. Dongyeop Kang and Eduard Hovy link 2021 ACL-IJCNLP, pages 2376–2387
Function words in authorship attribution. from black magic to theory?PLACEHOLDER Mike Kestemont link 2014 CLFL, pages 59–66
A deep metric learning approach to account linkingWe consider the task of linking social media accounts that belong to the same author in an automated fashion on the basis of the content and meta-data of the corresponding document streams. We focus on learning an embedding that maps variable-sized samples of user activity–ranging from single posts to entire months of activity–to a vector space, where samples by the same author map to nearby points. Our approach does not require human-annotated data for training purposes, which allows us to leverage large amounts of social media content. The proposed model outperforms several competitive baselines under a novel evaluation framework modeled after established recognition benchmarks in other domains. Our method achieves high linking accuracy, even with small samples from accounts not seen at training time, a prerequisite for practical applications of the proposed linking framework. Aleem Khan, Elizabeth Fleming, Noah Schofield, Marcus Bishop, and Nicholas Andrews link 2021 NAACL, pages 5275–5287
Learning to generate text in arbitrary writing stylesPrior work in style-controlled text generation has focused on tasks such as emulating the style of prolific literary authors, producing formal or informal text, and mitigating toxicity of generated text. Plentiful demonstrations of these styles are available, and as a result modern language models are often able to emulate them, either via prompting or discriminative control. However, in applications such as writing assistants, it is desirable for language models to produce text in an author-specific style on the basis of a potentially small writing sample. For example, someone writing in a particular dialect may prefer writing suggestions that retain the same dialect. We find that instruction-tuned language models can struggle to reproduce author-specific style demonstrated in a prompt. Instead, we propose to guide a language model to generate text in a target style using contrastively-trained representations that capture stylometric features. Our approach (StyleMC) combines an author-adapted language model with sequence-level inference to improve stylistic consistency, and is found to be effective in a variety of conditions, including unconditional generation and style transfer. Additionally, we find that the proposed approach can serve as an effective anonymization method, by editing a document to mask authorship while preserving the original meaning Aleem Khan, Andrew Wang, Sophia Hager, and Nicholas Andrews link 2023 arXiv:2312.17242
Supervised contrastive learningPLACEHOLDER Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan link 2020 NeurIPS, 33:18661–18673
Leveraging Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and DomainsAuthorship representation (AR) learning, which models an author’s unique writing style, has demonstrated strong performance in authorship attribution tasks. However, prior research has primarily focused on monolingual settings—mostly in English—leaving the potential benefits of multilingual AR models underexplored. We introduce a novel method for multilingual AR learning that incorporates two key innovations: probabilistic content masking, which encourages the model to focus on stylistically indicative words rather than content-specific words, and language-aware batching, which improves contrastive learning by reducing cross-lingual interference. Our model is trained on over 4.5 million authors across 36 languages and 13 domains. It consistently outperforms monolingual baselines in 21 out of 22 non-English languages, achieving an average Recall@8 improvement of 4.85%, with a maximum gain of 15.91% in a single language. Furthermore, it exhibits stronger cross-lingual and cross-domain generalization compared to a monolingual model trained solely on English. Our analysis confirms the effectiveness of both proposed techniques, highlighting their critical roles in the model’s improved performance. Junghwan Kim, Haotian Zhang, and David Jurgens link 2025 EMNLP, pages 34855–34880
Working in Language and Law: A German PerspectivePLACEHOLDER Hannes Kniffka link 2007 Palgrave Macmillan UK
What's in an embedding? analyzing word embeddings through multilingual evaluationPLACEHOLDER Arne Köhn link 2015 EMNLP, pages 2067–2073
Automatically categorizing written texts by author genderPLACEHOLDER Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni link 2002 Literary and Linguistic Computing, 17(4):401–412
Stylometric detection of ai-generated text in twitter timelinesRecent advancements in pre-trained language models have enabled convenient methods for generating human-like text at a large scale. Though these generation capabilities hold great potential for breakthrough applications, it can also be a tool for an adversary to generate misinformation. In particular, social media platforms like Twitter are highly susceptible to AI-generated misinformation. A potential threat scenario is when an adversary hijacks a credible user account and incorporates a natural language generator to generate misinformation. Such threats necessitate automated detectors for AI-generated tweets in a given user's Twitter timeline. However, tweets are inherently short, thus making it difficult for current state-of-the-art pre-trained language model-based detectors to accurately detect at what point the AI starts to generate tweets in a given Twitter timeline. In this paper, we present a novel algorithm using stylometric signals to aid detecting AI-generated tweets. We propose models corresponding to quantifying stylistic changes in human and AI tweets in two related tasks: Task 1 - discriminate between human and AI-generated tweets, and Task 2 - detect if and when an AI starts to generate tweets in a given Twitter timeline. Our extensive experiments demonstrate that the stylometric features are effective in augmenting the state-of-the-art AI-generated text detectors. Tharindu Kumarage, Joshua Garland, Amrita Bhattacharjee, Kirill Trapeznikov, Scott Ruston, and Huan Liu link 2023 arXiv:2303.03697
Sociolinguistic PatternsPLACEHOLDER William Labov - 1972 University of Pennsylvania Press
The Social Stratification of English in New York City, 2nd editionOne of the first accounts of social variation in language, this groundbreaking study founded the discipline of sociolinguistics, providing the model on which thousands of studies have been based. In this second edition, Labov looks back on forty years of sociolinguistic research, bringing the reader up to date on its methods, findings and achievements. In over thirty pages of new material, he explores the unforeseen implications of his earlier work, addresses the political issues involved, and evaluates the success of newer approaches to sociolinguistic investigation. In doing so, he reveals the outstanding accomplishments of sociolinguistics since his original study, which laid the foundations for studying language variation, introduced the crucial concept of the linguistic variable, and showed how variation across age groups is an indicator of language change. Bringing Labov's pioneering study into the 21st century, this classic volume will remain the benchmark in the field for years to come. William Labov link 2006 Cambridge University Press
Tulu 3: Pushing Frontiers in Open Language Model Post-TrainingLanguage model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce Tulu 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. Tulu 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With Tulu 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance. In addition to the Tulu 3 model weights and demo, we release the complete recipe -- including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the Tulu 3 approach to more domains. Team Tulu, Nathan Lambert, Jacob Morrison, Valentina Pyatkin, et al. link 2025 arXiv preprint ArXiv:2411.15124
Where does the sociolinguistic variable stop?In 1972 Labov described the basic sociolinguistic question as the one ‘posed by the need to understand why anyone says anything’ (1972:207). Clearly the aim is very different from that of specifying the form of a grammar that generates all and only the well-formed sentences of a language. The goal is a theory of utterances. Moreover, the ‘why’ question can also be read as ‘what for’. What does anyone say anything for? I think we can safely say that this question places sociolinguistic analysis in a functional framework. If sociolinguistics looks for answers to the ‘why’ of saying something, it is seeking functional explanations. Beatriz R. Lavandera link 1978 Language in Society, 7(2):171–192
LFTK: Handcrafted Features in Computational LinguisticsPast research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, no actively-maintained open-source library extracts a wide variety of handcrafted features. The current handcrafted feature extraction practices have several inefficiencies, and a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system to give the community a rich set of pre-implemented handcrafted features. Bruce W. Lee and Jason Lee link 2023 BEA 2023, pages 1–19
Diverse Demonstrations Improve In-context Compositional GeneralizationIn-context learning has shown great success in i.i.d semantic parsing splits, where the training and test sets are drawn from the same distribution. In this setup, models are typically prompted with demonstrations that are similar to the input utterance. However, in the setup of compositional generalization, where models are tested on outputs with structures that are absent from the training set, selecting similar demonstrations is insufficient, as often no example will be similar enough to the input. In this work, we propose a method to select diverse demonstrations that aims to collectively cover all of the structures required in the output program, in order to encourage the model to generalize to new structures from these demonstrations. We empirically show that combining diverse demonstrations with in-context learning substantially improves performance across three compositional generalization semantic parsing datasets in the pure in-context learning setup and when combined with finetuning. Itay Levy, Ben Bogin, and Jonathan Berant link 2023 ACL, pages 1401–1422
TextBugger: Generating Adversarial Text Against Real-world ApplicationsPLACEHOLDER Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang link 2019 NDSS
Towards robust and privacy-preserving text representationsWritten text often provides sufficient clues to identify the author, their gender, age, and other important attributes. Consequently, the authorship of training and evaluation corpora can have unforeseen impacts, including differing model performance for different user groups, as well as privacy implications. In this paper, we propose an approach to explicitly obscure important author characteristics at training time, such that representations learned are invariant to these attributes. Evaluating on two tasks, we show that this leads to increased privacy in the learned representations, as well as more robust models to varying evaluation conditions, including out-of-domain corpora. Yitong Li, Timothy Baldwin, and Trevor Cohn link 2018 ACL, pages 25–30
Textbooks Are All You Need II: phi-1.5 technical reportWe continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs) to generate ``textbook quality" data as a way to enhance the learning process compared to traditional web data. We follow the ``Textbooks Are All You Need" approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named \textbf{phi-1.5}, with performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding. More generally, \textbf{phi-1.5} exhibits many of the traits of much larger LLMs, both good -- such as the ability to ``think step by step" or perform some rudimentary in-context learning -- and bad, including hallucinations and the potential for toxic and biased generations -- encouragingly though, we are seeing improvement on that front thanks to the absence of web data. We open-source \textbf{phi-1.5} to promote further research on these urgent topics. Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee link 2023 arXiv preprint ArXiv:2309.05463
GPT detectors are biased against non-native English writersPLACEHOLDER Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou link 2023 Patterns, 4(7):100779
Let's Verify Step by StepIn recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model. Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe link 2023 ICLR
Style over Substance: Distilled Language Models Reason Via Stylistic ReplicationSpecialized reasoning language models (RLMs) have demonstrated that scaling test-time computation through detailed reasoning traces significantly enhances performance. Although these traces effectively facilitate knowledge distillation into smaller, instruction-tuned models, the precise nature of transferred reasoning remains unclear. In this study, we investigate to what extent distilled models internalize replicated stylistic patterns during reasoning. To this end, we systematically analyze reasoning traces, identifying structural and lexical patterns that characterize successful reasoning. We then introduce two new datasets -- a dataset of emergent reasoning traces and a synthetic dataset explicitly constructed to replicate these stylistic patterns -- to precisely examine their influence on distilled models' reasoning capabilities. We find that models trained on the synthetic traces achieve comparable performance, indicating that distilled reasoning abilities rely significantly on surface-level patterns. Surprisingly, we observe an increase in performance even when the synthetic traces are altered to lead to the wrong answer. Our findings highlight how stylistic patterns can be leveraged to efficiently enhance LM reasoning across diverse model families. Philip Lippmann and Jie Yang link 2025 arXiv
Anonymisation models for text data: State of the art, challenges and future directionsThis position paper investigates the problem of automated text anonymisation, which is a prerequisite for secure sharing of documents containing sensitive information about individuals. We summarise the key concepts behind text anonymisation and provide a review of current approaches. Anonymisation methods have so far been developed in two fields with little mutual interaction, namely natural language processing and privacy-preserving data publishing. Based on a case study, we outline the benefits and limitations of these approaches and discuss a number of open challenges, such as (1) how to account for multiple types of semantic inferences, (2) how to strike a balance between disclosure risk and data utility and (3) how to evaluate the quality of the resulting anonymisation. We lay out a case for moving beyond sequence labelling models and incorporate explicit measures of disclosure risk into the text anonymisation process. Pierre Lison, Ildikó Pilán, David Sanchez, Montserrat Batet, and Lilja Øvrelid link 2021 ACL-IJCNLP, pages 4188–4203
Enct5: A framework for fine-tuning t5 as non-autoregressive modelsPre-trained encoder-decoder transformer architectures have become increasingly popular recently with the advent of T5 models. T5 has also become more favorable over other architectures like BERT due to the amount of data that it is pre-trained on, increased scale of model parameter sizes and easy applicability to a diverse set of tasks due to the generative nature of the model. While being able to generalize to a wide variety of tasks, it is not clear that encoder-decoder architectures are the most efficient for fine-tuning tasks that don't require auto-regressive decoding. In this work, we study fine-tuning pre-trained encoder-decoder models for tasks such as classification, multi-label classification, and structured prediction. We propose \textbf{EncT5}, a framework for these problems, and illustrate instantiations for these tasks. Our experiment results show that EncT5 has advantages over T5 such as efficiency and usability out performs BERT when evaluated on publicly available pre-trained checkpoints. Frederick Liu, Terry Huang, Shihang Lyu, Siamak Shakeri, Hongkun Yu, and Jing Li link 2022 arXiv:2110.08426
A Survey of Personalized Large Language Models: Progress and Future DirectionsLarge Language Models (LLMs) excel in handling general knowledge tasks, yet they struggle with user-specific personalization, such as understanding individual emotions, writing styles, and preferences. Personalized Large Language Models (PLLMs) tackle these challenges by leveraging individual user data, such as user profiles, historical dialogues, content, and interactions, to deliver responses that are contextually relevant and tailored to each user's specific needs. This is a highly valuable research topic, as PLLMs can significantly enhance user satisfaction and have broad applications in conversational agents, recommendation systems, emotion recognition, medical assistants, and more. This survey reviews recent advancements in PLLMs from three technical perspectives: prompting for personalized context (input level), finetuning for personalized adapters (model level), and alignment for personalized preferences (objective level). To provide deeper insights, we also discuss current limitations and outline several promising directions for future research. Updated information about this survey can be found at the https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models. Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Wenhao Yu, Jieming Zhu, Minda Hu, Menglin Yang, Tat-Seng Chua, Irwin King link 2025 arXiv preprint ArXiv:2502.11528
RECAP: Retrieval-enhanced context-aware prefix encoder for personalized dialogue response generationEndowing chatbots with a consistent persona is essential to an engaging conversation, yet it remains an unresolved challenge. In this work, we propose a new retrieval-enhanced approach for personalized response generation. Specifically, we design a hierarchical transformer retriever trained on dialogue domain data to perform personalized retrieval and a context-aware prefix encoder that fuses the retrieved information to the decoder more effectively. Extensive experiments on a real-world dataset demonstrate the effectiveness of our model at generating more fluent and personalized responses. We quantitatively evaluate our model’s performance under a suite of human and automatic metrics and find it to be superior compared to state-of-the-art baselines on English Reddit conversations. Shuai Liu, Hyundong Cho, Marjorie Freedman, Xuezhe Ma, and Jonathan May link 2023 ACL, pages 8404–8419
More than words: The influence of affective content and linguistic style matches in online reviews on conversion ratesCustomers increasingly rely on other consumers' reviews to make purchase decisions online. New insights into the customer review phenomenon can be derived from studying the semantic content and style properties of verbatim customer reviews to examine their influence on online retail sites' conversion rates. The authors employ text mining to extract changes in affective content and linguistic style properties of customer book reviews on Amazon.com . A dynamic panel data model reveals that the influence of positive affective content on conversion rates is asymmetrical, such that greater increases in positive affective content in customer reviews have a smaller effect on subsequent increases in conversion rate. No such tapering-off effect occurs for changes in negative affective content in reviews. Furthermore, positive changes in affective cues and increasing congruence with the product interest group's typical linguistic style directly and conjointly increase conversion rates. These findings suggest that managers should identify and promote the most influential reviews in a given product category, provide instructions to stimulate reviewers to write powerful reviews, and adapt the style of their own editorial reviews to the relevant product category. Stephan Ludwig, Ko de Ruyter, Max Friedman, Elisabeth Constantin Brüggen, Martin Wetzels, and Gerard Pfann link 2013 Journal of Marketing, 77(1):87–103
Politeness transfer: A tag and generate approachThis paper introduces a new task of politeness transfer which involves converting non-polite sentences to polite sentences while preserving the meaning. We also provide a dataset of more than 1.39 instances automatically labeled for politeness to encourage benchmark evaluations on this new task. We design a tag and generate pipeline that identifies stylistic attributes and subsequently generates a sentence in the target style while preserving most of the source content. For politeness as well as five other transfer tasks, our model outperforms the state-of-the-art methods on automatic metrics for content preservation, with a comparable or better performance on style transfer accuracy. Additionally, our model surpasses existing methods on human evaluations for grammaticality, meaning preservation and transfer accuracy across all the six style transfer tasks. The data and code is located at <a href=https://github.com/tag-and-generate class=acl-markup-url>https://github.com/tag-and-generate</a>. Aman Madaan, Amrith Setlur, Tanmay Parekh, Barnabas Poczos, Graham Neubig, Yiming Yang, Ruslan Salakhutdinov, Alan W Black, and Shrimai Prabhumoye link 2020 ACL, pages 1869–1881
Jointly learning author and annotated character n-gram embeddings: A case study in literary textPLACEHOLDER Suraj Maharjan, Deepthi Mave, Prasha Shrestha, Manuel Montes, Fabio A. González, and Thamar Solorio link 2019 RANLP 2019, pages 684–692
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language ModelingLarge language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we propose Web Rephrase Augmented Pre-training ($\textbf{WRAP}$) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as "like Wikipedia" or in "question-answer format" to jointly pre-train LLMs on real and synthetic rephrases. First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by $\sim3x$. At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%. Second, we investigate the impact of the re-phrasing style on the performance of the model, offering insights into how the composition of the training data can impact the performance of LLMs in OOD settings. Our gains are attributed to the fact that re-phrased synthetic data has higher utility than just real data because it (i) incorporates style diversity that closely reflects downstream evaluation style, and (ii) has higher 'quality' than web-scraped data. Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly link 2024 arXiv preprint ArXiv:2401.16380
Counterfactual augmentation for robust authorship representation learningPLACEHOLDER Hieu Man and Thien Huu Nguyen link 2024 SIGIR '24, pages 2347–2351
Language technologies as if people mattered: Centering communities in language technology developmentIn this position paper we argue that researchers interested in language and/or language technologies should attend to challenges of linguistic and algorithmic injustice together with language communities. We put forward that this can be done by drawing together diverse scholarly and experiential insights, building strong interdisciplinary teams, and paying close attention to the wider social, cultural and historical contexts of both language communities and the technologies we aim to develop. Nina Markl, Lauren Hall-Lew, and Catherine Lai link 2024 LREC-COLING 2024, pages 10085–10099
Umap: Uniform manifold approximation and projection for dimension reductionUMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning. Leland McInnes, John Healy, and James Melville link 2020 arXiv:1802.03426
Introducing SociolinguisticsPLACEHOLDER Miriam Meyerhoff link 2006 Routledge
Linguistic profiling of a neural language modelIn this paper we investigate the linguistic knowledge learned by a Neural Language Model (NLM) before and after a fine-tuning process and how this knowledge affects its predictions during several classification problems. We use a wide set of probing tasks, each of which corresponds to a distinct sentence-level feature extracted from different levels of linguistic annotation. We show that BERT is able to encode a wide range of linguistic characteristics, but it tends to lose this information when trained on specific downstream tasks. We also find that BERT’s capacity to encode different kind of linguistic properties has a positive influence on its predictions: the more it stores readable linguistic information of a sentence, the higher will be its capacity of predicting the expected label assigned to that sentence. Alessio Miaschi, Dominique Brunato, Felice Dell'Orletta, and Giulia Venturi link 2020 COLING, pages 745–756
Stranger than paradigms word embedding benchmarks don't align with morphologyPLACEHOLDER Timothee Mickus and Maria Copot link 2024 SCiL 2024, pages 173–189
Investigating topic influence in authorship attributionPLACEHOLDER George K Mikros and Eleni K Argiri link 2007 SIGIR'07 Workshop
The signature stylometric systemPLACEHOLDER Peter Millican link 2003 Software documentation, University of Oxford
State of what art? a call for multi-prompt LLM evaluationAbstract Recent advances in LLMs have led to an abundance of evaluation benchmarks, which typically rely on a single instruction template per task. We create a large-scale collection of instruction paraphrases and comprehensively analyze the brittleness introduced by single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. We find that different instruction templates lead to very different performance, both absolute and relative. Instead, we propose a set of diverse metrics on multiple instruction paraphrases, specifically tailored for different use cases (e.g., LLM vs. downstream development), ensuring a more reliable and meaningful assessment of LLM capabilities. We show that our metrics provide new insights into the strengths and limitations of current LLMs. Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky link 2024 TACL, 12:933–949
Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed federalist papersPLACEHOLDER Frederick Mosteller and David L. Wallace link 1963 JASA, 58(302):275–309
MTEB: Massive Text Embedding BenchmarkText embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings todate. We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-theart results on all embedding tasks. MTEB comes with open-source code and a public leaderboard at <a href=https://github.com/embeddings-benchmark/mteb class=acl-markup-url>https://github.com/embeddings-benchmark/mteb</a>. Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers link 2023 EACL, pages 2014–2037
s1: Simple test-time scalingTest-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1 Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto link 2025 EMNLP, pages 20286–20332
Does your style engage? linguistic styles of influencers and digital consumer engagement on youtubePLACEHOLDER Ana Cristina Munaro, Renato Hübner Barcelos, Eliane Cristine Francisco Maffezzolli, João Pedro Santos Rodrigues, and Emerson Cabrera Paraiso link 2024 Computers in Human Behavior, 156(C)
Surveying stylometry techniques and applicationsThe analysis of authorial style, termed stylometry, assumes that style is quantifiably measurable for evaluation of distinctive qualities. Stylometry research has yielded several methods and tools over the past 200 years to handle a variety of challenging cases. This survey reviews several articles within five prominent subtasks: authorship attribution, authorship verification, authorship profiling, stylochronometry, and adversarial stylometry. Discussions on datasets, features, experimental techniques, and recent approaches are provided. Further, a current research challenge lies in the inability of authorship analysis techniques to scale to a large number of authors with few text samples. Here, we perform an extensive performance analysis on a corpus of 1,000 authors to investigate authorship attribution, verification, and clustering using 14 algorithms from the literature. Finally, several remaining research challenges are discussed, along with descriptions of various open-source and commercial software that may be useful for stylometry subtasks. Tempestt Neal, Kalaivani Sundararajan, Aneez Fatima, Yiming Yan, Yingfei Xiang, and Damon Woodard link 2017 ACM Computing Surveys, 50(6):86
Collaborative growth: When large language models meet sociolinguisticsABSTRACT Large Language Models (LLMs) have dramatically transformed the AI landscape. They can produce remarkable fluent text and exhibit a range of natural language understanding and generation capabilities. This article explores how LLMs might be used for sociolinguistic research and, conversely, how sociolinguistics can contribute to the development of LLMs. It argues that both areas of research will benefit from a thoughtful, engaging collaboration. Sociolinguists are not merely end users of LLMs; they have a crucial role to play in the development of LLMs. Dong Nguyen link 2025 Language and Linguistics Compass, 19(2):e70010
Computational sociolinguistics: A SurveyLanguage is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of “computational sociolinguistics” that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction, and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions used in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges. Dong Nguyen, A. Seza Doğruöz, Carolyn P. Rosé, and Franciska de Jong link 2016 Computational Linguistics, 42(3):537–593
"How old do you think I am?" A study of language and age in TwitterPLACEHOLDER Dong Nguyen, Rilana Gravel, Dolf Trieschnigg, and Theo Meder link 2013 ICWSM, pages 439–448
Do word embeddings capture spelling variation?Analyses of word embeddings have primarily focused on semantic and syntactic properties. However, word embeddings have the potential to encode other properties as well. In this paper, we propose a new perspective on the analysis of word embeddings by focusing on spelling variation. In social media, spelling variation is abundant and often socially meaningful. Here, we analyze word embeddings trained on Twitter and Reddit data. We present three analyses using pairs of word forms covering seven types of spelling variation in English. Taken together, our results show that word embeddings encode spelling variation patterns of various types to some extent, even embeddings trained using the skipgram model which does not take spelling into account. Our results also suggest a link between the intentionality of the variation and the distance of the non-conventional spellings to their conventional spellings. Dong Nguyen and Jack Grieve link 2020 COLING, pages 870–881
We Need to Measure Data Diversity in NLP – Better and BroaderAlthough diversity in NLP datasets has received growing attention, the question of how to measure it remains largely underexplored. This opinion paper examines the conceptual and methodological challenges of measuring data diversity and argues that interdisciplinary perspectives are essential for developing more fine-grained and valid measures. Dong Nguyen and Esther Ploeger link 2025 arXiv preprint ArXiv:2505.20264
On learning and representing social meaning in NLP: a sociolinguistic perspectiveThe field of NLP has made substantial progress in building meaning representations. However, an important aspect of linguistic meaning, social meaning, has been largely overlooked. We introduce the concept of social meaning to NLP and discuss how insights from sociolinguistics can inform work on representation learning in NLP. We also identify key challenges for this new line of research. Dong Nguyen, Laura Rosseel, and Jack Grieve link 2021 NAACL, pages 603–612
The Multi-Dimensional Analysis TaggerPLACEHOLDER Andrea Nini link 2019 Multi-Dimensional Analysis: Research Methods and Current Issues
A Theory of Linguistic Individuality for Authorship AnalysisThis Element examines progress in research and practice in forensic authorship analysis. It describes the existing research base and examines what makes an authorship analysis more or less reliable. Further to this, the author describes the recent history of forensic science and the scientific revolution brought about by the invention of DNA evidence. They chart the rise of three major changes in forensic science – the recognition of contextual bias in analysts, the need for validation studies and shift in logic of providing identification evidence. This Element addresses the idea of progress in forensic authorship analysis in terms of these three issues with regard to new knowledge about the nature of authorship and methods in stylistics and stylometry. The author proposes that the focus needs to shift to validation of protocols for approaching case questions, rather than on validation of systems or general approaches. This title is also available as Open Access on Cambridge Core. Andrea Nini link 2023 Cambridge University Press
A study of style in machine translation: Controlling the formality of machine translation outputStylistic variations of language, such as formality, carry speakers’ intention beyond literal meaning and should be conveyed adequately in translation. We propose to use lexical formality models to control the formality level of machine translation output. We demonstrate the effectiveness of our approach in empirical evaluations, as measured by automatic metrics and human assessments. Xing Niu, Marianna Martindale, and Marine Carpuat link 2017 EMNLP, pages 2814–2819
Multi-task neural models for translating between styles within and across languagesGenerating natural language requires conveying content in an appropriate style. We explore two related tasks on generating text of varying formality: monolingual formality transfer and formality-sensitive machine translation. We propose to solve these tasks jointly using multi-task learning, and show that our models achieve state-of-the-art performance for formality transfer and are able to perform formality-sensitive translation without being explicitly trained on style-annotated translation examples. Xing Niu, Sudha Rao, and Marine Carpuat link 2018 COLING, pages 1008–1021
2 OLMo 2 FuriousWe present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, training code and recipes, training logs and thousands of intermediate checkpoints. In this work, we describe our modified model architecture and training recipe, focusing on techniques for achieving better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from Tülu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to training compute, often matching or outperforming open-weight only models like Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with open-weight only models of comparable size and even some proprietary models like GPT-3.5 Turbo and GPT 4o Mini. Team OLMo, Pete Walsh, Luca Soldaini, et al. link 2025 arXiv preprint ArXiv:2501.00656
Linguistic style and crowdfunding success among social and commercial entrepreneursPLACEHOLDER Annaleena Parhankangas and Maija Renko link 2017 Journal of Business Venturing, 32(2):215–236
Learning interpretable style embeddings via prompting LLMsStyle representation learning builds content-independent representations of author style in text. To date, no large dataset of texts with stylometric annotations on a wide range of style dimensions has been compiled, perhaps because the linguistic expertise to perform such annotation would be prohibitively expensive. Therefore, current style representation approaches make use of unsupervised neural methods to disentangle style from content to create style vectors. These approaches, however, result in uninterpretable representations, complicating their usage in downstream applications like authorship attribution where auditing and explainability is critical. In this work, we use prompting to perform stylometry on a large number of texts to generate a synthetic stylometry dataset. We use this synthetic data to then train human-interpretable style representations we call LISA embeddings. We release our synthetic dataset (StyleGenome) and our interpretable style embedding model (LISA) as resources. Ajay Patel, Delip Rao, Ansh Kothary, Kathleen McKeown, and Chris Callison-Burch link 2023 Findings of EMNLP 2023, pages 15270–15290
StyleDistance: Stronger content-independent style embeddings with synthetic parallel examplesStyle representations aim to embed texts with similar writing styles closely and texts with different styles far apart, regardless of content. However, the contrastive triplets often used for training these representations may vary in both style and content, leading to potential content leakage in the representations. We introduce StyleDistance, a novel approach to training stronger content-independent style embeddings. We use a large language model to create a synthetic dataset of near-exact paraphrases with controlled style variations, and produce positive and negative examples across 40 distinct style features for precise contrastive learning. We assess the quality of our synthetic data and embeddings through human and automatic evaluations. StyleDistance enhances the content-independence of style embeddings, which generalize to real-world benchmarks and outperform leading style representations in downstream applications. Ajay Patel, Jiacheng Zhu, Justin Qiu, Zachary Horvitz, Marianna Apidianaki, Kathleen McKeown, and Chris Callison-Burch link 2025 NAACL, pages 8662–8685
Language independent authorship attribution using character level language modelsPLACEHOLDER Fuchun Peng, Dale Schuurmans, Shaojun Wang, and Vlado Keselj link 2003 EACL, pages 267–274
The Development and Psychometric Properties of LIWC2015PLACEHOLDER James W. Pennebaker, Ryan L. Boyd, Kayla Jordan, and Kate Blackburn link 2015 University of Texas at Austin
JSAN–The Integrated JStylo and Anonymouth PackagePLACEHOLDER Drexel University PSAL link 2013 Drexel University
Mind the style of text! adversarial and backdoor attacks based on text style transferDefinition generation, which aims to automatically generate dictionary definitions for words, has recently been proposed to assist the construction of dictionaries and help people understand unfamiliar texts. However, previous works hardly consider explicitly modeling the “components” of definitions, leading to under-specific generation results. In this paper, we propose ESD, namely Explicit Semantic Decomposition for definition Generation, which explicitly decomposes the meaning of words into semantic components, and models them with discrete latent variables for definition generation. Experimental results show that achieves top results on WordNet and Oxford benchmarks, outperforming strong previous baselines. Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li, Zhiyuan Liu, and Maosong Sun link 2021 EMNLP, pages 4569–4580
mStyleDistance: Multilingual style embeddings and their evaluationStyle embeddings are useful for stylistic analysis and style transfer, yet they only exist for English. We introduce Multilingual StyleDistance (mStyleDistance), a method that can generate style embeddings in new languages using synthetic data and a contrastive loss. We create style embeddings in nine languages and a multilingual STEL-or-Content benchmark (Wegmann et al., 2022) that serves to assess their quality. We also employ our embeddings in an authorship verification task involving different languages. Our results show that mStyleDistance embeddings outperform existing style embeddings on these benchmarks and generalize well to unseen features and languages. We make our models and datasets publicly available. Justin Qiu, Jiacheng Zhu, Ajay Patel, Marianna Apidianaki, and Chris Callison-Burch link 2025 Findings of ACL 2025, pages 16917–16931
Personalized machine translation: Preserving original author traitsThe language that we produce reflects our personality, and various personal and demographic characteristics can be detected in natural language texts. We focus on one particular personal trait of the author, gender, and study how it is manifested in original texts and in translations. We show that author’s gender has a powerful, clear signal in originals texts, but this signal is obfuscated in human and machine translation. We then propose simple domain-adaptation techniques that help retain the original gender traits in the translation, without harming the quality of the translation, thereby creating more personalized machine translation systems. Ella Rabinovich, Raj Nath Patel, Shachar Mirkin, Lucia Specia, and Shuly Wintner link 2017 EACL, pages 1074–1084
Overview of the author profiling task at PAN 2013PLACEHOLDER Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and Giacomo Inches link 2013 CLEF
Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transferStyle transfer is the task of automatically transforming a piece of text in one particular style into another. A major barrier to progress in this field has been a lack of training and evaluation datasets, as well as benchmarks and automatic metrics. In this work, we create the largest corpus for a particular stylistic transfer (formality) and show that techniques from the machine translation community can serve as strong baselines for future work. We also discuss challenges of using automatic metrics. Sudha Rao and Joel Tetreault link 2018 NAACL, pages 129–140
A recipe for arbitrary text style transfer with large language modelsIn this paper, we leverage large language models (LLMs) to perform zero-shot text style transfer. We present a prompting method that we call augmented zero-shot learning, which frames style transfer as a sentence rewriting task and requires only a natural language instruction, without model fine-tuning or exemplars in the target style. Augmented zero-shot learning is simple and demonstrates promising results not just on standard style transfer tasks such as sentiment, but also on arbitrary transformations such as ‘make this melodramatic’ or ‘insert a metaphor.’ Emily Reif, Daphne Ippolito, Ann Yuan, Andy Coenen, Chris Callison-Burch, and Jason Wei link 2022 ACL, pages 837–848
Addressee- and topic-influenced style shift: A quantitative sociolinguistic studyAbstract This chapter is a study of addressee-and topic-influenced style shift in language, within the framework of quantitative or “variationist” sociolinguistics. The first section is written from a theoretical, history-of-science perspective; we begin by contrasting the taxonomic, polydimensional approach of sociolinguists like Hymes (1972) and Halliday (1978) with the empirical, unidimensional approach of Labov (1966:90-135, 1972a:70-109), for whom styles were ordered on a single dimension, involving attention paid to speech. We suggest that the neglect of style within the American variationist school from the 1970s onward was due in part to methodological and theoretical difficulties with this approach. As we note, an alternative unidimensional approach, considering style as audience accommodation (Giles and Powesland 1975, Bell 1984), is more promising, but although several quantitative studies within this framework have been made over the past decade and a half, most of them were done outside the United States, primarily in Britain. John R. Rickford and Faye McNair-Knox link 1994 Sociolinguistic Perspectives on Register, pages 235–276
Few-shot detection of machine-generated text using style representationsPLACEHOLDER Rafael Rivera Soto, Kailin Koch, Aleem Khan, Barry Y. Chen, Marcus Bishop, and Nicholas Andrews link 2024 ICLR
Learning universal authorship representationsDetermining whether two documents were composed by the same author, also known as authorship verification, has traditionally been tackled using statistical methods. Recently, authorship representations learned using neural networks have been found to outperform alternatives, particularly in large-scale settings involving hundreds of thousands of authors. But do such representations learned in a particular domain transfer to other domains? Or are these representations inherently entangled with domain-specific features? To study these questions, we conduct the first large-scale study of cross-domain transfer for authorship verification considering zero-shot transfers involving three disparate domains: Amazon reviews, fanfiction short stories, and Reddit comments. We find that although a surprising degree of transfer is possible between certain domains, it is not so successful between others. We examine properties of these domains that influence generalization and propose simple but effective methods to improve transfer. Rafael Rivera Soto, Olivia Elizabeth Miano, Juanita Ordonez, Barry Y. Chen, Aleem Khan, Marcus Bishop, and Nicholas Andrews link 2021 EMNLP, main paper
My LLM might Mimic AAE - But When Should It?We examine the representation of African American English (AAE) in large language models (LLMs), exploring (a) the perceptions Black Americans have of how effective these technologies are at producing authentic AAE, and (b) in what contexts Black Americans find this desirable. Through both a survey of Black Americans (<span class=tex-math>n= Sandra Camille Sandoval, Christabel Acquaye, Kwesi Adu Cobbina, Mohammad Nayeem Teli, and Hal Daumé Iii link 2025 NAACL, pages 5277–5302
Topic-regularized authorship representation learningAuthorship attribution is a task that aims to identify the author of a given piece of writing. We aim to develop a generalized solution that can handle a large number of texts from authors and topics unavailable in training data. Previous studies have proposed strategies to address only either unseen authors or unseen topics. Authorship representation learning has been shown to work in open-set environments with a large number of unseen authors but has not been explicitly designed for cross-topic environments at the same time. To handle a large number of unseen authors and topics, we propose Authorship Representation Regularization (ARR), a distillation framework that creates authorship representation with reduced reliance on topic-specific information. To assess the performance of our framework, we also propose a cross-topic-open-set evaluation method. Our proposed method has improved performances in the cross-topic-open set setup over baselines in 4 out of 6 cases. Jitkapat Sawatphol, Nonthakit Chaiwong, Can Udomcharoenchaikit, and Sarana Nutanong link 2022 EMNLP, pages 1076–1082
Addressing Topic Leakage in Cross-Topic Evaluation for Authorship VerificationAbstract Authorship verification (AV) aims to identify whether a pair of texts has the same author. We address the challenge of evaluating AV models’ robustness against topic shifts. The conventional evaluation assumes minimal topic overlap between training and test data. However, we argue that there can still be topic leakage in test data, causing misleading model performance and unstable rankings. To address this, we propose an evaluation method called Heterogeneity-Informed Topic Sampling (HITS), which creates a smaller dataset with a heterogeneously distributed topic set. Our experimental results demonstrate that HITS-sampled datasets yield a more stable ranking of models across random seeds and evaluation splits. Our contributions include: 1. An analysis of causes and effects of topic leakage; 2. A demonstration of the HITS in reducing the effects of topic leakage; and 3. The Robust Authorship Verification bENchmark (RAVEN) that allows topic shortcut test to uncover AV models’ reliance on topic-specific features. Jitkapat Sawatphol, Can Udomcharoenchaikit, and Sarana Nutanong link 2024 TACL, 12:1363–1377
MATCHED: Multimodal Authorship-Attribution To Combat Human Trafficking in Escort-Advertisement DataHuman trafficking (HT) remains a critical issue, with traffickers increasingly leveraging online escort advertisements to advertise victims anonymously. Existing detection methods, including text-based Authorship Attribution (AA), overlook the multimodal nature of these ads, which combine text and images. To bridge this gap, we introduce MATCHED, a multimodal AA dataset comprising 27,619 unique text descriptions and 55,115 unique images sourced from Backpage across seven U.S. cities in four geographic regions. This study extensively benchmarks text-only, vision-only, and multimodal baselines for vendor identification and verification tasks, employing multitask (joint) training objectives that achieve superior classification and retrieval performance on in-sample and out-of-data distribution datasets. The results demonstrate that while text remains the dominant modality, integrating visual features adds stylistic cues that enrich model performance. Moreover, text-image alignment strategies like CLIP and BLIP2 struggle due to low semantic overlap and vague connections between the modalities of escort ads, with end-to-end multimodal training proving more robust. Our findings emphasize the potential of multimodal AA to combat HT, providing Law Enforcement Agencies with robust tools to link advertisements and disrupt trafficking networks. Vageesh Kumar Saxena, Benjamin Ashpole, Gijs Van Dijck, and Gerasimos Spanakis link 2025 Findings of ACL 2025, pages 4334–4373
Frequent-words analysis for forensic speaker comparisonPLACEHOLDER Eleni-Konstantina Sergidou, Nelleke Scheijen, Jeannette Leegwater, Tina Cambier-Langeveld, and Wauter Bosma link 2023 Speech Communication, 150:1–8
The power of words: Driving online consumer engagement in FintechPurpose This study aims to explore the role of the linguistic style used in the brand-posted social media content on consumer engagement in the Fintech domain. Design/methodology/approach A total of 3,286 tweets (registering nearly 1.35 million impressions) published by 10 leading Fintech unicorns in India were extracted using the Twitter API. The Linguistic Inquiry and Word Count (LIWC) dictionary was used to analyse the linguistic characteristics of the shared tweets. Negative Binomial Regression (NBR) was used for testing the hypotheses. Findings This study finds that using drive words and cognitive language increases consumer engagement with Fintech messages via the central route of information processing. Further, affective words and conversational language drive consumer engagement through the peripheral route of information processing. Research limitations/implications The study extends the literature on brand engagement by unveiling the effect of linguistic features used to design social media messages. Practical implications The study provides guidance to social media marketers of Fintech brands regarding what content strategies best enhance consumer engagement. The linguistic style to improve online consumer engagement (OCE) is detailed. Originality/value The study’s findings contribute to the growing stream of Fintech literature by exploring the role of linguistic style on consumer engagement in social media communication. The study’s findings indicate the relevance of the dual processing mechanism of elaboration likelihood model (ELM) as an explanatory theory for evaluating consumer engagement with messages posted by Fintech brands. R.V. ShabbirHusain, Atul Arun Pathak, Shabana Chandrasekaran, and Balamurugan Annamalai link 2023 International Journal of Bank Marketing, 42(2):331–355
Style transfer from non-parallel text by cross-alignmentPLACEHOLDER Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola link 2017 NIPS'17, pages 6833–6844
Does string-based neural MT learn source syntax?PLACEHOLDER Xing Shi, Inkit Padhi, and Kevin Knight link 2016 EMNLP, pages 1526–1534
Personalized author obfuscation with large language modelsIn this paper, we investigate the efficacy of large language models (LLMs) in obfuscating authorship by paraphrasing and altering writing styles. Rather than adopting a holistic approach that evaluates performance across the entire dataset, we focus on user-wise performance to analyze how obfuscation effectiveness varies across individual authors. While LLMs are generally effective, we observe a bimodal distribution of efficacy, with performance varying significantly across users. To address this, we propose a personalized prompting method that outperforms standard prompting techniques and partially mitigates the bimodality issue. Mohammad Shokri, Sarah Ita Levitan, and Rivka Levitan link 2025 arXiv preprint arXiv:2505.12090
A survey of modern authorship attribution methodsPLACEHOLDER Efstathios Stamatatos link 2009 JASIST, 60(3):538–556
Masking topic-related information to enhance authorship attributionAuthorship attribution attempts to reveal the authors of documents. In recent years, research in this field has grown rapidly. However, the performance of state‐of‐the‐art methods is heavily affected when text of known authorship and texts under investigation differ in topic and/or genre. So far, it is not clear how to quantify the personal style of authors in a way that is not affected by topic shifts or genre variations. In this paper, a set of text distortion methods are used attempting to mask topic‐related information. These methods transform the input texts into a more topic‐neutral form while maintaining the structure of documents associated with the personal style of the author. Using a controlled corpus that includes a fine‐grained range of topics and genres it is demonstrated how the proposed approach can be combined with existing authorship attribution methods to enhance their performance in very challenging tasks, especially in cross‐topic attribution. We also examine cross‐genre attribution and the most challenging, yet realistic, cross‐topic‐and‐genre attribution scenarios and show how the proposed techniques should be tuned to enhance performance in these tasks. Finally, we demonstrate that there are important differences in attribution effectiveness when either conversational genres, nonconversational genres, or a mix of them are considered. Efstathios Stamatatos link 2017 JASIST, 69(3):461–473
Multi-label style change detection by solving a binary classification problemPLACEHOLDER Eivind Strøm link 2021 CLEF 2021, pages 2146–2157
Dialect-robust evaluation of generated textText generation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. In this paper, we introduce a suite of methods to assess whether metrics are dialect robust. These methods show that state-of-the-art metrics are not dialect robust: they often prioritize dialect similarity over semantics, preferring outputs that are semantically incorrect over outputs that match the semantics of the reference but contain dialect differences. As a step towards dialect-robust metrics for text generation, we propose NANO, which introduces regional and language information to the metric’s pretraining. NANO significantly improves dialect robustness while preserving the correlation between automated metrics and human ratings. It also enables a more ambitious approach to evaluation, dialect awareness, in which system outputs are scored by both semantic match to the reference and appropriateness in any specified dialect. Jiao Sun, Thibault Sellam, Elizabeth Clark, Tu Vu, Timothy Dozat, Dan Garrette, Aditya Siddhant, Jacob Eisenstein, and Sebastian Gehrmann link 2023 ACL, pages 6010–6028
Idiosyncrasies in large language modelsIn this work, we unveil and study idiosyncrasies in Large Language Models (LLMs) -- unique patterns in their outputs that can be used to distinguish the models. To do so, we consider a simple classification task: given a particular text output, the objective is to predict the source LLM that generates the text. We evaluate this synthetic task across various groups of LLMs and find that simply fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on held-out validation data in the five-way classification problem involving ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals that these idiosyncrasies are rooted in word-level distributions. These patterns persist even when the texts are rewritten, translated, or summarized by an external LLM, suggesting that they are also encoded in the semantic content. Additionally, we leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies. Finally, we discuss the broader implications of our findings, including training on synthetic data, inferring model similarity, and robust evaluation of LLMs. Code is available at https://github.com/locuslab/llm-idiosyncrasies. Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, and Zhuang Liu link 2025 arXiv:2502.12150
Unsupervised neural text simplificationThe paper presents a first attempt towards unsupervised neural text simplification that relies only on unlabeled text corpora. The core framework is composed of a shared encoder and a pair of attentional-decoders, crucially assisted by discrimination-based losses and denoising. The framework is trained using unlabeled text collected from en-Wikipedia dump. Our analysis (both quantitative and qualitative involving human evaluators) on public test data shows that the proposed model can perform text-simplification at both lexical and syntactic levels, competitive to existing supervised methods. It also outperforms viable unsupervised baselines. Adding a few labeled pairs helps improve the performance further. Sai Surya, Abhijit Mishra, Anirban Laha, Parag Jain, and Karthik Sankaranarayanan link 2019 ACL, pages 2058–2068
What do you learn from context? Probing for sentence structure in contextualized word representationsContextualized representation models such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have recently achieved state-of-the-art results on a diverse array of downstream NLP tasks. Building on recent token-level probing work, we introduce a novel edge probing task design and construct a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline. We probe word-level contextual representations from four recent models and investigate how they encode sentence structure across a range of syntactic, semantic, local, and long-range phenomena. We find that existing models trained on language modeling and translation produce strong representations for syntactic phenomena, but only offer comparably small improvements on semantic tasks over a non-contextual baseline. Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R. Bowman, Dipanjan Das, and Ellie Pavlick link 2018 ICLR
Writing Style Author Embedding EvaluationLearning authors representations from their textual productions is now widely used to solve multiple downstream tasks, such as classification, link prediction or user recommendation. Author embedding methods are often built on top of either Doc2Vec (Mikolov et al. 2014) or the Transformer architecture (Devlin et al. 2019). Evaluating the quality of these embeddings and what they capture is a difficult task. Most articles use either classification accuracy or authorship attribution, which does not clearly measure the quality of the representation space, if it really captures what it has been built for. In this paper, we propose a novel evaluation framework of author embedding methods based on the writing style. It allows to quantify if the embedding space effectively captures a set of stylistic features, chosen to be the best proxy of an author writing style. This approach gives less importance to the topics conveyed by the documents. It turns out that recent models are mostly driven by the inner semantic of authors’ production. They are outperformed by simple baselines, based on state-of-the-art pretrained sentence embedding models, on several linguistic axes. These baselines can grasp complex linguistic phenomena and writing style more efficiently, paving the way for designing new style-driven author embedding models. Enzo Terreau, Antoine Gourru, and Julien Velcin link 2021 Evaluation and Comparison of NLP Systems Workshop, pages 84–93
StAyaL | Multilingual Style TransferStylistic text generation plays a vital role in enhancing communication by reflecting the nuances of individual expression. This paper presents a novel approach for generating text in a specific speaker's style across different languages. We show that by leveraging only 100 lines of text, an individuals unique style can be captured as a high-dimensional embedding, which can be used for both text generation and stylistic translation. This methodology breaks down the language barrier by transferring the style of a speaker between languages. The paper is structured into three main phases: augmenting the speaker's data with stylistically consistent external sources, separating style from content using machine learning and deep learning techniques, and generating an abstract style profile by mean pooling the learned embeddings. The proposed approach is shown to be topic-agnostic, with test accuracy and F1 scores of 74.9% and 0.75, respectively. The results demonstrate the potential of the style profile for multilingual communication, paving the way for further applications in personalized content generation and cross-linguistic stylistic transfer. Karishma Thakrar, Katrina Lawrence, and Kyle Howard link 2025 arXiv:2501.11639
Reddust: A large reusable dataset of reddit user traitsCognate words, defined as words in different languages which derive from a common etymon, can be useful for language learners, who can leverage the orthographical similarity of cognates to more easily understand a text in a foreign language. Deceptive cognates, or false friends, do not share the same meaning anymore; these can be instead deceiving and detrimental for language acquisition or text understanding in a foreign language. We use an automatic method of detecting false friends from a set of cognates, in a fully unsupervised fashion, based on cross-lingual word embeddings. We implement our method for English and five Romance languages, including a low-resource language (Romanian), and evaluate it against two different gold standards. The method can be extended easily to any language pair, requiring only large monolingual corpora for the involved languages and a small bilingual dictionary for the pair. We additionally propose a measure of “falseness” of a false friends pair. We publish freely the database of false friends in the six languages, along with the falseness scores for each cognate pair. The resource is the largest of the kind that we are aware of, both in terms of languages covered and number of word pairs. Anna Tigunova, Paramita Mirza, Andrew Yates, and Gerhard Weikum link 2020 LREC, pages 6118–6126
HANSEN: Human and AI spoken text benchmark for authorship analysis<span class=tex-math><span class=fst-italic>Authorship Analysis Nafis Tripto, Adaku Uchendu, Thai Le, Mattia Setzu, Fosca Giannotti, and Dongwon Lee link 2023 Findings of EMNLP 2023, pages 13706–13724
Research Methods: The Essential Knowledge BasePLACEHOLDER William M. K. Trochim, James P. Donnelly, and Kanika Arora link 2015 Cengage Learning
Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing StylesCurrent benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of communication patterns exhibited by humans. Thus, it is possible that LLMs, which are optimized on these benchmarks, may demonstrate brittle performance when faced with "non-standard" input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based LLM prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the LLM under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, and recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance across linguistic variations. Kimberly Le Truong, Riccardo Fogliato, Hoda Heidari, and Zhiwei Steven Wu link 2025 arXiv preprint ArXiv:2507.22168
Authorship attribution for neural text generationIn recent years, the task of generating realistic short and long texts have made tremendous advancements. In particular, several recently proposed neural network-based language models have demonstrated their astonishing capabilities to generate texts that are challenging to distinguish from human-written texts with the naked eye. Despite many benefits and utilities of such neural methods, in some applications, being able to tell the “author” of a text in question becomes critically important. In this work, in the context of this Turing Test, we investigate the so-called authorship attribution problem in three versions: (1) given two texts T1 and T2, are both generated by the same method or not? (2) is the given text T written by a human or machine? (3) given a text T and k candidate neural methods, can we single out the method (among k alternatives) that generated T? Against one humanwritten and eight machine-generated texts (i.e., CTRL, GPT, GPT2, GROVER, XLM, XLNET, PPLM, FAIR), we empirically experiment with the performance of various models in three problems. By and large, we find that most generators still generate texts significantly different from human-written ones, thereby making three problems easier to solve. However, the qualities of texts generated by GPT2, GROVER, and FAIR are better, often confusing machine classifiers in solving three problems. All codes and datasets of our experiments are available at: <a href=https://bit.ly/ class=acl-markup-url>https://bit.ly/</a> 302zWdz Adaku Uchendu, Thai Le, Kai Shu, and Dongwon Lee link 2020 EMNLP, pages 8384–8395
Paraphrase types elicit prompt engineering capabilitiesMuch of the success of modern language models depends on finding a suitable prompt to instruct the model. Until now, it has been largely unknown how variations in the linguistic expression of prompts affect these models. This study systematically and empirically evaluates which linguistic features influence models through paraphrase types, i.e., different linguistic changes at particular positions. We measure behavioral changes for five models across 120 tasks and six families of paraphrases (i.e., morphology, syntax, lexicon, lexico-syntax, discourse, and others). We also control for other prompt engineering factors (e.g., prompt length, lexical diversity, and proximity to training data). Our results show a potential for language models to improve tasks when their prompts are adapted in specific paraphrase types (e.g., 6.7% median gain in Mixtral 8x7B; 5.5% in LLaMA 3 8B). In particular, changes in morphology and lexicon, i.e., the vocabulary used, showed promise in improving prompts. These findings contribute to developing more robust language models capable of handling variability in linguistic expression. Jan Philip Wahle, Terry Ruas, Yang Xu, and Bela Gipp link 2024 EMNLP, pages 11004–11033
Can authorship representation learning capture stylistic features?Abstract Automatically disentangling an author’s style from the content of their writing is a longstanding and possibly insurmountable problem in computational linguistics. At the same time, the availability of large text corpora furnished with author labels has recently enabled learning authorship representations in a purely data-driven manner for authorship attribution, a task that ostensibly depends to a greater extent on encoding writing style than encoding content. However, success on this surrogate task does not ensure that such representations capture writing style since authorship could also be correlated with other latent variables, such as topic. In an effort to better understand the nature of the information these representations convey, and specifically to validate the hypothesis that they chiefly encode writing style, we systematically probe these representations through a series of targeted experiments. The results of these experiments suggest that representations learned for the surrogate authorship prediction task are indeed sensitive to writing style. As a consequence, authorship representations may be expected to be robust to certain kinds of data shift, such as topic drift over time. Additionally, our findings may open the door to downstream applications that require stylistic representations, such as style transfer. Andrew Wang, Cristina Aggazzotti, Rebecca Kotula, Rafael Rivera Soto, Marcus Bishop, and Nicholas Andrews link 2023 TACL, 11:1416–1431
Feature vector difference based neural network and logistic regression models for authorship verificationPLACEHOLDER Janith Weerasinghe and Rachel Greenstadt link 2020 PAN at CLEF 2020, 2695
Does it capture STEL? a modular, similarity-based linguistic style evaluation frameworkStyle is an integral part of natural language. However, evaluation methods for style measures are rare, often task-specific and usually do not control for content. We propose the modular, fine-grained and content-controlled similarity-based STyle EvaLuation framework (STEL) to test the performance of any model that can compare two sentences on style. We illustrate STEL with two general dimensions of style (formal/informal and simple/complex) as well as two specific characteristics of style (contrac’tion and numb3r substitution). We find that BERT-based methods outperform simple versions of commonly used style measures like 3-grams, punctuation frequency and LIWC-based approaches. We invite the addition of further tasks and task instances to STEL and hope to facilitate the improvement of style-sensitive measures. Anna Wegmann and Dong Nguyen link 2021 EMNLP, pages 7109–7130
Tokenization is sensitive to language variationVariation in language is ubiquitous and often systematically linked to regional, social, and contextual factors. Tokenizers split texts into smaller units and might behave differently for less common linguistic forms. This might affect downstream LLM performance differently on two types of tasks: Tasks where the model should be robust to language variation (e.g., for semantic tasks like NLI, labels do not depend on whether a text uses British or American spelling) and tasks where the model should be sensitive to language variation (e.g., for form-based tasks like authorship verification, labels depend on whether a text uses British or American spelling). We pre-train BERT base models with the popular Byte-Pair Encoding algorithm to investigate how key tokenization design choices impact the performance of downstream models: the corpus used to train the tokenizer, the pre-tokenizer and the vocabulary size. We find that the best tokenizer varies on the two task types and that the pre-tokenizer has the biggest overall impact on performance. Further, we introduce a new approach to estimate tokenizer impact on downstream LLM performance, showing substantial improvement over metrics like Rényi efficiency. We encourage more work on language variation and its relation to tokenizers and thus LLM performance. Anna Wegmann, Dong Nguyen, and David Jurgens link 2025 Findings of ACL 2025, pages 10958–10983
Same Author or Just Same Topic? Towards Content-Independent Style RepresentationsLinguistic style is an integral component of language. Recent advances in the development of style representations have increasingly used training objectives from authorship verification (AV)”:” Do two texts have the same author? The assumption underlying the AV training task (same author approximates same writing style) enables self-supervised and, thus, extensive training. However, a good performance on the AV task does not ensure good “general-purpose” style representations. For example, as the same author might typically write about certain topics, representations trained on AV might also encode content information instead of style alone. We introduce a variation of the AV training task that controls for content using conversation or domain labels. We evaluate whether known style dimensions are represented and preferred over content information through an original variation to the recently proposed STEL framework. We find that representations trained by controlling for conversation are better than representations trained with domain or no content control at representing style independent from content. Anna Wegmann, Marijn Schraagen, and Dong Nguyen link 2022 RepL4NLP Workshop, pages 249–268
Constraints on the agentless passiveThis paper is a quantitative study of the factors that determine the selection of passive constructions over active ones by English speakers. By examining a large body of passives used in spontaneous speech, together with the sentences that show an opposing choice, we are able to throw light on the crucial question of which syntactic and which semantic features of the environment act to constrain the choice and whether syntactic or semantic factors predominate in this case. In the course of the analysis, we will also have something to say about the social factors that have been reported to determine the use of the passive. E. Judith Weiner and William Labov link 1983 Journal of Linguistics, 19(1):29–58
Disentangling style factors from speaker representationsPLACEHOLDER Jennifer Williams and Simon King link 2019 Interspeech, pages 3945–3949
Style over substance: Evaluation biases for large language modelsAs large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Ranking the relative performance of LLMs based on Elo ratings, according to human or LLM judgment, is gaining more popularity. However, the extent to which humans and LLMs are capable evaluators remains uncertain. This study investigates the behavior of crowd-sourced and expert annotators, as well as LLMs, when comparing outputs from different models. To achieve this, we curate a dataset of intentionally flawed, machine-generated answers. Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. To address this issue, we propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System (MERS). Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced evaluations, indicating the need for further investigation. Minghao Wu and Alham Fikri Aji link 2025 COLING, pages 297–312
Out-of-distribution generalization in natural language processing: Past, present, and futureMachine learning (ML) systems in natural language processing (NLP) face significant challenges in generalizing to out-of-distribution (OOD) data, where the test distribution differs from the training data distribution. This poses important questions about the robustness of NLP models and their high accuracy, which may be artificially inflated due to their underlying sensitivity to systematic biases. Despite these challenges, there is a lack of comprehensive surveys on the generalization challenge from an OOD perspective in natural language understanding. Therefore, this paper aims to fill this gap by presenting the first comprehensive review of recent progress, methods, and evaluations on this topic. We further discuss the challenges involved and potential future research directions. By providing convenient access to existing work, we hope this survey will encourage future research in this area. Linyi Yang, Yaoxian Song, Xuan Ren, Chenyang Lyu, Yidong Wang, Jingming Zhuo, Lingqiao Liu, Jindong Wang, Jennifer Foster, and Yue Zhang link 2023 EMNLP, pages 4533–4559
A Survey of Controllable Text Generation Using Transformer-based Pre-trained Language ModelsControllable Text Generation (CTG) is an emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used Transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the limited level of interpretability of deep neural networks, the controllability of these methods needs to be guaranteed. To this end, controllable text generation using Transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the past 3 to 4 years, targeting different CTG tasks that require different types of controlled constraints. In this article, we present a systematic critical review on the common tasks, main approaches, and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey article to summarize the state-of-the-art CTG techniques from the perspective of Transformer-based PLMs. We hope it can help researchers and practitioners in the related fields to quickly track the academic and technological frontier, providing them with a landscape of the area and a roadmap for future research. Hanqing Zhang, Haolin Song, Shaoyu Li, Ming Zhou, and Dawei Song link 2023 ACM Computing Surveys, 56(3):64:1–64:37
Personalized Text Generation with Contrastive Activation SteeringPersonalized text generation aims to infer users’ writing style preferences from their historical texts and generate outputs that faithfully reflect these stylistic characteristics. Existing solutions primarily adopt two paradigms: retrieval-augmented generation (RAG) and parameter-efficient fine-tuning (PEFT). While these approaches have advanced the field, they suffer from two critical limitations: (1) the entanglement of content semantics and stylistic patterns in historical texts impedes accurate modeling of user-specific writing preferences; and (2) scalability challenges arising from both RAG’s inference latency by retrieval operations and PEFT’s parameter storage requirements for per user model. To overcome these limitations, we propose StyleVector, a training-free framework that disentangles and represents personalized writing style as a vector in LLM’s activation space, enabling style-steered generation during inference without requiring costly retrieval or parameter storage. Comprehensive experiments demonstrate that our framework achieves a significant 8% relative improvement in personalized generation while reducing storage requirements by 1700 <span class=tex-math>× Jinghao Zhang, Yuting Liu, Wenjie Wang, Qiang Liu, Shu Wu, Liang Wang, and Tat-Seng Chua link 2025 ACL, pages 7128–7141
How Well Do Text Embedding Models Understand Syntax?Text embedding models have significantly contributed to advancements in natural language processing by adeptly capturing semantic properties of textual data. However, the ability of these models to generalize across a wide range of syntactic contexts remains under-explored. In this paper, we first develop an evaluation set, named SR, to scrutinize the capability for syntax understanding of text embedding models from two crucial syntactic aspects: Structural heuristics, and Relational understanding among concepts, as revealed by the performance gaps in previous studies. Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges, and such ineffectiveness becomes even more apparent when evaluated against existing benchmark datasets. Furthermore, we conduct rigorous analysis to unearth factors that lead to such limitations and examine why previous evaluations fail to detect such ineffectiveness. Lastly, we propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios. This study serves to highlight the hurdles associated with syntactic generalization and provides pragmatic guidance for boosting model performance across varied syntactic contexts. Yan Zhang, Zhaopeng Feng, Zhiyang Teng, Zuozhu Liu, and Haizhou Li link 2023 Findings of EMNLP 2023, pages 9717–9728
Personalization of Large Language Models: A SurveyPersonalization of Large Language Models (LLMs) has recently become increasingly important with a wide range of applications. Despite the importance and recent progress, most existing works on personalized LLMs have focused either entirely on (a) personalized text generation or (b) leveraging LLMs for personalization-related downstream applications, such as recommendation systems. In this work, we bridge the gap between these two separate main directions for the first time by introducing a taxonomy for personalized LLM usage and summarizing the key differences and challenges. We provide a formalization of the foundations of personalized LLMs that consolidates and expands notions of personalization of LLMs, defining and discussing novel facets of personalization, usage, and desiderata of personalized LLMs. We then unify the literature across these diverse fields and usage scenarios by proposing systematic taxonomies for the granularity of personalization, personalization techniques, datasets, evaluation methods, and applications of personalized LLMs. Finally, we highlight challenges and important open problems that remain to be addressed. By unifying and surveying recent research using the proposed taxonomies, we aim to provide a clear guide to the existing literature and different facets of personalization in LLMs, empowering both researchers and practitioners. Zhehao Zhang, Ryan A. Rossi, Branislav Kveton, Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernoncourt, Joe Barrow, Tong Yu, Sungchul Kim, Ruiyi Zhang, Jiuxiang Gu, Tyler Derr, Hongjie Chen, Junda Wu, Xiang Chen, Zichao Wang, Subrata Mitra, Nedim Lipka, Nesreen K. Ahmed, and Yu Wang link 2025 Transactions on Machine Learning Research
Unmasking style sensitivity: A causal analysis of bias evaluation instability in large language modelsNatural language processing applications are increasingly prevalent, but social biases in their outputs remain a critical challenge. While various bias evaluation methods have been proposed, these assessments show unexpected instability when input texts undergo minor stylistic changes. This paper conducts a comprehensive analysis of how different style transformations impact bias evaluation results across multiple language models and bias types using causal inference techniques. Our findings reveal that formality transformations significantly affect bias scores, with informal style showing substantial bias reductions (up to 8.33% in LLaMA-2-13B). We identify appearance bias, sexual orientation bias, and religious bias as most susceptible to style changes, with variations exceeding 20%. Larger models demonstrate greater sensitivity to stylistic variations, with bias measurements fluctuating up to 3.1% more than in smaller models. These results highlight critical limitations in current bias evaluation methods and emphasize the need for reliable and fair assessments of language models. Jiaxu Zhao, Meng Fang, Kun Zhang, and Mykola Pechenizkiy link 2025 ACL, pages 16314–16338
Disentangled sequence to sequence learning for compositional generalizationThere is mounting evidence that existing neural network models, in particular the very popular sequence-to-sequence architecture, struggle to systematically generalize to unseen compositions of seen components. We demonstrate that one of the reasons hindering compositional generalization relates to representations being entangled. We propose an extension to sequence-to-sequence models which encourage disentanglement by adaptively re-encoding (at each time step) the source input. Specifically, we condition the source representations on the newly decoded target context which makes it easier for the encoder to exploit specialized information for each prediction rather than capturing it all in a single forward pass. Experimental results on semantic parsing and machine translation empirically show that our proposal delivers more disentangled representations and better generalization. Hao Zheng and Mirella Lapata link 2022 ACL, pages 4256–4268
Idiosyncratic but not Arbitrary: Learning Idiolects in Online Registers Reveals Distinctive yet Consistent Individual StylesAn individual’s variation in writing style is often a function of both social and personal attributes. While structured social variation has been extensively studied, e.g., gender based variation, far less is known about how to characterize individual styles due to their idiosyncratic nature. We introduce a new approach to studying idiolects through a massive cross-author comparison to identify and encode stylistic features. The neural model achieves strong performance at authorship identification on short texts and through an analogy-based probing task, showing that the learned representations exhibit surprising regularities that encode qualitative and quantitative shifts of idiolectal styles. Through text perturbation, we quantify the relative contributions of different linguistic elements to idiolectal variation. Furthermore, we provide a description of idiolects through measuring inter- and intra-author variation, showing that variation in idiolects is often distinctive yet consistent. Jian Zhu and David Jurgens link 2021 EMNLP, pages 279–297
StyleFlow: Disentangle latent representations via normalizing flow for unsupervised text style transferUnsupervised text style transfer aims to modify the style of a sentence while preserving its content without parallel corpora. Existing approaches attempt to separate content from style, but some words contain both content and style information. It makes them difficult to disentangle, where unsatisfactory disentanglement results in the loss of the content information or the target style. To address this issue, researchers adopted a “cycle reconstruction” mechanism to maintain content information, but it is still hard to achieve satisfactory content preservation due to incomplete disentanglement. In this paper, we propose a new disentanglement-based method, StyleFlow, which effectively avoids the loss of contents through a better cycle reconstruction via a reversible encoder. The reversible encoder is a normalizing flow that can not only produce output given input but also infer the exact input given the output reversely. We design a stack of attention-aware coupling layers, where each layer is reversible and adopts the attention mechanism to improve the content-style disentanglement. Moreover, we propose a data augmentation method based on normalizing flow to enhance the training data. Our experiments on sentiment transfer and formality transfer tasks show that StyleFlow outperforms strong baselines on both content preservation and style transfer. Kangchen Zhu, Zhiliang Tian, Jingyu Wei, Ruifeng Luo, Yiping Song, and Xiaoguang Mao link 2024 LREC-COLING 2024, pages 15384–15397
Trans self-identification and the language of neoliberal selfhood: Agency, power, and the limits of monologic discourseAbstract Sociocultural linguists share with transgender communities a strong interest in the power of individuals to assert agency over linguistic patterns. For trans people, a key principle of activism is gender self-determination , which treats each individual as the ultimate authority on their own gender identity. This article explores some of the ways gender self-determination and self-identification surface in transgender people’s linguistic practices. Three particular manifestations are highlighted: gendered identity labels, third person pronouns, and body part terminology. The observations offered on these subjects are based on a series of ethnographic projects carried out from 2006–2016 in transgender communities across several metropolitan areas in the United States and in online spaces frequented by trans people. However, the analysis goes beyond mere description by treating this kind of individualized linguistic agency as the product of cultural practice rather than an asocial given. Such a perspective introduces questions concerning why this form of agency arose in the time and place that it has. This article frames gender self-identification as an enactment of neoliberal personhood, in which individuals are framed as the driver of their destiny. What the ideology of neoliberal agency obscures, however, is that agency is not an equally distributed resource. Lal Zimman link 2019 IJSL, 2019(256):147–175
An ensemble-rich multi-aspect approach for robust style change detectionPLACEHOLDER Dimitrina Zlatkova, Daniel Kopev, Kristiyan Mitov, Atanas Atanasov, Momchil Hardalov, Ivan Koychev, and Preslav Nakov link 2018 PAN at CLEF-2018
Style change detection with feed-forward neural networksPLACEHOLDER Chaoyuan Zuo, Yu Zhao, and Ritwik Banerjee link 2019 PAN at CLEF 2019, 93