Do you want to do a BSc/MSc or research project in natural language processing (i.e., any algorithm that uses ‘natural language’, e.g., English as input or output) related to language variation (e.g., how people say something as opposed to what they say; includes any way that language varies, think differences between identities, regions, …)? If you have topic suggestions that fit these general fields and should be manageable your given time frame, I am open to discuss those. If you are unsure if it could be a fit, drop me a mail. My personal research focus lies on measuring, evaluating and representing language variation.

Current topics I would find interesting to develop further with you are listed below. Of course, these projects are mere suggestions and very tentative descriptions. Everything is open to change and for discussion.

Training Data in NLP

  • Dynamic Annotator Recruitment In NLP, data curation and especially human annotation is key for every step of the pipeline in training an NLP model. Recently, the field moved away from aiming for binary truth type of annotations (yes, this text contains hate speech, or no, this text does not contain hate speech) to distributions of what people would annotate (80% said hate speech, 20% said not, often noting annotator demographics etc.), see for example Learning from Disagreement. This implicates an expensive shift in data collection: asking a bigger, more diverse group of people to annotate your datapoints. You could work on methods to optimize the distribution of annotation resources, e.g., how to tell which instances should be annotated by more or less annotators, see for example Minimizing manual annotation cost in supervised training from corpora. This project could be largely theory and statistics: Simulating different distributions of annotations and working on methods/measures that help in deciding an ideal resource distribution.
  • Language Variation in Data (Under Construction): Future projects I will work on will concern the measurement and incooperation of language variation in (training) data. Think curating datasets to include texts written by diverse authors (e.g., Gender Identity in Pretrained Language Models: An Inclusive Approach to Data Creation and Probing, Exploring Generational Identity: A Multiparadigm Approach. Option to get creative and collaborate with me on a publication.

NLP models and Language Variation

  • Style Evaluation Do you want to expand on an existing NLP project? You could work on testing NLP models on whether are able to capture differences in how people express themselves (e.g., whether they use more formal or more informal words). In our paper, we proposed a general framework to do this. However, the framework is far from finished (i.e., far from becoming an actual benchmark). You could work on an existing NLP project and add a new dimension that state-of-the art “style models” can be tested on. Part of your thesis project would be motivating this new style dimension and collecting data to demonstrate it and possibly test models on it. Your contribution could be part of a future publication on expanding the STEL framework. Possible new style dimensions you could work on include: + Are people using the active or passive voice (“The cashier counted the money.” vs. “The money was counted by the cashier.”, see also: https://www.grammarly.com/blog/active-vs-passive-voice/; here, you could develop an algorithm to detect active as opposed to passive voice) + How are people using punctuation (i.e,. !,?,.,…)? E.g., are they using punctuation at all? When are exclamation marks used? When are people repeating the same punctuation mark? + How are people casing their words? Are people starting the sentence with an upper case letter or not? Are they writing “i” or “I”? Does it depend on the context? + simple vs. complex language - is the word “principal” or “main” simpler? (A Report on the Complex Word Identification Shared Task 201, Optimizing Statistical Machine Translation for Text Simplification) + British vs. American vs. Australian … English + grammatically correct vs incorrect BLiMP: The Benchmark of Linguistic Minimal Pairs for English … Related Keywords: Style Evaluation, Linguistic Style, Language Variation
  • Style Embeddings In Natural Language Processing, there is a lot of work on training representations of sentences that encode the meaning of a sentence in machine-readable form (i.e., often in the form of vectors in high dimensional space, where paraphrases are mapped to the same point). However, less work has been done on learning representations that encode the style (as opposed to the content) of a sentence. Work on this is important to improve NLP models for different demographics that use language different from “the standard”. You could expand on our current model of style representations (see here), e.g., by changing the training task, training on more data, finding harder negatives for the contrastive learning approach, controlling for content using semantic similairity scores, experimenting with different forms of tokenization … Further reading: Style Representations, Universal Authorship Representations
  • Dutch NLP Models Do you want to translate the big strides made in English NLP to Dutch? You could work on a Dutch version of a style representation model (see previous point). You would have a good basis to build on (knowledge on what currently works best in the English setting) but face quite some interesting challenges in making it work for Dutch. You could also chose any other language.
  • Style Change Detection Task Did you always want to take part in a leaderboard competition? You can make it your thesis project. For example, you could participate in a Style Change Detection Task. (2021, see here). The submissions are usually somewhere in April. In case that does not work together with your starting date, you might not be able to formally submit for this year, but you can still test your model and compare it to other people’s work or submit it a year later. The Style Change Detection Task is about detecting whether and where the author of a text changes. These kinds of tasks are often also known as authorship attribution tasks (e.g., see here). You could try out different methods from that field (e.g., LIWC, character n-grams, style embeddings …) and train some classification methods (e.g., logistic regression, RoBERTa, …). Related Keywords: Authorship Verification, Authorship Attribution, Style Measurement
  • Language Variation Evaluation Work on evaluation methods/datasets for language variation
  • Generalizable Evaluation LLMs get larger and larger. This means they become harder and harder to evaluate. When we change the training data or the tokenizer, it is infeasible to re-train huge models. It is a pressing research question how to create inexpensive evaluation methods that are generalizable to larger models.

Other Projects

  • Is Next Sentence Prediction worth something after all? It is a half-agreed fact in NLP, that out of the two BERT pretraining objectives, “Masked Language Model” (MLM) is the more effective one. That one should rather use training resources completely on MLM rather than waste them on Next Sentence Prediction (NSP). This is very probably true for most downstream NLP tasks. However, our work on Style Evaluation, inspired the question whether NSP is adding information to language models relating to style that is not learned (as well) with MLM. Maybe by asking whether the second sentence comes after the first, the model has to understand more about style than when predicting what word might be missing in the same sentence. In your project/thesis, you could systematically try changing BERT’s pretraining objective to only using NSP vs. only using MLM and see how this affects the performances on Authorship Verifciation (AV) and Style Representations. Either answer would be a contribution: NSP does not improve style representation in language models as well as NSP does improve style representations in language models.

Your first step will be to develop a research plan that is doable in your given time frame. That usually means making several assumptions and simplifications to fit, e.g., a 10 week project plan. It is great if you can already bring a plan to our first meeting. It does not have to be perfect by any means. We can develop it further together.

Hope to work with you soon!