Do you want to do a BSc/MSc (thesis) project in the fields of natural language processing (i.e., any algorithm that uses ‘natural language’, e.g., English as input or output) and/or computational social sciences (i.e., social science questions that are tackled with modern computational approaches)? If you have topic suggestions that fit these general fields and should be manageable your given time frame, I am open to discuss those. If you are unsure if it could be a fit, drop me a mail. My personal research focus lies on language variation (e.g., how people say something as opposed to what they say) and online discussions.

Current topics I would find interesting to develop further with you are listed below. Of course, these projects are mere suggestions and very tentative descriptions. Everything is open to change and for discussion.

Training Data in NLP

  • Dynamic Annotator Recruitment In NLP, data curation and especially human annotation is key for every step of the pipeline in training an NLP model. Recently, the field moved away from aiming for binary truth type of annotations (yes, this text contains hate speech, or no, this text does not contain hate speech) to distributions of what people would annotate (80% said hate speech, 20% said not, often noting annotator demographics etc.), see for example Learning from Disagreement. This implicates an expensive shift in data collection: asking a bigger, more diverse group of people to annotate your datapoints. You could work on methods to optimize the distribution of annotation resources, e.g., how to tell which instances should be annotated by more or less annotators, see for example Minimizing manual annotation cost in supervised training from corpora. This project could be largely theory and statistics: Simulating different distributions of annotations and working on methods/measures that help in deciding an ideal resource distribution.

NLP models and Style

  • Style Evaluation Do you want to expand on an existing NLP project? You could work on testing NLP models on whether are able to capture differences in how people express themselves (e.g., whether they use more formal or more informal words). In our paper, we proposed a general framework to do this. However, the framework is far from finished (i.e., far from becoming an actual benchmark). You could work on an existing NLP project and add a new dimension that state-of-the art “style models” can be tested on. Part of your thesis project would be motivating this new style dimension and collecting data to demonstrate it and possibly test models on it. Your contribution could even be part of a future publication on expanding the STEL framework. Possible new style dimensions you could work on include: + Are people using the active or passive voice (“The cashier counted the money.” vs. “The money was counted by the cashier.”, see also: https://www.grammarly.com/blog/active-vs-passive-voice/; here, you could develop an algorithm to detect active as opposed to passive voice) + How are people using punctuation (i.e,. !,?,.,…)? E.g., are they using punctuation at all? When are exclamation marks used? When are people repeating the same punctuation mark? + How are people casing their words? Are people starting the sentence with an upper case letter or not? Are they writing “i” or “I”? Does it depend on the context? + simple vs. complex language - is the word “principal” or “main” simpler? (A Report on the Complex Word Identification Shared Task 201, Optimizing Statistical Machine Translation for Text Simplification) + British vs. American vs. Australian … English + grammatically correct vs incorrect BLiMP: The Benchmark of Linguistic Minimal Pairs for English … Related Keywords: Style Evaluation, Linguistic Style, Language Variation
  • Style Embeddings In Natural Language Processing, there is a lot of work on training representations of sentences that encode the meaning of a sentence in machine-readable form (i.e., often in the form of vectors in high dimensional space, where paraphrases are mapped to the same point). However, less work has been done on learning representations that encode the style (as opposed to the content) of a sentence. Work on this is important to improve NLP models for different demographics that use language different from “the standard”. You could expand on our current model of style representations (see here), e.g., by changing the training task, training on more data, finding harder negatives for the contrastive learning approach, controlling for content using semantic similairity scores, experimenting with different forms of tokenization … Further reading: Style Representations, Universal Authorship Representations
  • Dutch NLP Models Do you want to translate the big strides made in English NLP to Dutch? You could work on a Dutch version of a style representation model (see previous point). You would have a good basis to build on (knowledge on what currently works best in the English setting) but face quite some interesting challenges in making it work for Dutch.
  • Style Change Detection Task Did you always want to take part in a leaderboard competition? You can make it your thesis project. For example, you could participate in a Style Change Detection Task. (2021, see here). The submissions are usually somewhere in April. In case that does not work together with your starting date, you might not be able to formally submit for this year, but you can still test your model and compare it to other people’s work or submit it a year later. The Style Change Detection Task is about detecting whether and where the author of a text changes. These kinds of tasks are often also known as authorship attribution tasks (e.g., see here). You could try out different methods from that field (e.g., LIWC, character n-grams, style embeddings …) and train some classification methods (e.g., logistic regression). Related Keywords: Authorship Verification, Authorship Attribution, Style Measurement

Other Projects

  • Is Next Sentence Prediction worth something after all? It is a half-agreed fact in NLP, that out of the two BERT pretraining objectives, “Masked Language Model” (MLM) is the more effective one. That one should rather use training resources completely on MLM rather than waste them on Next Sentence Prediction (NSP). This is very probably true for most downstream NLP tasks. However, our work on Style Evaluation, inspired the question whether NSP is adding information to language models relating to style that is not learned (as well) with MLM. Maybe by asking whether the second sentence comes after the first, the model has to understand more about style than when predicting what word might be missing in the same sentence. In your project/thesis, you could systematically try changing BERT’s pretraining objective to only using NSP vs. only using MLM and see how this affects the performances on Authorship Verifciation (AV) and Style Representations. Either answer would be a contribution: NSP does not improve style representation in language models as well as NSP does improve style representations in language models.
  • Open/Closed Question Detection: Is this a closed question? What open question do you typically ask new acquaintances? See also: Open thinking, closed questioning: Two kinds of open and closed questions You will build a classifier that detects whether a question is open or closed and study the share of open/closed questions and how it affects a conversation (e.g., conversation length, overall sentiment).
  • Generation Detection No. This is not about generating texts. Rather about whether a text was written by a boomer, zoomer or millenial. Do you sometimes read a text and just know it has been written by an unhappy teenager? Or a disgruntled grandmother? You could work on an algorithm that learns to predict generational identity (e.g., Exploring Generational Identity: A Multiparadigm Approach) from a social media post. This could be interesting to later (e.g., on a platform like Reddit) see how the average Boomer argues about climate change in comparison to the average Millennial. There has been some work in predicting age from short texts (e.g., Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment, Age Groups Classification in Social Network Using Deep Learning). You could use those as a (first) approximation of generational identity and go further from there.

Project Ideas that are very tentative

Your first step will be to develop a research plan that is doable in your given time frame. That usually means making several assumptions and simplifications to fit, e.g., a 10 week project plan. It is great if you can already bring a plan to our first meeting. It does not have to be perfect by any means. We can develop it further together.

Hope to work with you soon!