Do you want to do a BSc/MSc (thesis) project in the fields of natural language processing (i.e., any algorithm that uses ‘natural language’, e.g., English as input or output) and/or computational social sciences (i.e., social science questions that are tackled with modern computational approaches)? If you have topic suggestions that fit these general fields and should be manageable your given time frame, I am open to discuss those. If you are unsure if it could be a fit, drop me a mail. My personal research focus lies on language variation (e.g., how people say something as opposed to what they say) and online discussions.

Current topics I would find interesting to develop further with you are:

  • Style Evaluation Do you want to expand on an existing NLP project? We recently published a paper on testing NLP models on whether are able to capture differences in how people express themselves (e.g., whether they use more formal or more informal words). We proposed a general framework to do this. However, the framework is far from finished (i.e., far from becoming an actual benchmark). You could work on an existing NLP project and add a new dimension that state-of-the art “style models” can be tested on. Part of your thesis project would be motivating this new style dimension and collecting data to demonstrate it and possibly test models on it. Your contribution could even be part of a future publication on expanding the STEL framework. Possible new style dimensions you could work on include: + Are people using the active or passive voice (“The cashier counted the money.” vs. “The money was counted by the cashier.”, see also:; here, you could develop an algorithm to detect active as opposed to passive voice) + How are people using punctuation (i.e,. !,?,.,…)? E.g., are they using punctuation at all? When are exclamation marks used? When are people repeating the same punctuation mark? + How are people casing their words? Are people starting the sentence with an upper case letter or not? Are they writing “i” or “I”? Does it depend on the context? + simple vs. complex language - is the word “principal” or “main” simpler? (A Report on the Complex Word Identification Shared Task 201, Optimizing Statistical Machine Translation for Text Simplification) + British vs. American vs. Australian … English + grammatically correct vs incorrect BLiMP: The Benchmark of Linguistic Minimal Pairs for English … Related Keywords: Style Evaluation, Linguistic Style, Language Variation
  • Language Variation Maybe you do not want to meddle with an existing project but are still interested in language variation and how people use different ways to express themselves? You could take a look at one specific style dimension that people use (same dimensions possible as before: e.g., active vs. passive voice, punctuation usage, …) and when/how online communities use them. You would develop a method to detect whether people use a specific style dimension or not (e.g., active vs. passive voice) and then (possibly) quantitatively analyze when people use these styles/ what the effect of using the styles are (e.g., are other people adapting to this style or not?). For example, see How Active, Passive and Nominal Styles Affect Redability of Science Writing Related Keywords: Linguistic Style, Language Variation, Linguistic Accommodation
  • Style Embeddings In Natural Language Processing, there is a lot of work on training representations of sentences that encode the meaning of a sentence in machine-readable form (i.e., often in the form of vectors in high dimensional space, where paraphrases are mapped to the same point). However, less work has been done on learning representations that encode the style (as opposed to the content) of a sentence. You could expand on our current model of style representations (see here), e.g., by training on more data, finding harder negatives for the contrastive learning approach, controlling for content using semantic similairity scores, experimenting with different forms of tokenization … Further reading: Style Representations, Universal Authorship Representations
  • Style Change Detection Task Did you always want to take part in a leaderboard competition? You can make it your thesis project. For example, you could participate in the Style Change Detection Task for 2022. (2021, see here). The submission is probably somewhere in April for this year. In case that does not work together with your starting date, you might not be able to formally submit for this year, but you can still test your model and compare it to other people’s work or submit it a year later. The Style Change Detection Task is about detecting whether and where the author of a text changes. These kinds of tasks are often also known as authorship attribution tasks (e.g., see here). You could try out different methods from that field (e.g., LIWC, character n-grams, style embeddings …) and train some classification methods (e.g., logistic regression). Related Keywords: Authorship Verification, Authorship Attribution, Style Measurement
  • Is Next Sentence Prediction worth something after all? It is a half-agreed fact in NLP, that out of the two BERT pretraining objectives, “Masked Language Model” (MLM) is the more effective one. That one should rather use training resources completely on MLM rather than waste them on Next Sentence Prediction (NSP). This is very probably true for most downstream NLP tasks. However, our work on Style Evaluation, inspired the question whether NSP is adding information to language models relating to style that is not learned (as well) with MLM. Maybe by asking whether the second sentence comes after the first, the model has to understand more about style than when predicting what word might be missing in the same sentence. In your project/thesis, you could systematically try changing BERT’s pretraining objective to only using NSP vs. only using MLM and see how this affects the performances on Authorship Verifciation (AV) and Style Representations. Either answer would be a contribution: NSP does not improve style representation in language models as well as NSP does improve style representations in language models.
  • Open/Closed Question Detection: Is this a closed question? What open question do you typically ask new acquaintances? See also: Open thinking, closed questioning: Two kinds of open and closed questions You will build a classifier that detects whether a question is open or closed and study the share of open/closed questions and how it affects a conversation (e.g., conversation length, overall sentiment).
  • Generation Detection No. This is not about generating texts. Rather about whether a text was written by a boomer, zoomer or millenial. Do you sometimes read a text and just know it has been written by an unhappy teenager? Or a disgruntled grandmother? You could work on an algorithm that learns to predict generational identity (e.g., Exploring Generational Identity: A Multiparadigm Approach) from a social media post. This could be interesting to later (e.g., on a platform like Reddit) see how the average Boomer argues about climate change in comparison to the average Millennial. There has been some work in predicting age from short texts (e.g., Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment, Age Groups Classification in Social Network Using Deep Learning). You could use those as a (first) approximation of generational identity and go further from there.
  • Conversations You can work on conversation datasets, for example, online discussions (e.g., on Reddit) and investigate questions regarding interactions. You could calculate the Discourse Quality Index (DQI) (for computational applications see, e.g., here) on a political discussion dataset. Other research questions could include: What are factors that worsen or better the overall flow of a conversation (see, e.g., Conversations gone awry, convokit) – Questions could include “When do participants say about the same number of words in a discussion?”, “Whats the influence of topic change in a conversation?”
  • Conflict conversations – Detection, Resolving, Strategies (e.g., see Conversational Receptiveness)
  • Intrinsic plagiarism detection (see, e.g. Is writing style predicitve of scientific fraud?) – Topics could be about improving detection algorithms, finding features relating to fraud, finding areas that are especially susceptible to fraud, …
  • Gender Bias in Fiction (e.g., Analyzing Gender Bias within Narrative Tropes) – How does popular culture influence popular belief?

Of course, these projects are mere suggestions and very tentative descriptions. Everything is open to change and for discussion.

Your first step will be to develop a research plan that is doable in your given time frame. That usually means making several assumptions and simplifications to fit, e.g., a 10 week project plan. It is great if you can already bring a plan to our first meeting. It does not have to be perfect by any means. We can develop it further together.

Hope to work with you soon!