Methods
diversify-text uses a pluggable method architecture. Each method is a
DiversificationMethod subclass that generates
paraphrases using a different model or algorithm.
Overview
Method |
Model Size |
Speed |
Performance |
Description |
|---|---|---|---|---|
|
~800M params |
TBD |
TBD |
Few-shot style transfer using authorship embeddings |
|
~1.7B params (default) |
TBD |
TBD |
Prompt-based paraphrasing using a causal LM |
TinyStyler
TinyStyler is a T5-based model that performs few-shot text style transfer by conditioning on authorship-embedding representations.
Given a source text and a set of style example sentences, TinyStyler generates
a paraphrase that preserves the content while shifting toward the demonstrated
writing style. diversify-text cycles through different style groups from a
configurable style bank to produce multiple stylistically diverse outputs.
Note
TinyStyler is based on CISR style embeddings, which have been shown to work well for social-media-like settings and formality transfer. The model may not perform as expected when reproducing other styles.
Default style bank. The built-in bank contains named styles drawn from
the CORE corpus, the
TinyStyler repository and
the STEL demo for the formality dimension.
See diversify_text.method.tinystyler.styles.DEFAULT_STYLE_BANK for the
full list of available styles.
Citation:
@inproceedings{horvitz-etal-2024-tinystyler,
title = "{T}iny{S}tyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings",
author = "Horvitz, Zachary and
Patel, Ajay and
Singh, Kanishk and
Callison-Burch, Chris and
McKeown, Kathleen and
Yu, Zhou",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.781",
pages = "13376--13390",
}
Prompting
The prompting method generates paraphrases by sending input texts to a
local HuggingFace causal language model with a prompt template. The default
model is SmolLM3-3B
using insights from The Synthetic Data Playbook.
results = diversify("The cat sat on the mat.", methods=["prompting"])
Choosing a model. Any HuggingFace causal LM can be used. Pass the model identifier to the constructor:
from diversify_text import Diversifier
from diversify_text.method.prompting import PromptingMethod
method = PromptingMethod(model="mistralai/Mistral-7B-Instruct-v0.3")
results = Diversifier(methods=[method]).diversify("The cat sat on the mat.")
Instruct-tuned models are recommended. Chat templates are applied automatically when the tokenizer provides one.
Note
Thinking/reasoning models (e.g. SmolLM3-3B) are detected automatically and
have their thinking mode turned off (enable_thinking=False) during
generation. Thinking tokens add overhead without improving paraphrase
quality in this setting.
Inference backend. The method currently uses the transformers library
for inference.
Note
vLLM support, batched inference, and streaming from large files are planned for a future release.
Default prompt bank. The built-in bank contains multiple prompt templates
covering different rewriting styles (paraphrasing, simplification, dialogue,
tables, and more). When no explicit selection is made, the templates listed in
DEFAULT_PROMPTS are used.
See Prompt Templates for the full list of available templates.
Customising the prompt bank. Like TinyStyler’s style bank, you can provide
a custom prompt bank or select specific prompts via method_kwargs. Each
prompt template must contain the placeholder [DOCUMENT SEGMENT]:
custom_bank = {
"simple": "Rewrite the following text in simpler words: [DOCUMENT SEGMENT]",
"formal": "Rewrite the following text in a formal academic tone: [DOCUMENT SEGMENT]",
}
results = diversify(
"The cat sat on the mat.",
methods=["prompting"],
method_kwargs={"prompting": {"prompt_bank": custom_bank}},
)
You can also select specific prompts by key name:
results = diversify(
"The cat sat on the mat.",
methods=["prompting"],
method_kwargs={"prompting": {"prompt_keys": ["wikipedia_paraphrase"]}},
)
Zero-shot humanize rewriting
The prompt bank includes humanize prompts based on Zhang et al. (2024) that rewrite machine-generated text to appear more human-written. These prompts instruct the model to introduce informal elements such as typos, slang, hashtags, and varied casing:
results = diversify(
"The experiment was conducted in a controlled lab setting.",
methods=["prompting"],
method_kwargs={"prompting": {"prompt_keys": ["humanize_llm-as-coauthor"]}},
)
A stricter variant, humanize_llm-as-coauthor_original, uses the original
five modifications from the paper and explicitly forbids emojis.
Few-shot style transfer with prompting
The prompting method can also perform few-shot style transfer by combining
style examples from the shared style bank with a few-shot prompt template.
When styles is provided without explicit prompt_keys, the method
automatically uses the style_transfer template from
EXAMPLE_BASED_PROMPT_BANK:
results = diversify(
"The experiment was conducted in a controlled lab setting.",
methods=["prompting"],
method_kwargs={
"prompting": {
"styles": ["informal_tinystyler"],
}
},
)
You can select a different few-shot template via prompt_keys. For
example, humanize_transfer combines humanization instructions with the
style examples:
results = diversify(
"The experiment was conducted in a controlled lab setting.",
methods=["prompting"],
method_kwargs={
"prompting": {
"styles": ["informal_tinystyler"],
"prompt_keys": ["humanize_transfer"],
}
},
)
Development
To see the exact prompts sent to the model, enable debug logging:
import logging
logging.basicConfig(level=logging.DEBUG)
Adding a new method
See Creating a custom method in the Usage Guide for instructions on
implementing your own DiversificationMethod.