Usage Guide
Control number of paraphrases
results = diversify("Some text.", n=3)
[{"original": "Some text.", "paraphrases": ["...", "...", "..."]}]
Reproducibility (seed)
diversify sets a default random seed (51173) to make runs more
reproducible. The seed is applied to Python’s random, PyTorch
(CPU and CUDA), and NumPy. It is logged at the start of each run, but
exact determinism is not guaranteed across different hardware, library
versions, or backends.
To get a different set of paraphrases, pass a different seed:
results = diversify("Some text.", seed=123)
To disable seeding entirely (non-deterministic output):
results = diversify("Some text.", seed=None)
List of texts
results = diversify([
"The experiment was conducted in a controlled lab setting.",
"She graduated from MIT in 2019.",
])
[
{"original": "The experiment ...", "paraphrases": ["...", "...", ...]},
{"original": "She graduated ...", "paraphrases": ["...", "...", ...]},
]
CSV / TSV file
Reads the file and writes a JSONL file next to the input
(<input>_diversified.jsonl).
results = diversify("bios.csv", text_column="bio")
# writes bios_diversified.jsonl
Each line in the JSONL output is one JSON object:
{"original": "Jane is a ...", "paraphrases": ["Jane works as a ...", "As a ..., Jane ..."]}
{"original": "John studied ...", "paraphrases": ["John was educated ...", "..."]}
TXT file
Each non-empty line is treated as a separate text to diversify. Output is
written to <input>.jsonl.
results = diversify("texts.txt")
# writes texts.jsonl
Controlling output location
By default, file inputs write output next to the input file and in-memory
inputs (strings, lists) return a Python list. You can override this with
output_dir and output_name:
# Write output to a specific directory
results = diversify("bios.csv", text_column="bio", output_dir="/results")
# writes /results/bios_diversified.jsonl
# Also set a custom filename
results = diversify("bios.csv", text_column="bio", output_dir="/results", output_name="my_output")
# writes /results/my_output.jsonl
# Force a list input to write to disk instead of returning in-memory
results = diversify(["text one", "text two"], output_dir=".")
# writes ./diversified_output.jsonl
The .jsonl extension is always added automatically.
Longer texts
For tips on handling longer texts (punctuation splitting, increasing
max_new_tokens), see Longer Texts.
Multiple methods
You can combine methods to get diverse paraphrases from different approaches.
The requested n are distributed across the methods:
results = diversify("The cat sat on the mat.", methods=["tinystyler", "prompting"], n=4)
Customising the TinyStyler style bank
TinyStyler generates each paraphrase by conditioning on a style example — a short sentence that demonstrates the target writing style. The style bank is the list of such examples that get cycled through when producing multiple paraphrases.
The default bank is a dictionary mapping style labels to lists of example
sentences (drawn from the CORE corpus). You can replace or extend it by
passing a custom bank via method_kwargs.
A style bank can be a dict[str, list[str]] or a list[list[str]]:
from diversify_text import diversify
from diversify_text.styles import DEFAULT_STYLE_BANK
custom_bank = {
"academic": ["The results demonstrate a statistically significant effect."],
"enthusiastic": ["We found something really interesting — check this out!"],
"telegraphic": ["Key finding: effect confirmed. Details follow."],
}
results = diversify(
"The experiment was conducted in a controlled lab setting.",
method_kwargs={"tinystyler": {"style_bank": custom_bank}},
)
DEFAULT_STYLE_BANK is exported from diversify_text.styles so you
can build on it:
from diversify_text.styles import DEFAULT_STYLE_BANK
extended_bank = {
**DEFAULT_STYLE_BANK,
"scientific": ["The data clearly indicate a statistically significant result."],
}
You can also select specific styles by key name with styles, instead of
cycling through the entire bank. The number of paraphrases is determined by
the number of selected styles:
results = diversify(
"The experiment was conducted in a controlled lab setting.",
method_kwargs={"tinystyler": {"styles": ["research_article", "personal_blog", "recipe"]}},
)
Creating a custom method
from diversify_text import Diversifier
from diversify_text.method import DiversificationMethod
class MyMethod(DiversificationMethod):
name = "my_method"
def generate(self, texts, *, n, max_new_tokens, temperature, top_p, **kwargs):
return [[f"{text} :: variant {i}" for i in range(n)] for text in texts]
results = Diversifier(methods=[MyMethod()]).diversify("Hello", n=3)
[{"original": "Hello", "paraphrases": ["Hello :: variant 0", "Hello :: variant 1", "Hello :: variant 2"]}]