Usage Guide
===========
Control number of paraphrases
-----------------------------
.. code-block:: python
results = diversify("Some text.", n=3)
.. code-block:: python
[{"original": "Some text.", "paraphrases": ["...", "...", "..."]}]
Reproducibility (seed)
----------------------
``diversify`` sets a default random seed (``51173``) to make runs more
reproducible. The seed is applied to Python's ``random``, PyTorch
(CPU and CUDA), and NumPy. It is logged at the start of each run, but
exact determinism is **not** guaranteed across different hardware, library
versions, or backends.
To get a different set of paraphrases, pass a different seed:
.. code-block:: python
results = diversify("Some text.", seed=123)
To disable seeding entirely (non-deterministic output):
.. code-block:: python
results = diversify("Some text.", seed=None)
List of texts
-------------
.. code-block:: python
results = diversify([
"The experiment was conducted in a controlled lab setting.",
"She graduated from MIT in 2019.",
])
.. code-block:: python
[
{"original": "The experiment ...", "paraphrases": ["...", "...", ...]},
{"original": "She graduated ...", "paraphrases": ["...", "...", ...]},
]
CSV / TSV file
--------------
Reads the file and writes a JSONL file next to the input
(``_diversified.jsonl``).
.. code-block:: python
results = diversify("bios.csv", text_column="bio")
# writes bios_diversified.jsonl
Each line in the JSONL output is one JSON object:
.. code-block:: json
{"original": "Jane is a ...", "paraphrases": ["Jane works as a ...", "As a ..., Jane ..."]}
{"original": "John studied ...", "paraphrases": ["John was educated ...", "..."]}
TXT file
--------
Each non-empty line is treated as a separate text to diversify. Output is
written to ``.jsonl``.
.. code-block:: python
results = diversify("texts.txt")
# writes texts.jsonl
Controlling output location
----------------------------
By default, file inputs write output next to the input file and in-memory
inputs (strings, lists) return a Python list. You can override this with
``output_dir`` and ``output_name``:
.. code-block:: python
# Write output to a specific directory
results = diversify("bios.csv", text_column="bio", output_dir="/results")
# writes /results/bios_diversified.jsonl
# Also set a custom filename
results = diversify("bios.csv", text_column="bio", output_dir="/results", output_name="my_output")
# writes /results/my_output.jsonl
# Force a list input to write to disk instead of returning in-memory
results = diversify(["text one", "text two"], output_dir=".")
# writes ./diversified_output.jsonl
The ``.jsonl`` extension is always added automatically.
Longer texts
-------------
For tips on handling longer texts (punctuation splitting, increasing
``max_new_tokens``), see :doc:`longer_texts`.
Multiple methods
----------------
You can combine methods to get diverse paraphrases from different approaches.
The requested ``n`` are distributed across the methods:
.. code-block:: python
results = diversify("The cat sat on the mat.", methods=["tinystyler", "prompting"], n=4)
Customising the TinyStyler style bank
--------------------------------------
TinyStyler generates each paraphrase by conditioning on a *style example* — a
short sentence that demonstrates the target writing style. The style bank is
the list of such examples that get cycled through when producing multiple
paraphrases.
The default bank is a dictionary mapping style labels to lists of example
sentences (drawn from the CORE corpus). You can replace or extend it by
passing a custom bank via ``method_kwargs``.
A style bank can be a ``dict[str, list[str]]`` or a ``list[list[str]]``:
.. code-block:: python
from diversify_text import diversify
from diversify_text.styles import DEFAULT_STYLE_BANK
custom_bank = {
"academic": ["The results demonstrate a statistically significant effect."],
"enthusiastic": ["We found something really interesting — check this out!"],
"telegraphic": ["Key finding: effect confirmed. Details follow."],
}
results = diversify(
"The experiment was conducted in a controlled lab setting.",
method_kwargs={"tinystyler": {"style_bank": custom_bank}},
)
``DEFAULT_STYLE_BANK`` is exported from ``diversify_text.styles`` so you
can build on it:
.. code-block:: python
from diversify_text.styles import DEFAULT_STYLE_BANK
extended_bank = {
**DEFAULT_STYLE_BANK,
"scientific": ["The data clearly indicate a statistically significant result."],
}
You can also select specific styles by key name with ``styles``, instead of
cycling through the entire bank. The number of paraphrases is determined by
the number of selected styles:
.. code-block:: python
results = diversify(
"The experiment was conducted in a controlled lab setting.",
method_kwargs={"tinystyler": {"styles": ["research_article", "personal_blog", "recipe"]}},
)
.. _creating-a-custom-method:
Creating a custom method
------------------------
.. code-block:: python
from diversify_text import Diversifier
from diversify_text.method import DiversificationMethod
class MyMethod(DiversificationMethod):
name = "my_method"
def generate(self, texts, *, n, max_new_tokens, temperature, top_p, **kwargs):
return [[f"{text} :: variant {i}" for i in range(n)] for text in texts]
results = Diversifier(methods=[MyMethod()]).diversify("Hello", n=3)
.. code-block:: python
[{"original": "Hello", "paraphrases": ["Hello :: variant 0", "Hello :: variant 1", "Hello :: variant 2"]}]