Usage Guide =========== Control number of paraphrases ----------------------------- .. code-block:: python results = diversify("Some text.", n=3) .. code-block:: python [{"original": "Some text.", "paraphrases": ["...", "...", "..."]}] Reproducibility (seed) ---------------------- ``diversify`` sets a default random seed (``51173``) to make runs more reproducible. The seed is applied to Python's ``random``, PyTorch (CPU and CUDA), and NumPy. It is logged at the start of each run, but exact determinism is **not** guaranteed across different hardware, library versions, or backends. To get a different set of paraphrases, pass a different seed: .. code-block:: python results = diversify("Some text.", seed=123) To disable seeding entirely (non-deterministic output): .. code-block:: python results = diversify("Some text.", seed=None) List of texts ------------- .. code-block:: python results = diversify([ "The experiment was conducted in a controlled lab setting.", "She graduated from MIT in 2019.", ]) .. code-block:: python [ {"original": "The experiment ...", "paraphrases": ["...", "...", ...]}, {"original": "She graduated ...", "paraphrases": ["...", "...", ...]}, ] CSV / TSV file -------------- Reads the file and writes a JSONL file next to the input (``_diversified.jsonl``). .. code-block:: python results = diversify("bios.csv", text_column="bio") # writes bios_diversified.jsonl Each line in the JSONL output is one JSON object: .. code-block:: json {"original": "Jane is a ...", "paraphrases": ["Jane works as a ...", "As a ..., Jane ..."]} {"original": "John studied ...", "paraphrases": ["John was educated ...", "..."]} TXT file -------- Each non-empty line is treated as a separate text to diversify. Output is written to ``.jsonl``. .. code-block:: python results = diversify("texts.txt") # writes texts.jsonl Controlling output location ---------------------------- By default, file inputs write output next to the input file and in-memory inputs (strings, lists) return a Python list. You can override this with ``output_dir`` and ``output_name``: .. code-block:: python # Write output to a specific directory results = diversify("bios.csv", text_column="bio", output_dir="/results") # writes /results/bios_diversified.jsonl # Also set a custom filename results = diversify("bios.csv", text_column="bio", output_dir="/results", output_name="my_output") # writes /results/my_output.jsonl # Force a list input to write to disk instead of returning in-memory results = diversify(["text one", "text two"], output_dir=".") # writes ./diversified_output.jsonl The ``.jsonl`` extension is always added automatically. Longer texts ------------- For tips on handling longer texts (punctuation splitting, increasing ``max_new_tokens``), see :doc:`longer_texts`. Multiple methods ---------------- You can combine methods to get diverse paraphrases from different approaches. The requested ``n`` are distributed across the methods: .. code-block:: python results = diversify("The cat sat on the mat.", methods=["tinystyler", "prompting"], n=4) Customising the TinyStyler style bank -------------------------------------- TinyStyler generates each paraphrase by conditioning on a *style example* — a short sentence that demonstrates the target writing style. The style bank is the list of such examples that get cycled through when producing multiple paraphrases. The default bank is a dictionary mapping style labels to lists of example sentences (drawn from the CORE corpus). You can replace or extend it by passing a custom bank via ``method_kwargs``. A style bank can be a ``dict[str, list[str]]`` or a ``list[list[str]]``: .. code-block:: python from diversify_text import diversify from diversify_text.styles import DEFAULT_STYLE_BANK custom_bank = { "academic": ["The results demonstrate a statistically significant effect."], "enthusiastic": ["We found something really interesting — check this out!"], "telegraphic": ["Key finding: effect confirmed. Details follow."], } results = diversify( "The experiment was conducted in a controlled lab setting.", method_kwargs={"tinystyler": {"style_bank": custom_bank}}, ) ``DEFAULT_STYLE_BANK`` is exported from ``diversify_text.styles`` so you can build on it: .. code-block:: python from diversify_text.styles import DEFAULT_STYLE_BANK extended_bank = { **DEFAULT_STYLE_BANK, "scientific": ["The data clearly indicate a statistically significant result."], } You can also select specific styles by key name with ``styles``, instead of cycling through the entire bank. The number of paraphrases is determined by the number of selected styles: .. code-block:: python results = diversify( "The experiment was conducted in a controlled lab setting.", method_kwargs={"tinystyler": {"styles": ["research_article", "personal_blog", "recipe"]}}, ) .. _creating-a-custom-method: Creating a custom method ------------------------ .. code-block:: python from diversify_text import Diversifier from diversify_text.method import DiversificationMethod class MyMethod(DiversificationMethod): name = "my_method" def generate(self, texts, *, n, max_new_tokens, temperature, top_p, **kwargs): return [[f"{text} :: variant {i}" for i in range(n)] for text in texts] results = Diversifier(methods=[MyMethod()]).diversify("Hello", n=3) .. code-block:: python [{"original": "Hello", "paraphrases": ["Hello :: variant 0", "Hello :: variant 1", "Hello :: variant 2"]}]