API Reference

Top-level API

diversify_text.diversify(texts, *, device=None, methods=None, semantic_filter=False, **kwargs)[source]

One-shot convenience function: create a Diversifier and run it.

The generation method(s) and the MIS filter are cached independently between calls. Generation methods are cached per (device, resolved method list), while the MIS filter is cached per device. Switching semantic_filter on or off reuses the cached generation models, and changing methods reuses the cached MIS filter when possible. Expensive components are only recreated when their respective cache keys change; changing filter thresholds (min_score, n_candidates) updates the existing MIS filter instance rather than reloading it.

Parameters:

texts (str | list[str] | Iterable[str]) – Input text(s).
device (str, optional) – Torch device.
methods (sequence[str | DiversificationMethod], optional) – Method names and/or pre-built method instances.
semantic_filter (bool) – When True, score each paraphrase with the Mutual Implication Score model and select the best candidate above a minimum score.
**kwargs – Forwarded to Diversifier (min_score, n_candidates) and Diversifier.diversify() (n, text_column, batch_size, max_new_tokens, temperature, top_p, seed, method_kwargs, preprocess_kwargs, output_dir, output_name).

Returns:

See Diversifier.diversify().

Return type:

list[dict] | Path

Notes

The internal cache is not thread-safe. For multi-threaded applications, use Diversifier directly.

class diversify_text.Diversifier(device=None, *, methods=None, semantic_filter=False, _methods=None, _mis_filter=None, **filter_kwargs)[source]

Bases: object

Generate stylistic paraphrases using one or more pluggable methods.

Each method can be a separate model or algorithm. The class supports combining multiple methods and automatically distributing requested styles across them.

Parameters:

device (str, optional) – Torch device ("cuda", "cpu", "mps", …).
methods (sequence[str | DiversificationMethod], optional) – Method names and/or pre-built method instances.
semantic_filter (bool)
_methods (list[DiversificationMethod] | None)
_mis_filter (MISFilter | None)
filter_kwargs (Any)

Example

>>> div = Diversifier(methods=["tinystyler"])
>>> results = div.diversify("The experiment was conducted in a lab.")
>>> len(results)  # one dict per input text
1
>>> list(results[0].keys())
['original', 'paraphrases']

diversify(texts, *, n=None, text_column='text', batch_size=32, max_new_tokens=None, temperature=None, top_p=None, seed=<object object>, method_kwargs=None, preprocess_kwargs=None, output_dir=None, output_name=None)[source]

Produce n stylistic paraphrases for each input text.

Parameters:

texts (str | list[str] | Iterable[str]) – A single text, a list of texts, a generator/iterable of texts, or a path to a .csv, .tsv, or .txt file.
n (int or None) – Number of stylistically diverse paraphrases to generate per input text. None (default) uses len(prompt_keys) when prompt keys are provided via method_kwargs, or 5 otherwise.
text_column (str) – Column name to extract when texts points to a CSV/TSV file.
batch_size (int) – Number of texts to pull from the input iterator per batch.
max_new_tokens (int, optional) – Maximum number of tokens to generate per paraphrase. None lets each method choose its own default.
temperature (float, optional) – Sampling temperature. None lets each method choose its own default.
top_p (float, optional) – Nucleus-sampling probability mass. None lets each method choose its own default.
seed (int or None, optional) – Random seed for reproducible output. Seeds Python’s random, PyTorch (CPU + CUDA), and NumPy if available. When omitted, the default seed (51173) is applied on the first call only and skipped on subsequent calls. Pass an explicit integer to always (re-)seed. Pass None to disable seeding entirely.
method_kwargs (mapping[str, dict], optional) – Per-method keyword arguments. Example: {"tinystyler": {"style_bank": [...]}}.
preprocess_kwargs (dict, optional) – Keyword arguments forwarded to preprocess(). Example: {"split_on_punctuation": True}.
output_dir (str | Path, optional) – Directory to write output files into. When provided for str / list[str] input, forces disk output instead of in-memory. Defaults vary by input type (see resolve_output_path()).
output_name (str, optional) – Base filename (without extension). The .jsonl extension is appended automatically.

Returns:

For in-memory input (str, list[str]) without output_dir, returns a list with one entry per input text:

{"original": str, "paraphrases": list[str]}

Otherwise, returns the Path to the output file(s).

Return type:

list[dict] | Path

Input resolution

Input resolution for diversify.

Converts the many input forms users can provide (single string, list, generator, CSV/TSV/TXT file path) into a uniform Iterator[str] plus an InputContext that describes the source.

class diversify_text._input.InputKind(*values)[source]

Bases: Enum

Discriminator for how the user provided input.

SINGLE_STR = 1

LIST = 2

ITERABLE = 3

FILE_CSV = 4

FILE_TSV = 5

FILE_TXT = 6

class diversify_text._input.InputContext(kind, input_path=None, text_column=None, total=None)[source]

Bases: object

Read-only metadata about the resolved input source.

Parameters:

kind (InputKind)
input_path (Path | None)
text_column (str | None)
total (int | None)

kind: InputKind

input_path: Path | None = None

text_column: str | None = None

total: int | None = None

diversify_text._input.resolve_input(texts, text_column='text')[source]

Convert any supported input into a lazy Iterator[str] plus metadata.

Parameters:

texts (str | list[str] | Iterable[str]) – A single text, a list of texts, a generator / iterable of texts, or a path to a .csv, .tsv, or .txt file.
text_column (str) – Column name to extract when texts points to a CSV/TSV file.

Return type:

(Iterator[str], InputContext)

Output writing

Output path resolution and incremental writing for diversify.

Decides where results go (in-memory vs. disk) and writes them in the appropriate format (Python list or JSONL).

diversify_text._output.resolve_output_path(input_context, output_dir=None, output_name=None)[source]

Determine where output should be written, or None for in-memory.

The user controls where (directory) and what name (stem) to use, but the extension is always ``.jsonl``.

Directory defaults (when output_dir is None):

SINGLE_STR / LIST → None (keep in memory, return as list[dict]). These are small, known-size inputs from Python code, so the caller typically wants results as Python objects.
ITERABLE → current working directory.
FILE_CSV / FILE_TSV / FILE_TXT → same directory as the input file.

If output_dir is provided for SINGLE_STR / LIST, results are written to disk instead of being returned in memory.

Name defaults (when output_name is None):

FILE_CSV / FILE_TSV → <input_stem>_diversified
FILE_TXT → <input_stem>
Everything else → diversified_output

Parameters:

input_context (InputContext) – Metadata produced by resolve_input().
output_dir (str, Path, or None) – Directory to write output files into.
output_name (str or None) – Base filename (without extension). The correct extension is appended automatically. If the name already contains an extension it is not stripped — the correct extension is appended after it — unless it already ends with .jsonl.

Returns:

None means in-memory mode; otherwise the path to write to.

Return type:

Path or None

class diversify_text._output.OutputWriter(input_context, n, output_path)[source]

Bases: object

Incrementally writes diversify results to the right format.

Modes

In-memory (output_path is None): accumulates list[dict] with keys "original" and "paraphrases".
JSONL (output_path is not None): writes one JSON object per line to a .jsonl file.

__init__(input_context, n, output_path)[source]

Initialize the writer.

Parameters:

input_context (InputContext) – Metadata about the input source (kind, path, etc.).
n (int) – Number of paraphrase styles requested per text.
output_path (Path or None) – Where to write results on disk. None means results are kept in memory and returned as list[dict].

Return type:

None

open()[source]

Open the file handle when writing to disk.

Must be called before write_batch().

output_path is None — does nothing (in-memory mode).
Otherwise — opens a single JSONL file for writing.

Return type:: None

write_batch(originals, paraphrases_by_text)[source]

Append one batch of results.

Parameters:

originals (list[str]) – The original texts in this batch.
paraphrases_by_text (list[list[str]]) – One inner list per original text, each containing n paraphrased variants. For example, with 2 styles and 2 texts: [["a_style1", "a_style2"], ["b_style1", "b_style2"]].

Raises:

ValueError – If originals and paraphrases_by_text have different lengths.

Return type:

None

finish()[source]

Close the file handle and return the final result.

Returns:

list[dict] – When output_path was None (in-memory mode). Each dict has keys "original" and "paraphrases".
Path – When results were written to disk — the .jsonl path.

Return type:

list[dict] | Path

Parameters:

input_context (InputContext)
n (int)
output_path (Path | None)

Preprocessing

Text preprocessing utilities for diversify.

class diversify_text._preprocess.PreprocessContext(segments_per_text=None)[source]

Bases: object

State produced by preprocess() and consumed by postprocess().

New preprocessing steps can add fields here without changing the caller in core.py.

Parameters:: segments_per_text (list[list[str]] | None)

segments_per_text: list[list[str]] | None = None

diversify_text._preprocess.split_sentences(text)[source]

Split text into sentences using pysbd.

Returns a list of stripped sentence strings. If the text is empty or whitespace-only, returns a single-element list containing the stripped (possibly empty) input.

Parameters:: text (str)
Return type:: list[str]

diversify_text._preprocess.preprocess(texts, *, split_on_punctuation=False)[source]

Prepare a batch of texts for generation.

Returns the (possibly transformed) texts to feed into the generation method, together with a PreprocessContext that postprocess() needs to undo the transformations.

Parameters:

texts (list[str]) – Original input texts.
split_on_punctuation (bool) – If True, split each text into sentence-level segments and flatten the result. The per-text segment mapping is stored in the context so that postprocess() can reassemble them.

Returns:

generation_texts (list[str]) – Texts to pass to the generation method.
context (PreprocessContext) – Context needed by postprocess().

Return type:

tuple[list[str], PreprocessContext]

Postprocessing

Text postprocessing utilities for diversify.

diversify_text._postprocess.reassemble_segments(segments_per_text, paraphrases_by_segment)[source]

Join per-segment paraphrases back into per-original-text paraphrases.

Parameters:

segments_per_text (list[list[str]]) – The sentence segments for each original text (from split_sentences()).
paraphrases_by_segment (list[list[str]]) – Flat list of paraphrases for every segment, shape [total_segments][n].

Returns:

Shape [n_texts][n] — reassembled paraphrases.

Return type:

list[list[str]]

diversify_text._postprocess.postprocess(candidate, context)[source]

Undo preprocessing transformations on a candidate set.

Applies the inverse of each step performed by preprocess(), using the state stored in context.

Parameters:

candidate (list[list[str]]) – Raw generation output, shape [n_generation_texts][n].
context (PreprocessContext) – Context returned by preprocess().

Returns:

Shape [n_texts][n] — one paraphrase per original text per style.

Return type:

list[list[str]]

Methods

class diversify_text.method.base.DiversificationMethod[source]

Bases: ABC

Interface for pluggable diversification methods.

name = 'base'

prepare()[source]

Load any resources (models, tokenizers) needed before generation.

Called once before the progress bar starts so that loading messages appear before generation begins. No-op by default.

Return type:: None

abstractmethod generate(texts, *, n, max_new_tokens, temperature, top_p, **kwargs)[source]

Return paraphrases per input text.

Output shape must be len(texts) x n.

Parameters:

texts (list[str])
n (int)
max_new_tokens (int | None)
temperature (float | None)
top_p (float | None)
kwargs (Any)

Return type:

list[list[str]]

class diversify_text.method.registry.MethodRegistry[source]

Bases: object

Registry of named diversification method classes.

register(name, method_cls)[source]

Parameters:

name (str)
method_cls (type[DiversificationMethod])

Return type:

None

get(name)[source]

Parameters:: name (str)
Return type:: type[DiversificationMethod]

resolve(methods, **kwargs)[source]

Resolve a sequence of method names/instances into ready instances.

String entries are looked up in the registry and instantiated. Pre-built DiversificationMethod instances are passed through as-is. Extra kwargs (e.g. device) are forwarded to each constructor only if it accepts them.

Parameters:

methods (sequence[str | DiversificationMethod]) – Method names and/or pre-built instances.
**kwargs – Keyword arguments forwarded to method constructors (only those accepted by the constructor signature).

Return type:

list[DiversificationMethod]

Raises:

KeyError – If a string name is not registered.
TypeError – If an element is neither str nor DiversificationMethod.
ValueError – If the resulting list is empty.

unregister(name)[source]

Parameters:: name (str)
Return type:: None

names()[source]

Return type:: list[str]