API Reference
Top-level API
- diversify_text.diversify(texts, *, device=None, methods=None, semantic_filter=False, **kwargs)[source]
One-shot convenience function: create a
Diversifierand run it.The generation method(s) and the MIS filter are cached independently between calls. Generation methods are cached per (
device, resolved method list), while the MIS filter is cached perdevice. Switchingsemantic_filteron or off reuses the cached generation models, and changing methods reuses the cached MIS filter when possible. Expensive components are only recreated when their respective cache keys change; changing filter thresholds (min_score,n_candidates) updates the existing MIS filter instance rather than reloading it.- Parameters:
device (str, optional) – Torch device.
methods (sequence[str | DiversificationMethod], optional) – Method names and/or pre-built method instances.
semantic_filter (bool) – When
True, score each paraphrase with the Mutual Implication Score model and select the best candidate above a minimum score.**kwargs – Forwarded to
Diversifier(min_score,n_candidates) andDiversifier.diversify()(n,text_column,batch_size,max_new_tokens,temperature,top_p,seed,method_kwargs,preprocess_kwargs,output_dir,output_name).
- Returns:
- Return type:
Notes
The internal cache is not thread-safe. For multi-threaded applications, use
Diversifierdirectly.
- class diversify_text.Diversifier(device=None, *, methods=None, semantic_filter=False, _methods=None, _mis_filter=None, **filter_kwargs)[source]
Bases:
objectGenerate stylistic paraphrases using one or more pluggable methods.
Each method can be a separate model or algorithm. The class supports combining multiple methods and automatically distributing requested styles across them.
- Parameters:
device (str, optional) – Torch device (
"cuda","cpu","mps", …).methods (sequence[str | DiversificationMethod], optional) – Method names and/or pre-built method instances.
semantic_filter (bool)
_methods (list[DiversificationMethod] | None)
_mis_filter (MISFilter | None)
filter_kwargs (Any)
Example
>>> div = Diversifier(methods=["tinystyler"]) >>> results = div.diversify("The experiment was conducted in a lab.") >>> len(results) # one dict per input text 1 >>> list(results[0].keys()) ['original', 'paraphrases']
- diversify(texts, *, n=None, text_column='text', batch_size=32, max_new_tokens=None, temperature=None, top_p=None, seed=<object object>, method_kwargs=None, preprocess_kwargs=None, output_dir=None, output_name=None)[source]
Produce n stylistic paraphrases for each input text.
- Parameters:
texts (str | list[str] | Iterable[str]) – A single text, a list of texts, a generator/iterable of texts, or a path to a
.csv,.tsv, or.txtfile.n (int or None) – Number of stylistically diverse paraphrases to generate per input text.
None(default) useslen(prompt_keys)when prompt keys are provided via method_kwargs, or5otherwise.text_column (str) – Column name to extract when texts points to a CSV/TSV file.
batch_size (int) – Number of texts to pull from the input iterator per batch.
max_new_tokens (int, optional) – Maximum number of tokens to generate per paraphrase.
Nonelets each method choose its own default.temperature (float, optional) – Sampling temperature.
Nonelets each method choose its own default.top_p (float, optional) – Nucleus-sampling probability mass.
Nonelets each method choose its own default.seed (int or None, optional) – Random seed for reproducible output. Seeds Python’s
random, PyTorch (CPU + CUDA), and NumPy if available. When omitted, the default seed (51173) is applied on the first call only and skipped on subsequent calls. Pass an explicit integer to always (re-)seed. PassNoneto disable seeding entirely.method_kwargs (mapping[str, dict], optional) – Per-method keyword arguments. Example:
{"tinystyler": {"style_bank": [...]}}.preprocess_kwargs (dict, optional) – Keyword arguments forwarded to
preprocess(). Example:{"split_on_punctuation": True}.output_dir (str | Path, optional) – Directory to write output files into. When provided for
str/list[str]input, forces disk output instead of in-memory. Defaults vary by input type (seeresolve_output_path()).output_name (str, optional) – Base filename (without extension). The
.jsonlextension is appended automatically.
- Returns:
For in-memory input (
str,list[str]) without output_dir, returns a list with one entry per input text:{"original": str, "paraphrases": list[str]}
Otherwise, returns the
Pathto the output file(s).- Return type:
Input resolution
Input resolution for diversify.
Converts the many input forms users can provide (single string, list,
generator, CSV/TSV/TXT file path) into a uniform Iterator[str] plus
an InputContext that describes the source.
- class diversify_text._input.InputKind(*values)[source]
Bases:
EnumDiscriminator for how the user provided input.
- SINGLE_STR = 1
- LIST = 2
- ITERABLE = 3
- FILE_CSV = 4
- FILE_TSV = 5
- FILE_TXT = 6
- class diversify_text._input.InputContext(kind, input_path=None, text_column=None, total=None)[source]
Bases:
objectRead-only metadata about the resolved input source.
Output writing
Output path resolution and incremental writing for diversify.
Decides where results go (in-memory vs. disk) and writes them in the appropriate format (Python list or JSONL).
- diversify_text._output.resolve_output_path(input_context, output_dir=None, output_name=None)[source]
Determine where output should be written, or
Nonefor in-memory.The user controls where (directory) and what name (stem) to use, but the extension is always ``.jsonl``.
Directory defaults (when output_dir is
None):SINGLE_STR/LIST→None(keep in memory, return aslist[dict]). These are small, known-size inputs from Python code, so the caller typically wants results as Python objects.ITERABLE→ current working directory.FILE_CSV/FILE_TSV/FILE_TXT→ same directory as the input file.
If output_dir is provided for
SINGLE_STR/LIST, results are written to disk instead of being returned in memory.Name defaults (when output_name is
None):FILE_CSV/FILE_TSV→<input_stem>_diversifiedFILE_TXT→<input_stem>Everything else →
diversified_output
- Parameters:
input_context (InputContext) – Metadata produced by
resolve_input().output_dir (str, Path, or None) – Directory to write output files into.
output_name (str or None) – Base filename (without extension). The correct extension is appended automatically. If the name already contains an extension it is not stripped — the correct extension is appended after it — unless it already ends with
.jsonl.
- Returns:
Nonemeans in-memory mode; otherwise the path to write to.- Return type:
Path or None
- class diversify_text._output.OutputWriter(input_context, n, output_path)[source]
Bases:
objectIncrementally writes diversify results to the right format.
Modes
In-memory (
output_path is None): accumulateslist[dict]with keys"original"and"paraphrases".JSONL (
output_path is not None): writes one JSON object per line to a.jsonlfile.
- __init__(input_context, n, output_path)[source]
Initialize the writer.
- Parameters:
input_context (InputContext) – Metadata about the input source (kind, path, etc.).
n (int) – Number of paraphrase styles requested per text.
output_path (Path or None) – Where to write results on disk.
Nonemeans results are kept in memory and returned aslist[dict].
- Return type:
None
- open()[source]
Open the file handle when writing to disk.
Must be called before
write_batch().output_path is None— does nothing (in-memory mode).Otherwise — opens a single JSONL file for writing.
- Return type:
None
- write_batch(originals, paraphrases_by_text)[source]
Append one batch of results.
- Parameters:
- Raises:
ValueError – If
originalsandparaphrases_by_texthave different lengths.- Return type:
None
- Parameters:
input_context (InputContext)
n (int)
output_path (Path | None)
Preprocessing
Text preprocessing utilities for diversify.
- class diversify_text._preprocess.PreprocessContext(segments_per_text=None)[source]
Bases:
objectState produced by
preprocess()and consumed bypostprocess().New preprocessing steps can add fields here without changing the caller in
core.py.
- diversify_text._preprocess.split_sentences(text)[source]
Split text into sentences using pysbd.
Returns a list of stripped sentence strings. If the text is empty or whitespace-only, returns a single-element list containing the stripped (possibly empty) input.
- diversify_text._preprocess.preprocess(texts, *, split_on_punctuation=False)[source]
Prepare a batch of texts for generation.
Returns the (possibly transformed) texts to feed into the generation method, together with a
PreprocessContextthatpostprocess()needs to undo the transformations.- Parameters:
split_on_punctuation (bool) – If
True, split each text into sentence-level segments and flatten the result. The per-text segment mapping is stored in the context so thatpostprocess()can reassemble them.
- Returns:
generation_texts (list[str]) – Texts to pass to the generation method.
context (PreprocessContext) – Context needed by
postprocess().
- Return type:
Postprocessing
Text postprocessing utilities for diversify.
- diversify_text._postprocess.reassemble_segments(segments_per_text, paraphrases_by_segment)[source]
Join per-segment paraphrases back into per-original-text paraphrases.
- Parameters:
- Returns:
Shape
[n_texts][n]— reassembled paraphrases.- Return type:
- diversify_text._postprocess.postprocess(candidate, context)[source]
Undo preprocessing transformations on a candidate set.
Applies the inverse of each step performed by
preprocess(), using the state stored in context.- Parameters:
candidate (list[list[str]]) – Raw generation output, shape
[n_generation_texts][n].context (PreprocessContext) – Context returned by
preprocess().
- Returns:
Shape
[n_texts][n]— one paraphrase per original text per style.- Return type:
Methods
- class diversify_text.method.base.DiversificationMethod[source]
Bases:
ABCInterface for pluggable diversification methods.
- name = 'base'
- prepare()[source]
Load any resources (models, tokenizers) needed before generation.
Called once before the progress bar starts so that loading messages appear before generation begins. No-op by default.
- Return type:
None
- class diversify_text.method.registry.MethodRegistry[source]
Bases:
objectRegistry of named diversification method classes.
- register(name, method_cls)[source]
- Parameters:
name (str)
method_cls (type[DiversificationMethod])
- Return type:
None
- resolve(methods, **kwargs)[source]
Resolve a sequence of method names/instances into ready instances.
String entries are looked up in the registry and instantiated. Pre-built
DiversificationMethodinstances are passed through as-is. Extra kwargs (e.g.device) are forwarded to each constructor only if it accepts them.- Parameters:
methods (sequence[str | DiversificationMethod]) – Method names and/or pre-built instances.
**kwargs – Keyword arguments forwarded to method constructors (only those accepted by the constructor signature).
- Return type:
- Raises:
KeyError – If a string name is not registered.
TypeError – If an element is neither
strnorDiversificationMethod.ValueError – If the resulting list is empty.