Methods ======= ``diversify-text`` uses a pluggable method architecture. Each method is a :class:`~diversify_text.method.base.DiversificationMethod` subclass that generates paraphrases using a different model or algorithm. Overview -------- .. list-table:: :header-rows: 1 :widths: 20 15 15 15 35 * - Method - Model Size - Speed - Performance - Description * - ``tinystyler`` - ~800M params - TBD - TBD - Few-shot style transfer using authorship embeddings * - ``prompting`` - ~1.7B params (default) - TBD - TBD - Prompt-based paraphrasing using a causal LM TinyStyler ---------- `TinyStyler `_ is a T5-based model that performs few-shot text style transfer by conditioning on authorship-embedding representations. Given a source text and a set of style example sentences, TinyStyler generates a paraphrase that preserves the content while shifting toward the demonstrated writing style. ``diversify-text`` cycles through different style groups from a configurable *style bank* to produce multiple stylistically diverse outputs. .. note:: TinyStyler is based on `CISR `_ style embeddings, which have been shown to work well for **social-media-like settings** and **formality transfer**. The model may not perform as expected when reproducing other styles. **Default style bank.** The built-in bank contains named styles drawn from the `CORE corpus `_, the `TinyStyler repository `_ and the `STEL demo for the formality dimension `_. See :data:`diversify_text.method.tinystyler.styles.DEFAULT_STYLE_BANK` for the full list of available styles. **Citation:** .. code-block:: bibtex @inproceedings{horvitz-etal-2024-tinystyler, title = "{T}iny{S}tyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings", author = "Horvitz, Zachary and Patel, Ajay and Singh, Kanishk and Callison-Burch, Chris and McKeown, Kathleen and Yu, Zhou", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-emnlp.781", pages = "13376--13390", } Prompting --------- The ``prompting`` method generates paraphrases by sending input texts to a local HuggingFace causal language model with a prompt template. The default model is `SmolLM3-3B `_ using insights from `The Synthetic Data Playbook `_. .. code-block:: python results = diversify("The cat sat on the mat.", methods=["prompting"]) **Choosing a model.** Any HuggingFace causal LM can be used. Pass the model identifier to the constructor: .. code-block:: python from diversify_text import Diversifier from diversify_text.method.prompting import PromptingMethod method = PromptingMethod(model="mistralai/Mistral-7B-Instruct-v0.3") results = Diversifier(methods=[method]).diversify("The cat sat on the mat.") Instruct-tuned models are recommended. Chat templates are applied automatically when the tokenizer provides one. .. note:: Thinking/reasoning models (e.g. SmolLM3-3B) are detected automatically and have their thinking mode turned off (``enable_thinking=False``) during generation. Thinking tokens add overhead without improving paraphrase quality in this setting. **Inference backend.** The method currently uses the ``transformers`` library for inference. .. note:: `vLLM `_ support, batched inference, and streaming from large files are planned for a future release. **Default prompt bank.** The built-in bank contains multiple prompt templates covering different rewriting styles (paraphrasing, simplification, dialogue, tables, and more). When no explicit selection is made, the templates listed in :data:`~diversify_text.method.prompting.prompts.DEFAULT_PROMPTS` are used. See :doc:`prompts` for the full list of available templates. **Customising the prompt bank.** Like TinyStyler's style bank, you can provide a custom prompt bank or select specific prompts via ``method_kwargs``. Each prompt template must contain the placeholder ``[DOCUMENT SEGMENT]``: .. code-block:: python custom_bank = { "simple": "Rewrite the following text in simpler words: [DOCUMENT SEGMENT]", "formal": "Rewrite the following text in a formal academic tone: [DOCUMENT SEGMENT]", } results = diversify( "The cat sat on the mat.", methods=["prompting"], method_kwargs={"prompting": {"prompt_bank": custom_bank}}, ) You can also select specific prompts by key name: .. code-block:: python results = diversify( "The cat sat on the mat.", methods=["prompting"], method_kwargs={"prompting": {"prompt_keys": ["wikipedia_paraphrase"]}}, ) Zero-shot humanize rewriting ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The prompt bank includes humanize prompts based on `Zhang et al. (2024) `_ that rewrite machine-generated text to appear more human-written. These prompts instruct the model to introduce informal elements such as typos, slang, hashtags, and varied casing: .. code-block:: python results = diversify( "The experiment was conducted in a controlled lab setting.", methods=["prompting"], method_kwargs={"prompting": {"prompt_keys": ["humanize_llm-as-coauthor"]}}, ) A stricter variant, ``humanize_llm-as-coauthor_original``, uses the original five modifications from the paper and explicitly forbids emojis. Few-shot style transfer with prompting ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The prompting method can also perform few-shot style transfer by combining style examples from the shared style bank with a few-shot prompt template. When ``styles`` is provided without explicit ``prompt_keys``, the method automatically uses the ``style_transfer`` template from :data:`~diversify_text.method.prompting.prompts.EXAMPLE_BASED_PROMPT_BANK`: .. code-block:: python results = diversify( "The experiment was conducted in a controlled lab setting.", methods=["prompting"], method_kwargs={ "prompting": { "styles": ["informal_tinystyler"], } }, ) You can select a different few-shot template via ``prompt_keys``. For example, ``humanize_transfer`` combines humanization instructions with the style examples: .. code-block:: python results = diversify( "The experiment was conducted in a controlled lab setting.", methods=["prompting"], method_kwargs={ "prompting": { "styles": ["informal_tinystyler"], "prompt_keys": ["humanize_transfer"], } }, ) Development ^^^^^^^^^^^ To see the exact prompts sent to the model, enable debug logging: .. code-block:: python import logging logging.basicConfig(level=logging.DEBUG) Adding a new method ------------------- See :ref:`creating-a-custom-method` in the Usage Guide for instructions on implementing your own :class:`~diversify_text.method.base.DiversificationMethod`.