Longer Texts

diversify is designed for short texts — single sentences or short paragraphs.

If you need to diversify longer texts, there are two approaches: increasing the token limit and splitting on punctuation.

Increasing max_new_tokens

By default, the number of new tokens is capped automatically based on input length (up to 256 tokens). You can override this with max_new_tokens:

from diversify_text import diversify

results = diversify(
    "Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. "
    "Despite their families' financial constraints, both parents attended "
    "the University of Oxford.",
    max_new_tokens=512,
)
[{
    "original": "Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. Despite their families' financial constraints, both parents attended the University of Oxford.",
    "paraphrases": [
        "both parents went to the university of Oxford, Stephen Hawking was born 8 January 1942...",
        "Well I know that both parents went to Oxford.",
        "How is that? Stephen Hawking was born 8 January 1942 to Frank and Isobel Hawking, who both attended the University of Oxford.",
        "Isobel and Frank Hawking were both at Oxford.",
        "Well I mean, Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking who attended Oxford.",
    ]
}]

Note how information is lost in several paraphrases (e.g. the financial constraints are dropped entirely).

Warning

Increasing max_new_tokens beyond the default may produce unexpected results. The used models were not tested for long-form generation and may hallucinate, repeat itself, or drift off-topic.

Splitting on punctuation

This package also provides the option to split on punctuation. This splits each input into sentence-level segments, paraphrases each segment independently (where the model works best), and reassembles the results:

results = diversify(
    "Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. "
    "Despite their families' financial constraints, both parents attended "
    "the University of Oxford.",
    preprocess_kwargs={"split_on_punctuation": True},
)
[{
    "original": "Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. Despite their families' financial constraints, both parents attended the University of Oxford.",
    "paraphrases": [
        "Stephen Hawking was born 8 January 1942 to Frank and isobel Hawking... both parents went to the university of oxford despite their families financial constraints...",
        "Well, Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. Well, both parents went to the University of Oxford despite their families' financial constraints.",
        "How is Stephen Hawking? He was born on 8 January 1942 to Frank and Isobel Hawking. What? Both parents went to the University of Oxford despite their families financial constraints.",
        "I believe Stephen Hawking was born on 8 January 1942. I have heard both parents went to Oxford despite their families financial constraints.",
        "Well, Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. So both parents went to Oxford despite their families financial constraints? I just want to say, the university is a great place to live.",
    ]
}]

The paraphrases retain more information compared to the max_new_tokens approach above.

Combining both

You can combine both approaches — split on punctuation and raise the token limit for individual segments that may still be long:

results = diversify(
    "Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. "
    "Despite their families' financial constraints, both parents attended "
    "the University of Oxford.",
    preprocess_kwargs={"split_on_punctuation": True},
    max_new_tokens=512,
)