Longer Texts
diversify is designed for short texts — single sentences or short
paragraphs.
If you need to diversify longer texts, there are two approaches: increasing the token limit and splitting on punctuation.
Increasing max_new_tokens
By default, the number of new tokens is capped automatically based on input
length (up to 256 tokens). You can override this with max_new_tokens:
from diversify_text import diversify
results = diversify(
"Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. "
"Despite their families' financial constraints, both parents attended "
"the University of Oxford.",
max_new_tokens=512,
)
[{
"original": "Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. Despite their families' financial constraints, both parents attended the University of Oxford.",
"paraphrases": [
"both parents went to the university of Oxford, Stephen Hawking was born 8 January 1942...",
"Well I know that both parents went to Oxford.",
"How is that? Stephen Hawking was born 8 January 1942 to Frank and Isobel Hawking, who both attended the University of Oxford.",
"Isobel and Frank Hawking were both at Oxford.",
"Well I mean, Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking who attended Oxford.",
]
}]
Note how information is lost in several paraphrases (e.g. the financial constraints are dropped entirely).
Warning
Increasing max_new_tokens beyond the default may produce unexpected
results. The used models were not tested for long-form
generation and may hallucinate, repeat itself, or drift off-topic.
Splitting on punctuation
This package also provides the option to split on punctuation. This splits each input into sentence-level segments, paraphrases each segment independently (where the model works best), and reassembles the results:
results = diversify(
"Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. "
"Despite their families' financial constraints, both parents attended "
"the University of Oxford.",
preprocess_kwargs={"split_on_punctuation": True},
)
[{
"original": "Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. Despite their families' financial constraints, both parents attended the University of Oxford.",
"paraphrases": [
"Stephen Hawking was born 8 January 1942 to Frank and isobel Hawking... both parents went to the university of oxford despite their families financial constraints...",
"Well, Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. Well, both parents went to the University of Oxford despite their families' financial constraints.",
"How is Stephen Hawking? He was born on 8 January 1942 to Frank and Isobel Hawking. What? Both parents went to the University of Oxford despite their families financial constraints.",
"I believe Stephen Hawking was born on 8 January 1942. I have heard both parents went to Oxford despite their families financial constraints.",
"Well, Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. So both parents went to Oxford despite their families financial constraints? I just want to say, the university is a great place to live.",
]
}]
The paraphrases retain more information compared to the max_new_tokens
approach above.
Combining both
You can combine both approaches — split on punctuation and raise the token limit for individual segments that may still be long:
results = diversify(
"Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. "
"Despite their families' financial constraints, both parents attended "
"the University of Oxford.",
preprocess_kwargs={"split_on_punctuation": True},
max_new_tokens=512,
)