Longer Texts ============ ``diversify`` is designed for **short texts** — single sentences or short paragraphs. If you need to diversify longer texts, there are two approaches: increasing the token limit and splitting on punctuation. Increasing ``max_new_tokens`` ----------------------------- By default, the number of new tokens is capped automatically based on input length (up to 256 tokens). You can override this with ``max_new_tokens``: .. code-block:: python from diversify_text import diversify results = diversify( "Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. " "Despite their families' financial constraints, both parents attended " "the University of Oxford.", max_new_tokens=512, ) .. code-block:: python [{ "original": "Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. Despite their families' financial constraints, both parents attended the University of Oxford.", "paraphrases": [ "both parents went to the university of Oxford, Stephen Hawking was born 8 January 1942...", "Well I know that both parents went to Oxford.", "How is that? Stephen Hawking was born 8 January 1942 to Frank and Isobel Hawking, who both attended the University of Oxford.", "Isobel and Frank Hawking were both at Oxford.", "Well I mean, Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking who attended Oxford.", ] }] Note how information is lost in several paraphrases (e.g. the financial constraints are dropped entirely). .. warning:: Increasing ``max_new_tokens`` beyond the default may produce unexpected results. The used models were not tested for long-form generation and may hallucinate, repeat itself, or drift off-topic. Splitting on punctuation ------------------------ This package also provides the option to split on punctuation. This splits each input into sentence-level segments, paraphrases each segment independently (where the model works best), and reassembles the results: .. code-block:: python results = diversify( "Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. " "Despite their families' financial constraints, both parents attended " "the University of Oxford.", preprocess_kwargs={"split_on_punctuation": True}, ) .. code-block:: python [{ "original": "Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. Despite their families' financial constraints, both parents attended the University of Oxford.", "paraphrases": [ "Stephen Hawking was born 8 January 1942 to Frank and isobel Hawking... both parents went to the university of oxford despite their families financial constraints...", "Well, Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. Well, both parents went to the University of Oxford despite their families' financial constraints.", "How is Stephen Hawking? He was born on 8 January 1942 to Frank and Isobel Hawking. What? Both parents went to the University of Oxford despite their families financial constraints.", "I believe Stephen Hawking was born on 8 January 1942. I have heard both parents went to Oxford despite their families financial constraints.", "Well, Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. So both parents went to Oxford despite their families financial constraints? I just want to say, the university is a great place to live.", ] }] The paraphrases retain more information compared to the ``max_new_tokens`` approach above. Combining both -------------- You can combine both approaches — split on punctuation *and* raise the token limit for individual segments that may still be long: .. code-block:: python results = diversify( "Stephen Hawking was born on 8 January 1942 to Frank and Isobel Hawking. " "Despite their families' financial constraints, both parents attended " "the University of Oxford.", preprocess_kwargs={"split_on_punctuation": True}, max_new_tokens=512, )