LongForm
LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction is a research paper that discusses a method to improve the performance of language models for long text generation by optimizing instruction tuning.[1] The paper was authored by Abdullatif Köksal, Timo Schick, Anna Korhonen, and Hinrich Schütze and was published on the arXiv pre-print server in April 2023.
The LongForm dataset and LongForm models are publicly released on Github.
Motivation
The authors of the paper note that instruction tuning can help language models to better generalize and follow user intent. However, obtaining instruction data can be expensive and challenging. Previous works have employed methods such as human annotation, crowd-sourced datasets with alignment issues, or generating noisy examples via language models.
Method
To address this issue, the authors introduce the LongForm dataset, which is created by leveraging English corpus examples with augmented instructions. The authors select a diverse set of human-written documents from existing corpora such as C4 and Wikipedia and generate instructions for the given documents via language models. This approach provides a cheaper and cleaner instruction-tuning dataset suitable for long text generation.
Evaluation
The authors finetune T5, OPT, and LLaMA models on their dataset and show that even smaller LongForm models have good generalization capabilities for text generation. Their models outperform larger language models without instruction tuning on various tasks such as story/recipe generation and long-form question answering. Furthermore, LongForm models outperform prior instruction-tuned models such as FLAN-T5 and Alpaca by a large margin. Finally, the authors demonstrate that their models can effectively follow and answer multilingual instructions, as shown in news generation.
Overall, the LongForm dataset and the resulting models provide a cost-effective and efficient method for instruction tuning in long text generation. The authors suggest that this approach can be extended to other languages and domains, and can benefit various downstream applications such as chatbots, summarization, and content creation.
Dataset creation
The LongForm dataset is created by collecting diverse corpus examples from existing corpora such as C4 and Wikipedia and generating instructions for the given documents via LLMs. The authors collect paragraphs and documents from the corpora and use zero-shot templates to prompt LLMs for generating instructions in different styles. They also leverage structured examples from Stack Exchange and WikiHow, and long text generation tasks from NLP benchmarks to enhance the diversity and quality of the dataset.
The final instruction is generated by selecting one of three styles (instruction, informal chatbot, search engine query) and optionally including length information. The authors demonstrate that the generated instructions are highly relevant to the corpus examples and contain a diverse set of tasks.
Prompts
The template for the instruction style:
Instruction: X Output: {CORPUS_EXAMPLE} What kind of instruction could this be the answer to? X:
The template for the informal chatbot style:
You are a chatbot. A user sent you an informal message and your reply is as follows. Message: X Reply: {CORPUS_EXAMPLE} What is the informal message X? X:
The template for the search engine/query style:
You are a search engine. A person queried something in detail and the most relevant document about the query is as follows. Query: X Document: {CORPUS_EXAMPLE} What is the detailed query X? X:
References
- ↑ Köksal et al. "LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction." 2023. arXiv:2304.08460