Created at 6am, Jan 22
benjaminArtificial Intelligence
0
Knowledge Fusion of Large Language Models
cGdrNY8UtxuIcbxWmsTPY4JYIH6Ee_kiNYgH-j0hoE0
File Type
PDF
Entry Count
112
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

Abstract of the Paper: While training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures--Llama-2, MPT, and OpenLLaMA--across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. Our code, model weights, and data are public at \url{this https URL}.Original Paper: https://arxiv.org/abs/2401.10491

Model Pythia Phil NIH USPTO Ensemble Weight Merging FUSELLM Phil 0.9008 0.8397 0.9248 0.9296 0.8960 0.8786 0.8463 NIH 0.6740 0.6861 0.6215 0.6872 0.6647 0.6496 0.6569 USPTO Average 0.7275 0.6077 0.7162 0.6228 0.7247 0.6278 0.6017 0.7395 0.7262 0.6180 0.7112 0.6054 0.7034 0.6068 Table 7: Comparison of perplexity between FUSELLM and ensemble&weight merging.
id: a484f33394365139ccf22a1b3e555890 - page: 9
5 CONCLUSION In this study, we have explored the realm of knowledge fusion for LLMs to create a unified model that combines the capabilities and distinctive strengths of multiple structurally diverse LLMs. We introduced a novel method, FUSELLM, which leverages the generative distributions of these source LLMs to externalize their knowledge and employs them in the continual training of the target LLM. Through a series of experiments, we have demonstrated the superiority of FUSELLM over individual source LLMs and established baselines. Notably, in a simulated experiment featuring multiple structurally identical LLMs, FUSELLM has showcased its competitive effectiveness compared to ensemble and weight merging methods. Hence, the domain of LLMs fusion emerges as a more promising avenue for exploration, particularly given the diverse structures and substantial model sizes of LLMs. We believe that these findings will inspire future research endeavors. 9
id: 4615c2205dd1d63d87c59258806c7b61 - page: 9
Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENTS This work was supported by the National Natural Science Foundation of China (No. 62176270), the Guangdong Basic and Applied Basic Research Foundation (No. 2023A1515012832), and the Tencent AI Lab Rhino-Bird Focused Research Program. REFERENCES Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023. Devansh Arpit, Huan Wang, Yingbo Zhou, and Caiming Xiong. Ensemble of averages: Improving model selection and boosting performance in domain generalization. Advances in Neural Information Processing Systems, 35:82658277, 2022. Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von
id: d939a0c180f86935c7ea902f27faa64a - page: 10
Werra. A framework for the evaluation of code generation models, 2022. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle OBrien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 23972430. PMLR, 2023. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:18771901, 2020. Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. Multiple: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227, 2022.
id: c6bb952ece5e39d7694d05c5fedbf2e7 - page: 10
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "cGdrNY8UtxuIcbxWmsTPY4JYIH6Ee_kiNYgH-j0hoE0", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "cGdrNY8UtxuIcbxWmsTPY4JYIH6Ee_kiNYgH-j0hoE0", "level": 2}'