Created at 4pm, Apr 4
Ms-RAGArtificial Intelligence
0
BADAM: A MEMORY EFFICIENT FULL PARAMETER TRAINING METHOD FOR LARGE LANGUAGE MODEL
1rKX_pAShsggTxcH0FPVfaOSe3YFKqB9BLRCAqL1hvg
File Type
PDF
Entry Count
65
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

Qijun LuoSchool of Science and EngineeringShenzhen Research Institute of Big DataThe Chinese University of Hong Kong, Shenzhenqijunluo@link.cuhk.edu.cnHengxu YuSchool of Data ScienceThe Chinese University of Hong Kong, Shenzhenhengxuyu@link.cuhk.edu.cnXiao LiSchool of Data ScienceThe Chinese University of Hong Kong, Shenzhenlixiao@cuhk.edu.cnABSTRACTThis work presents BAdam, an optimizer that leverages the block coordinate optimization frameworkwith Adam as the inner solver. BAdam offers a memory efficient approach to the full parameterfinetuning of large language models and reduces running time of the backward process thanksto the chain rule property. Experimentally, we apply BAdam to instruction-tune the Llama 2-7Bmodel on the Alpaca-GPT4 dataset using a single RTX3090-24GB GPU. The results indicate thatBAdam exhibits superior convergence behavior in comparison to LoRA and LOMO. Furthermore,our downstream performance evaluation of the instruction-tuned models using the MT-bench showsthat BAdam modestly surpasses LoRA and more substantially outperforms LOMO. Finally, wecompare BAdam with Adam on a medium-sized task, i.e., finetuning RoBERTa-large on the Super-GLUE benchmark. The results demonstrate that BAdam is capable of narrowing the performancegap with Adam. Our code is available at https://github.com/Ledzy/BAdam.

3.1 Experiment Setup We consider both natural language generation (NLG) and natural language understanding (NLU) tasks. For NLG, we adopt the Alpaca-GPT4 dataset , which consists of 52k instruction-following data generated by GPT-4, using prompts from the Alpaca dataset . Our implementation is based on . We perform supervised finetuning (SFT) on the Alpaca-GPT4 dataset for the Llama 2-7B model , which contains approximately 6.7 billion parameters. The resulted model is then evaluated on MT-bench to test its downstream performance. As for NLU, we finetune the RoBERTa-large model with 355 million parameters on the SuperGLUE benchmark with a particular focus on 6 tasks, i.e., BoolQ, COPA, MultiRC, RTE, WiC, and WSC, as they are selected in [17, 18]. We evaluate the NLU downstream performance over the test dataset of the 6 tasks.
id: 9ff8db13166a0db28a13a7fbff4552ac - page: 6
For each of the task, we compare BAdam with existing approaches including 1) LoRA , which adds trainable low-rank adapter to the original pretrained base model, 2) LOMO , which execute the stochastic gradient descent (SGD) update on the fly when performing the BP process, so that one does not need to physically store the stochastic gradient of the full trainable model parameters, and 3) Adam , which is the standard optimizer for full parameter training. For training Llama 2-7B on the Alpaca-GPT4 dataset, we set the learning rate to 1e-5 for all the methods. The batch size is set to 8, in the meanwhile, we apply 15 steps of gradient accumulation for all the methods, resulting in an effective batch size of 120. Note that LOMO does not support gradient accumulation as it has to perform the update during the backward process, and hence its effective batch size is 8. For a fair comparison, we will count 15 actual iterations of LOMO as one iteration in the sequel. For tasks in Supe
id: 803978090b9c5beefc85ca877e9b03aa - page: 6
For all the experiments, we choose the rank of LoRA to be 100 and use low-rank adaptions for all the trainable matrices rather than only the query and key matrices. In this manner, the number of trainable parameters for LoRA is nearly the same as that for BAdam at each iteration, ensuring a fairer comparison.
id: 5ba101a09b1a0a8bfa588a50acd568f2 - page: 6
Due to the limitation of GPU memory, the performance of Adam is only reported for RoBERTA-large model. Through all the experiments for training Llama 2-7B, we enable gradient checkpointing to reduce the memory cost caused by storing activations for all the tested optimization methods, so that larger batch size can be applied. 6 BAdam: A Memory Efficient Full Parameter Training Method for LLMs WORKING PAPER 3.2 Experiments on Llama 2-7B using a Single RTX3090-24GB GPU In this subsection, we conduct instruction-tuning for the Llama 2-7B model on the Alpaca-GPT4 dataset. We illustrate the convergence behaviors of different methods. Additionally, we evaluate the downstream performance of the instruction-tuned models on MT-bench. Note that all the experiments in this subsection are conducted using a single RTX3090-24GB GPU.
id: c208613b93999fb3d86ea6162d8d2e3d - page: 6
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "1rKX_pAShsggTxcH0FPVfaOSe3YFKqB9BLRCAqL1hvg", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "1rKX_pAShsggTxcH0FPVfaOSe3YFKqB9BLRCAqL1hvg", "level": 2}'