LLM in a Flash: Efficient Large Language Model Inference with Limited Memory

Created at 5pm, Jan 4

omer

Artificial Intelligence

Contract ID

b5NAKWoZHOL9jz1ayE5a99YkGwRvOPdHxCVQt-pu_as

File Type

PDF

Entry Count

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

The paper 'LLM in a Flash: Efficient Large Language Model Inference with Limited Memory' by Apple presents a novel method to run large language models (LLMs) on devices with limited memory, such as mobile devices or low-end computers. The authors address the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters on flash memory and optimizing the inference process.

Apple, M. T., & Khudani, A. (2023). LLM in a Flash: Efficient Large Language Model Inference with Limited Memory. arXiv preprint arXiv:2302.08741.

For the implementation of our inference process, we utilize the HuggingFaces transformers and KV caching. This setup is tested under the condition where approximately half of the model size is available in DRAM. We select this amount as a showcase of the idea of hosting the LLM in flash. With a different level of sparsity or employing quantization, one can work with smaller available DRAM capacity as well. Such a configuration demonstrates the practicality of executing inference with lower memory footprints.

id: 6ef08b12d02a66bc65e3936f48b452c6 - page: 7

Hardware Configuration. Our models are evaluated using two distinct hardware setups. The first setup includes an Apple M1 Max with a 1TB solid-state drive (SSD) for flash memory. In this configuration, computations are performed on the CPU, and the models are maintained in a 32-bit format. The second setup involves a Linux machine equipped with a 24 GB NVIDIA GeForce RTX 4090 graphics card. For this machine, computations are GPU-based, and models are run in the bfloat16 format. For both setups, we operate under the assumption that almost half of the total available memory (DRAM plus GPU memory) is allocated for model computations. Models. We use OPT 6.7B (Zhang et al., 2022b) and a sparsified Falcon 7B (Mirzadeh et al., 2023) model for our evaluations.

id: 7e48a4c5f5b66ff1f067d9dfd7fbe7d0 - page: 7

Baselines. For methods not employing sparsity or weight sharing, at least half of the model must be transferred from flash memory during the forward pass. This necessity arises because, initially, only half of the model is available in DRAM, but as the forward pass progresses, the entire model capacity is utilized. Consequently, any data not present at the start must be transferred at least once. Thus, the most efficient theoretical baseline involves loading half of the model size from the flash memory into DRAM. This optimal I/O scenario serves as our primary baseline. Comparative methods, such as FlexGen (Sheng et al., 2023) and Petals (Borzunov et al., 2023), are also constrained by the limited available DRAM or GPU memory, and therefore cannot surpass this theoretical I/O efficiency.

id: fc1a9aaebf4a99c19c977ff2b3f8b620 - page: 7

Flash memory Data Loading Implementation. To optimize data loading from flash memory, our system employs a 32-thread reading process. This multithreading approach is specifically designed to enhance data retrieval efficiency, allowing for simultaneous access to multiple data segments (Figure 2b). Caching Considerations for Data Loading from Flash Memory. When data is read from flash memory, the operating system typically caches these pages, anticipating future reuse. However, this caching mechanism consumes additional memory in DRAM beyond what is allocated for the model. To accurately assess the real throughput of flash memory under limited DRAM conditions, benchmarks should be conducted without relying on caching.

id: f5479579956f5069b758fdea06cc0f12 - page: 7

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "b5NAKWoZHOL9jz1ayE5a99YkGwRvOPdHxCVQt-pu_as", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "b5NAKWoZHOL9jz1ayE5a99YkGwRvOPdHxCVQt-pu_as", "level": 2}'