# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "zQDVmYyJcVgbP3S6JDA3LeqY_c8Xdugg23bNuQXSPYY", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "zQDVmYyJcVgbP3S6JDA3LeqY_c8Xdugg23bNuQXSPYY", "level": 2}'
        

Bash

# Search

response = requests.post(
    "https://search.dria.co/hnsw/search",
    headers={'x-api-key': '<YOUR_API_KEY>', 'Content-Type': 'application/json'},
    json={
    "rerank": True,
    "top_n": 10,
    "contract_id": "zQDVmYyJcVgbP3S6JDA3LeqY_c8Xdugg23bNuQXSPYY",
    "query": "What is alexanDRIA library?"
}
)
print(response.status_code)
print(response.json())
        
# Query

response = requests.post(
    "https://search.dria.co/hnsw/query",
    headers={'x-api-key': '<YOUR_API_KEY>', 'Content-Type': 'application/json'},
    json={
    "vector": [
        0.123,
        0.5236
    ],
    "top_n": 10,
    "contract_id": "zQDVmYyJcVgbP3S6JDA3LeqY_c8Xdugg23bNuQXSPYY",
    "level": 2
}
)
print(response.status_code)
print(response.json())
        

Python

Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which requires a demand for large memory capacity and high memory bandwidth. In this paper, researchers propose an effective approach that can make the deployment of LLMs more efficiently. They support an automatic INT4 weight-only quantization flow and design a special LLM runtime with highly-optimized kernels to accelerate the LLM inference on CPUs. They demonstrate the general applicability of our approach on popular LLMs including Llama2, Llama, GPT-NeoX, and showcase the extreme inference efficiency on CPUs. The code is publicly available at: https://github.com/intel/intel-extension-for-transformers.

Efficient LLM Inference CPUs

Compute ISA AVX2 AVX512F AVX_VNNI AVX512_VNNI AMX_INT8 AVX512_FP16 AMX_BF16 LLM Optimizations. Most recent LLMs are typically decoder-only Transformer-based models Vaswani et al. . Given the unique characteristics of next token generation, KV cache becomes performance critical for LLM inference. We describe the optimizations in Figure 3. (a) (b) Figure 3: KV cache optimization. Left (a) shows the default KV cache, where new token generation requires memory reallocation for all the tokens (5 in this example); right (b) shows the optimized KV cache with pre-allocated KV memory and only new token updated each time. 3 3 Results

3.1 Experimental Setup To demonstrate the generality, we select the popular LLMs across a wide range of architectures with the model parameter size from 7B to 20B. We evaluate the accuracy of both FP32 and INT4 models using open-source datasets from lm-evaluation-harness including lambada Paperno et al. openai, hellaswag Zellers et al. , winogrande Sakaguchi et al. , piqa Bisk et al. , and wikitext. To demonstrate the performance, we measure the latency of next token generation on the 4th Generation Intel Xeon Scalable Processors, available on the public clouds such as AWS. 3.2 Accuracy We evaluate the accuracy on the aforementioned datasets and show the average accuracy in Table 2. We can see from the table that the accuracy of INT4 model is nearly on par with that of FP32 model within 1% relative loss from FP32 baseline. Table 2: INT4 and FP32 model accuracy. INT4 model has two configurations: group size=4 and 128.

LLM EleutherAI/gpt-j-6B meta-llama/Llama-2-7b-hf decapoda-research/llama-7b-hf EleutherAI/gpt-neox-20b tiiuae/falcon-7b FP32 0.643 0.69 0.689 0.674 0.698 INT4 (Group size=32) 0.644 0.69 0.682 0.672 0.694 INT4 (Group size=128) 0.64 0.685 0.68 0.669 0.693 3.3 Performance We measure the latency of next token generation using LLM runtime and the popular open-source ggml-based implementation. Table 3 presents the latency under a proxy configuration with 32 as both input and output tokens. Note that ggml-based solution only supports group size 32 when testing. Table 3: INT4 performance using LLM runtime and ggml-based solution. LLM runtime outperforms ggml-based solution by up to 1.6x under group-size=128 and 1.3x under group size=32. model EleutherAI/gpt-j-6B meta-llama/Llama-2-7b-hf decapoda-research/llama-7b-hf EleutherAI/gpt-neox-20b tiiuae/falcon-7b

LLM Runtime (Group size=32) 22.99ms 23.4ms 23.88ms 80.16ms 31.23ms LLM Runtime (Group size=128) 19.98ms 21.96ms 22.04ms 61.21ms 22.26ms ggml-based (Group size=32) 31.62ms 27.71ms 27.2ms 92.36ms 36.22ms 3.4 Discussion Though we demonstrate the performance advantage over ggml-based solution, there are still opportunities for LLM runtime to further improve the performance through additional performance tuning such as thread scheduler in LLM runtime, blocking strategy in CPU tensor library.