# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "7oDK7A5VsIsA1kLMEKm_BDRnzjFXZm3sc9q6PjJnsBk", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "7oDK7A5VsIsA1kLMEKm_BDRnzjFXZm3sc9q6PjJnsBk", "level": 2}'
        

Bash

# Search

response = requests.post(
    "https://search.dria.co/hnsw/search",
    headers={'x-api-key': '<YOUR_API_KEY>', 'Content-Type': 'application/json'},
    json={
    "rerank": True,
    "top_n": 10,
    "contract_id": "7oDK7A5VsIsA1kLMEKm_BDRnzjFXZm3sc9q6PjJnsBk",
    "query": "What is alexanDRIA library?"
}
)
print(response.status_code)
print(response.json())
        
# Query

response = requests.post(
    "https://search.dria.co/hnsw/query",
    headers={'x-api-key': '<YOUR_API_KEY>', 'Content-Type': 'application/json'},
    json={
    "vector": [
        0.123,
        0.5236
    ],
    "top_n": 10,
    "contract_id": "7oDK7A5VsIsA1kLMEKm_BDRnzjFXZm3sc9q6PjJnsBk",
    "level": 2
}
)
print(response.status_code)
print(response.json())
        

Python

Abstract of the paper: The rapid proliferation of Large Language Models (LLMs) has been a driving force in the growth of cloud-based LLM services, which are now integral to advancing AI applications. However, the dynamic auto-regressive nature of LLM service, along with the need to support exceptionally long context lengths, demands the flexible allocation and release of substantial resources. This presents considerable challenges in designing cloud-based LLM service systems, where inefficient management can lead to performance degradation or resource wastage. In response to these challenges, this work introduces DistAttention, a novel distributed attention algorithm that segments the KV Cache into smaller, manageable units, enabling distributed processing and storage of the attention module. Based on that, we propose DistKV-LLM, a distributed LLM serving system that dynamically manages KV Cache and effectively orchestrates all accessible GPU and CPU memories spanning across the data center. This ensures a high-performance LLM service on the cloud, adaptable to a broad range of context lengths. Validated in a cloud environment with 32 NVIDIA A100 GPUs in configurations from 2 to 32 instances, our system exhibited 1.03-2.4x end-to-end throughput improvements and supported context lengths 2-19x longer than current state-of-the-art LLM service systems, as evidenced by extensive testing across 18 datasets with context lengths up to 1,900K.Original paper: https://arxiv.org/abs/2401.02669

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Competing Candidates : In scenarios where multiple debtor instances concurrently send requests to a rManager, the system must navigate these competing demands efficiently. The global debt ledger plays an important role here, enabling the gManager to evenly distribute requests among instances, thereby preventing an overload on any single instance. On the other side, the rManager adopts a first-come-first-serve policy for allocating physical spaces to rBlocks from remote instances. If the rManager finds itself unable to allocate sufficient physical rBlocks for remote rBlocks due to space constraints, it responds with a false to the debtor instances. This response also prompts the gManager to update its records of the current resource availability, effectively pausing the forwarding of new requests until more resources become available. This approach ensures a balanced and orderly allocation of memory resources, mitigating potential bottlenecks in the system.

Coherency : We employ a loose coherence policy between the gManager and the rManagers. Under this approach, the gManager is not required to meticulously track every memory allocation or release action across all instances. Instead, it gathers this information through regular heartbeats that are automatically sent by the rManagers. Consequently, the gManager maintains an overview of general space usage throughout the data center rather than detailed, real-time data. When responding to a debtor rManagers request for borrowing space, the gManager only provides recommendations of potential creditor candidates. The debtor then must engage in negotiations with these suggested creditors to finalize the memory allocation. Situations involving multiple concurrent requests to the same rManager are managed using the previously discussed competing candidate strategy. This loosely coupled coherence framework not only streamlines operations but also minimizes excessive transaction overheads,

Scalability : To meet varying throughput demands, the gManager is designed to enhance scalability through the deployment of multiple processes that concurrently handle querying requests. To expedite the process of identifying instances with surplus memory, the gManager periodically initiates a sorting operation. This operation arranges the instances based on their remaining available memory space, enabling querying requests to efficiently bypass instances with minimal memory resources. This approach ensures that the gManager operates within its optimal capacity, maintaining system efficiency and responsiveness while scaling to accommodate the dynamic needs of the network.

4.5 Fragmented Memory Management Due to the dynamicity in variable context length and batching, a critical challenge emerges in the form of fragmented memory management1. Each instance within the system operates both as a creditor and a debtor of memory space, lending to and borrowing from other instances as required. For example, instances handling requests with long contexts may continuously grow, necessitating borrowing space from remote instances. Conversely, instances with short-lived requests release memory space sooner, which can then be lent to others or allocated to new requests. This dynamicity leads to a significant issue: the deterioration of data locality. As instances frequently access data stored in remote memory locations, the system incurs a substantial performance penalty, such as increased latency and reduced throughput.