System Design Interview An Insider’s Guide by Alex Yu

Synthetic Data Generation Join Blog

Join Discord

Created at 8pm, Jan 4

brk

Software Development

System Design Interview An Insider’s Guide by Alex Yu

Contract ID

SY8n93E-SkJzdBylBCyVm6uTayQtO2Ymo6MKQpeDFPM

File Type

PDF

Entry Count

357

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

System Design Interview helper

Step 7: Link extractor extracts links from HTML pages. Step 8: Extracted links are passed to the URL filter. Step 9: After links are filtered, they are passed to the URL Seen? component. Step 10: URL Seen component checks if a URL is already in the storage, if yes, it is processed before, and nothing needs to be done. Step 11: If a URL has not been processed before, it is added to the URL Frontier. Step 3 Design deep dive Up until now, we have discussed the high-level design. Next, we will discuss the most important building components and techniques in depth: Depth-first search (DFS) vs Breadth-first search (BFS) URL frontier HTML Downloader Robustness Extensibility Detect and avoid problematic content

id: 665d4cd6ab5c4890302fb6467239c738 - page: 139

DFS vs BFS You can think of the web as a directed graph where web pages serve as nodes and hyperlinks (URLs) as edges. The crawl process can be seen as traversing a directed graph from one web page to others. Two common graph traversal algorithms are DFS and BFS. However, DFS is usually not a good choice because the depth of DFS can be very deep. BFS is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO) queue. In a FIFO queue, URLs are dequeued in the order they are enqueued. However, this implementation has two problems: Most links from the same web page are linked back to the same host. In Figure 9-5, all the links in wikipedia.com are internal links, making the crawler busy processing URLs from the same host (wikipedia.com). When the crawler tries to download web pages in parallel, Wikipedia servers will be flooded with requests. This is considered as impolite.

id: fa5027fde25d9b6062a1c64f4d75a965 - page: 140

Standard BFS does not take the priority of a URL into consideration. The web is large and not every page has the same level of quality and importance. Therefore, we may want to prioritize URLs according to their page ranks, web traffic, update frequency, etc. URL frontier URL frontier helps to address these problems. A URL frontier is a data structure that stores URLs to be downloaded. The URL frontier is an important component to ensure politeness, URL prioritization, and freshness. A few noteworthy papers on URL frontier are mentioned in the reference materials . The findings from these papers are as follows:

id: 6f08e8ca8b13f02a95a78588f6e8f6fd - page: 141

Politeness Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period. Sending too many requests is considered as impolite or even treated as denial-of-service (DOS) attack. For example, without any constraint, the crawler can send thousands of requests every second to the same website. This can overwhelm the web servers. The general idea of enforcing politeness is to download one page at a time from the same host. A delay can be added between two download tasks. The politeness constraint is implemented by maintain a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs obtained from that queue. Figure 9-6 shows the design that manages politeness. Queue router: It ensures that each queue (b1, b2, bn) only contains URLs from the same host. Mapping table: It maps each host to a queue.

id: 58fa1188cfe7420713838fbf64bf7d38 - page: 141

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "SY8n93E-SkJzdBylBCyVm6uTayQtO2Ymo6MKQpeDFPM", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "SY8n93E-SkJzdBylBCyVm6uTayQtO2Ymo6MKQpeDFPM", "level": 2}'