Created at 4pm, Mar 24
t2ruvaArtificial Intelligence
0
H O W C H AT G P T U N D E R S T A N D S C O N T E X T: THE POWER OF SELF-ATTENTION
7dK4AdLB8aQJF6pv7egEeyKosJwXmKRPwFwCtsfrmRs
File Type
PDF
Entry Count
33
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

BUSINESS SUMMARYAnyone who has used ChatGPT will notice that the quality of responses of this next-generation chatbotare superior to that of older chatbots. ChatGPT and other similar apps are somehow able to produceresponses that are more coherent and well-organized and that go to the heart of the users’ instructions.This leap in quality can be in large part explained by the innovation of “self-attention”, a mechanism bywhich a machine learning model can use the context of inputs to extract and apply more informationabout language, and thus produce higher-quality outputs. In this technology explainer, we describehow the self-attention mechanism works, at a technical level, and highlight its legal implications. Withthe rise of legal disputes implicating generative AI apps and their outputs, it is imperative that legalpractitioners understand the underlying technology in order to make informed assertions and adequatelyguide clients.INTRODUCTIONLarge language models (“LLMs”) are machine learning models designed for natural language processingtasks. Generative LLMs focus on generating new text. One of the most influential and well-knowngenerative LLMs is “GPT”, a model developed by OpenAI. GPT utilizes a “transformer” modelarchitecture (the “T” in “GPT”), first defined by researchers at Google in a 2017 paper titled “AttentionIs All You Need.”1 As this title implies, the lodestar of the transformer architecture is the concept of“attention”, specifically, “self-attention”: a way for each element in a sequence to focus on otherelements in the sequence and consider their importance adaptively. This mechanism captures contextualrelationships between elements of natural language to produce a level of human-like coherence thatmakes it appear as if the model “understands” natural language.In this technology explainer, we begin with the assumption that our model has already been trained, andwe are examining how the attention mechanism works at “inference time”, that is, operating on users’inputs it has not seen before. We will describe how the transformer model, specifically the subtype usedby GPT,2 uses self-attention to produce a mathematical representation of “context” to generate new text.

The first step in the attention mechanism is to pass the Input Embedding matrix into the model so that it can perform a sequence of operations with it. Using matrix multiplication, the model will combine the Input Embedding matrix with three different matrices: the Query Weight, the Key Weight and the Value Weight matrices (each, a Weight matrix).5 Just as the Input Embedding matrix contains components adjusted during the training process, the components in the Weight matrices are likewise adjusted during the training process, and are similarly somewhat inscrutable. Whereas the Input Embedding matrix is meant to directly represent the relevant tokens, the Weight matrices will be used to capture contextualized information among tokens.
id: dd7bca8067d65abdce5fd8114b0a307a - page: 5
The Input Embedding matrix is first multiplied against each of the Query Weight, Key Weight and Value Weight matrices. The result is three new matrices: the Query, Key and Value matrices. The Query matrix and Key matrix are then combined, again using matrix multiplication, to produce the Attention Score matrix. (We will return to the Value matrix at a later step.) F I G U R E 6 : Visual representation of matrix multiplication steps to combine the Input Embedding, Query Weight and Key Weight matrices to produce the Attention Score matrix.6 5
id: 989b221f7cc7a98330e258ce1b955335 - page: 5
In this article, our narrow goal is to demystify how GPT can extract contextual information through self-attention. In reality, GPT has many other trained weights than those we describe in this article (including the other sets of query, key and value matrices). And even within this small piece of the transformer model we focus on, we are simplifying significantly for the purposes of this discussion. For example, in addition to the fact that there are considerably more weights for each input in the Weight matrices, we also do not discuss how the model accounts for token order (addressed by positional embeddings). 6 Matrix multiplication combines two matrices to produce a third matrix, where each number in the resulting matrix is 5
id: b2831e449e166d08402076003326bf30 - page: 5
In matrix multiplication, the number of columns in the first matrix must equal the number of rows in the second matrix. The resulting third matrix will have dimensions corresponding with the number of rows in the first matrix and the number of columns in the second matrix. See Figure 10 for an example. The numbers in this resulting Attention Score matrix represent a combination of the information captured in the Input Embedding, Query Weight and Key Weight matrices, or put another way, the Attention Score matrix constitutes a set of numbers representing the relationship between each of the tokens. This can be represented in a table as follows: I P I C K E D U P A I A T T E N T I O N T O 0.06 0.89 0.40 1.42 P I C K E D A T T E N T I O N T O 0.51 1.19 0.88 1.30 U P A T T E N T I O N T O 0.04 0.29 0.17 0.58 A A T T E N T I O N T O 0.84 0.47 0.69 1.25
id: c4c18f9399efb017bb38d335d90139d8 - page: 5
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "7dK4AdLB8aQJF6pv7egEeyKosJwXmKRPwFwCtsfrmRs", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "7dK4AdLB8aQJF6pv7egEeyKosJwXmKRPwFwCtsfrmRs", "level": 2}'