Created at 1pm, Jan 2
omerArtificial Intelligence
48
Attention Is All You Need
Es2GbfGHzdUDRl6UKEGIwIpD2rByjOs11A0e3n2vVP8
File Type
PDF
Entry Count
53
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

'Attention Is All You Need' is a groundbreaking paper by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, u0141ukasz Kaiser, and Illia Polosukhin, published in 2017. The paper introduced the Transformer, a novel network architecture based solely on attention mechanisms, without relying on recurrence or convolutions. This groundbreaking work has significantly influenced the field of natural language processing and led to the development of many state-of-the-art models

As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences. 5 Training This section describes the training regime for our models.
id: b0dfa3895ebcb54dad4cb49863b1969d - page: 7
5.1 Training Data and Batching We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding , which has a shared sourcetarget vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary . Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
id: d1c2d6bc762f492c22219fc09dd114f5 - page: 7
5.2 Hardware and Schedule We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days). 5.3 Optimizer We used the Adam optimizer with 1 = 0.9, 2 = 0.98 and = 109. We varied the learning rate over the course of training, according to the formula: lrate = d0.5 model min(step_num0.5, step_num warmup_steps1.5) This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup_steps = 4000.
id: 3ebc9de1f0575d99482b149392ec70bb - page: 7
5.4 Regularization We employ three types of regularization during training: 7 (3) Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost. Model ByteNet Deep-Att + PosUnk GNMT + RL ConvS2S MoE Deep-Att + PosUnk Ensemble GNMT + RL Ensemble ConvS2S Ensemble Transformer (base model) Transformer (big) BLEU EN-DE EN-FR 23.75 24.6 25.16 26.03 26.30 26.36 27.3 28.4 39.2 39.92 40.46 40.56 40.4 41.16 41.29 38.1 41.8 Training Cost (FLOPs) EN-DE
id: d8519d3fadf4807a88456017943a4f3d - page: 7
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "Es2GbfGHzdUDRl6UKEGIwIpD2rByjOs11A0e3n2vVP8", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "Es2GbfGHzdUDRl6UKEGIwIpD2rByjOs11A0e3n2vVP8", "level": 2}'