Sailor: Open Language Models for South-East Asia - Dria

Join the Network

Created at 6am, Apr 5

Artificial Intelligence

0

Sailor: Open Language Models for South-East Asia

XMHWYK5nAo-cklPdQWp1nGr6t3SQjJo6dAFPtpm5ayI

File Type

PDF

Entry Count

131

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

Longxu Dou1∗ Qian Liu1 ∗ Guangtao Zeng2 Jia Guo1 Jiahui Zhou1 Wei Lu2 Min Lin11- Sea AI Lab, Singapore 2SUTD, Singapore{doulx, liuqian}@sea.comHomepage: https://sailorllm.github.ioModel: https://huggingface.co/sailAbstractWe present Sailor, a family of open language models ranging from 0.5B to 7B parameters, tailored for South-East Asian (SEA) languages. These models are continually pre-trained from Qwen1.5, a great language model for multilingual use cases. From Qwen1.5, Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness,aggressive data cleaning and deduplication, and small proxy models to optimize data mixture. Experimental results on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense reasoning, question answering, reading comprehension and examination. Embracing the open-source spirit, we share our insights through this report to spark a wider interest in developing large language models for multilingual use cases.

5.2 Training Details We adopt most of the pre-training settings and model architectures from Qwen1.5 (Bai et al., 2023). It follows the standard transformer architecture (Vaswani et al., 2017), adopts the pre-normalization with RMSNorm (Jiang et al., 2023b), SwiGLU activation (Shazeer, 2020) and rotary positional embeddings (Su et al., 2022). Notably, Qwen1.5 adds a bias item in attention of the QKV layer to improve the extrapolation ability. Meanwhile, for the 0.5B model, we set TIE WORD EMBEDDINGS to False, i.e., not tying the learning of the input embedding (EMBEDDING module) and output projection (LM HEAD module). Thus, the parameter of Sailor 0.5B is approximately 0.6B. However, we still name it 0.5B to be consistent with Qwen1.5.

id: 3257749bd4388a1e2bbeabab8e496736 - page: 15

During training, we utilize a context window length of 4,096, and integrate Flash Attention 2 (Dao, 2023) to improve the training efficiency and reduce the memory usage 29. We utilize AdamW (Kingma & Ba, 2014) for optimization, with the hyper-parameters 1 = 0.9, 2 = 0.95, eps = 1e5. We use the weight decay of 0.1 and the gradient clipping of 1.0. We train models with BFloat16 mixed precision to balance the training efficiency and stability. Notably, we set ATTENTION SOFTMAX IN FP32 to True to execute attention masking and Softmax operations in fp32, thereby preventing precision underflow 30.

id: 0e5ad265d96bf757d64481456eca5c91 - page: 15

The final pre-training corpus, SailCraft, is composed of approximately 200B tokens, integrating both SEA tokens and replay tokens, as elaborated in Section 4.4. We use the batch size of 4M tokens and the learning rate of 1e-4. Following a warmup period of 500 steps, the learning rate remains constant. This scheduling strategy encourages more transferable conclusions from simulations and allows for easier recovery from interrupted training sessions. Generally Sailor models consume around 200B tokens, completing a full pass through the SailCraft corpus once. However, the Sailor-0.5B model undergoes training with 400B tokens, equivalent to 2 epochs. 6 Experiments Sailor models are evaluated on several high-quality benchmarks, including question answering, commonsense reasoning, reading comprehension and examination.

id: d40c6548a58b0b0c5152fe722ef1bfbe - page: 15

6.1 Benchmark Question Answering The XQuAD dataset (Artetxe et al., 2020) (Thai, Vietnamese) and the TydiQA dataset (Clark et al., 2020) (Indonesian) were selected as the representative benchmarks for question answering. The XQuAD dataset comprises 1,190 question-answer pairs from professional translations of the development set of SQuAD v1.1 (Rajpurkar et al., 2016). The TydiQA dataset covers 204,000 question-answer pairs directly sourced from data in their original languages, with human-written questions.

id: 3b4b4625e35737665e7272f2adc9c06b - page: 15

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "XMHWYK5nAo-cklPdQWp1nGr6t3SQjJo6dAFPtpm5ayI", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "XMHWYK5nAo-cklPdQWp1nGr6t3SQjJo6dAFPtpm5ayI", "level": 2}'