DeepSeek LLM Scaling Open-Source Language Models with Longtermism - Dria

Join the Network

Created at 3pm, Jan 8

Artificial Intelligence

2

DeepSeek LLM Scaling Open-Source Language Models with Longtermism

VOo1Zrx7Ke6A385RLialwboa58W0bzJ2maAJgH5QbKs

File Type

PDF

Entry Count

146

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

The rapid development of open-source large language models (LLMs) has been truly remarkable.However, the scaling laws described in previous literature presents varying conclusions, whichcasts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present ourdistinctive findings that facilitate the scaling of large scale models in two prevalent used opensource configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM,a project dedicated to advancing open-source language models with a long-term perspective.To support the pre-training phase, we have developed a dataset that currently consists of 2trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT)and direct preference optimization (DPO) on DeepSeek LLM Base models, resulting in thecreation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM67B surpasses LLaMA-2 70B across a range of benchmarks, especially in the domains of code,mathematics, and reasoning. Furthermore, open-ended evaluations reveal that our DeepSeekLLM 67B Chat exhibits superior performance compared to GPT-3.5.

Table 9 | Held-out Dataset Evaluation. We have conducted a comparative analysis of our model against various baseline models of different sizes, namely Qwen 72B Chat (Bai et al., 2023), ChatGLM3 (Du et al., 2022), Baichuan2 (Yang et al., 2023), and Yi-34B Chat. Our observations indicate that there exists a significant performance gap between large models and small models on these held-out datasets, even if certain small models achieve promising results on conventional benchmarks. For instance, ChatGLM3 achieves a score of 52.4 on MBPP, a code testset, which is close to DeepSeek 67B. However, when evaluated on new benchmarks, its performance falls considerably short compared to DeepSeek 67B. A similar trend is also observed in math datasets, where ChatGLM3 is very strong on GSM8K (72.3), but its performance in the Hungarian Exam score is inferior to large models. Furthermore, the capability of instruction following demonstrates that total computing plays a crucial role.

id: d9129f7c2f99f254d73e746299091d9f - page: 19

The DeepSeek 7B and 67B models utilize the same training pipeline, but there is a significant disparity in their performance. Through our subjective evaluation, we have observed a notable discrepancy in intelligence across various tasks when scaling model size to 67B. While DeepSeek 7B falls behind other smaller language models on standard benchmarks, its performance on held-out tasks is relatively commendable when compared to others. 5.4. Safety Evaluation We profoundly recognize the importance of safety for general artificial intelligence. The premise for establishing a truly helpful artificial intelligence model is that it possesses values consistent with those of humans and exhibits friendliness towards humanity. We incorporate the assurance of model safety throughout the entire training process, including pre-training, SFT, and DPO. To validate the safety of our model, we established a 20-person expert team from various 19 Category Subcategory #Safety Answers / #Total Cases

id: 225db5b5e890c6b61b7e9bb357182f3c - page: 19

(Discrimination and Prejudice Questions) (Ethnic and Racial), (Religious Belief), (Nationality and Geography), (Gender), (Age), (Occupation), (Health), (Discrimination in Other Aspects) 486/500 (Infringement of Others Legal Rights) (Physical and Mental Health), (Legitimate Property), (Portrait Rights), (Reputation Rights), (Honor Rights), (Privacy Rights), (Information Rights), (Other Legal Rights) 473/500 (Trade Secrets and Intellectual Property Rights) (Infringing Others Intellectual Property Rights), (Monopolistic and Unfair Competitive Actions), (Other Commercially Illegal and Non-compliant Behaviors), (Violating Business Ethics), (Disclosing Others Trade Secrets) 281/300 (Illegal and Non-compliant Behavior) (Cults and Superstition), (Pornography), (Gambling), (Drugs and Prohibited Items), (Insults and Abuse), (Violent Behavior), (Involvement in Organized Crime), (Other Illegal and Non-compliant Behaviors) 290/300

id: 773abce9d7911cc0fb6691faea34a0fa - page: 20

(Other Safety Issues) (Issues of Illusion and Reality), (Time-sensitive Issues), (Self-recognition Problems), (Other Sensitive Topics), 767/800 Table 10 | Our taxonomy for safety evaluation. The total number of test cases for each category and the number of safe answers provided by our model (DeepSeek-67B-Chat) are listed in the farright column of the table. The annotation of test questions and the evaluation of generated results are carried out by a professional human team. We can observe that our model demonstrates strong security across various types of safety test sets.

id: 360e5602941c582a9f2a75377239f83e - page: 20

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "VOo1Zrx7Ke6A385RLialwboa58W0bzJ2maAJgH5QbKs", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "VOo1Zrx7Ke6A385RLialwboa58W0bzJ2maAJgH5QbKs", "level": 2}'