Created at 12pm, Mar 28
ProactiveSoftware Development
0
Vulnerability Detection with Code Language Models: How Far Are We?
9FP5ELPHrVpqw5jhplqcr9QFfgm0VkvUPuMGAzXAZTg
File Type
DOCX
Entry Count
68
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

Abstract—In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PRIMEVUL, a new dataset for training and evaluating code LMs for vulnerability detection. PRIMEVUL incorporates a novel set of data labeling techniques that achieve comparable label accuracy to humanverified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs’ performance in real-world conditions. Evaluating code LMs on PRIMEVUL reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PRIMEVUL. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.

Concretely, we find the original commit for each sample and collect the time of that commit, tying it with the sample. Then, we sort the samples according to the commit, where the oldest 80% will be the train set, 10% in the middle will be the validation set, and the most recent 10% will be the test set. We also make sure that the samples from the same commit will not be split into different sets. This ensures that the vulnerability detection model is trained using past data and tested over future data. B. More Realistic and Challenging Evaluation 1) Vulnerability Detection Score: The primary goal in vulnerability detection is to catch as many real vulnerabilities as possible (can be measured by False Negative Rate, or F NR, where we expect low F NR). Meanwhile, from a practical perspective, a certain level of false positives can be manageable
id: e314d45114e41ed56a10ea03f264254e - page: 9
Therefore, a metric that focuses on minimizing the false negative rate within a tolerable level of false positives is essential. To this end, we propose Vulnerability Detection Score (VDS), that evaluates the
id: 4f10547145ae8f9142eb3fb1a964c1e4 - page: 9
False Negative Rate of a vulnerability detector within an acceptable False Positive Rate, i.e., F NR (F P R r), where r [0%, 100%] is a configurable parameter. In this paper, we choose a tolerance rate r = 0.5% to perform the evaluation in Section V. 2) Paired Functions and Pair-wise Evaluation: As discussed in Section II-D2, evaluating the models on paired functionsvulnerable and benign versions of codecould potentially reveal whether a model merely relies on superficial text patterns to make predictions without grasping the underlying security implications, indicating areas where the model needs improvement to reduce the false positives and false negatives. We collected 5,480 such pairs in PRIMEVUL, significantly larger than existing paired datasets [12, 20]. Concretely, we match the vulnerable functions with their patches in PRIMEVUL to construct such pairs. As we show in Table III, the paired vulnerable functions are fewer than all vulnerable functions, since not all vulnerable
id: 7fe64b42526f079150e7e72f1faf98c9 - page: 10
g., a patch could delete the vulnerable function), and we only include those challenging pairs sharing at least 80% of the string between the vulnerable and benign version. Accordingly, we also propose a pair-wise evaluation method. The core idea is to evaluate the models predictions on the entire pair as a single entity, emphasizing the importance of correctly identifying both the presence and absence of vulnerabilities in a textually similar context, while recording the models concrete predicting behaviors. We define four outcomes of the pair-wise prediction: Pair-wise Correct Prediction (P-C): The model correctly predicts the ground-truth labels for both elements of a pair. Pair-wise Vulnerable Prediction (P-V): The model incorrectly predicts both elements of the pair as vulnerable. Pair-wise Benign Prediction (P-B): The model incorrectly predicts both elements of the pair as benign. Pair-wise Reversed Prediction (P-R): The model incorrectly and invers
id: 332d609ae0f1781b185df0cc1d0af526 - page: 10
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "9FP5ELPHrVpqw5jhplqcr9QFfgm0VkvUPuMGAzXAZTg", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "9FP5ELPHrVpqw5jhplqcr9QFfgm0VkvUPuMGAzXAZTg", "level": 2}'