Abstract—As malicious cyber threats become more sophisticated in breaching computer networks, the need for effective intrusion detection systems (IDSs) becomes crucial. Techniques such as Deep Packet Inspection (DPI) have been introduced to allow IDSs analyze the content of network packets, providing more context for identifying potential threats. IDSs traditionally rely on using anomaly-based and signature-based detection techniques to detect unrecognized and suspicious activity. Deep learning techniques have shown great potential in DPI for IDSs due to their efficiency in learning intricate patterns from the packet content being transmitted through the network. In this paper, we propose a revolutionary DPI algorithm based on transformers adapted for the purpose of detecting malicious traffic with a classifier head. Transformers learn the complex content of sequence data and generalize them well to similar scenarios thanks to their selfattention mechanism. Our proposed method uses the raw payload bytes that represent the packet contents and is deployed as man-inthe-middle. The payload bytes are used to detect malicious packets and classify their types. Experimental results on the UNSW-NB15 and CIC-IOT23 datasets demonstrate that our transformer-based model is effective in distinguishing malicious from benign traffic in the test dataset, attaining an average accuracy of 79% using binary classification and 72% on the multi-classification experiment, both using solely payload bytes. Index Terms—Malware Detection, Malware Classification, Deep Packet Inspection, Transport Layer Security
The intermediate size determines the dimension of the intermediate later in the network within each transformer layer. This value helps ensure that the network can process and transform the input data at each layer effectively. The maximum position embeddings denote the maximum length of the input sequence. For this study, we do not restrict the maximum payload length, which is up to 1460 bytes. The last hyperparameter is the number of labels, which is set to 2 for binary classification or 3 for multi-class classification. These configurations remained consistent through our study of analyzing the payload bytes.
id: f7446df782ed11f78f9d0aded1d6b335 - page: 3
B. Model Training The model is trained using a 70-20-10 split: 70% of the data is used for training, 20% is used for testing, and 10% is used for validation. To enhance the models generalization across Fig. 1: The overall architecture of the proposed packet detection algorithm. Represented here is the payload as input for the model. First, the input payload is pre-processed and converted to hexadecimal format. Next, every two hexadecimal characters are converted to a decimal integer value between [0, 255]. The integer strings are then fed into the embedding layer to obtain its embedding vector, then processed in the self attention mechanism. The classifier head then predicts if the packet is either benign or malicious.
id: a0a5e357a55dedece46d52db4da9aafa - page: 3
The training process of the transformer and classification head is conducted end-to-end. The transformer is responsible for encoding the input sequences into representations that capture the patterns and dependencies in the data. The classification head takes these representations and maps them to respective class labels. Cross-entropy loss is used as the cost function to train the model and the AdamW optimizer is used with a learning rate of 2e-5. The AdamW optimizer provides weight decay regularization, an approach which is crucial as it limits the magnitude of the weights, preventing the model from becoming overly complex and generalized to the training data, in-turn resulting in the mitigation of overfitting .
id: 306e609574f088b6da0271b463b7ce20 - page: 4
The scheduler gradually increases the learning rate from zero to a specified learning rate during the warmup period, then linearly decreases the learning rate over the remaining epochs of training . This process encourages the model to find a more generalized solution for the test and validation sets, rather than just the training set. This learning rate strategy also benefits in the prevention of overfitting, as it prevents the model from converging too quickly to a solution that may be specific to the training data. The training of the model was conducted on NVIDIA GeForce RTX 2080 GPUs. This state-of-the-art hardware enabled us to harness significant computational power, facilitating faster processing and more efficient learning from the datasets. The model
id: 1958cccb1400a4ee35307d1b1dcb3645 - page: 4