Article Preview
TopIntroduction
Distributed denial of service (DDoS) attacks are a type of cybersecurity threat that compromises multiple systems using malware. These attacks typically involve overwhelming a target server with high requests, leading to severe service disruptions. By exhausting the bandwidth and computational resources, DDoS attacks render systems unavailable for legitimate users. Their effects include service interruptions, revenue loss, reputational damage, and increased operational costs, making detecting and mitigating DDoS attacks a critical priority for organizations.
To address these challenges, we introduce a robust framework for DDoS detection. Our approach combines the following processes:
- •
generating synthetic data using a variational autoencoder (VAE) synthesizer
- •
capturing real data from a virtual network consisting of a server and two clients
- •
balancing data with Synthetic Minority Oversampling Technique (SMOTE) and TOMEK-LINK (SMOTETomek)
- •
optimizing features through recursive feature elimination (RFE)
This hybrid method achieved high accuracy rates, demonstrating its effectiveness in distinguishing DDoS attacks from normal traffic.
By integrating synthetic and real data, balancing skewed datasets, and leveraging feature elimination techniques, we provide a scalable, reliable framework for detecting malicious network activity. The findings affirm the validity of this approach and underscore its potential to mitigate cyberattacks that can cause significant operational and financial losses. This work contributes to the field by offering an innovative pipeline for anomaly detection and infrastructure protection in high-dimensional datasets. In the rest of the paper we include a literature review, an overview of our methodology, a discussion on results and an interpretation of findings, and a conclusion and recommendations for future work.
TopLiterature Review
Advances in machine learning, deep learning, and large language models have provided open-source libraries and tools that significantly enhance the ability to detect and mitigate cyberattacks. Several studies have demonstrated the potential of synthetic data generated by generative adversarial networks (GANs) and VAEs to augment datasets when real data are scarce, imbalanced, unreliable, or skewed (Khakurel et al., 2022; Mehrabi et al., 2021). The use of synthetic data generated from labeled data allows for training robust models and improving classification outcomes. Some studies (Chalé & Bastian, 2022; Nikolov, 2023) have shown that combining synthetic and real data can achieve results comparable to using real data alone, whereas models trained only on synthetic data tend to underperform. However, other researchers (Halvorsen & Gebremedhin, 2024; Llugiqi & Mayer, 2022) have reported that data models trained exclusively on synthetic data perform equally well, or in some cases better, than models trained on real data. Enhanced feature extraction has also been shown to improve anomaly detection speed and accuracy (Patil et al., 2022; Wang et al., 2022).
Machine learning algorithms are commonly used to evaluate the accuracy of methods for detecting various types of cybercrime. For instance, Kilincer et al. (2022) and Oneto and Chiappa (2020) used Light Gradient-Boosting Machine (LightGBM) and Extreme Gradient Boosting (XGBoost) on the Comprehensive Cyber Security Intrusion Detection Dataset (CCiDD) and its subsets, CCiDD_A and CCiDD_B. Their findings revealed that LightGBM outperformed XGBoost in detecting cyberattacks within these datasets. Similarly, Louk and Tama (2023) and Chen et al. (2023) reported that ensemble methods such as gradient boosting machine, XGBoost, LightGBM, and CatBoost were effective for intrusion detection. Among these, CatBoost consistently achieved superior performance in identifying cyberattacks.