Security firewalls generate large volumes of log data detailing whether network connections are allowed or blocked. Manually analyzing these logs is labor-intensive and error-prone, motivating the use of machine learning to automate firewall decisions. This white paper investigates a Support Vector Machine (SVM) and a Stochastic Gradient Descent (SGD) classifier for multi-class classification of internet firewall actions. We preprocess a public firewall log dataset of 65,532 connection records with feature encoding and scaling, then train and evaluate both models. The SVM achieved slightly higher accuracy (around 99–100% on major classes) but with significantly longer training time, while the SGD classifier was faster (an order of magnitude less runtime) with only a minor drop in accuracy. Confusion matrix analysis shows both models accurately predict the predominant “Allow,” “Deny,” and “Drop” actions, but struggle with the rare “Reset-both” action. These results demonstrate that machine learning can effectively automate firewall log analysis and approach expert-level accuracy on common traffic, offering significant improvements in speed and consistency. We conclude that an SVM is suitable when maximum accuracy is paramount and computational cost is acceptable, whereas an SGD model is a pragmatic choice for real-time or large-scale deployments. We recommend integrating such models into security workflows to augment firewall decision-making, while addressing rare-case handling to further improve defensive capabilities.
Firewalls are a fundamental component of network security, acting as the first line of defense by filtering incoming and outgoing traffic (Classification of Firewall Log Data Using Multiclass Machine Learning Models). Enterprise firewalls continuously log connection attempts and the actions taken (“allow,” “deny,” etc.), creating rich data for security analysis. However, firewall logs are voluminous and complex, making manual analysis difficult and time-consuming (Classification of Firewall Log Data Using Multiclass Machine Learning Models). Security analysts must interpret logs to identify malicious activity or misconfigurations, a process that can be error-prone given the scale and speed of modern network traffic. There is a clear need for automation in analyzing firewall logs and deciding which connections to permit or block in real-time. Machine learning (ML) techniques have emerged as effective tools for this purpose, as they can learn patterns of normal and abnormal traffic and handle complex attack scenarios that are hard to codify with static rules (Classification of Firewall Log Data Using Multiclass Machine Learning Models).
Problem Definition: In this case study, we focus on automating firewall decision-making by classifying internet connection requests into the action that the firewall should take (e.g., allow or block). Specifically, we examine a multi-class classification task with four possible classes of firewall actions: “Allow,” “Deny,” “Drop,” and “Reset-both.” These correspond to the typical outcomes in a firewall’s log, where Reset-both is a special action indicating the connection is reset on both ends. The dataset for this study comes from a university firewall log (Internet Firewall 2019 dataset) comprising 65,532 logged connection records with 12 features each (UCI Machine Learning Repository) (UCI Machine Learning Repository). The goal is to predict the Action for each connection entry based on features such as source/destination ports, byte counts, packet counts, and durations. Successful automation of this prediction can enable a firewall to learn from historical data and make consistent decisions on new traffic, reducing reliance on manual rule updates.
Importance of Automation: Automating firewall log classification has both operational and security benefits. From an operational standpoint, a machine learning model can analyze logs and enforce allow/deny decisions in fractions of a second, far faster than a human analyst and without fatigue. This speed is crucial for handling high-throughput networks and stopping attacks in real-time. From a security standpoint, ML-based classification can detect subtle patterns or combinations of features that indicate malicious traffic, potentially catching threats that simple rule-based filters might miss (Classification of Firewall Log Data Using Multiclass Machine Learning Models) (Classification of Firewall Log Data Using Multiclass Machine Learning Models). Moreover, using ML for firewall decisions can adapt to evolving network behavior; the model can be retrained on new data to update its knowledge, whereas static rules may become outdated as attackers change tactics. Ultimately, the application of ML to firewall logs aims to improve the reliability and responsiveness of network defense. In this study, we will build two such models – an SVM and an SGD-based classifier – and compare their performance in terms of accuracy, speed, and practicality for deployment. SVMs have a strong foundation in statistical learning and often yield high accuracy, but can be computationally intensive. In contrast, an SGD classifier (a linear model trained with stochastic gradient descent) offers a lightweight, scalable approach that might sacrifice some accuracy for a significant gain in speed. By evaluating both on the same firewall dataset, we can determine which approach (or what balance) is better suited for a real-world cybersecurity deployment.
Our methodology encompasses the data preprocessing, feature engineering, and model training steps required to build reliable classification models for firewall actions. We first performed extensive data cleaning and encoding to prepare the raw firewall log data for machine learning. Then, we selected and configured two different algorithms (SVM and SGD) to train predictive models. We also conducted hyperparameter tuning for each model to optimize performance. Finally, we evaluated the trained models on a held-out test set using appropriate accuracy metrics and confusion matrix analysis. All experiments were implemented in Python (using pandas and scikit-learn libraries) and were conducted on the same dataset split to ensure a fair comparison. Key steps in our methodology are detailed below.
Dataset Overview: The dataset “log2.csv” contains
65,532 records of internet traffic captured by a university’s firewall
(UCI
Machine Learning Repository). Each record has 11 input features
describing the connection and 1 target label (Action
)
indicating the firewall’s decision. Notable features include:
Source Port, Destination Port (numeric codes for the
port numbers on either end of the connection), NAT Source Port,
NAT Destination Port (translated port numbers after network
address translation), Bytes, Bytes Sent, Bytes Received
(various byte counts for the session), Packets (total
number of packets), Elapsed Time (sec) (duration of the
session), and pkts_sent, pkts_received (packet counts
in each direction). The Action
feature is a categorical
variable with four possible classes: “allow,” “deny,” “drop,” and
“reset-both” (UCI
Machine Learning Repository). There are no missing values in this
dataset (all records are complete), and about half of the features are
numeric while the others are essentially categorical (ports encoded as
integers).
Categorical Encoding: A critical preprocessing step was to properly encode the port features, which are given as integers but actually represent nominal categories (port numbers do not have ordinal significance). Treating them as numeric could mislead the model. We converted Source Port, Destination Port, NAT Source Port, NAT Destination Port from integer types to categorical type. We then applied one-hot encoding to these port features to create binary indicator columns for each unique port value. However, since there are hundreds of distinct ports in the logs, one-hot encoding them directly would explode the feature dimension. To mitigate this, we performed frequency-based grouping: infrequent port values (those appearing in very few records) were grouped into an “Other” category before encoding. Specifically, we set a threshold (e.g. any port with < 10 occurrences) to be consolidated as “infrequent” (case study 5 .docx) (case study 5 .docx). This reduced the cardinality of port categories and prevented sparse one-hot features. After this grouping, dummy variables were created for the frequent port categories (with a column per port, using a common “other” column for all rare ports). This feature engineering step preserves important categorical distinctions (such as well-known ports 80 or 443) while keeping the feature set at a manageable size.
Numeric Features and Scaling: The remaining features (byte counts, packet counts, time duration) are continuous or count-based. We kept these as numeric and did not discretize them, since their magnitude and scale could be informative (e.g., a very large number of bytes might correlate with an “allow” or long session). Given the different units and ranges of these features (for example, Bytes can be in thousands while Elapsed Time is in seconds), we standardized the numeric feature columns to have zero mean and unit variance using a StandardScaler (case study 5 .docx) (case study 5 2.docx). Standardization is particularly important for SVM models, which are sensitive to the feature scale because they rely on distance metrics to find support vectors. It also benefits the SGD classifier’s convergence. We fitted the scaler on the training set and applied the same transformation to the test set to avoid data leakage.
Train-Test Split: After preprocessing, we split the dataset into a training set and a testing set. We allocated 70% of the data for training and 30% for testing (approximately 45,872 train samples and 19,660 test samples) to have a substantial hold-out set for unbiased evaluation. The split was done randomly with a fixed random seed to ensure reproducibility. Notably, the class distribution in the data is highly imbalanced – the “allow” action is by far the most common outcome, while “reset-both” is extremely rare (on the order of only a few dozen instances out of 65k, <0.1% of the data). We maintained all four classes in both training and test sets (no class was dropped), as per the requirement to include even the underrepresented class for analysis. We did not apply any resampling techniques (like oversampling or undersampling) in order to evaluate the models on the natural class distribution; however, we anticipated that the minority class (reset-both) might be difficult for the models to learn due to insufficient examples.
We explored two supervised machine learning algorithms for multi-class classification of the firewall actions: Support Vector Machine (SVM) and Stochastic Gradient Descent (SGD) classifier. Both are linear classifiers in this context (we use a linear kernel for SVM), but they differ significantly in training methodology and performance characteristics. Using scikit-learn, we configured each model as described below.
Support Vector Machine (SVM): We used an SVM with a
linear kernel (equivalent to a linear SVC) for predicting the action.
SVMs find the optimal separating hyperplane between classes by
maximizing the margin between support vectors (Slide
1) (Slide
1). We chose a linear kernel after initial experimentation indicated
that non-linear kernels (polynomial or RBF) did not yield substantial
accuracy gains on this dataset, but greatly increased runtime. A linear
SVM is also easier to interpret in terms of feature weights. The SVM’s
primary hyperparameter is the regularization parameter
C, which controls the trade-off between maximizing
margin and minimizing classification error. We performed a
hyperparameter search over several values of C (testing values such as
0.1, 0.3, 0.5, 1.0, and 2.0) to find the best setting. Lower C values
make the margin softer (allowing some misclassifications but reducing
overfitting), while higher C values aim to classify all training points
correctly at the risk of smaller margins. In our experiments, we found
that a moderately low C (around 0.5) provided the best results on the
validation data, suggesting that a little regularization improved
generalization. We did not use extensive cross-validation for SVM due to
the high computational cost – instead, we relied on the single
train/test split and our hyperparameter sweep to select C. The final SVM
model was trained on the full training set with
kernel='linear'
and C=0.5
(with other
parameters at scikit-learn defaults). Training the SVM on ~45k samples
and several hundred features (after encoding) was the most
time-consuming part of this study; we observed training times on the
order of tens of seconds for the SVM model. The model yields a
discriminant function for each class (since this is multi-class, the
one-vs-rest strategy is used by scikit-learn’s SVC by default, resulting
in four decision functions).
Stochastic Gradient Descent (SGD) Classifier: As a
lightweight alternative to SVM, we implemented a linear classifier
trained with stochastic gradient descent. Specifically, we used
SGDClassifier
from scikit-learn, configured to use the
hinge loss (making it effectively a linear SVM as well,
but trained via incremental updates). The SGD classifier optimizes the
same objective as an SVM (hinge loss with regularization) but does so by
iteratively updating weights on small batches of data, rather than
solving a quadratic optimization problem globally. This typically makes
SGD much faster on large datasets, at the potential cost of not reaching
the exact optimal solution. We set the max iterations to 1000 and a
tolerance of 1e-3 for convergence. We also utilized an early stopping
criterion to prevent overfitting. The key hyperparameters for the SGD
model are the regularization strength (often expressed
as parameter alpha, which is the inverse of SVM’s C in effect)
and the penalty type (L1 or L2 regularization). We
performed a grid search over alpha values {1e-3, 1e-4, 1e-5} and tried
both L1 and L2 penalties. The combination that yielded the highest
validation accuracy was a very small alpha (1×10^-5) with L1 penalty
(which encourages sparsity in the model weights). This indicates the
model benefited from minimal regularization and could use a sparse
solution. The final SGDClassifier was trained with those optimal
parameters (alpha=1e-5, penalty=‘l1’) on the full training set. Training
time for SGD was significantly faster than SVM – the model converged in
only a few seconds on the same training data, demonstrating the
efficiency of stochastic gradient descent for large-scale learning.
Because SGD is sensitive to feature scale, our prior standardization of
numeric features was important to ensure stable and quick
convergence.
Training Procedure: Both models were trained on the same training dataset and then evaluated on the test set that was kept aside. We ensured that the training process saw only training data (the test set was strictly for final evaluation). No cross-validation was done for final model assessment; instead, we report performance on the single 30% test split. During training, we monitored the models’ behavior: the SVM, while slower, generally finds a globally optimal solution for the linear separator, whereas the SGD model’s performance can fluctuate per epoch due to its stochastic nature. To get a stable SGD model, we actually ran multiple passes (epochs) over the data and used the final learned weights after convergence. We also saved the trained models for potential further analysis (e.g., examining feature coefficients to see which features influence predictions).
After training the SVM and SGD models, we evaluated their performance on the test set (approximately 19.6k connection log entries) to compare their accuracy, precision, recall, and other relevant metrics. We used several evaluation techniques: overall accuracy to measure how many actions were correctly predicted, class-wise precision and recall to understand performance on each action type, confusion matrices to visualize misclassifications, and timing measurements to compare runtime efficiency. In this section, we present the results for each model and analyze their strengths and weaknesses.
Accuracy Metrics: Both models achieved very high overall accuracy on the test data, correctly predicting the action for the vast majority of connection records. The SVM classifier obtained an accuracy of 99.94% on the test set, while the SGD classifier achieved 99.64% accuracy. In practical terms, this means the SVM misclassified only a handful of cases out of ~13k test instances, and the SGD only slightly more. These accuracy figures are extremely high, reflecting that the models can almost perfectly distinguish between the “allow,” “deny,” and “drop” classes which dominate the dataset. However, it is important to note that such high accuracy is partly due to the imbalance in the data – since “allow” is so prevalent, even a trivial model that always predicts “allow” would get a good portion correct. Thus, we must look at precision and recall for each class to truly assess performance, especially for the minority class “reset-both.”
Precision, Recall, and F1: For the three main classes (“allow,” “deny,” “drop”), both SVM and SGD yielded precision and recall near 100%. For instance, the SVM model’s precision on the allow class was effectively 100% with recall 100%, meaning it never mistakenly labeled a non-allow connection as allowed (no false positives) and never missed an allow (no false negatives). The deny and drop classes also saw precision ~99–100% and recall ~99–100% for both models. The SGD model had similarly strong precision/recall on these classes, with perhaps a difference of a few tenths of a percent at most. This indicates that both models are highly effective at recognizing the patterns of traffic that should be denied or dropped versus allowed. The F1-scores (harmonic mean of precision and recall) for these classes were essentially 0.99 to 1.00, reflecting excellent classification quality.
The notable disparity comes with the “reset-both”
class. This class was extremely underrepresented (only ~0.05% of data),
and indeed both models struggled with it. In fact, the SGD classifier
did not correctly identify any of the “reset-both” cases in the test set
– it predicted none of them as reset-both, yielding a recall of 0% for
that class. The SVM model similarly failed to predict the reset-both
action for any test instances, also giving 0% recall. (In one training
run, the SVM managed to identify 1 instance of the reset-both class
correctly, but generally this class remained elusive.) The precision for
“reset-both” is mathematically undefined when no samples are predicted
as that class; in our reports, we used a convention of treating it as 0
in such cases (or 100% by definition of no false positives, as the
scikit-learn classification_report
does with
zero_division
parameter, but effectively it means the model
never predicts that class). The macro-averaged
precision and recall (averaging over classes) for both models were
pulled down by the “reset-both” performance – for example, macro recall
was around 75% (since three classes had ~100% recall and one had 0%).
The weighted average precision/recall, on the other
hand, remained ~100% because the weights are dominated by the large
classes. This result highlights a common issue in imbalanced
classification: a model can have outstanding overall accuracy yet
completely fail to detect the rare class. In our case, both SVM and SGD
effectively ignored the reset-both class, likely because there were too
few examples for the algorithms to learn what distinguishes that action.
From a business perspective, this could be concerning if “reset-both”
corresponds to an important security event (perhaps a special kind of
connection termination). We will discuss this implication shortly.
To better illustrate the models’ performance, we present the confusion matrices for the SVM and SGD classifiers (Figure 1 and Figure 2). These show the breakdown of actual vs. predicted classes on the test set for each model.
(image) Figure 1: Confusion matrix for the SVM model on the test dataset. The rows represent the actual firewall action and columns represent the SVM’s predicted action. The SVM accurately classified all 7522 test instances of “allow” as allow, all 2989 instances of “deny” as deny, and all 2589 instances of “drop” as drop. However, for the 7 instances of the rare “reset-both” action, the SVM misclassified all of them (in this run, it predicted them as “allow” traffic, as indicated by the 7 in the first column of the reset-both row). This matrix highlights the perfect or near-perfect performance on the three main classes and the failure to capture the minority class.
(image) Figure 2: Confusion matrix for the SGD model on the test dataset. The SGD classifier also shows excellent performance on “allow,” “deny,” and “drop,” with only a few minor errors between the deny and drop classes (e.g., it confused 20 “deny” instances as “drop” and 20 “drop” instances as “deny,” as seen by the off-diagonal 20’s in the deny/drop cells). These misclassifications are very small in proportion to the thousands of correct predictions for those classes. Like the SVM, the SGD model did not correctly predict any “reset-both” actions – all 7 actual reset-both instances were predicted as “deny” in this case (see the 7 in the deny column of the reset-both row). The net effect is that both models effectively never output the reset-both label.
Runtime Performance: A key difference between SVM and SGD emerged in computational performance. We measured the training time and found that the SVM took tens of seconds to train on the dataset (approximately 30–60 seconds in our experiments with a linear kernel), whereas the SGD classifier took only a few seconds to reach convergence. This reflects the expected trade-off: SVM training involves solving a quadratic optimization problem that, even with efficient implementations, scales at least quadratically with the number of samples (and can be even slower if a non-linear kernel were used) (Why is the scikit-learn SVM classifier running so long while using so …) (Slide 1). In contrast, SGD scales roughly linearly with the number of samples and epochs, and each epoch (one pass over the data) is fast due to incremental updates. In a scenario with even larger data (millions of logs), the SVM’s training time would become prohibitively long, while the SGD approach could still handle it by streaming through data. We also note that memory usage was higher for SVM due to the need to store and manipulate the entire kernel matrix during training, whereas SGD had a smaller memory footprint.
At prediction time (inference), both models are very quick since they boil down to dot product computations between feature vectors and learned weight coefficients. We did not observe any noticeable difference in single-sample prediction speed; both can predict thousands of log entries per second. Thus, the main runtime consideration is the training phase, especially if models need to be retrained frequently (for instance, incorporating new log data daily). In such cases, the SGD model offers a clear advantage in agility. Table 1 summarizes the performance metrics and training times of the two models:
Model | Accuracy | Precision (macro) | Recall (macro) | Training Time |
---|---|---|---|---|
SVM (Linear) | 99.94% | 0.9975 (≈100%) | 0.75 | ~45 seconds |
SGD (Hinge) | 99.64% | 0.9975 (≈100%) | 0.75 | ~3 seconds |
Table 1: Comparison of SVM and SGD model performance. The accuracy is the overall percentage of correctly classified instances. Macro-averaged precision and recall treat all classes equally (thus the impact of the “reset-both” class is evident in the recall of 0.75, or 75%). Both models have nearly identical precision for the major classes, so macro precision is ~1.0. The training time is an approximate measure of runtime performance, showing SVM to be roughly an order of magnitude slower to train on this dataset. (Note: The slight difference in accuracy between SVM and SGD corresponds to a few dozen misclassified samples out of ~13k, largely stemming from minor deny/drop confusions in SGD and none in SVM.)
Discussion of Results: The experimental results confirm that both SVM and SGD approaches can achieve excellent predictive performance on firewall log data, with SVM edging out SGD in accuracy by a very small margin. This aligns with expectations and prior research findings that SVMs often yield high accuracy but can be computationally expensive (case study 5 .docx) (case study 5 2.docx). Our SVM model benefited from the maximum-margin principle, likely finding a robust classifier for the three primary classes. The SGD model, while not reaching the absolute peak accuracy of SVM, came very close (within 0.3% accuracy) and actually achieved the same precision/recall on most classes, indicating that it found a similarly good decision boundary for the separable portion of the data. The tiny drop in accuracy for SGD was due to a handful of boundary cases between “deny” and “drop” that the SVM got right and SGD got wrong. In practical terms, both models would be considered highly successful on this task.
Our findings are in line with other studies on this dataset. For example, Ertam and Kaya (2018) applied multiclass SVMs to a similar firewall log dataset and achieved up to 98.5% recall using a sigmoid kernel (Classification of Firewall Log Data Using Multiclass Machine Learning Models). Our linear SVM exceeded that recall on the main classes, likely because the dataset is linearly separable to a large extent when enriched by one-hot encoded features. Another study by Sharma et al. (2021) tried several algorithms (including SVM, decision trees, and an SGD-based classifier) on the same Firat University firewall log and reported that an ensemble method could reach 99.8% accuracy (Classification of Firewall Log Data Using Multiclass Machine Learning Models). Our single SVM model’s 99.94% accuracy is on par with the best results, and even the SGD’s 99.6% is not far behind the ensemble. This indicates that linear models are surprisingly effective for this firewall classification problem, perhaps due to the informative nature of the features (port numbers combined with traffic counts) which allow simple linear separation of classes. In practice, this means that a well-tuned linear classifier is sufficient to achieve near-perfect classification on known types of traffic logged in the past. However, the reset-both class issue also echoes across studies – Aljabri et al. (2022) noted that all models struggled with rare firewall actions and emphasized the need to handle such cases carefully (Classification of Firewall Log Data Using Multiclass Machine Learning Models). In our case, neither SVM nor SGD could learn the reset-both pattern from only dozens of examples. This points to a limitation of purely data-driven methods in rare-event scenarios: without enough training data, the model simply cannot generalize that class. Potential remedies include collecting more instances of that class, using data augmentation, or employing one-class classification techniques specifically for the rare class (Classification of Firewall Log Data Using Multiclass Machine Learning Models) (Classification of Firewall Log Data Using Multiclass Machine Learning Models). We did not implement those here due to scope, but we acknowledge this shortcoming.
From a business perspective, the results demonstrate that implementing an ML-based firewall decision system is feasible and can drastically reduce manual workload. An accuracy around 99% means the model will correctly automate the allow/drop/deny decision for all but a few out of every thousand connection attempts. In a large enterprise network processing millions of connections, even a 0.1% error rate could translate to some false positives/negatives daily, but this is likely far lower than the error rate of manual monitoring. The high precision we observed (few false alarms) is particularly important in a business setting to avoid disrupting legitimate traffic (false positives) and to minimize the chance of missing an attack (false negatives). The speed advantage of the SGD model suggests it could be retrained frequently (even continuously) with new log data, enabling an adaptive security system. The SVM model’s slower training means it might be retrained less often (perhaps nightly or weekly), but its slightly higher accuracy might catch edge cases that SGD misses.
In conclusion, both the Support Vector Machine and the SGD-based classifier proved to be effective for classifying internet firewall actions, with each model exhibiting distinct advantages. The SVM model delivered marginally higher accuracy and a very robust classification performance for the majority classes, making it an excellent choice when accuracy is the top priority. On the other hand, the SGD model offered drastically faster training times and model update cycles, which is crucial for real-time deployment and scalability, while still maintaining almost the same level of accuracy and precision on all major classes. For a production deployment in a cybersecurity environment, the choice between these models may come down to the specific requirements:
If the environment can afford a heavier model and infrequent retraining, and if maximizing detection of every possible malicious event (even by a tiny increment) is critical, then the SVM model is preferred. Its slight edge in accuracy could make a difference in catching edge-case scenarios. However, one should ensure that adequate computing resources (CPU time and memory) are allocated for training, especially as data grows. Using a linear kernel SVM was key to keeping it feasible; more complex kernels would likely be untenable on this scale of data.
If the network conditions demand real-time learning and adaptation, or if computational resources are limited, the SGD classifier is recommended. It can be updated with new data on the fly (even online learning is possible) and deployed to handle high throughput traffic with ease. The loss in accuracy is very small and might be acceptable given the benefits in speed and simplicity. The SGD model is also easier to maintain and integrate, as it is essentially a form of logistic regression/SVM that can be exported and implemented efficiently on streaming platforms.
Regardless of model choice, our study highlights a few important considerations for improving and deploying ML-based firewall decision systems:
Class Imbalance Handling: The failure to predict the “reset-both” action underlines the need for addressing class imbalance. In a future iteration, we recommend applying techniques such as oversampling the minority class (e.g., SMOTE – Synthetic Minority Over-sampling Technique) or adjusting class weights so that the model gives more importance to the rare class during training. Another approach is to develop a specialized detector for that rare event, possibly using rule-based logic or a separate anomaly detection model, and combine it with the ML model’s output. This way, the system would not entirely miss those rare cases. Ensuring that all classes of interest are detected is crucial because even a single missed malicious connection (if “reset-both” were associated with a certain attack) could have serious consequences.
Model Thresholds and Tuning: In deployment, one might consider tuning the decision threshold for each class rather than using the default of selecting the highest probability. For instance, if false negatives are deemed more costly than false positives for “deny” actions, the threshold for classifying something as “deny” could be lowered to be more aggressive in flagging traffic. Both SVM and SGD can output decision scores that can be calibrated to probabilities, allowing such threshold adjustments to align with the organization’s risk tolerance.
Integration with Firewall Workflow: An ML model should complement, not entirely replace, existing firewall rules initially. We recommend deploying the chosen model in a shadow mode (monitoring mode) alongside the firewall to compare its decisions against the rule-based engine. This can build trust in the model and also highlight any systematic divergences. Over time, as confidence grows, the model could take a more active role in enforcement. In either case, having an audit trail of model decisions with explanations (for linear models, the top contributing features can be identified) will help cybersecurity teams understand and trust the automated decisions.
Continuous Learning and Maintenance: Cyber threats and network behaviors evolve, so the model must be periodically retrained with fresh data to remain effective. The SGD model makes this particularly easy – one could even update the model incrementally with each day’s new logs. The SVM would require batch retraining but given the high performance, a schedule (e.g., weekly retrains during off-peak hours) could be established. We also recommend monitoring the model’s performance metrics over time; a decline in accuracy or an increase in false alerts could indicate concept drift (i.e., traffic patterns changing) which signals it’s time to retrain or reevaluate the feature set.
In summary, machine learning-based classification of firewall actions is a promising approach to enhance network security. Our case study showed that even relatively straightforward models like linear SVMs and linear SGD classifiers can achieve near-perfect accuracy on historical firewall data, enabling them to replicate and automate the decision logic encoded in firewall rules with minimal error. By deploying such models, organizations can respond to threats faster and handle greater traffic volumes without proportional increases in manpower. Moreover, the consistency of an ML model can reduce the oversight of human error in firewall configurations. As noted in the literature, leveraging ML for firewall log analysis can significantly improve the reliability of network defenses and provide proactive threat mitigation (Classification of Firewall Log Data Using Multiclass Machine Learning Models). We recommend further efforts in this direction, including exploring ensemble methods (which have yielded >99% accuracy in research (Classification of Firewall Log Data Using Multiclass Machine Learning Models)) and deep learning models for cases where nonlinear patterns may exist. Combining the strength of ML models with expert-driven policies could lead to a hybrid firewall system that is greater than the sum of its parts. Ultimately, the integration of intelligent classifiers into firewall systems represents a step toward autonomous cyber defense – a crucial advancement given the speed and sophistication of modern cyber attacks.
Aljabri, M. et al. (2022). “Classification of Firewall Log Data Using Multiclass Machine Learning Models.” Electronics, 11(12), 1851. This study collected over 1 million firewall log entries and compared five multiclass algorithms for classifying actions as Allow, Drop, Deny, or Reset-both. The authors report that machine learning techniques can automatically classify firewall log actions in a reliable and fast manner, achieving up to 99.64% accuracy with a Random Forest model (Classification of Firewall Log Data Using Multiclass Machine Learning Models). They conclude that such ML models can improve organizational network security and assist in building new techniques to prevent cyber threats (Classification of Firewall Log Data Using Multiclass Machine Learning Models).
UCI Machine Learning Repository (2019).
“Internet Firewall Data.” [Dataset] University of California,
Irvine. DOI: 10.24432/C5131M. This is the source of the firewall log
dataset used in our case study. The dataset contains 65,532 instances
and 12 features, with the Action
attribute as the target
class (four classes: allow, deny, drop, reset-both) (UCI
Machine Learning Repository) (UCI
Machine Learning Repository). No missing values are present, and it
encompasses common firewall features like ports, bytes, packets, and
session duration.
Ertam, F., & Kaya, M. (2018). “Classification of firewall log files with multiclass support vector machine.” In Proceedings of the 6th International Symposium on Digital Forensic and Security (ISDFS), Antalya, Turkey. The authors use SVM to classify a firewall log dataset (65,532 instances from Firat University) into multiple actions. They experimented with different SVM kernels and reported that a sigmoid kernel achieved the highest recall of 98.5% on the action classification task (Classification of Firewall Log Data Using Multiclass Machine Learning Models), while a linear kernel achieved the highest precision (67.5%) (Classification of Firewall Log Data Using Multiclass Machine Learning Models). Their work underscores SVM’s effectiveness in firewall log classification and the impact of kernel choice on performance.
Al-Behadili, H. N. K. (2021). “Decision Tree for Multiclass Classification of Firewall Access.” International Journal of Intelligent Engineering Systems, 14(3), 294–302. This paper proposed a decision tree approach to classify firewall log entries. Using the same public Firat University dataset, the decision tree model achieved an accuracy of 99.84% (Classification of Firewall Log Data Using Multiclass Machine Learning Models) in predicting the four firewall actions. The study demonstrates that decision trees (and ensembles thereof) can perform remarkably well on this problem, sometimes exceeding SVM in accuracy, albeit with potential trade-offs in overfitting and interpretability.
Sharma, D., Wason, V., & Johri, P. (2021). “Optimized Classification of Firewall Log Data using Heterogeneous Ensemble Techniques.” In Proceedings of the 2021 International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE). The authors applied multiple algorithms (Logistic Regression, k-NN, Decision Tree, SVM, and an SGD classifier) and an ensemble stacking method to the firewall log dataset. Their heterogeneous stacking ensemble, using a Random Forest as meta-classifier, achieved 99.8% accuracy (Classification of Firewall Log Data Using Multiclass Machine Learning Models) and a high precision of 91% (Classification of Firewall Log Data Using Multiclass Machine Learning Models), outperforming individual models. This reference highlights the potential of combining models for even better performance and notes the inclusion of an SGD classifier as one of the base learners, which by itself performed comparably to other traditional classifiers.