Predicting Internet Connection Requests Using SVM and SGD

Scope and Purpose of the Research

Our client has tasked us with a cyber security project to analyze firewall interactions and help determine which requests to allow access to. The objective is to help our client automate the access process to minimize the amount of human hours that were previously used to manually manage this process. The client has provided us historical data with several fields used to determine if access should be granted. The project is expected to filter the incoming requests, classify them and auto accept or auto deny the request. Client needs a highly accurate model that can function at speed. We must provide serveral methods with their performance summaries to allow our client to decide if this project will meet their current needs.

Importing necessary Python packages.

We will utilize SKLearn’s package that includes test/train splits, gridsearch, Stochastic Gradient Decent and Support Vector Machine algorithms. We will utilize seaborn and matplotlib for visualizations.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import SGDClassifier
from itertools import product
import matplotlib.pyplot as plt
import seaborn as sns

Loading the data

Historical data provided by client in CSV format.

data = pd.read_csv(r'C:\Users\fidel\Desktop\School\Quantifying the World\Case Study 5\log2.csv')

Exploratory Data Analysis

Methodology of Data Collection and Analysis

Summary statistics for the dataset provided. There are 65,532 records and 12 columns. The first 4 columns (Source Port, Destination Port, NAT Source Port, NAT Destination Port) appear numerically but are treated as strings. These features are viewed as an address and not a quantitive value. The target variable is ‘action’ which has 4 classes (allow, deny, drop, reset-both). The data does not have any missing or null values.

# Display the first few rows of the dataset
print("First few rows of the dataset:")

## First few rows of the dataset:

print(data.head())

##    Source Port  Destination Port  ...  pkts_sent  pkts_received
## 0        57222                53  ...          1              1
## 1        56258              3389  ...         10              9
## 2         6881             50321  ...          1              1
## 3        50553              3389  ...          8              7
## 4        50002               443  ...         13             18
## 
## [5 rows x 12 columns]

# Display the dataset's structure
print("\nDataset structure:")

## 
## Dataset structure:

print(data.info())

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 65532 entries, 0 to 65531
## Data columns (total 12 columns):
##  #   Column                Non-Null Count  Dtype 
## ---  ------                --------------  ----- 
##  0   Source Port           65532 non-null  int64 
##  1   Destination Port      65532 non-null  int64 
##  2   NAT Source Port       65532 non-null  int64 
##  3   NAT Destination Port  65532 non-null  int64 
##  4   Action                65532 non-null  object
##  5   Bytes                 65532 non-null  int64 
##  6   Bytes Sent            65532 non-null  int64 
##  7   Bytes Received        65532 non-null  int64 
##  8   Packets               65532 non-null  int64 
##  9   Elapsed Time (sec)    65532 non-null  int64 
##  10  pkts_sent             65532 non-null  int64 
##  11  pkts_received         65532 non-null  int64 
## dtypes: int64(11), object(1)
## memory usage: 6.0+ MB
## None

# Summary statistics for numeric columns
print("\nSummary statistics for numeric columns:")

## 
## Summary statistics for numeric columns:

print(data.describe())

##         Source Port  Destination Port  ...      pkts_sent  pkts_received
## count  65532.000000      65532.000000  ...   65532.000000   65532.000000
## mean   49391.969343      10577.385812  ...      41.399530      61.466505
## std    15255.712537      18466.027039  ...    3218.871288    2223.332271
## min        0.000000          0.000000  ...       1.000000       0.000000
## 25%    49183.000000         80.000000  ...       1.000000       0.000000
## 50%    53776.500000        445.000000  ...       1.000000       1.000000
## 75%    58638.000000      15000.000000  ...       3.000000       2.000000
## max    65534.000000      65535.000000  ...  747520.000000  327208.000000
## 
## [8 rows x 11 columns]

# Check for missing values
print("\nMissing values per column:")

## 
## Missing values per column:

print(data.isnull().sum())

## Source Port             0
## Destination Port        0
## NAT Source Port         0
## NAT Destination Port    0
## Action                  0
## Bytes                   0
## Bytes Sent              0
## Bytes Received          0
## Packets                 0
## Elapsed Time (sec)      0
## pkts_sent               0
## pkts_received           0
## dtype: int64

EDA Visualizations

We can visualize the breakdown of each of the 4 classes for action; the ‘allow’ class makes of more than 50% of the action columns. Majority of the source ports are in the 50,000+ range m where destination, NAT source and NAT distintation ports were in the 1-10,000 range. Boxplots are also provided to show the distribution of values from a different view. We also created a correlation matrix of only the numeric values. The matrix shows the port values but are excluded as only numeric values are be used.

# Distribution of categorical data if 'Action' is your categorical column
if 'Action' in data.columns:
    print("\nDistribution of 'Action' categories:")
    print(data['Action'].value_counts())
    sns.countplot(x='Action', data=data)
    plt.title('Distribution of Action Categories')
    plt.show()

# Histograms for all numeric columns
numeric_cols = data.select_dtypes(include=['number']).columns
num_plots = len(numeric_cols)
num_pages = (num_plots - 1) // 6 + 1  # Calculate how many pages needed, 6 plots per page

print("\nAdjusted Histograms for all numeric columns (6 per page):")

## 
## Adjusted Histograms for all numeric columns (6 per page):

for page in range(num_pages):
    plt.figure(figsize=(15, 10))
    for i in range(6):
        plot_number = page * 6 + i
        if plot_number < num_plots:
            plt.subplot(2, 3, i+1)  # 2 rows, 3 cols, position i+1
            sns.histplot(data[numeric_cols[plot_number]], kde=True, bins=15)
            plt.title(numeric_cols[plot_number])
    plt.tight_layout()
    plt.show()

# Boxplots for all numeric columns to check for outliers
print("\nAdjusted Boxplots for all numeric columns to check for outliers (6 per page):")

## 
## Adjusted Boxplots for all numeric columns to check for outliers (6 per page):

for page in range(num_pages):
    plt.figure(figsize=(15, 10))
    for i in range(6):
        plot_number = page * 6 + i
        if plot_number < num_plots:
            plt.subplot(2, 3, i+1)  # 2 rows, 3 cols, position i+1
            sns.boxplot(y=data[numeric_cols[plot_number]])
            plt.title(numeric_cols[plot_number])
    plt.tight_layout()
    plt.show()

# Correlation heatmap for numeric variables
numeric_data = data.select_dtypes(include=['number'])  # Select only numeric columns
correlation_matrix = numeric_data.corr()  # Compute the correlation matrix

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation Heatmap for Numeric Variables')
plt.show()

Splitting the data into test/train sets

We did not find any data issues that require scaling or transforming and move on to splitting the data into a test and training sets. We split the data in an 80% training set and 20% testing set. We also initialize a GridSearch for find the optimal parameters for our models.

# Creating the data splits
X = data.drop(columns = ['Action'])
y = data['Action']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Grid Search
param_grid = {
    'C': [0.1],
    'kernel': ['rbf', 'linear'],
}
param_combinations = list(product(param_grid['C'], param_grid['kernel']))

SVM Model

The initial model evaluated was the Support Vector Machine (SVM). This model, utilizing a linear kernel and a regularization parameter of .1, achieved a notable accuracy of 99.14%. To introduce a non-linear approach, the Radial Basis Function (RBF) kernel was also explored. Analysis of the feature importance chart reveals that the most informative attributes for determining the system’s action are the number of packets, packets sent, elapsed time, and packets received. The confusion matrix for the SVM model with a linear kernel offers a visual representation of the model’s high level of precision which a vast majority of predictions being spot on.

The model utilizing an RBF kernel returned an accuracy of 57.56% The model utilizing a linear kernel returned an accuracy of 99.144%

svm_model = SVC()
accuracies = []
param_combinations = list(product(param_grid['C'], param_grid['kernel']))
for i, (C, kernel) in enumerate(param_combinations):
    print(f"Grid Search Iteration {i+1}/{len(param_combinations)}")
    print(f"Hyperparameters: C={C}, kernel={kernel}")

    # Set the hyperparameters
    svm_model.set_params(C=C, kernel=kernel)

    # Fit the model
    svm_model.fit(X_train, y_train)

    # Make predictions
    y_pred = svm_model.predict(X_test)

    # Evaluate the model's accuracy
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

SVC(C=0.1, kernel='linear')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

print(f'Accuracy: {accuracy * 100:.2f}%')

## Accuracy: 99.15%

print("-" * 50)

## --------------------------------------------------

c_values = [param[0] for param in param_combinations]
kernel_values = [param[1] for param in param_combinations]

# Create a scatter plot for accuracy
plt.bar(range(len(accuracies)), accuracies,
    tick_label=[f'C={C}, Kernel={kernel}' for C, kernel in param_combinations])
plt.xlabel('Hyperparameter Combinations')
plt.ylabel('Accuracy')
plt.xticks(rotation=45, ha='right')

## (array([0, 1]), [Text(0, 0, 'C=0.1, Kernel=rbf'), Text(1, 0, 'C=0.1, Kernel=linear')])

plt.title('Accuracy for Different Hyperparameter Combinations')

# Show the plot
plt.tight_layout()
plt.show()

feature_importance = svm_model.coef_[0]
print(feature_importance)

## [-1.84061663e-03 -1.98739390e-02  8.94575757e-03  4.17249621e-02
##   1.91238988e-01 -9.17681129e-02  2.83007101e-01  6.10877797e+00
##   2.13092359e+01  4.56250606e+00  1.54627192e+00]

SVM visualizations

The SVM model selected packets, elapsed time, pkts_sent, and pkts_received as the most important features. This is in line with our expectations of which features would be identified as the most important. Our expectations were request sizes and received would not be the most important feature and the model identified features that are more in line with princples of network engineering.

# Visualize feature importances
plt.figure(figsize=(10, 6))
plt.bar(range(len(feature_importance)), feature_importance)
plt.xticks(range(len(feature_importance)), X.columns, rotation=45, ha='right')

## ([<matplotlib.axis.XTick object at 0x00000246C5147320>, <matplotlib.axis.XTick object at 0x00000246C5145F10>, <matplotlib.axis.XTick object at 0x00000246C3435460>, <matplotlib.axis.XTick object at 0x00000246C51A46B0>, <matplotlib.axis.XTick object at 0x00000246C503ED20>, <matplotlib.axis.XTick object at 0x00000246C51779B0>, <matplotlib.axis.XTick object at 0x00000246C51A53A0>, <matplotlib.axis.XTick object at 0x00000246C51A5D60>, <matplotlib.axis.XTick object at 0x00000246C51A66F0>, <matplotlib.axis.XTick object at 0x00000246C51A7050>, <matplotlib.axis.XTick object at 0x00000246C51A6900>], [Text(0, 0, 'Source Port'), Text(1, 0, 'Destination Port'), Text(2, 0, 'NAT Source Port'), Text(3, 0, 'NAT Destination Port'), Text(4, 0, 'Bytes'), Text(5, 0, 'Bytes Sent'), Text(6, 0, 'Bytes Received'), Text(7, 0, 'Packets'), Text(8, 0, 'Elapsed Time (sec)'), Text(9, 0, 'pkts_sent'), Text(10, 0, 'pkts_received')])

plt.xlabel('Features')
plt.ylabel('Feature Importance (Coefficient)')
plt.title('Feature Importance for SVM Model')
plt.tight_layout()
plt.show()

conf_matrix = confusion_matrix(y_test, y_pred)

# Create a custom confusion matrix visualization
class_labels = sorted(y.unique())

# Create a custom confusion matrix visualization
plt.figure(figsize=(10, 8))
plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()

## <matplotlib.colorbar.Colorbar object at 0x00000246C5147830>

tick_marks = range(len(class_labels))
plt.xticks(tick_marks, class_labels, rotation=45)

## ([<matplotlib.axis.XTick object at 0x00000246C385D370>, <matplotlib.axis.XTick object at 0x00000246C385D280>, <matplotlib.axis.XTick object at 0x00000246C5146600>, <matplotlib.axis.XTick object at 0x00000246C3805CA0>], [Text(0, 0, 'allow'), Text(1, 0, 'deny'), Text(2, 0, 'drop'), Text(3, 0, 'reset-both')])

plt.yticks(tick_marks, class_labels)

## ([<matplotlib.axis.YTick object at 0x00000246C385CD10>, <matplotlib.axis.YTick object at 0x00000246C51A63C0>, <matplotlib.axis.YTick object at 0x00000246C3897470>, <matplotlib.axis.YTick object at 0x00000246C39602F0>], [Text(0, 0, 'allow'), Text(0, 1, 'deny'), Text(0, 2, 'drop'), Text(0, 3, 'reset-both')])

for i in range(len(conf_matrix)):
    for j in range(len(conf_matrix[i])):
        plt.text(j, i, f'{conf_matrix[i][j]}', ha='center', va='center',
         color='white' if conf_matrix[i][j] > conf_matrix.max() / 2 else 'black')

plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

SGD Model fitting

Following our initial exploration with the SVM model to categorize internet request actions, we developed a Stochastic Gradient Descent (SGD) model to offer an alternative analysis of the issue. This SGD model achieved a 98.19% accuracy. The feature importance analysis highlighted a notable overlap between the two models: Elapsed Time emerged as a critical variable in both instances. However, the SGD model differentiated itself by identifying Bytes, Bytes Sent, Bytes Received, and Elapsed Time as the key factors for determining the action on an internet connection, thereby presenting a different approach to addressing the problem.

sgd_model = SGDClassifier(max_iter=1000, random_state=42)
param_grid = {
    'alpha': [0.0001, 0.001, 0.01, 0.1],
    'penalty': ['l2', 'l1', 'elasticnet'],
    'loss': ['hinge', 'modified_huber', 'squared_hinge', 'perceptron'],
}
# Initialize variables to store the best model and best accuracy
best_model = None
best_accuracy = 0

# Manual verbose control
verbose = True

# Perform the grid search manually
for alpha in param_grid['alpha']:
    for penalty in param_grid['penalty']:
        for loss in param_grid['loss']:
            if verbose:
                print(f"Training with alpha={alpha}, penalty={penalty}, loss={loss}")

            # Create an SGD Classifier with the current hyperparameters
            sgd_model = SGDClassifier(max_iter=1000, random_state=42, alpha=alpha, penalty=penalty, loss=loss)

            # Fit the model
            sgd_model.fit(X_train, y_train)

            # Make predictions
            y_pred = sgd_model.predict(X_test)

            # Calculate accuracy
            accuracy = accuracy_score(y_test, y_pred)

            if verbose:
                print(f'Accuracy: {accuracy * 100:.2f}%')

            # Check if the current model is the best so far
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_model = sgd_model

SGDClassifier(alpha=0.1, loss='perceptron', penalty='elasticnet',
              random_state=42)

# Correctly print the best model accuracy outside the loop
if verbose:
    print(f"Best Model Accuracy: {best_accuracy * 100:.2f}%")

## Best Model Accuracy: 98.19%


y_pred = best_model.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)

# Get the unique class labels
class_labels = sorted(y.unique())

SGD Visualization

Here we visualize which features the SGD model. The SGD model emphasizes a numeric importance which may not be aligned with current principles of network engineering.

plt.figure(figsize=(10, 8))
plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()

## <matplotlib.colorbar.Colorbar object at 0x00000246C548A5D0>

tick_marks = range(len(class_labels))
plt.xticks(tick_marks, class_labels, rotation=45)

## ([<matplotlib.axis.XTick object at 0x00000246C50A5550>, <matplotlib.axis.XTick object at 0x00000246C38E8140>, <matplotlib.axis.XTick object at 0x00000246C50A4F50>, <matplotlib.axis.XTick object at 0x00000246C5848050>], [Text(0, 0, 'allow'), Text(1, 0, 'deny'), Text(2, 0, 'drop'), Text(3, 0, 'reset-both')])

plt.yticks(tick_marks, class_labels)

## ([<matplotlib.axis.YTick object at 0x00000246C50A6150>, <matplotlib.axis.YTick object at 0x00000246C50A4590>, <matplotlib.axis.YTick object at 0x00000246C50A5250>, <matplotlib.axis.YTick object at 0x00000246C5B20980>], [Text(0, 0, 'allow'), Text(0, 1, 'deny'), Text(0, 2, 'drop'), Text(0, 3, 'reset-both')])

for i in range(len(conf_matrix)):
    for j in range(len(conf_matrix[i])):
        plt.text(j, i, f'{conf_matrix[i][j]}', ha='center', va='center', color='white' if conf_matrix[i][j] > conf_matrix.max() / 2 else 'black')

plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

feature_importance = sgd_model.coef_[0]
print(feature_importance)

## [-0.12337887 -0.18689351  1.17010968  0.54045514  5.27583253 -3.29085875
##   9.61478434  0.          3.95621064  0.          0.        ]

# Create a custom variable importance graph
plt.figure(figsize=(10, 6))
plt.bar(range(len(feature_importance)), feature_importance)
plt.xticks(range(len(feature_importance)), X.columns, rotation=45, ha='right')

## ([<matplotlib.axis.XTick object at 0x00000246C619C890>, <matplotlib.axis.XTick object at 0x00000246C59B51C0>, <matplotlib.axis.XTick object at 0x00000246C59B69C0>, <matplotlib.axis.XTick object at 0x00000246C3435040>, <matplotlib.axis.XTick object at 0x00000246C3435280>, <matplotlib.axis.XTick object at 0x00000246C6187F80>, <matplotlib.axis.XTick object at 0x00000246C6187AA0>, <matplotlib.axis.XTick object at 0x00000246C3437740>, <matplotlib.axis.XTick object at 0x00000246C5177B60>, <matplotlib.axis.XTick object at 0x00000246C5364B00>, <matplotlib.axis.XTick object at 0x00000246C538D340>], [Text(0, 0, 'Source Port'), Text(1, 0, 'Destination Port'), Text(2, 0, 'NAT Source Port'), Text(3, 0, 'NAT Destination Port'), Text(4, 0, 'Bytes'), Text(5, 0, 'Bytes Sent'), Text(6, 0, 'Bytes Received'), Text(7, 0, 'Packets'), Text(8, 0, 'Elapsed Time (sec)'), Text(9, 0, 'pkts_sent'), Text(10, 0, 'pkts_received')])

plt.xlabel('Features')
plt.ylabel('Feature Importance (Coefficient)')
plt.title('Feature Importance for SGD Classifier')
plt.tight_layout()
plt.show()

Conclusion

While both models demonstrated high accuracy, the SVM model stands out for its logical coherence and superior precision. Its emphasis on the significance of packets sent and received, rather than the bytes of these packets, not only yielded a more effective model but also aligns more closely with principles of network engineering.

Looking ahead, we aim to extend our experimentation to a broader dataset and to evaluate the model’s performance in a real-world setting. Additionally, exploring alternative modeling approaches, including Neural Networks, will be part of our future research efforts.