The rise of the internet has fundamentally changed how we gather information, communicate, and do business; creating a global network that provides individuals and organizations with access to an expansive and varied digital realm. With the benefits of this accessibility also comes significant challenges, particularly in safeguarding users, networks, and confidential information against cyber threats and unsuitable content.
The implementation of internet content filtering is now a critical task in the modern, digitally connected era. This method employs specific rules and policies to oversee the exchange of data and information into and out of a network. Its importance cannot be overstated, encompassing aspects from cybersecurity and information protection to promoting a secure and efficient online environment. This paper aims to delve into the significance of filtering internet requests and examine its broad impact on individuals, enterprises, and the wider community.
Our client has tasked us with a cyber security project to analyze firewall interactions and help determine which requests to allow access to. The objective is to help our client automate the access process to minimize the amount of human hours that were previously used to manually manage this process. The client has provided us historical data with several fields used to determine if access should be granted. The project is expected to filter the incoming requests, classify them and auto accept or auto deny the request. Client needs a highly accurate model that can function at speed. We must provide serveral methods with their performance summaries to allow our client to decide if this project will meet their current needs.
We will utilize SKLearn’s package that includes test/train splits, gridsearch, Stochastic Gradient Decent and Support Vector Machine algorithms. We will utilize seaborn and matplotlib for visualizations.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import SGDClassifier
from itertools import product
import matplotlib.pyplot as plt
import seaborn as sns
Historical data provided by client in CSV format.
data = pd.read_csv(r'C:\Users\fidel\Desktop\School\Quantifying the World\Case Study 5\log2.csv')
Summary statistics for the dataset provided. There are 65,532 records and 12 columns. The first 4 columns (Source Port, Destination Port, NAT Source Port, NAT Destination Port) appear numerically but are treated as strings. These features are viewed as an address and not a quantitive value. The target variable is ‘action’ which has 4 classes (allow, deny, drop, reset-both). The data does not have any missing or null values.
# Display the first few rows of the dataset
print("First few rows of the dataset:")
## First few rows of the dataset:
print(data.head())
## Source Port Destination Port ... pkts_sent pkts_received
## 0 57222 53 ... 1 1
## 1 56258 3389 ... 10 9
## 2 6881 50321 ... 1 1
## 3 50553 3389 ... 8 7
## 4 50002 443 ... 13 18
##
## [5 rows x 12 columns]
# Display the dataset's structure
print("\nDataset structure:")
##
## Dataset structure:
print(data.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 65532 entries, 0 to 65531
## Data columns (total 12 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Source Port 65532 non-null int64
## 1 Destination Port 65532 non-null int64
## 2 NAT Source Port 65532 non-null int64
## 3 NAT Destination Port 65532 non-null int64
## 4 Action 65532 non-null object
## 5 Bytes 65532 non-null int64
## 6 Bytes Sent 65532 non-null int64
## 7 Bytes Received 65532 non-null int64
## 8 Packets 65532 non-null int64
## 9 Elapsed Time (sec) 65532 non-null int64
## 10 pkts_sent 65532 non-null int64
## 11 pkts_received 65532 non-null int64
## dtypes: int64(11), object(1)
## memory usage: 6.0+ MB
## None
# Summary statistics for numeric columns
print("\nSummary statistics for numeric columns:")
##
## Summary statistics for numeric columns:
print(data.describe())
## Source Port Destination Port ... pkts_sent pkts_received
## count 65532.000000 65532.000000 ... 65532.000000 65532.000000
## mean 49391.969343 10577.385812 ... 41.399530 61.466505
## std 15255.712537 18466.027039 ... 3218.871288 2223.332271
## min 0.000000 0.000000 ... 1.000000 0.000000
## 25% 49183.000000 80.000000 ... 1.000000 0.000000
## 50% 53776.500000 445.000000 ... 1.000000 1.000000
## 75% 58638.000000 15000.000000 ... 3.000000 2.000000
## max 65534.000000 65535.000000 ... 747520.000000 327208.000000
##
## [8 rows x 11 columns]
# Check for missing values
print("\nMissing values per column:")
##
## Missing values per column:
print(data.isnull().sum())
## Source Port 0
## Destination Port 0
## NAT Source Port 0
## NAT Destination Port 0
## Action 0
## Bytes 0
## Bytes Sent 0
## Bytes Received 0
## Packets 0
## Elapsed Time (sec) 0
## pkts_sent 0
## pkts_received 0
## dtype: int64
We can visualize the breakdown of each of the 4 classes for action; the ‘allow’ class makes of more than 50% of the action columns. Majority of the source ports are in the 50,000+ range m where destination, NAT source and NAT distintation ports were in the 1-10,000 range. Boxplots are also provided to show the distribution of values from a different view. We also created a correlation matrix of only the numeric values. The matrix shows the port values but are excluded as only numeric values are be used.
# Distribution of categorical data if 'Action' is your categorical column
if 'Action' in data.columns:
print("\nDistribution of 'Action' categories:")
print(data['Action'].value_counts())
sns.countplot(x='Action', data=data)
plt.title('Distribution of Action Categories')
plt.show()
# Histograms for all numeric columns
numeric_cols = data.select_dtypes(include=['number']).columns
num_plots = len(numeric_cols)
num_pages = (num_plots - 1) // 6 + 1 # Calculate how many pages needed, 6 plots per page
print("\nAdjusted Histograms for all numeric columns (6 per page):")
##
## Adjusted Histograms for all numeric columns (6 per page):
for page in range(num_pages):
plt.figure(figsize=(15, 10))
for i in range(6):
plot_number = page * 6 + i
if plot_number < num_plots:
plt.subplot(2, 3, i+1) # 2 rows, 3 cols, position i+1
sns.histplot(data[numeric_cols[plot_number]], kde=True, bins=15)
plt.title(numeric_cols[plot_number])
plt.tight_layout()
plt.show()
# Boxplots for all numeric columns to check for outliers
print("\nAdjusted Boxplots for all numeric columns to check for outliers (6 per page):")
##
## Adjusted Boxplots for all numeric columns to check for outliers (6 per page):
for page in range(num_pages):
plt.figure(figsize=(15, 10))
for i in range(6):
plot_number = page * 6 + i
if plot_number < num_plots:
plt.subplot(2, 3, i+1) # 2 rows, 3 cols, position i+1
sns.boxplot(y=data[numeric_cols[plot_number]])
plt.title(numeric_cols[plot_number])
plt.tight_layout()
plt.show()
# Correlation heatmap for numeric variables
numeric_data = data.select_dtypes(include=['number']) # Select only numeric columns
correlation_matrix = numeric_data.corr() # Compute the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation Heatmap for Numeric Variables')
plt.show()
We did not find any data issues that require scaling or transforming and move on to splitting the data into a test and training sets. We split the data in an 80% training set and 20% testing set. We also initialize a GridSearch for find the optimal parameters for our models.
# Creating the data splits
X = data.drop(columns = ['Action'])
y = data['Action']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Grid Search
param_grid = {
'C': [0.1],
'kernel': ['rbf', 'linear'],
}
param_combinations = list(product(param_grid['C'], param_grid['kernel']))
The initial model evaluated was the Support Vector Machine (SVM). This model, utilizing a linear kernel and a regularization parameter of .1, achieved a notable accuracy of 99.14%. To introduce a non-linear approach, the Radial Basis Function (RBF) kernel was also explored. Analysis of the feature importance chart reveals that the most informative attributes for determining the system’s action are the number of packets, packets sent, elapsed time, and packets received. The confusion matrix for the SVM model with a linear kernel offers a visual representation of the model’s high level of precision which a vast majority of predictions being spot on.
The model utilizing an RBF kernel returned an accuracy of 57.56% The model utilizing a linear kernel returned an accuracy of 99.144%
svm_model = SVC()
accuracies = []
param_combinations = list(product(param_grid['C'], param_grid['kernel']))
for i, (C, kernel) in enumerate(param_combinations):
print(f"Grid Search Iteration {i+1}/{len(param_combinations)}")
print(f"Hyperparameters: C={C}, kernel={kernel}")
# Set the hyperparameters
svm_model.set_params(C=C, kernel=kernel)
# Fit the model
svm_model.fit(X_train, y_train)
# Make predictions
y_pred = svm_model.predict(X_test)
# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
SVC(C=0.1, kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=0.1, kernel='linear')
print(f'Accuracy: {accuracy * 100:.2f}%')
## Accuracy: 99.15%
print("-" * 50)
## --------------------------------------------------
c_values = [param[0] for param in param_combinations]
kernel_values = [param[1] for param in param_combinations]
# Create a scatter plot for accuracy
plt.bar(range(len(accuracies)), accuracies,
tick_label=[f'C={C}, Kernel={kernel}' for C, kernel in param_combinations])
plt.xlabel('Hyperparameter Combinations')
plt.ylabel('Accuracy')
plt.xticks(rotation=45, ha='right')
## (array([0, 1]), [Text(0, 0, 'C=0.1, Kernel=rbf'), Text(1, 0, 'C=0.1, Kernel=linear')])
plt.title('Accuracy for Different Hyperparameter Combinations')
# Show the plot
plt.tight_layout()
plt.show()
feature_importance = svm_model.coef_[0]
print(feature_importance)
## [-1.84061663e-03 -1.98739390e-02 8.94575757e-03 4.17249621e-02
## 1.91238988e-01 -9.17681129e-02 2.83007101e-01 6.10877797e+00
## 2.13092359e+01 4.56250606e+00 1.54627192e+00]
The SVM model selected packets, elapsed time, pkts_sent, and pkts_received as the most important features. This is in line with our expectations of which features would be identified as the most important. Our expectations were request sizes and received would not be the most important feature and the model identified features that are more in line with princples of network engineering.
# Visualize feature importances
plt.figure(figsize=(10, 6))
plt.bar(range(len(feature_importance)), feature_importance)
plt.xticks(range(len(feature_importance)), X.columns, rotation=45, ha='right')
## ([<matplotlib.axis.XTick object at 0x00000246C5147320>, <matplotlib.axis.XTick object at 0x00000246C5145F10>, <matplotlib.axis.XTick object at 0x00000246C3435460>, <matplotlib.axis.XTick object at 0x00000246C51A46B0>, <matplotlib.axis.XTick object at 0x00000246C503ED20>, <matplotlib.axis.XTick object at 0x00000246C51779B0>, <matplotlib.axis.XTick object at 0x00000246C51A53A0>, <matplotlib.axis.XTick object at 0x00000246C51A5D60>, <matplotlib.axis.XTick object at 0x00000246C51A66F0>, <matplotlib.axis.XTick object at 0x00000246C51A7050>, <matplotlib.axis.XTick object at 0x00000246C51A6900>], [Text(0, 0, 'Source Port'), Text(1, 0, 'Destination Port'), Text(2, 0, 'NAT Source Port'), Text(3, 0, 'NAT Destination Port'), Text(4, 0, 'Bytes'), Text(5, 0, 'Bytes Sent'), Text(6, 0, 'Bytes Received'), Text(7, 0, 'Packets'), Text(8, 0, 'Elapsed Time (sec)'), Text(9, 0, 'pkts_sent'), Text(10, 0, 'pkts_received')])
plt.xlabel('Features')
plt.ylabel('Feature Importance (Coefficient)')
plt.title('Feature Importance for SVM Model')
plt.tight_layout()
plt.show()
conf_matrix = confusion_matrix(y_test, y_pred)
# Create a custom confusion matrix visualization
class_labels = sorted(y.unique())
# Create a custom confusion matrix visualization
plt.figure(figsize=(10, 8))
plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()
## <matplotlib.colorbar.Colorbar object at 0x00000246C5147830>
tick_marks = range(len(class_labels))
plt.xticks(tick_marks, class_labels, rotation=45)
## ([<matplotlib.axis.XTick object at 0x00000246C385D370>, <matplotlib.axis.XTick object at 0x00000246C385D280>, <matplotlib.axis.XTick object at 0x00000246C5146600>, <matplotlib.axis.XTick object at 0x00000246C3805CA0>], [Text(0, 0, 'allow'), Text(1, 0, 'deny'), Text(2, 0, 'drop'), Text(3, 0, 'reset-both')])
plt.yticks(tick_marks, class_labels)
## ([<matplotlib.axis.YTick object at 0x00000246C385CD10>, <matplotlib.axis.YTick object at 0x00000246C51A63C0>, <matplotlib.axis.YTick object at 0x00000246C3897470>, <matplotlib.axis.YTick object at 0x00000246C39602F0>], [Text(0, 0, 'allow'), Text(0, 1, 'deny'), Text(0, 2, 'drop'), Text(0, 3, 'reset-both')])
for i in range(len(conf_matrix)):
for j in range(len(conf_matrix[i])):
plt.text(j, i, f'{conf_matrix[i][j]}', ha='center', va='center',
color='white' if conf_matrix[i][j] > conf_matrix.max() / 2 else 'black')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
Following our initial exploration with the SVM model to categorize internet request actions, we developed a Stochastic Gradient Descent (SGD) model to offer an alternative analysis of the issue. This SGD model achieved a 98.19% accuracy. The feature importance analysis highlighted a notable overlap between the two models: Elapsed Time emerged as a critical variable in both instances. However, the SGD model differentiated itself by identifying Bytes, Bytes Sent, Bytes Received, and Elapsed Time as the key factors for determining the action on an internet connection, thereby presenting a different approach to addressing the problem.
sgd_model = SGDClassifier(max_iter=1000, random_state=42)
param_grid = {
'alpha': [0.0001, 0.001, 0.01, 0.1],
'penalty': ['l2', 'l1', 'elasticnet'],
'loss': ['hinge', 'modified_huber', 'squared_hinge', 'perceptron'],
}
# Initialize variables to store the best model and best accuracy
best_model = None
best_accuracy = 0
# Manual verbose control
verbose = True
# Perform the grid search manually
for alpha in param_grid['alpha']:
for penalty in param_grid['penalty']:
for loss in param_grid['loss']:
if verbose:
print(f"Training with alpha={alpha}, penalty={penalty}, loss={loss}")
# Create an SGD Classifier with the current hyperparameters
sgd_model = SGDClassifier(max_iter=1000, random_state=42, alpha=alpha, penalty=penalty, loss=loss)
# Fit the model
sgd_model.fit(X_train, y_train)
# Make predictions
y_pred = sgd_model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
if verbose:
print(f'Accuracy: {accuracy * 100:.2f}%')
# Check if the current model is the best so far
if accuracy > best_accuracy:
best_accuracy = accuracy
best_model = sgd_model
SGDClassifier(alpha=0.1, loss='perceptron', penalty='elasticnet',
random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. SGDClassifier(alpha=0.1, loss='perceptron', penalty='elasticnet',
random_state=42)# Correctly print the best model accuracy outside the loop
if verbose:
print(f"Best Model Accuracy: {best_accuracy * 100:.2f}%")
## Best Model Accuracy: 98.19%
y_pred = best_model.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)
# Get the unique class labels
class_labels = sorted(y.unique())
Here we visualize which features the SGD model. The SGD model emphasizes a numeric importance which may not be aligned with current principles of network engineering.
plt.figure(figsize=(10, 8))
plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()
## <matplotlib.colorbar.Colorbar object at 0x00000246C548A5D0>
tick_marks = range(len(class_labels))
plt.xticks(tick_marks, class_labels, rotation=45)
## ([<matplotlib.axis.XTick object at 0x00000246C50A5550>, <matplotlib.axis.XTick object at 0x00000246C38E8140>, <matplotlib.axis.XTick object at 0x00000246C50A4F50>, <matplotlib.axis.XTick object at 0x00000246C5848050>], [Text(0, 0, 'allow'), Text(1, 0, 'deny'), Text(2, 0, 'drop'), Text(3, 0, 'reset-both')])
plt.yticks(tick_marks, class_labels)
## ([<matplotlib.axis.YTick object at 0x00000246C50A6150>, <matplotlib.axis.YTick object at 0x00000246C50A4590>, <matplotlib.axis.YTick object at 0x00000246C50A5250>, <matplotlib.axis.YTick object at 0x00000246C5B20980>], [Text(0, 0, 'allow'), Text(0, 1, 'deny'), Text(0, 2, 'drop'), Text(0, 3, 'reset-both')])
for i in range(len(conf_matrix)):
for j in range(len(conf_matrix[i])):
plt.text(j, i, f'{conf_matrix[i][j]}', ha='center', va='center', color='white' if conf_matrix[i][j] > conf_matrix.max() / 2 else 'black')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
feature_importance = sgd_model.coef_[0]
print(feature_importance)
## [-0.12337887 -0.18689351 1.17010968 0.54045514 5.27583253 -3.29085875
## 9.61478434 0. 3.95621064 0. 0. ]
# Create a custom variable importance graph
plt.figure(figsize=(10, 6))
plt.bar(range(len(feature_importance)), feature_importance)
plt.xticks(range(len(feature_importance)), X.columns, rotation=45, ha='right')
## ([<matplotlib.axis.XTick object at 0x00000246C619C890>, <matplotlib.axis.XTick object at 0x00000246C59B51C0>, <matplotlib.axis.XTick object at 0x00000246C59B69C0>, <matplotlib.axis.XTick object at 0x00000246C3435040>, <matplotlib.axis.XTick object at 0x00000246C3435280>, <matplotlib.axis.XTick object at 0x00000246C6187F80>, <matplotlib.axis.XTick object at 0x00000246C6187AA0>, <matplotlib.axis.XTick object at 0x00000246C3437740>, <matplotlib.axis.XTick object at 0x00000246C5177B60>, <matplotlib.axis.XTick object at 0x00000246C5364B00>, <matplotlib.axis.XTick object at 0x00000246C538D340>], [Text(0, 0, 'Source Port'), Text(1, 0, 'Destination Port'), Text(2, 0, 'NAT Source Port'), Text(3, 0, 'NAT Destination Port'), Text(4, 0, 'Bytes'), Text(5, 0, 'Bytes Sent'), Text(6, 0, 'Bytes Received'), Text(7, 0, 'Packets'), Text(8, 0, 'Elapsed Time (sec)'), Text(9, 0, 'pkts_sent'), Text(10, 0, 'pkts_received')])
plt.xlabel('Features')
plt.ylabel('Feature Importance (Coefficient)')
plt.title('Feature Importance for SGD Classifier')
plt.tight_layout()
plt.show()
While both models demonstrated high accuracy, the SVM model stands out for its logical coherence and superior precision. Its emphasis on the significance of packets sent and received, rather than the bytes of these packets, not only yielded a more effective model but also aligns more closely with principles of network engineering.
Looking ahead, we aim to extend our experimentation to a broader dataset and to evaluate the model’s performance in a real-world setting. Additionally, exploring alternative modeling approaches, including Neural Networks, will be part of our future research efforts.