Description

The CICIDS2017 dataset was generated by the Canadian Institute for Cybersecurity at the University of New Brunswick. It is a labeled bi-directional flow-based data set containing both benign and malicious instances of computer network activity. Included are results of the network traffic analysis using CICFlowMeter with labeled flows based on the time stamp, source and destination IPs, source and destination ports, protocols and attack type.

Find dataset here: https://iscxdownloads.cs.unb.ca/iscxdownloads/CIC-IDS-2017/#CI-IDS-2017

*NOTE:

  1. All non-benign data will be referred to as “malicious”.

  2. R-Markdown produces an error when trying to render plot #2 (“Compare Forward and Backward Packet Length”). I included a saved copy of the visualization and commented out the code to produce the plot.

  3. There is some pre-processing of the data that I have not included, such as combining the multiple files into one .csv file, removing incomplete data, and creating the three “port_category” features.

  4. Certain network flow instances in the dataset were “incomplete”. These incomplete flows had one or more measurements less than 0.0. I removed these incomplete instances. Here is a breakdown of the removed data:

REMOVED DATA

  • BENIGN (1321662)
  • DoS Hulk (67219)
  • DDoS (46547)
  • DoS Slowhttptest (3172)
  • DoS GoldenEye (2584)
  • DoS slowloris (1642)
  • FTP-Patator (1498)
  • PortScan (64)
  • Web Attack � XSS (17)
  • SSH-Patator (14)
  • Infiltration (4)
  • Heartbleed (4)

Objective

To gain a sense of the feasibility of discriminating benign network activity from malicious network activity.

Library imports

import numpy as np
import pandas as pd
from sklearn.decomposition import KernelPCA
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore")

Read in and modify data set

np.random.seed(1234)

PATH = "/Users/Micho/Desktop/Computer_Science/DS795-Data_Science_Project_Design/ids_data-formatted.csv"

df = pd.read_csv(PATH)

# balance proportion of benign & malicious data
malicious_indexes = df[df.label != "BENIGN"].index
benign_indexes = df[df.label == "BENIGN"].index
extra_indexes = np.random.choice(benign_indexes, size=(len(benign_indexes)-len(malicious_indexes)), replace=False)
data_cols = list(filter(lambda col: col not in ['destination_port', 'port', 'port_category', 'label'], df.columns))
extra_benign_data = df.loc[extra_indexes, data_cols].copy()
extra_benign_labels = df.loc[extra_indexes, df.columns == 'label'].copy()
df.drop(extra_indexes, axis=0, inplace=True)

data = df.loc[:, data_cols].copy()
labels = df.loc[:, df.columns == 'label'].copy()

Data Set Shape - (Number of Rows, Number of Columns)

print(data.shape)
## (869762, 80)

View Network Activity Breakdown

# get name of each network activity type
x_coords = labels.label.value_counts().index
# get frequency of each network activity type
y_coords = labels.label.value_counts().values

fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(111)
_ = ax.barh(x_coords, y_coords)

# add text to end of each bar
for index, value in enumerate(y_coords):
    _ = ax.text(value + 1000, index, format(value, ","), color='black')
_ = plt.title("Distribution of Network Activity in CICIDS2017 Dataset", fontsize=18)
_ = plt.xlabel("Number of Instances", fontsize=14)
_ = plt.ylabel("Network Activity Label", fontsize=14)
# adjust x-tick limits
_ = plt.xticks(list(range(0, 600_000, 100_000)))
# format x-tick labels with commas for readability
_ = ax.set_xticklabels(["{:,}".format(x) for x in ax.get_xticks()])
#plt.show()

The distribution plot above reveals that certain attack types are highly underrepresented. This type of dataset in which classes are not equally represented is called “unbalanced”. However by considering all non-benign data as “malicious”, the dataset becomes balanced. The balancing step taken in the section titled “Read in and modify data set” ensures this balance of the two class labels (i.e. “benign” and “malicious”).

Compare Forward and Backward Packet Length

def get_outlier_bounds(x):
    Q1, Q3 = np.quantile(x, [0.25, 0.75])
    IQR = Q3 - Q1
    lower_bound = Q1 - IQR
    upper_bound = Q3 + IQR
      
    return (lower_bound, upper_bound)

benign_indexes = labels[labels.label == "BENIGN"].index
malicious_indexes = labels[labels.label != "BENIGN"].index

benign_fwd_lower, benign_fwd_upper = get_outlier_bounds(data.loc[benign_indexes, "fwd_packet_length_mean"])
benign_fwd_filtered = data[(data.fwd_packet_length_mean >= benign_fwd_lower) &
                           (data.fwd_packet_length_mean <= benign_fwd_upper)]

malicious_fwd_lower, malicious_fwd_upper = get_outlier_bounds(data.loc[malicious_indexes, "fwd_packet_length_mean"])
malicious_fwd_filtered = data[(data.fwd_packet_length_mean >= malicious_fwd_lower) &
                              (data.fwd_packet_length_mean <= malicious_fwd_upper)]

benign_bwd_lower, benign_bwd_upper = get_outlier_bounds(data.loc[benign_indexes, "bwd_packet_length_mean"])
benign_bwd_filtered = data[(data.bwd_packet_length_mean >= benign_bwd_lower) &
                           (data.bwd_packet_length_mean <= benign_bwd_upper)]

malicious_bwd_lower, malicious_bwd_upper = get_outlier_bounds(data.loc[malicious_indexes, "bwd_packet_length_mean"])
malicious_bwd_filtered = data[(data.bwd_packet_length_mean >= malicious_bwd_lower) &
                              (data.bwd_packet_length_mean <= malicious_bwd_upper)]

#fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharey='row', figsize=(16,10))

# reduce horizontal space between plots
#plt.subplots_adjust(wspace=0.05)

#ax1.hist(benign_fwd_filtered.loc[benign_indexes, "fwd_packet_length_mean"],
#             label="Benign Forward", ec="black", color="aliceblue", bins=15)
#ax2.hist(malicious_fwd_filtered.loc[malicious_indexes, "fwd_packet_length_mean"],
#             label="Malicious Forward", ec="black", color="mistyrose", bins=15)
#ax3.hist(benign_bwd_filtered.loc[benign_indexes, "bwd_packet_length_mean"],
#             label="Benign Backward", ec="black", color="steelblue", bins=15)
#ax4.hist(malicious_bwd_filtered.loc[malicious_indexes, "bwd_packet_length_mean"],
#             label="Malicious Backward", ec="black", color="indianred", bins=15)

# format x-tick labels with commas for readability
#ax1.set_yticklabels(["{:,}".format(int(x)) for x in ax1.get_yticks()])
#ax3.set_yticklabels(["{:,}".format(int(x)) for x in ax3.get_yticks()])

# include grids for added readability
#ax1.grid()
#ax2.grid()
#ax3.grid()
#ax4.grid()

#fig.suptitle('HISTOGRAM COMPARISON:\nAverage Forward & Backward Packet Length\nBenign Activity vs. Malicious Activity', fontsize=20)
#fig.text(0.5, 0.05, 'Average Packet Length (in bytes)', ha='center', fontsize=16)
#fig.text(0.05, 0.5, 'Frequency', ha='center', fontsize=16, rotation=90)
#fig.legend(fontsize=12, ncol=2)
#plt.show()

Prior to generating this visualization, I removed outliers for each measurement. According to this plot, there is no strong obvious distinction between benign and malicious activity. Both classes are right skewed for each of the two features being described. Malicious activity looks slightly bimodal compared to the benign activity which is unimodal and right skewed for both average forward and backward packet length.

Port Usage Comparison

fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111)

port_cols = ["port_category_0:1023", "port_category_1024:49151", "port_category_49152:65535"]
benign_ports = data.loc[benign_indexes, port_cols]
malicious_ports = data.loc[malicious_indexes, port_cols]

# get rid of rows with missing values
benign_ports.dropna(inplace=True)
malicious_ports.dropna(inplace=True)

# sum each port category column
benign_ports = benign_ports.apply(lambda x: sum(x), axis=0).astype("int")
malicious_ports = malicious_ports.apply(lambda x: sum(x), axis=0).astype("int")

indexes = np.arange(len(port_cols))
width = 0.4
rect1 = ax.bar(indexes, benign_ports.values, width, color="steelblue", label="benign")
rect2 = ax.bar(indexes + width, malicious_ports.values, width, color="indianred", label="malicious")

def add_text(rect):
    # add text to top of each bar
    for r in rect:
        h = r.get_height()
        ax.text(r.get_x() + r.get_width()/2, h*1.01, s=format(h, ",") ,fontsize=8, ha='center', va='bottom')

add_text(rect1)
add_text(rect2)
_ = ax.set_xticks(indexes + width / 2)
_ = ax.set_xticklabels(["1 - 1,023", "1,024 - 49,151", "49,152 - 65,535"])
_ = ax.set_yticklabels(["{:,}".format(int(x)) for x in ax.get_yticks()])

_ = fig.suptitle('Distribution of Port Usage\nAccording to Network Activity Type', fontsize=20)
_ = fig.text(0.5, 0.05, 'Port Range', ha='center', fontsize=16)
_ = fig.text(0.01, 0.5, 'Frequency', ha='center', fontsize=16, rotation=90)
_ = fig.legend(fontsize=12)

plt.show()

There is drastic difference in the use of the last two port categories (“1,024 - 49,151” and “49,152 - 65,535”) when comparing benign and malicious activity. Malicious acitivity occurs on ports 1,024 - 49,151 more than twice as much as benign activity while benign activity occurs on ports 49,152 - 65,535 over ten times as much as malicious activity.

Two-dimensional Visualization of CICIDS2017 Dataset

# NOTE: data transformation will depend on the value passed to np.random.seed()

# select random sample from data
random_indexes = list(np.random.choice(data.index, size=2000, replace=False))
# transform selected data using KPCA
KPCA = KernelPCA(n_components=2, kernel="linear")
data_transformed = KPCA.fit_transform(data.loc[random_indexes, :])
# grab the respective labels and reset index
random_labels = labels.loc[random_indexes, :]
random_labels.reset_index(drop=True, inplace=True)

# determine which of the transformed data is benign and which is malicious
random_benign_indexes = random_labels[random_labels.label == "BENIGN"].index
random_malicious_indexes = random_labels[random_labels.label != "BENIGN"].index
benign_data_transformed = data_transformed[random_benign_indexes]
malicious_data_transformed = data_transformed[random_malicious_indexes]

BENIGN = {0: [x[0] for x in benign_data_transformed],
          1: [x[1] for x in benign_data_transformed]}

MALICIOUS = {0: [x[0] for x in malicious_data_transformed],
             1: [x[1] for x in malicious_data_transformed]}

fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(111)
_ = ax.scatter(BENIGN[0], BENIGN[1], label="Benign")
_ = ax.scatter(MALICIOUS[0], MALICIOUS[1], label="Malicious")
_ = fig.suptitle('Scatter Plot of Random Data Subset\nAfter Kernel PCA Dimensionality Reduction', fontsize=20)
_ = fig.text(0.5, 0.05, 'Principal Component 1\n(scale: 10^8)', ha='center', fontsize=16)
_ = fig.text(0.02, 0.5, 'Principal Component 2\n(scale: 10^8)', ha='center', fontsize=16, rotation=90)
_ = fig.legend(fontsize=16)

plt.show()

The plot above shows some evident separability between benign and malicious activity, which provides evidence in support of the feasibility of classifying benign and malicious network activity. With the use of Principal Component Analysis, we can visualize a sample of a dataset of 80 dimensions. However, the data used for this visualization is only a sample of the data and may not be representative of the entire dataset.


Conclusion

The visualizations reveal that benign and malicious data have noticeable differences that confirm the suspicion of the possbility for discrimination (better than guessing).