Malicious URLs have been around for much of the existence of the internet and continue to be a very effective attack vector in the modern day. The prevalence of hyperlinks allow these urls to be concealed from casual observers, and even when not concealed the structure of urls is not widely understood by the general public. Attackers can use our assumptions about the format of urls to mimic popular sites and entrap unaware users. Given the highly structured nature of URLs the detection of malicious links is a problem that can likely be addressed using machine learning techniques. We will explore the application of Random Forest and Neural Networks to this problem space.
The dataset consists of just shy of 50,000 urls which have been labeled as either “benign”, “defacement”, “phishing”, or “malware”. While the exact nature of the attack is interesting when analysing common attack vectors, it is more useful for a predictive model to be able to determine whether the link is simply benign or malicious. For this reason we will be reclassifying the data into either benign or malicious links and treating this as a binary classification problem.
raw_data <- read_delim("malicious_url.csv", delim = ",", col_types = "icf") |>
rename(id = 1)
After reclassifying the data we can see that while there is a class imbalance, it is not particularly extreme. We will keep this in mind while training and evaluating models and address it if necessary.
b_class_data <- raw_data |>
mutate(type = fct_collapse(type,
"benign" = "benign",
"malicious" = c("phishing", "defacement", "malware")
))
ggplot(b_class_data, aes(x = type, after_stat(count))) +
geom_bar() +
labs(title = "URL Class Distribution")
As the input data is raw text we will need to perform feature engineering to train the models. Sophisticated neural networks such as Convolutional or Transformer Neural Networks can perform feature engineering themselves, however for this project a simple feed-forward neural network with a single hidden layer will be used. This choice is made for several reasons:
As a result we will be constructing features on which both models will be trained. The features are as follows:
extract_url_features <- function(urls) {
get_tld <- function(url) {
domain_part <- domain(url)
if(is.na(domain_part) || domain_part == "") return(NA)
parts <- strsplit(domain_part, "\\.")[[1]]
if(length(parts) > 0) return(parts[length(parts)])
return(NA)
}
features <- data.frame(
url_length = nchar(urls),
num_dots = str_count(urls, "\\."),
num_slashes = str_count(urls, "/"),
num_hyphens = str_count(urls, "-"),
num_question_marks = str_count(urls, "\\?"),
num_equals = str_count(urls, "="),
num_ampersands = str_count(urls, "&"),
num_semicolon = str_count(urls, ";"),
scheme = scheme(urls),
has_ip = as.integer(grepl("^https?://\\d+\\.\\d+\\.\\d+\\.\\d+", urls)),
has_fragment = as.integer(grepl("#", urls)),
domain_length = nchar(domain(urls)),
subdomain_count = lengths(regmatches(urls, gregexpr("\\.", gsub("^https?://", "", urls)))),
path_length = nchar(path(urls)),
port_number = port(urls),
num_params = lengths(regmatches(urls, gregexpr("&|;", urls))),
avg_token_length = apply(str_split(urls, "/", simplify = TRUE), 1, function(x) mean(nchar(x[x != ""]))),
tld = sapply(urls, get_tld)
)
return(features)
}
feature_matrix <- extract_url_features(b_class_data$url)
# Combine features with labels
model_data <- cbind(feature_matrix, type = as.factor(b_class_data$type))
After creating the features, we can see that several have large numbers of missing values. A new level for scheme will be added indicating none was present, the missing path lengths will be set to 0, and the urls with missing port_numbers will have that value set to -1 as 0 is technically a valid port.
model_data |>
summarise(across(everything(), ~ sum(is.na(.x))))
## url_length num_dots num_slashes num_hyphens num_question_marks num_equals
## 1 0 0 0 0 0 0
## num_ampersands num_semicolon scheme has_ip has_fragment domain_length
## 1 0 0 35192 0 0 0
## subdomain_count path_length port_number num_params avg_token_length tld type
## 1 0 6708 49736 0 0 0 0
Checking for the number of distinct values we can see that there are reasonable ranges for all predictors, and none have just a single class or value.
model_data |>
summarise(across(everything(), ~ n_distinct(.x)))
## url_length num_dots num_slashes num_hyphens num_question_marks num_equals
## 1 195 19 18 34 5 15
## num_ampersands num_semicolon scheme has_ip has_fragment domain_length
## 1 14 8 3 2 2 88
## subdomain_count path_length port_number num_params avg_token_length tld type
## 1 19 188 10 16 820 208 2
model_data_clean <- model_data |>
mutate(scheme = replace_na(scheme, "none"),
path_length = replace_na(path_length, 0),
port_number = replace_na(port_number, "-1") # 0 is a valid port number
)
Validating that all missing values have been appropriately addressed:
model_data_clean |>
summarise(across(everything(), ~ sum(is.na(.x))))
## url_length num_dots num_slashes num_hyphens num_question_marks num_equals
## 1 0 0 0 0 0 0
## num_ampersands num_semicolon scheme has_ip has_fragment domain_length
## 1 0 0 0 0 0 0
## subdomain_count path_length port_number num_params avg_token_length tld type
## 1 0 0 0 0 0 0 0
The scheme, port_number, and tld fields should be converted to factors. In each of these cases the field acts as a label, and is therefore better represented as a factor.
model_data_clean$scheme <- as.factor(model_data_clean$scheme)
model_data_clean$port_number <- as.factor(model_data_clean$port_number)
model_data_clean$tld <- as.factor(model_data_clean$tld)
Several predictors exhibit near zero variance. These predictors will create issues in the Neural Network and will be excluded when training said model, but will be used in the Random Forest.
nzv <- nearZeroVar(model_data_clean, saveMetrics = TRUE)
print(nzv[nzv$nzv, ])
## freqRatio percentUnique zeroVar nzv
## num_ampersands 24.13725 0.028140704 FALSE TRUE
## num_semicolon 751.71212 0.016080402 FALSE TRUE
## has_ip 2925.47059 0.004020101 FALSE TRUE
## has_fragment 16582.33333 0.004020101 FALSE TRUE
## port_number 9947.20000 0.020100503 FALSE TRUE
## num_params 24.18662 0.032160804 FALSE TRUE
Looking at the distributions of the categorical features with meaningful variance, we can see a large number of urls in the dataset have no scheme specified. This is unexpected, and may be a byproduct of the data collection methods. When no scheme is specified, the technical standard is http, however many modern browsers instead force the use of https. As there is no practically applied standard, the missingness will be retained as a separate value.
model_data_clean |>
group_by(scheme) |>
count() |>
ggplot(aes(x = scheme, y = n)) +
geom_bar(stat = "identity") +
labs(y = "Count", x = "Scheme")
There are over 200 top level domains present, however the top domains account for the vast majority of the data. That being said there is a long tail, with 1/5th of the entries not in the top 5 most common domains. While it may at first appear that decreasing the levels present would be a good idea, we will not be doing so. The model types selected can handle the feature gracefully, and in the hypothetical situation these models were applied in a practical capacity and trained on a larger dataset the feauture would only increase in meaning.
tld_dist <- model_data_clean |>
count(tld, name = "count") |>
arrange(desc(count)) |>
mutate(tld = if_else(row_number() <= 5,
as.character(tld),
"Other")) |>
group_by(tld) |>
summarise(count = sum(count)) |>
arrange(desc(count)) |>
mutate(tld = factor(tld, levels = c(tld[tld != "Other"], "Other")))
ggplot(tld_dist, aes(x = tld, y = count)) +
geom_bar(stat = "identity") +
labs(y = "count", x = "Top Level Domain")
The distributions of the remaining variables largely fall in line with what one would expect from a text datasource, with large tails present in each of the fields. The bump at the end of url_length likely indicates that the method of data collection limited the length of the urls selected.
remaining_dist <- model_data_clean |>
select(!c(tld, scheme, num_ampersands, num_semicolon, has_ip, has_fragment, port_number, num_params, type))
hist_plots <- lapply(names(remaining_dist), function(col) {
ggplot(remaining_dist, aes(x = .data[[col]])) +
geom_histogram() +
theme_minimal() +
labs(y = "")
})
hist_list <- rbind(hist_plots)
wrap_plots(hist_list, ncol = 2)
We are splitting the data into an 80/20 train/test split, and validate the distribution of the target class is the same as in the original dataset.
set.seed(8675309)
train_index <- createDataPartition(model_data_clean$type, p = 0.8, list = FALSE)
train_data <- model_data_clean[train_index, ]
test_data <- model_data_clean[-train_index, ]
og_data <- ggplot(model_data_clean, aes(x = type, after_stat(count))) +
geom_bar() +
labs(title = "Full Dataset",
x = NULL)
train_split <- ggplot(train_data, aes(x = type, after_stat(count))) +
geom_bar() +
labs(title = "Training Split",
x = NULL)
test_split <- ggplot(test_data, aes(x = type, after_stat(count))) +
geom_bar() +
labs(title = "Testing Split")
dist_combined <- og_data / train_split / test_split +
plot_layout(axis_titles = "collect", axes = "collect")
dist_combined
The Random Forest model is trained with the following hyperparameter ranges:
set.seed(8675309)
model_control <- trainControl(
method = "cv",
number = 3,
savePredictions = "final",
classProbs = TRUE,
summaryFunction = twoClassSummary,
verboseIter = TRUE
)
rf_grid <- expand.grid(
mtry = c(2, 4, 8, 16),
splitrule = c("gini", "extratrees"),
min.node.size = c(1, 5, 10)
)
rf_model <- train(
type ~ .,
data = train_data,
method = "ranger",
trControl = model_control,
metric = "ROC",
tuneGrid = rf_grid,
importance = "impurity"
)
saveRDS(rf_model, "rf_model_temp.rds")
The optimal model had the parameters mtry = 16, splitrule = gini and min.node.size = 1.
rf_model <- readRDS("rf_model_temp.rds")
print(rf_model)
## Random Forest
##
## 39801 samples
## 18 predictor
## 2 classes: 'malicious', 'benign'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 26534, 26534, 26534
## Resampling results across tuning parameters:
##
## mtry splitrule min.node.size ROC Sens Spec
## 2 gini 1 0.9554029 0.06816690 0.9999657
## 2 gini 5 0.9556690 0.06601031 0.9999657
## 2 gini 10 0.9529803 0.13530239 0.9998970
## 2 extratrees 1 0.9370825 0.00000000 1.0000000
## 2 extratrees 5 0.9395915 0.00000000 1.0000000
## 2 extratrees 10 0.9462491 0.00000000 1.0000000
## 4 gini 1 0.9823937 0.57627754 0.9887768
## 4 gini 5 0.9825748 0.57121425 0.9890513
## 4 gini 10 0.9826722 0.57018284 0.9897721
## 4 extratrees 1 0.9647774 0.56502579 0.9871293
## 4 extratrees 5 0.9648596 0.54861697 0.9878501
## 4 extratrees 10 0.9625595 0.54542897 0.9880217
## 8 gini 1 0.9961882 0.83938115 0.9921746
## 8 gini 5 0.9962418 0.86657290 0.9923806
## 8 gini 10 0.9959880 0.84528833 0.9920030
## 8 extratrees 1 0.9834999 0.76024379 0.9838688
## 8 extratrees 5 0.9836320 0.75227379 0.9843493
## 8 extratrees 10 0.9836129 0.74533521 0.9841433
## 16 gini 1 0.9990598 0.97571496 0.9959500
## 16 gini 5 0.9990241 0.97402719 0.9956068
## 16 gini 10 0.9990629 0.97562119 0.9956755
## 16 extratrees 1 0.9952287 0.87229255 0.9930327
## 16 extratrees 5 0.9954534 0.87201125 0.9928611
## 16 extratrees 10 0.9952647 0.87426160 0.9929297
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 16, splitrule = gini
## and min.node.size = 10.
rf_predictions <- predict(rf_model, newdata = test_data)
rf_probs <- predict(rf_model, newdata = test_data, type = "prob")
Examining the results of training provide several interesting insights. Models splitting using the Gini method consistently outperformed those using Extra Trees. While the number of variables used in each tree had a pronounced impact on model performance the minimum node size parameter had relatively little impact and what impact it had diminished as more features were included. While ROC began high and remained very good, sensitivity began almost non-existent and dramatically improved as the number of features increased. Specificity exhibited some unusual behavior, beginning with some perfect scores before dropping slightly and recovering. It should be noted the scale for both ROC and Specificity is very small, as neither value exhibited very much change from model to model.
rf_training_results <- rf_model$results |>
mutate(min.node.size = factor(min.node.size, levels = c("1", "5", "10"))) # Needed to allow for discrete scaling in visualization
rf_roc <- ggplot(rf_training_results, aes(x = mtry, y = ROC)) +
geom_point(aes(size = min.node.size, fill = splitrule), position = position_dodge(width = 0.5), shape = 21, color = "black", alpha = 0.8) +
scale_size_manual(values = c("1" = 1, "5" = 2, "10" = 3), name = "Min Node Size") +
scale_fill_discrete(name = "Split Rule") +
labs(title = "Random Forest Training Performance",
x = NULL) +
theme(legend.position = "bottom")
rf_sens <- ggplot(rf_training_results, aes(x = mtry, y = Sens)) +
geom_point(aes(size = min.node.size, fill = splitrule), position = position_dodge(width = 0.5), shape = 21, color = "black", alpha = 0.8) +
scale_size_manual(values = c("1" = 1, "5" = 2, "10" = 3), name = "Min Node Size") +
scale_fill_discrete(name = "Split Rule") +
labs(x = NULL, y = "Sensitivity") +
theme(legend.position = "none")
rf_spec <- ggplot(rf_training_results, aes(x = mtry, y = Spec)) +
geom_point(aes(size = min.node.size, fill = splitrule), position = position_dodge(width = 0.5), shape = 21, color = "black", alpha = 0.8) +
scale_size_manual(values = c("1" = 1, "5" = 2, "10" = 3), name = "Min Node Size") +
scale_fill_discrete(name = "Split Rule") +
labs(x = "mtry (Number of variables)", y = "Specificity") +
theme(legend.position = "none")
combined_plot <- rf_roc / rf_sens / rf_spec +
plot_layout(guides = "collect") &
theme(legend.position = "bottom")
combined_plot +
plot_annotation(
caption = "Models using the Gini coefficient consistantly outperformed those using Extra Trees. Node size had a relatively small impact on performance, especially as the number of variables included increased."
)
rf_eval_data <- data.frame(pred = rf_predictions, obs = test_data$type, benign = rf_probs$benign, malicious = rf_probs$malicious)
rf_summary <- imbalanced_summary(data = rf_eval_data)
Model performance on the test set was impressive with the extremely good performance on the training set carrying over. This should provide confidence the model performance is genuine and not due to overfitting.
rf_summary |>
t() |>
as.data.frame() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kbl(caption = "Random Forest Performance on Test Set") |>
kable_styling(bootstrap_options = "condensed")
| AUC-ROC | AUC-PR | Recall | Specificity | Precision | F1 | GMean |
|---|---|---|---|---|---|---|
| 0.999 | 0.9995 | 0.9966 | 0.9812 | 0.9932 | 0.9949 | 0.9889 |
While the same features are being used for both models, there is additional pre-processing for the Neural Network. The numeric data is centered and scaled, and predictors with near-zero variance are removed as they resulted in failures during model training. Hyperparameter tuning was conducted, with the following values:
set.seed(8675309)
preprocess_steps <- c("center", "scale", "nzv")
nn_grid <- expand.grid(
size = c(5, 10, 15),
decay = c(0.001, 0.01, 0.1)
)
nn_model <- train(
type ~ .,
data = train_data,
method = "nnet",
preProcess = preprocess_steps,
trControl = model_control,
metric = "ROC",
tuneGrid = nn_grid,
maxit = 500
)
saveRDS(nn_model, "nn_model.rds")
The best performing model had size = 15 and decay rate of 0.1.
nn_model <- readRDS("nn_model.rds")
print(nn_model)
## Neural Network
##
## 39801 samples
## 18 predictor
## 2 classes: 'malicious', 'benign'
##
## Pre-processing: centered (13), scaled (13), remove (220)
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 26534, 26534, 26534
## Resampling results across tuning parameters:
##
## size decay ROC Sens Spec
## 5 0.001 0.9981025 0.9712143 0.9891200
## 5 0.010 0.9981742 0.9708392 0.9917284
## 5 0.100 0.9978598 0.9718706 0.9917284
## 10 0.001 0.9979660 0.9764651 0.9936848
## 10 0.010 0.9985744 0.9788092 0.9950233
## 10 0.100 0.9987725 0.9783404 0.9943026
## 15 0.001 0.9985812 0.9797468 0.9946458
## 15 0.010 0.9986305 0.9801219 0.9954009
## 15 0.100 0.9989196 0.9814346 0.9948861
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were size = 15 and decay = 0.1.
nn_predictions <- predict(nn_model, newdata = test_data)
nn_probs <- predict(nn_model, newdata = test_data, type = "prob")
Model performance for the neural network varied little as the parameters were tuned, with each metric performing practically identically until the third decimal place. As a general trend the size of the hidden layer had more of an effect than the decay rate on model performance, but as stated earlier neither appeared to have much of an effect overall.
nn_training_results <- nn_model$results |>
mutate(decay = factor(decay, levels = c("0.001", "0.01", "0.1"))) # Needed to allow for discrete scaling in visualization
nn_roc <- ggplot(nn_training_results, aes(x = size, y = ROC)) +
geom_point(aes(size = decay), position = position_dodge(width = 0.5), color = "black", alpha = 0.8) +
scale_size_manual(values = c("0.001" = 1, "0.01" = 2, "0.1" = 3), name = "Decay Rate") +
labs(title = "Neural Network Training Performance",
x = NULL) +
theme(legend.position = "bottom")
nn_sens <- ggplot(nn_training_results, aes(x = size, y = Sens)) +
geom_point(aes(size = decay), position = position_dodge(width = 0.5), color = "black", alpha = 0.8) +
scale_size_manual(values = c("0.001" = 1, "0.01" = 2, "0.1" = 3), name = "Decay Rate") +
labs(x = NULL, y = "Sensitivity") +
theme(legend.position = "none")
nn_spec <- ggplot(nn_training_results, aes(x = size, y = Spec)) +
geom_point(aes(size = decay), position = position_dodge(width = 0.5), color = "black", alpha = 0.8) +
scale_size_manual(values = c("0.001" = 1, "0.01" = 2, "0.1" = 3), name = "Decay Rate") +
labs(x = "Size of Hidden Layer") +
theme(legend.position = "none")
nn_combined_plot <- nn_roc / nn_sens / nn_spec +
plot_layout(guides = "collect") &
theme(legend.position = "bottom")
nn_combined_plot +
plot_annotation(
caption = "Model performance varied very little as tuning parameters changed, with 3-4 decimal places needed before differences appear."
)
nn_eval_data <- data.frame(pred = nn_predictions, obs = test_data$type, benign = nn_probs$benign, malicious = nn_probs$malicious)
nn_summary <- imbalanced_summary(data = nn_eval_data)
Model performance again carried over to the test set and displayed similarly high performance to the random forest model.
nn_summary |>
t() |>
as.data.frame() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kbl(caption = "Neural Network Performance on Test Set") |>
kable_styling(bootstrap_options = "condensed")
| AUC-ROC | AUC-PR | Recall | Specificity | Precision | F1 | GMean |
|---|---|---|---|---|---|---|
| 0.9992 | 0.9997 | 0.9962 | 0.9782 | 0.9921 | 0.9941 | 0.9872 |
Both models performed extremely well on the dataset, with the neural network performing slightly better by both AUC metrics and the random forest model exhibiting slightly better performance in the other tracked metrics. For all practical purposes, the models have performed virtually identically with either being an excellent choice when attempting to identify malicious urls.
results_df <- data.frame(rf_summary, nn_summary) |>
rename(`Neural Network` = nn_summary, `Random Forest` = rf_summary)
results_df |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
t() |>
as.data.frame() |>
rownames_to_column(var = "Model") |>
mutate(
`AUC-ROC` = cell_spec(
`AUC-ROC`,
background = ifelse(`AUC-ROC` == max(`AUC-ROC`, na.rm = TRUE), "#FFFF99", "white")
),
`AUC-PR` = cell_spec(
`AUC-PR`,
background = ifelse(`AUC-PR` == max(`AUC-PR`, na.rm = TRUE), "#FFFF99", "white")
),
Recall = cell_spec(
Recall,
background = ifelse(Recall == max(Recall, na.rm = TRUE), "#FFFF99", "white")
),
Specificity = cell_spec(
Specificity,
background = ifelse(Specificity == max(Specificity, na.rm = TRUE), "#FFFF99", "white")
),
Precision = cell_spec(
Precision,
background = ifelse(Precision == max(Precision, na.rm = TRUE), "#FFFF99", "white")
),
`F1` = cell_spec(
`F1`,
background = ifelse(`F1` == max(`F1`, na.rm = TRUE), "#FFFF99", "white")
),
GMean = cell_spec(
GMean,
background = ifelse(GMean == max(GMean, na.rm = TRUE), "#FFFF99", "white")
)
) |>
kbl(caption = "Performance of Optimal Models on the Test Set", escape = FALSE) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Model | AUC-ROC | AUC-PR | Recall | Specificity | Precision | F1 | GMean |
|---|---|---|---|---|---|---|---|
| Random Forest | 0.999 | 0.9995 | 0.9966 | 0.9812 | 0.9932 | 0.9949 | 0.9889 |
| Neural Network | 0.9992 | 0.9997 | 0.9962 | 0.9782 | 0.9921 | 0.9941 | 0.9872 |
One benefit of the random forest model is the well established method of determining variable importance. When examining importance we can see that the majority of the importance is in the top 10 variables, with a rapid decay afterwards. The majority of the variables of importance are numeric, tracking lengths and the number of inputs. Of the categorical predictors, the lack of a scheme is by far the most powerful predictor present, with a tld of “com” being the only other categorical predictor of note.
importance <- varImp(rf_model, scale = TRUE)
importance[[1]] |>
arrange(desc(Overall)) |>
top_n(n = 15, wt = Overall) |>
kbl() |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Overall | |
|---|---|
| schemenone | 100.000000 |
| num_slashes | 60.869693 |
| num_dots | 23.464153 |
| subdomain_count | 22.235384 |
| path_length | 20.363533 |
| url_length | 17.028934 |
| num_params | 15.136738 |
| num_equals | 13.150670 |
| num_ampersands | 11.649123 |
| tldcom | 10.899555 |
| domain_length | 10.534880 |
| avg_token_length | 8.749679 |
| num_question_marks | 5.307672 |
| num_hyphens | 5.225182 |
| tldinfo | 3.857897 |
Seeing as both models performed extremely well, the recommended model for use would have to be the random forest model. The simpler and more interpretable nature of the model makes it more valuable, providing insights into the nature of attacks which can be used to further engineer relevant features. As these additional insights do not come at any meaningful loss of performance, the choice of models is an easy one.