Introduction

Malicious URLs have been around for much of the existence of the internet and continue to be a very effective attack vector in the modern day. The prevalence of hyperlinks allow these urls to be concealed from casual observers, and even when not concealed the structure of urls is not widely understood by the general public. Attackers can use our assumptions about the format of urls to mimic popular sites and entrap unaware users. Given the highly structured nature of URLs the detection of malicious links is a problem that can likely be addressed using machine learning techniques. We will explore the application of Random Forest and Neural Networks to this problem space.

The dataset consists of just shy of 50,000 urls which have been labeled as either “benign”, “defacement”, “phishing”, or “malware”. While the exact nature of the attack is interesting when analysing common attack vectors, it is more useful for a predictive model to be able to determine whether the link is simply benign or malicious. For this reason we will be reclassifying the data into either benign or malicious links and treating this as a binary classification problem.

raw_data <- read_delim("malicious_url.csv", delim = ",", col_types = "icf") |>
  rename(id = 1)

EDA and Data Preparation

After reclassifying the data we can see that while there is a class imbalance, it is not particularly extreme. We will keep this in mind while training and evaluating models and address it if necessary.

b_class_data <- raw_data |>
  mutate(type = fct_collapse(type,
    "benign" = "benign",
    "malicious" = c("phishing", "defacement", "malware")
    ))

ggplot(b_class_data, aes(x = type, after_stat(count))) + 
  geom_bar() +
  labs(title = "URL Class Distribution")

As the input data is raw text we will need to perform feature engineering to train the models. Sophisticated neural networks such as Convolutional or Transformer Neural Networks can perform feature engineering themselves, however for this project a simple feed-forward neural network with a single hidden layer will be used. This choice is made for several reasons:

While 50,000 records is a reasonably large dataset it is not at the scale needed for these more sophisticated models to perform high quality feature engineering.
The highly structured nature of URLs makes the construction of high quality features simple.
The compute needed to train these more sophisticated models is not feasible for an academic excersize such as this.

As a result we will be constructing features on which both models will be trained. The features are as follows:

url_length: The number of characters in the url.
num_dots: The number of “.” characters.
num_slashes: The number of “/” characters.
num_hyphens: The number of “-” characters.
num_question_marks: The number of “?” characters.
num_equals: The number of “=” characters.
num_ampersands: The number of “&” characters.
num_semicolon: The number of “;” characters.
scheme: The scheme specified, if any. Common schemes are http and https but can include others such as ftp.
has_ip: Whether a pattern matching an ip address is present.
has_fragment: Whether a fragment or the indicator for a fragment is present in the url.
domain_length: The number of characters in the domain portion of the url.
subdomain_count: The number of subdomains present.
path_length: The number of characters in the path portion of the url.
port_number: The port number if present, NA otherwise.
num_params: The number of parameters present in the url, defined by the number of “&” and “;” present in the url.
avg_token_length: A token is defined as a portion of the url when split on the “/” character.
tld: The top level domain of the url.

extract_url_features <- function(urls) {
  get_tld <- function(url) {
    domain_part <- domain(url)
    if(is.na(domain_part) || domain_part == "") return(NA)
    parts <- strsplit(domain_part, "\\.")[[1]]
    if(length(parts) > 0) return(parts[length(parts)])
    return(NA)
  }
  
  features <- data.frame(
    url_length = nchar(urls),
    num_dots = str_count(urls, "\\."),
    num_slashes = str_count(urls, "/"),
    num_hyphens = str_count(urls, "-"),
    num_question_marks = str_count(urls, "\\?"),
    num_equals = str_count(urls, "="),
    num_ampersands = str_count(urls, "&"),
    num_semicolon = str_count(urls, ";"),
    scheme = scheme(urls),
    has_ip = as.integer(grepl("^https?://\\d+\\.\\d+\\.\\d+\\.\\d+", urls)),
    has_fragment = as.integer(grepl("#", urls)),
    domain_length = nchar(domain(urls)),
    subdomain_count = lengths(regmatches(urls, gregexpr("\\.", gsub("^https?://", "", urls)))),
    path_length = nchar(path(urls)),
    port_number = port(urls),
    num_params = lengths(regmatches(urls, gregexpr("&|;", urls))),
    avg_token_length = apply(str_split(urls, "/", simplify = TRUE), 1, function(x) mean(nchar(x[x != ""]))),
    tld = sapply(urls, get_tld)
  )
  return(features)
}

feature_matrix <- extract_url_features(b_class_data$url)

# Combine features with labels
model_data <- cbind(feature_matrix, type = as.factor(b_class_data$type))

After creating the features, we can see that several have large numbers of missing values. A new level for scheme will be added indicating none was present, the missing path lengths will be set to 0, and the urls with missing port_numbers will have that value set to -1 as 0 is technically a valid port.

model_data |>
  summarise(across(everything(), ~ sum(is.na(.x))))

##   url_length num_dots num_slashes num_hyphens num_question_marks num_equals
## 1          0        0           0           0                  0          0
##   num_ampersands num_semicolon scheme has_ip has_fragment domain_length
## 1              0             0  35192      0            0             0
##   subdomain_count path_length port_number num_params avg_token_length tld type
## 1               0        6708       49736          0                0   0    0

Checking for the number of distinct values we can see that there are reasonable ranges for all predictors, and none have just a single class or value.

model_data |>
  summarise(across(everything(), ~ n_distinct(.x)))

##   url_length num_dots num_slashes num_hyphens num_question_marks num_equals
## 1        195       19          18          34                  5         15
##   num_ampersands num_semicolon scheme has_ip has_fragment domain_length
## 1             14             8      3      2            2            88
##   subdomain_count path_length port_number num_params avg_token_length tld type
## 1              19         188          10         16              820 208    2

model_data_clean <- model_data |>
  mutate(scheme = replace_na(scheme, "none"),
         path_length = replace_na(path_length, 0),
         port_number = replace_na(port_number, "-1") # 0 is a valid port number
  )

Validating that all missing values have been appropriately addressed:

model_data_clean |>
  summarise(across(everything(), ~ sum(is.na(.x))))

##   url_length num_dots num_slashes num_hyphens num_question_marks num_equals
## 1          0        0           0           0                  0          0
##   num_ampersands num_semicolon scheme has_ip has_fragment domain_length
## 1              0             0      0      0            0             0
##   subdomain_count path_length port_number num_params avg_token_length tld type
## 1               0           0           0          0                0   0    0

The scheme, port_number, and tld fields should be converted to factors. In each of these cases the field acts as a label, and is therefore better represented as a factor.

model_data_clean$scheme <- as.factor(model_data_clean$scheme)
model_data_clean$port_number <- as.factor(model_data_clean$port_number)
model_data_clean$tld <- as.factor(model_data_clean$tld)

Several predictors exhibit near zero variance. These predictors will create issues in the Neural Network and will be excluded when training said model, but will be used in the Random Forest.

nzv <- nearZeroVar(model_data_clean, saveMetrics = TRUE)
print(nzv[nzv$nzv, ])

##                  freqRatio percentUnique zeroVar  nzv
## num_ampersands    24.13725   0.028140704   FALSE TRUE
## num_semicolon    751.71212   0.016080402   FALSE TRUE
## has_ip          2925.47059   0.004020101   FALSE TRUE
## has_fragment   16582.33333   0.004020101   FALSE TRUE
## port_number     9947.20000   0.020100503   FALSE TRUE
## num_params        24.18662   0.032160804   FALSE TRUE

Looking at the distributions of the categorical features with meaningful variance, we can see a large number of urls in the dataset have no scheme specified. This is unexpected, and may be a byproduct of the data collection methods. When no scheme is specified, the technical standard is http, however many modern browsers instead force the use of https. As there is no practically applied standard, the missingness will be retained as a separate value.

model_data_clean |>
  group_by(scheme) |>
  count() |>
  ggplot(aes(x = scheme, y = n)) +
    geom_bar(stat = "identity") +
    labs(y = "Count", x = "Scheme")

There are over 200 top level domains present, however the top domains account for the vast majority of the data. That being said there is a long tail, with 1/5th of the entries not in the top 5 most common domains. While it may at first appear that decreasing the levels present would be a good idea, we will not be doing so. The model types selected can handle the feature gracefully, and in the hypothetical situation these models were applied in a practical capacity and trained on a larger dataset the feauture would only increase in meaning.

tld_dist <- model_data_clean |>
  count(tld, name = "count") |>
  arrange(desc(count)) |>
  mutate(tld = if_else(row_number() <= 5,
                       as.character(tld),
                       "Other")) |>
  group_by(tld) |>
  summarise(count = sum(count)) |>
  arrange(desc(count)) |>
  mutate(tld = factor(tld, levels = c(tld[tld != "Other"], "Other")))

ggplot(tld_dist, aes(x = tld, y = count)) +
  geom_bar(stat = "identity") +
  labs(y = "count", x = "Top Level Domain")

The distributions of the remaining variables largely fall in line with what one would expect from a text datasource, with large tails present in each of the fields. The bump at the end of url_length likely indicates that the method of data collection limited the length of the urls selected.

remaining_dist <- model_data_clean |>
  select(!c(tld, scheme, num_ampersands, num_semicolon, has_ip, has_fragment, port_number, num_params, type))

hist_plots <- lapply(names(remaining_dist), function(col) {
  ggplot(remaining_dist, aes(x = .data[[col]])) +
    geom_histogram() +
    theme_minimal() +
    labs(y = "")
})

hist_list <- rbind(hist_plots)

wrap_plots(hist_list, ncol = 2)

Model Training

We are splitting the data into an 80/20 train/test split, and validate the distribution of the target class is the same as in the original dataset.

set.seed(8675309)

train_index <- createDataPartition(model_data_clean$type, p = 0.8, list = FALSE)
train_data <- model_data_clean[train_index, ]
test_data <- model_data_clean[-train_index, ]

og_data <- ggplot(model_data_clean, aes(x = type, after_stat(count))) + 
  geom_bar() +
  labs(title = "Full Dataset",
       x = NULL)

train_split <- ggplot(train_data, aes(x = type, after_stat(count))) + 
  geom_bar() +
  labs(title = "Training Split",
       x = NULL)

test_split <- ggplot(test_data, aes(x = type, after_stat(count))) + 
  geom_bar() +
  labs(title = "Testing Split")

dist_combined <- og_data / train_split / test_split + 
  plot_layout(axis_titles = "collect", axes = "collect")

dist_combined

Random Forest

The Random Forest model is trained with the following hyperparameter ranges:

Number of features: 2, 4, 8, 16
Minimum node size: 1, 5, 10
Splitting rule: Gini coefficient or Extra Trees

set.seed(8675309)

model_control <- trainControl(
  method = "cv",
  number = 3,
  savePredictions = "final",
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  verboseIter = TRUE
)

rf_grid <- expand.grid(
  mtry = c(2, 4, 8, 16),
  splitrule = c("gini", "extratrees"),
  min.node.size = c(1, 5, 10)
)

rf_model <- train(
  type ~ .,
  data = train_data,
  method = "ranger",
  trControl = model_control,
  metric = "ROC",
  tuneGrid = rf_grid,
  importance = "impurity"
)

saveRDS(rf_model, "rf_model_temp.rds")

The optimal model had the parameters mtry = 16, splitrule = gini and min.node.size = 1.

rf_model <- readRDS("rf_model_temp.rds")
print(rf_model)

## Random Forest 
## 
## 39801 samples
##    18 predictor
##     2 classes: 'malicious', 'benign' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 26534, 26534, 26534 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   min.node.size  ROC        Sens        Spec     
##    2    gini         1             0.9554029  0.06816690  0.9999657
##    2    gini         5             0.9556690  0.06601031  0.9999657
##    2    gini        10             0.9529803  0.13530239  0.9998970
##    2    extratrees   1             0.9370825  0.00000000  1.0000000
##    2    extratrees   5             0.9395915  0.00000000  1.0000000
##    2    extratrees  10             0.9462491  0.00000000  1.0000000
##    4    gini         1             0.9823937  0.57627754  0.9887768
##    4    gini         5             0.9825748  0.57121425  0.9890513
##    4    gini        10             0.9826722  0.57018284  0.9897721
##    4    extratrees   1             0.9647774  0.56502579  0.9871293
##    4    extratrees   5             0.9648596  0.54861697  0.9878501
##    4    extratrees  10             0.9625595  0.54542897  0.9880217
##    8    gini         1             0.9961882  0.83938115  0.9921746
##    8    gini         5             0.9962418  0.86657290  0.9923806
##    8    gini        10             0.9959880  0.84528833  0.9920030
##    8    extratrees   1             0.9834999  0.76024379  0.9838688
##    8    extratrees   5             0.9836320  0.75227379  0.9843493
##    8    extratrees  10             0.9836129  0.74533521  0.9841433
##   16    gini         1             0.9990598  0.97571496  0.9959500
##   16    gini         5             0.9990241  0.97402719  0.9956068
##   16    gini        10             0.9990629  0.97562119  0.9956755
##   16    extratrees   1             0.9952287  0.87229255  0.9930327
##   16    extratrees   5             0.9954534  0.87201125  0.9928611
##   16    extratrees  10             0.9952647  0.87426160  0.9929297
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 16, splitrule = gini
##  and min.node.size = 10.

rf_predictions <- predict(rf_model, newdata = test_data)
rf_probs <- predict(rf_model, newdata = test_data, type = "prob")

Examining the results of training provide several interesting insights. Models splitting using the Gini method consistently outperformed those using Extra Trees. While the number of variables used in each tree had a pronounced impact on model performance the minimum node size parameter had relatively little impact and what impact it had diminished as more features were included. While ROC began high and remained very good, sensitivity began almost non-existent and dramatically improved as the number of features increased. Specificity exhibited some unusual behavior, beginning with some perfect scores before dropping slightly and recovering. It should be noted the scale for both ROC and Specificity is very small, as neither value exhibited very much change from model to model.

rf_training_results <- rf_model$results |>
  mutate(min.node.size = factor(min.node.size, levels = c("1", "5", "10"))) # Needed to allow for discrete scaling in visualization

rf_roc <- ggplot(rf_training_results, aes(x = mtry, y = ROC)) +
  geom_point(aes(size = min.node.size, fill = splitrule), position = position_dodge(width = 0.5), shape = 21, color = "black", alpha = 0.8) +
  scale_size_manual(values = c("1" = 1, "5" = 2, "10" = 3), name = "Min Node Size") +
  scale_fill_discrete(name = "Split Rule") +
  labs(title = "Random Forest Training Performance",
       x = NULL) +
  theme(legend.position = "bottom")

rf_sens <- ggplot(rf_training_results, aes(x = mtry, y = Sens)) +
  geom_point(aes(size = min.node.size, fill = splitrule), position = position_dodge(width = 0.5), shape = 21, color = "black", alpha = 0.8) +
  scale_size_manual(values = c("1" = 1, "5" = 2, "10" = 3), name = "Min Node Size") +
  scale_fill_discrete(name = "Split Rule") +
  labs(x = NULL, y = "Sensitivity") +
  theme(legend.position = "none")

rf_spec <- ggplot(rf_training_results, aes(x = mtry, y = Spec)) +
  geom_point(aes(size = min.node.size, fill = splitrule), position = position_dodge(width = 0.5), shape = 21, color = "black", alpha = 0.8) +
  scale_size_manual(values = c("1" = 1, "5" = 2, "10" = 3), name = "Min Node Size") +
  scale_fill_discrete(name = "Split Rule") +
  labs(x = "mtry (Number of variables)", y = "Specificity") + 
  theme(legend.position = "none")

combined_plot <- rf_roc / rf_sens / rf_spec + 
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

combined_plot + 
  plot_annotation(
    caption = "Models using the Gini coefficient consistantly outperformed those using Extra Trees. Node size had a relatively small impact on performance, especially as the number of variables included increased."
  )

rf_eval_data <- data.frame(pred = rf_predictions, obs = test_data$type, benign = rf_probs$benign, malicious = rf_probs$malicious)

rf_summary <- imbalanced_summary(data = rf_eval_data)

Model performance on the test set was impressive with the extremely good performance on the training set carrying over. This should provide confidence the model performance is genuine and not due to overfitting.

rf_summary |>
  t() |>
  as.data.frame() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kbl(caption = "Random Forest Performance on Test Set") |>
  kable_styling(bootstrap_options = "condensed")

Random Forest Performance on Test Set
AUC-ROC	AUC-PR	Recall	Specificity	Precision	F1	GMean
0.999	0.9995	0.9966	0.9812	0.9932	0.9949	0.9889

Neural Network

While the same features are being used for both models, there is additional pre-processing for the Neural Network. The numeric data is centered and scaled, and predictors with near-zero variance are removed as they resulted in failures during model training. Hyperparameter tuning was conducted, with the following values:

Number of units in the hidden layer: 5, 10, 15
Weight decay: 0.001, 0.01, 0.1

set.seed(8675309)

preprocess_steps <- c("center", "scale", "nzv")

nn_grid <- expand.grid(
  size = c(5, 10, 15),
  decay = c(0.001, 0.01, 0.1)
)

nn_model <- train(
  type ~ .,
  data = train_data,
  method = "nnet",
  preProcess = preprocess_steps,
  trControl = model_control,
  metric = "ROC",
  tuneGrid = nn_grid,
  maxit = 500
)

saveRDS(nn_model, "nn_model.rds")

The best performing model had size = 15 and decay rate of 0.1.

nn_model <- readRDS("nn_model.rds")
print(nn_model)

## Neural Network 
## 
## 39801 samples
##    18 predictor
##     2 classes: 'malicious', 'benign' 
## 
## Pre-processing: centered (13), scaled (13), remove (220) 
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 26534, 26534, 26534 
## Resampling results across tuning parameters:
## 
##   size  decay  ROC        Sens       Spec     
##    5    0.001  0.9981025  0.9712143  0.9891200
##    5    0.010  0.9981742  0.9708392  0.9917284
##    5    0.100  0.9978598  0.9718706  0.9917284
##   10    0.001  0.9979660  0.9764651  0.9936848
##   10    0.010  0.9985744  0.9788092  0.9950233
##   10    0.100  0.9987725  0.9783404  0.9943026
##   15    0.001  0.9985812  0.9797468  0.9946458
##   15    0.010  0.9986305  0.9801219  0.9954009
##   15    0.100  0.9989196  0.9814346  0.9948861
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were size = 15 and decay = 0.1.

nn_predictions <- predict(nn_model, newdata = test_data)
nn_probs <- predict(nn_model, newdata = test_data, type = "prob")

Model performance for the neural network varied little as the parameters were tuned, with each metric performing practically identically until the third decimal place. As a general trend the size of the hidden layer had more of an effect than the decay rate on model performance, but as stated earlier neither appeared to have much of an effect overall.

nn_training_results <- nn_model$results |>
  mutate(decay = factor(decay, levels = c("0.001", "0.01", "0.1"))) # Needed to allow for discrete scaling in visualization

nn_roc <- ggplot(nn_training_results, aes(x = size, y = ROC)) +
  geom_point(aes(size = decay), position = position_dodge(width = 0.5), color = "black", alpha = 0.8) +
  scale_size_manual(values = c("0.001" = 1, "0.01" = 2, "0.1" = 3), name = "Decay Rate") +
  labs(title = "Neural Network Training Performance",
       x = NULL) +
  theme(legend.position = "bottom")

nn_sens <- ggplot(nn_training_results, aes(x = size, y = Sens)) +
  geom_point(aes(size = decay), position = position_dodge(width = 0.5), color = "black", alpha = 0.8) +
  scale_size_manual(values = c("0.001" = 1, "0.01" = 2, "0.1" = 3), name = "Decay Rate") +
  labs(x = NULL, y = "Sensitivity") +
  theme(legend.position = "none")

nn_spec <- ggplot(nn_training_results, aes(x = size, y = Spec)) +
  geom_point(aes(size = decay), position = position_dodge(width = 0.5), color = "black", alpha = 0.8) +
  scale_size_manual(values = c("0.001" = 1, "0.01" = 2, "0.1" = 3), name = "Decay Rate") +
  labs(x = "Size of Hidden Layer") + 
  theme(legend.position = "none")

nn_combined_plot <- nn_roc / nn_sens / nn_spec + 
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

nn_combined_plot + 
  plot_annotation(
    caption = "Model performance varied very little as tuning parameters changed, with 3-4 decimal places needed before differences appear."
  )

nn_eval_data <- data.frame(pred = nn_predictions, obs = test_data$type, benign = nn_probs$benign, malicious = nn_probs$malicious)

nn_summary <- imbalanced_summary(data = nn_eval_data)

Model performance again carried over to the test set and displayed similarly high performance to the random forest model.

nn_summary |>
  t() |>
  as.data.frame() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kbl(caption = "Neural Network Performance on Test Set") |>
  kable_styling(bootstrap_options = "condensed")

Neural Network Performance on Test Set
AUC-ROC	AUC-PR	Recall	Specificity	Precision	F1	GMean
0.9992	0.9997	0.9962	0.9782	0.9921	0.9941	0.9872

Conclusion

Both models performed extremely well on the dataset, with the neural network performing slightly better by both AUC metrics and the random forest model exhibiting slightly better performance in the other tracked metrics. For all practical purposes, the models have performed virtually identically with either being an excellent choice when attempting to identify malicious urls.

results_df <- data.frame(rf_summary, nn_summary) |>
  rename(`Neural Network` = nn_summary, `Random Forest` = rf_summary)

results_df |>
    mutate(across(where(is.numeric), \(x) round(x, 4))) |>
    t() |>
    as.data.frame() |>
    rownames_to_column(var = "Model") |>
    mutate(
      `AUC-ROC` = cell_spec(
        `AUC-ROC`, 
        background = ifelse(`AUC-ROC` == max(`AUC-ROC`, na.rm = TRUE), "#FFFF99", "white")
      ),
      `AUC-PR` = cell_spec(
      `AUC-PR`, 
        background = ifelse(`AUC-PR` == max(`AUC-PR`, na.rm = TRUE), "#FFFF99", "white")
      ),
      Recall = cell_spec(
        Recall, 
        background = ifelse(Recall == max(Recall, na.rm = TRUE), "#FFFF99", "white")
      ),
      Specificity = cell_spec(
        Specificity, 
        background = ifelse(Specificity == max(Specificity, na.rm = TRUE), "#FFFF99", "white")
      ),
      Precision = cell_spec(
        Precision, 
        background = ifelse(Precision == max(Precision, na.rm = TRUE), "#FFFF99", "white")
      ),
      `F1` = cell_spec(
        `F1`, 
        background = ifelse(`F1` == max(`F1`, na.rm = TRUE), "#FFFF99", "white")
      ),
      GMean = cell_spec(
        GMean, 
        background = ifelse(GMean == max(GMean, na.rm = TRUE), "#FFFF99", "white")
      )
    ) |>
    kbl(caption = "Performance of Optimal Models on the Test Set", escape = FALSE) |>
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Performance of Optimal Models on the Test Set
Model	AUC-ROC	AUC-PR	Recall	Specificity	Precision	F1	GMean
Random Forest	0.999	0.9995	0.9966	0.9812	0.9932	0.9949	0.9889
Neural Network	0.9992	0.9997	0.9962	0.9782	0.9921	0.9941	0.9872

One benefit of the random forest model is the well established method of determining variable importance. When examining importance we can see that the majority of the importance is in the top 10 variables, with a rapid decay afterwards. The majority of the variables of importance are numeric, tracking lengths and the number of inputs. Of the categorical predictors, the lack of a scheme is by far the most powerful predictor present, with a tld of “com” being the only other categorical predictor of note.

importance <- varImp(rf_model, scale = TRUE)

importance[[1]] |>
  arrange(desc(Overall)) |>
  top_n(n = 15, wt = Overall) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

	Overall
schemenone	100.000000
num_slashes	60.869693
num_dots	23.464153
subdomain_count	22.235384
path_length	20.363533
url_length	17.028934
num_params	15.136738
num_equals	13.150670
num_ampersands	11.649123
tldcom	10.899555
domain_length	10.534880
avg_token_length	8.749679
num_question_marks	5.307672
num_hyphens	5.225182
tldinfo	3.857897

Seeing as both models performed extremely well, the recommended model for use would have to be the random forest model. The simpler and more interpretable nature of the model makes it more valuable, providing insights into the nature of attacks which can be used to further engineer relevant features. As these additional insights do not come at any meaningful loss of performance, the choice of models is an easy one.

Assignment 4

Matthew Tillmawitz

2025-05-11