DATA 622 - Homework 2

Packages

library(caret)
library(DataExplorer)
library(ggcorrplot)
library(knitr)
library(MASS)
select <- dplyr::select
library(tidyverse)
library(RColorBrewer)
library(randomForest)
library(rpart)
library(rpart.plot)
library(psych)

Introduction

We load a dataset of Web sites labeled either Phishing or Legitimate. Below are the first 10 observations in the dataset, and for the sake of readability, only the first 12 columns are displayed.

cur_theme <- theme_set(theme_classic())
my_url <- "https://raw.githubusercontent.com/geedoubledee/data622_homework2/main/web-page-phishing.csv"
phishing_df <- read.csv(my_url, skip = 1)
rem <- c("phishing", "n_hastag", "n_hypens")
phishing_df <- phishing_df |>
    mutate(LABEL = factor(phishing, labels = c("Legitimate", "Phishing")),
           n_hashtag = n_hastag,
           n_hyphens = n_hypens) |>
    relocate(LABEL, .before = url_length) |>
    relocate(n_hashtag, .before = n_dollar) |>
    select(-all_of(rem))
kable(phishing_df[1:10, 1:12], format = "simple")

LABEL	url_length	n_dots	n_underline	n_questionmark	n_equal	n_and
Legitimate	37	3	0	0	0	0
Phishing	77	1	0	0	0	0
Phishing	126	4	2	1	3	2
Legitimate	18	2	0	0	0	0
Legitimate	55	2	0	0	0	0
Phishing	32	3	0	0	0	0
Legitimate	19	2	0	0	0	0
Phishing	81	2	0	0	0	0
Legitimate	42	2	0	0	0	0
Legitimate	104	1	0	0	0	0

The first column is the response variable that we will attempt to predict: a binary factor named LABEL. In addition to the response variable, there are 19 integer predictor variables:

classes <- as.data.frame(unlist(lapply(phishing_df, class))) |>
    rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
    group_by(Class) |>
    summarize(Count = n(),
              Variables = paste(sort(unique(Variable)),collapse=", ")) |>
    filter(Class == "integer")
kable(classes_summary, format = "simple")

Class	Count	Variables
integer	19	n_and, n_asterisk, n_at, n_comma, n_dollar, n_dots, n_equal, n_exclamation, n_hashtag, n_hyphens, n_percent, n_plus, n_questionmark, n_redirection, n_slash, n_space, n_tilde, n_underline, url_length

All of these predictor variables except for url_length and n_redirection represent counts of specific punctuation characters within the Web sites’ urls. The former is the count of all characters within the url, and the latter is the count of redirects within the url.

Exploratory Data Analysis

We check for any missing values within the dataset.

rem <- c("discrete_columns", "continuous_columns",
            "total_observations", "memory_usage")
completeness <- introduce(phishing_df) |>
    select(-all_of(rem))
knitr::kable(t(completeness), format = "simple")

rows	100077
columns	20
all_missing_columns	0
total_missing_values	0
complete_rows	100077

Of the 100,000+ observations, none contain missing values we need to address for any of the variables.

We check the distribution of the response variable to see if there’s a class imbalance between Phishing Web sites and Legitimate Web sites.

pal <- brewer.pal(n = 12, name = "Paired")
cols <- pal[c(2, 8)]
names(cols) <- c("Legitimate", "Phishing")
obs = nrow(phishing_df)
p1 <- phishing_df |>
    ggplot(aes(x = LABEL)) +
    geom_histogram(aes(color = LABEL, fill = LABEL), stat = "count") +
    geom_text(stat = "count", aes(label = paste0(round(
        after_stat(count) / obs * 100, 1), "%")),
              size = 5, color = "white", vjust = 2, fontface = "bold") + 
    scale_color_manual(values = cols) +
    scale_fill_manual(values = cols) +
    scale_y_continuous(labels = scales::comma) +
    labs(title = "Distribution of Phishing & Legitimate Web sites",
         y = "COUNT") +
    theme(legend.position = "none")
p1

The ratio of Legitimate to Phishing Web sites is not quite 2:1, so the classes are only marginally imbalanced here, and no oversampling corrections will be required later.

We summarize the distributions of the predictor variables.

rem <- c("vars", "n", "trimmed", "mad", "skew", "kurtosis", "se")
excl <- c("LABEL*")
describe <- describe(phishing_df) |>
    select(-all_of(rem))
describe <- describe |>
    filter(!rownames(describe) %in% excl)
knitr::kable(describe, format = "simple")

	mean	sd	median	min	max	range
url_length	39.1776832	47.9718467	24	4	4165	4161
n_dots	2.2243972	1.2550458	2	1	24	23
n_underline	0.1377240	0.7239953	0	0	21	21
n_slash	1.1353858	1.8285255	0	0	44	44
n_questionmark	0.0243912	0.1677892	0	0	9	9
n_equal	0.2158338	0.9598019	0	0	23	23
n_at	0.0221429	0.2683926	0	0	43	43
n_and	0.1433296	0.9136555	0	0	26	26
n_exclamation	0.0026080	0.0822074	0	0	10	10
n_space	0.0048762	0.1445694	0	0	18	18
n_tilde	0.0036172	0.0784998	0	0	5	5
n_comma	0.0023782	0.0795583	0	0	11	11
n_plus	0.0024681	0.1043821	0	0	19	19
n_asterisk	0.0040968	0.2840456	0	0	60	60
n_hashtag	0.0004497	0.0580279	0	0	13	13
n_dollar	0.0018985	0.0974124	0	0	10	10
n_percent	0.1092858	1.6953268	0	0	174	174
n_redirection	0.3615316	0.7754916	0	-1	17	18
n_hyphens	0.4051880	1.2854651	0	0	43	43

The median for many of these predictor variables is 0, which suggests their distributions are degenerate. One predictor, n_redirection, also seems to have a nonsensical range that includes -1. We update the value for these observations to be NA. When we build decision tree models later, they won’t have any issues handling these missing values. However, we will replace the NA values with their original values for the random forest model we build. This is because random forest models can’t handle missing values unless they’ve been imputed, and we favor the original values over imputation here.

phishing_df <- phishing_df |>
    mutate(n_redirection = ifelse(n_redirection == -1, NA, n_redirection))

We check for any near-zero variance predictors to confirm the degenerate distributions we suspect.

nzv <- nearZeroVar(phishing_df, names = TRUE, saveMetrics = FALSE)
nzv

##  [1] "n_underline"    "n_questionmark" "n_equal"        "n_at"          
##  [5] "n_and"          "n_exclamation"  "n_space"        "n_tilde"       
##  [9] "n_comma"        "n_plus"         "n_asterisk"     "n_hashtag"     
## [13] "n_dollar"       "n_percent"

Out of 19 predictors, 14 of them demonstrate near-zero variance and would only serve as noise in our models, so we remove them.

phishing_df <- phishing_df |>
    select(-all_of(nzv))

We take a look at the distributions of the remaining predictor variables.

skip <- c("LABEL")
phishing_piv <- phishing_df |>
    pivot_longer(cols = !all_of(skip), names_to = "PREDICTOR",
                 values_to = "VALUE")
p2 <- phishing_piv |>
    ggplot(aes(x = VALUE, color = LABEL, fill = LABEL)) +
    geom_histogram(data = subset(phishing_piv, LABEL == "Legitimate"),
                   alpha = 0.5) +
    geom_histogram(data = subset(phishing_piv, LABEL == "Phishing"),
                   alpha = 0.5) +
    scale_color_manual(values = cols) +
    scale_fill_manual(values = cols) +
    scale_y_continuous(labels = scales::comma) +
    facet_wrap(PREDICTOR ~ ., ncol = 2, scales = "free_x") +
    labs(title = "Distribution of Remaining Predictor Variables",
         y = "COUNT") +
    theme(legend.position = "top")
p2

The distributions for the remaining predictor variables are all right-skewed. We considered whether shifting and transforming the data might improve the performance of the models we will be building later, but decision tree models seem to be insensitive to the scale of predictors. Transformed data didn’t result in the creation of different decision boundaries, so we decided against it.

We visualize correlations between the response variable and the remaining predictors, as well as any predictor-predictor correlations. In the interest of ignoring clutter, only correlations greater than 0.1 (in absolute value) are displayed.

plot_corr_range <- function(df, mn=0.1, mx=1.0, excl=c(NA)){
    palette <- brewer.pal(n = 7, name = "RdBu")[c(1, 4, 7)]
    tit = sprintf("Correlations Between %s and %s (Absolute Value)", mn, mx)
    r <- model.matrix(~0+., data = df) |>
        cor() |>
        round(digits=2)
    is.na(r) <- abs(r) > mx
    is.na(r) <- abs(r) < mn
    if (!is.na(excl)){
        r <- as.data.frame(r) |>
            select(-all_of(excl)) |>
            filter(!rownames(r) %in% excl)
    }
    p <- r |>
        ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE,
                   lab_size = 3, tl.cex = 10, tl.srt = 90,
                   colors = palette, outline.color = "white") +
        labs(title = tit) +
        theme(plot.title.position = "plot")
    p
}
excl <- c("LABELLegitimate")
p3 <- plot_corr_range(df = phishing_df, excl = excl)
p3

We see that n_slash is strongly positively correlated with Phishing Web sites, and url_length is moderately positively correlated with Phishing Web sites. So the more slashes a url contains, and the longer the url is, the more likely it is that the url belongs to a Phishing Web site. No other predictor variables have as strong of a correlation with the response variable as either of these.

We also see that url_length is pretty correlated with most other predictors, including n_slash. Correlated predictors don’t negatively impact the predictive power of decision tree models much though.

Data Preparation

We split the data into train and test sets.

set.seed(816)
sample <- sample(nrow(phishing_df),
                 round(nrow(phishing_df) * 0.7),
                 replace = FALSE)
train_df <- phishing_df[sample, ]
test_df <- phishing_df[-sample, ]

We confirm the class distributions are similar in the original, train, and test sets.

dist1 <- as.data.frame(round(prop.table(table(select(phishing_df, LABEL))), 2))
colnames(dist1) <- c("LABEL", "Original Freq")
dist2 <- as.data.frame(round(prop.table(table(select(train_df, LABEL))), 2))
colnames(dist2) <- c("LABEL", "Train Freq")
dist3 <- as.data.frame(round(prop.table(table(select(test_df, LABEL))), 2))
colnames(dist3) <- c("LABEL", "Test Freq")
class_dist <- dist1 |>
    left_join(dist2, by = join_by(LABEL)) |>
    left_join(dist3, by = join_by(LABEL))
kable(class_dist, format = "simple")

LABEL	Original Freq	Train Freq	Test Freq
Legitimate	0.64	0.64	0.63
Phishing	0.36	0.36	0.37

The class distributions are all very similar.

Model Building

Now we are ready to build our decision tree models. We’ll exclude url_length from the first model, and we’ll exclude n_slash from the second model. These predictors have the highest correlations with the response variable, but they are also pretty correlated with each other, so a decision tree trained on both features might not be able to use the information from both of them anyway. With two correlated predictors, whatever feature is not used for an early split isn’t typically going to be very useful in creating a later split because it just reinforces what we already know. So the second correlated predictor would probably be ignored in favor of a different, more illuminating feature.

Decision Tree Model 1:

dtree_mod1 <- rpart(LABEL ~ . - url_length, method = "class", data = train_df)
rpart.plot(dtree_mod1, box.palette = "BuOr")

Decision Tree Model 1 includes a root node and two decision nodes using two variables, n_slash and n_hyphens, resulting in four leaf nodes.

dtree1Imp <- varImp(dtree_mod1)
dtree1Imp <- dtree1Imp |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(dtree1Imp) <- cols
dtree1Imp <- dtree1Imp |>
    arrange(desc(Importance))
kable(dtree1Imp, format = "simple")

Predictor	Importance
n_slash	14581.7585
n_hyphens	3410.3099
n_dots	2048.8300
n_redirection	616.5381

The feature importance estimates for Decision Tree Model 1 are in line with our expectations, with n_slash and n_hyphens having scored highest.

Decision Tree Model 2:

dtree_mod2 <- rpart(LABEL ~ . - n_slash, method = "class", data = train_df)
rpart.plot(dtree_mod2, box.palette = "BuOr")

Decision Tree Model 2 also uses only two variables, url_length and n_dots, and has the same assortment of nodes as Decision Tree Model 1.

dtree2Imp <- varImp(dtree_mod2)
dtree2Imp <- dtree2Imp |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(dtree2Imp) <- cols
dtree2Imp <- dtree2Imp |>
    arrange(desc(Importance))
kable(dtree2Imp, format = "simple")

Predictor	Importance
url_length	14115.9074
n_hyphens	3101.7858
n_dots	2643.0505
n_redirection	540.6571

Looking at the feature importance estimates for Decision Tree Model 2, n_dots scored lower than n_hyphens, so at first glance it’s surprising n_hyphens was not included in any splits. However, this just means that n_hyphens was a close competitor for the best variable to split on sometimes. We don’t see it in the splits for Decision Tree Model 2 because it just never won.

Now we build our random forest model. Since this type of model can’t handle missing values, we revert the NA values we introduced for n_redirection to the original values of -1. We set the number of candidate features for each tree to two.

Random Forest Model:

train_df_alt <- train_df |>
    mutate(n_redirection = ifelse(is.na(n_redirection), -1, n_redirection))
test_df_alt <- test_df |>
    mutate(n_redirection = ifelse(is.na(n_redirection), -1, n_redirection))
rf_mod <- train(LABEL ~ ., data = train_df_alt, metric = "Accuracy",
                method = "rf", trControl = trainControl(method = "none"),
                tuneGrid = expand.grid(.mtry = 2))
rfImp <- varImp(rf_mod, scale = TRUE)
rfImp <- rfImp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rfImp) <- cols
rfImp <- rfImp |>
    arrange(desc(Importance))
kable(rfImp, format = "simple")

Predictor	Importance
url_length	100.000000
n_slash	85.694406
n_dots	9.400319
n_hyphens	9.284066
n_redirection	0.000000

Unsurprisingly, the random forest model estimates url_length and n_slash to be the two most important features for predicting the response variable. The relative importance estimate for n_redirection is 0, indicating it was not used in any of the trees.

Model Evaluation

We make predictions on the test data using all three models, and we construct confusion matrices and calculate a variety of performance metrics for them.

First, we look at the confusion matrices for each of the models.

#Decision Tree Model 1: predictions/confusion matrix
pred_dtree_mod1 <- predict(dtree_mod1, test_df, type = "class")
dt1cm_complete <- confusionMatrix(pred_dtree_mod1, test_df$LABEL,
                                  positive = "Phishing")
dt1cm <- as.data.frame(dt1cm_complete$table)
dt1cm$Reference <- factor(dt1cm$Reference, levels = rev(levels(dt1cm$Reference)))
dt1cm <- dt1cm |>
    mutate(
        Label = case_when(
            Prediction == "Legitimate" & Reference == "Legitimate" ~ "TN",
            Prediction == "Phishing" & Reference == "Phishing" ~ "TP",
            Prediction == "Legitimate" & Reference == "Phishing" ~ "FN",
            Prediction == "Phishing" & Reference == "Legitimate" ~ "FP"),
        Model = "Decision Tree Model 1")
#Decision Tree Model 2: predictions/confusion matrix
pred_dtree_mod2 <- predict(dtree_mod2, test_df, type = "class")
dt2cm_complete <- confusionMatrix(pred_dtree_mod2, test_df$LABEL,
                                  positive = "Phishing")
dt2cm <- as.data.frame(dt2cm_complete$table)
dt2cm$Reference <- factor(dt2cm$Reference, levels = rev(levels(dt2cm$Reference)))
dt2cm <- dt2cm |>
    mutate(
        Label = case_when(
            Prediction == "Legitimate" & Reference == "Legitimate" ~ "TN",
            Prediction == "Phishing" & Reference == "Phishing" ~ "TP",
            Prediction == "Legitimate" & Reference == "Phishing" ~ "FN",
            Prediction == "Phishing" & Reference == "Legitimate" ~ "FP"),
        Model = "Decision Tree Model 2")
#Random Forest Model: predictions/confusion matrix
pred_rf_mod <- predict(rf_mod, test_df_alt)
rfcm_complete <- confusionMatrix(pred_rf_mod, test_df_alt$LABEL,
                                 positive = "Phishing")
rfcm <- as.data.frame(rfcm_complete$table)
rfcm$Reference <- factor(rfcm$Reference, levels = rev(levels(rfcm$Reference)))
rfcm <- rfcm |>
    mutate(
        Label = case_when(
            Prediction == "Legitimate" & Reference == "Legitimate" ~ "TN",
            Prediction == "Phishing" & Reference == "Phishing" ~ "TP",
            Prediction == "Legitimate" & Reference == "Phishing" ~ "FN",
            Prediction == "Phishing" & Reference == "Legitimate" ~ "FP"),
        Model = "Random Forest Model")
cm <- bind_rows(dt1cm, dt2cm, rfcm)
p4 <- cm |>
    ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
    geom_tile(col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label), vjust = 3) + 
    scale_fill_gradient(low = "white", high = pal[4]) +
    scale_x_discrete(position = "top") +
    facet_wrap(Model ~ ., ncol = 3, strip.position = "bottom") +
    labs(title = "Confusion Matrices for All Models") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank(),
          axis.text.y = element_text(angle = 90, hjust = 0.5),
          axis.ticks = element_blank(),
          legend.position = "bottom",
          strip.placement = "outside")
p4

We immediately note some differences between the models. Decision Tree Model 1 has fewer false positives than false negatives, whereas the opposite is true for Decision Tree Model 2. The Random Forest Model, in contrast, has a very balanced ratio of false positives to false negatives.

Next, we look at various performance measures.

metrics <- as.data.frame(cbind(rbind(dt1cm_complete$byClass,
                                     dt2cm_complete$byClass,
                                     rfcm_complete$byClass),
                               rbind(dt1cm_complete$overall,
                                     dt2cm_complete$overall,
                                     rfcm_complete$overall)))
rownames(metrics) <- c("Decision Tree Model 1",
                       "Decision Tree Model 2",
                       "Random Forest Model")
keep <- c("Accuracy", "Kappa", "Precision", "Recall", "F1", "Specificity")
metrics <- metrics |>
    select(all_of(keep)) |>
    round(3)
kable(metrics, format = "simple")

	Accuracy	Kappa	Precision	Recall	F1	Specificity
Decision Tree Model 1	0.839	0.639	0.840	0.691	0.758	0.925
Decision Tree Model 2	0.837	0.654	0.754	0.821	0.786	0.846
Random Forest Model	0.888	0.759	0.848	0.846	0.847	0.913

Confirming what we assessed visually, Decision Tree Model 1 has the worst recall, whereas Decision Tree Model 2 has the worst precision. But both decision tree models are similarly accurate overall. The Random Forest Model is more accurate than either of them though, and it balances precision and recall well, giving it the highest F1 score.

Conclusion

Because our dataset included two good predictors of Phishing Web sites that were pretty correlated with one another, the decision tree model we created using only one of those predictors was almost as accurate as the decision tree model we created using only the other of those predictors. However, each model suffered more from one kind of classification error than the other. If we were concerned about limiting the number of irrelevant Phishing alerts, Decision Tree Model 1 outperforms Decision Tree Model 2 because relatively few of the positive alerts it generates are false positives. However, if we were concerned about capturing most of the Web sites that are actually Phishing Web sites in our alerts, Decision Tree Model 2 outperforms Decision Tree Model 1 because it classifies relatively few Phishing Web sites as Legitimate Web sites.

If we had to choose between the decision tree models only, there is a clear trade-off in how the models perform, and our choice would need to be informed by the business’s needs. In many instances, our preference would probably be for Decision Tree Model 2. It is the safer of the two, in that deploying it would prevent employees from being exposed to more Phishing Web sites. Although it might annoy employees because a lot of Legitimate Web sites would be blocked as well, it’s easier to over-censor at first and slowly add Web sites to a “cleared” list than it is to under-censor at first and incur more harm.

By aggregating a variety of decision trees using any two of the predictors, the Random Forest Model overcomes some of the limitations of the individual decision tree models. The increased accuracy and balance between precision and recall we saw make it an excellent Phishing Web site classifier. What’s even more impressive is that the random forest model could probably have identified the important predictors and generated pretty good predictions with this dataset even if we hadn’t done all the exploratory data analysis we did and trimmed the feature space ourselves. While we should always do the work of getting rid of any noise we can identify, random forest models just aren’t as sensitive to it as some other models. They can weed out unimportant features on their own, as evidenced by n_redirection being given a feature importance estimate of 0 in our Random Forest Model.