DATA 622 - Homework 3

Packages

library(caret)
library(DataExplorer)
library(e1071)
library(ggcorrplot)
library(knitr)
library(MASS)
select <- dplyr::select
library(png)
library(psych)
library(RColorBrewer)
library(tidyverse)

Introduction

We load the dataset of Web sites labeled either Phishing or Legitimate that we used in Homework 2. As a reminder, below are the first 10 observations in the dataset, and for the sake of readability, only the first 12 columns are displayed.

cur_theme <- theme_set(theme_classic())
pal <- brewer.pal(n = 12, name = "Paired")
my_url <- "https://raw.githubusercontent.com/geedoubledee/data622_homework3/main/web-page-phishing.csv"
phishing_df <- read.csv(my_url, skip = 1)
rem <- c("phishing", "n_hastag", "n_hypens")
phishing_df <- phishing_df |>
    mutate(LABEL = factor(phishing, labels = c("Legitimate", "Phishing")),
           n_hashtag = n_hastag,
           n_hyphens = n_hypens) |>
    relocate(LABEL, .before = url_length) |>
    relocate(n_hashtag, .before = n_dollar) |>
    select(-all_of(rem))
kable(phishing_df[1:10, 1:12], format = "simple")

LABEL	url_length	n_dots	n_underline	n_questionmark	n_equal	n_and
Legitimate	37	3	0	0	0	0
Phishing	77	1	0	0	0	0
Phishing	126	4	2	1	3	2
Legitimate	18	2	0	0	0	0
Legitimate	55	2	0	0	0	0
Phishing	32	3	0	0	0	0
Legitimate	19	2	0	0	0	0
Phishing	81	2	0	0	0	0
Legitimate	42	2	0	0	0	0
Legitimate	104	1	0	0	0	0

The first column is the response variable that we will again attempt to predict: a binary factor named LABEL. In addition to the response variable, there are 19 integer predictor variables:

classes <- as.data.frame(unlist(lapply(phishing_df, class))) |>
    rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
    group_by(Class) |>
    summarize(Count = n(),
              Variables = paste(sort(unique(Variable)),collapse=", ")) |>
    filter(Class == "integer")
kable(classes_summary, format = "simple")

Class	Count	Variables
integer	19	n_and, n_asterisk, n_at, n_comma, n_dollar, n_dots, n_equal, n_exclamation, n_hashtag, n_hyphens, n_percent, n_plus, n_questionmark, n_redirection, n_slash, n_space, n_tilde, n_underline, url_length

All of these predictor variables except for url_length and n_redirection represent counts of specific punctuation characters within the Web sites’ urls. The former is the count of all characters within the url, and the latter is the count of redirects within the url.

The exploratory data analysis that follows is largely the same as was performed during Homework 2.

Exploratory Data Analysis

We check for any missing values within the dataset.

rem <- c("discrete_columns", "continuous_columns",
            "total_observations", "memory_usage")
completeness <- introduce(phishing_df) |>
    select(-all_of(rem))
knitr::kable(t(completeness), format = "simple")

rows	100077
columns	20
all_missing_columns	0
total_missing_values	0
complete_rows	100077

Of the 100,000+ observations, none contain missing values we need to address for any of the variables.

We check the distribution of the response variable to see if there’s a class imbalance between Phishing Web sites and Legitimate Web sites.

pal <- brewer.pal(n = 12, name = "Paired")
cols <- pal[c(2, 8)]
names(cols) <- c("Legitimate", "Phishing")
obs = nrow(phishing_df)
p1 <- phishing_df |>
    ggplot(aes(x = LABEL)) +
    geom_histogram(aes(color = LABEL, fill = LABEL), stat = "count") +
    geom_text(stat = "count", aes(label = paste0(round(
        after_stat(count) / obs * 100, 1), "%")),
              size = 5, color = "white", vjust = 2, fontface = "bold") + 
    scale_color_manual(values = cols) +
    scale_fill_manual(values = cols) +
    scale_y_continuous(labels = scales::comma) +
    labs(title = "Distribution of Phishing & Legitimate Web sites",
         y = "COUNT") +
    theme(legend.position = "none")
p1

The ratio of Legitimate to Phishing Web sites is not quite 2:1, so the classes are only marginally imbalanced here, and no oversampling corrections will be required later.

We summarize the distributions of the predictor variables.

rem <- c("vars", "n", "trimmed", "mad", "skew", "kurtosis", "se")
excl <- c("LABEL*")
describe <- describe(phishing_df) |>
    select(-all_of(rem))
describe <- describe |>
    filter(!rownames(describe) %in% excl)
knitr::kable(describe, format = "simple")

	mean	sd	median	min	max	range
url_length	39.1776832	47.9718467	24	4	4165	4161
n_dots	2.2243972	1.2550458	2	1	24	23
n_underline	0.1377240	0.7239953	0	0	21	21
n_slash	1.1353858	1.8285255	0	0	44	44
n_questionmark	0.0243912	0.1677892	0	0	9	9
n_equal	0.2158338	0.9598019	0	0	23	23
n_at	0.0221429	0.2683926	0	0	43	43
n_and	0.1433296	0.9136555	0	0	26	26
n_exclamation	0.0026080	0.0822074	0	0	10	10
n_space	0.0048762	0.1445694	0	0	18	18
n_tilde	0.0036172	0.0784998	0	0	5	5
n_comma	0.0023782	0.0795583	0	0	11	11
n_plus	0.0024681	0.1043821	0	0	19	19
n_asterisk	0.0040968	0.2840456	0	0	60	60
n_hashtag	0.0004497	0.0580279	0	0	13	13
n_dollar	0.0018985	0.0974124	0	0	10	10
n_percent	0.1092858	1.6953268	0	0	174	174
n_redirection	0.3615316	0.7754916	0	-1	17	18
n_hyphens	0.4051880	1.2854651	0	0	43	43

The median for many of these predictor variables is 0, which suggests their distributions are degenerate. One predictor, n_redirection, also seems to have a nonsensical range that includes -1, but we leave these values as-is since the Support Vector Machine (SVM) models we will be building can’t handle missing values unless they’ve been imputed, and we favor the original values over imputation here.

Out of these predictors, 14 of them demonstrate near-zero variance:

nzv <- nearZeroVar(phishing_df, names = TRUE, saveMetrics = FALSE)
nzv

##  [1] "n_underline"    "n_questionmark" "n_equal"        "n_at"          
##  [5] "n_and"          "n_exclamation"  "n_space"        "n_tilde"       
##  [9] "n_comma"        "n_plus"         "n_asterisk"     "n_hashtag"     
## [13] "n_dollar"       "n_percent"

Since they would only serve as noise in our models, we remove them.

phishing_df <- phishing_df |>
    select(-all_of(nzv))

We take a look at the distributions of the remaining predictor variables.

skip <- c("LABEL")
phishing_piv <- phishing_df |>
    pivot_longer(cols = !all_of(skip), names_to = "PREDICTOR",
                 values_to = "VALUE")
p2 <- phishing_piv |>
    ggplot(aes(x = VALUE, color = LABEL, fill = LABEL)) +
    geom_histogram(data = subset(phishing_piv, LABEL == "Legitimate"),
                   alpha = 0.5) +
    geom_histogram(data = subset(phishing_piv, LABEL == "Phishing"),
                   alpha = 0.5) +
    scale_color_manual(values = cols) +
    scale_fill_manual(values = cols) +
    scale_y_continuous(labels = scales::comma) +
    facet_wrap(PREDICTOR ~ ., ncol = 2, scales = "free_x") +
    labs(title = "Distribution of Remaining Predictor Variables",
         y = "COUNT") +
    theme(legend.position = "top")
p2

The distributions for the remaining predictor variables are all right-skewed. We forego any transformations since they weren’t appropriate during Homework 2, and we’ll be making comparisons later between the SVM models we will be building and the tree models we developed then.

We visualize correlations between the response variable and the remaining predictors, as well as any predictor-predictor correlations. In the interest of ignoring clutter, only correlations greater than 0.1 (in absolute value) are displayed.

plot_corr_range <- function(df, mn=0.1, mx=1.0, excl=c(NA)){
    palette <- brewer.pal(n = 7, name = "RdBu")[c(1, 4, 7)]
    tit = sprintf("Correlations Between %s and %s (Absolute Value)", mn, mx)
    r <- model.matrix(~0+., data = df) |>
        cor() |>
        round(digits=2)
    is.na(r) <- abs(r) > mx
    is.na(r) <- abs(r) < mn
    if (!is.na(excl)){
        r <- as.data.frame(r) |>
            select(-all_of(excl)) |>
            filter(!rownames(r) %in% excl)
    }
    p <- r |>
        ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE,
                   lab_size = 3, tl.cex = 10, tl.srt = 90,
                   colors = palette, outline.color = "white") +
        labs(title = tit) +
        theme(plot.title.position = "plot")
    p
}
excl <- c("LABELLegitimate")
p3 <- plot_corr_range(df = phishing_df, excl = excl)
p3

We see that n_slash is strongly positively correlated with Phishing Web sites, and url_length is moderately positively correlated with Phishing Web sites. So the more slashes a url contains, and the longer the url is, the more likely it is that the url belongs to a Phishing Web site. No other predictor variables have as strong of a correlation with the response variable as either of these.

We also see that url_length is pretty correlated with most other predictors, including n_slash. We generated a set of models that excluded either url_length or n_slash to compare performance, but they performed worse than models that included both predictors, so we ultimately decided not to exclude either.

Data Preparation

We then split the data into train and test sets. Because the SVM models we will be building can be slow to train on large datasets, and we have a large number of observations, we reduce the percentage of data used to train the model from the typical 70% to a more manageable 10%. This reduction seemed a little extreme at first, but we made several attempts to use larger percentages of the data to train the models, and the computation costs only became reasonable at this figure. We appreciate the gain in speed when building the SVM models, and we will see later that they perform very well despite using only a fraction of the data on which the tree models we built in Homework 2 were trained.

set.seed(816)
sample <- sample(nrow(phishing_df),
                 round(nrow(phishing_df) * 0.1),
                 replace = FALSE)
train_df <- phishing_df[sample, ]
test_df <- phishing_df[-sample, ]

Model Building

We build both radial basis and linear SVM models, using only three-fold cross-validation to further reduce our computation costs.

SVM Model 1: Radial Basis

A summary of the best radial basis SVM model that we arrived at during tuning is below:

fn <- "svm1.rds"
if (!file.exists(fn)){
    ctrl <-  tune.control(sampling = "cross", cross = 3, nrepeat = 1)
    tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000),
                  gamma = c(0.5, 1, 2, 3, 4))
    svm_tune1 <- tune(svm, LABEL ~ .,
                      data = train_df, kernel = "radial",
                      ranges = tune_grid, tunecontrol = ctrl)
    svm1 <- svm_tune1$best.model
    saveRDS(svm1, "svm1.rds")
}else{
    svm1 <- readRDS("svm1.rds")
}
summarize_svm <- function(svm_model){
    col1 <- c("call", "cost", "gamma", "num_classes", "classes",
              "support_vectors_total", "support_vectors_split")
    subset <- c("call", "cost", "gamma", "nclasses", "levels",
              "tot.nSV", "nSV")
    col2 <- svm_model[subset]
    copy <- col2
    for (i in 1:length(copy)){
        if (is.vector(copy[[i]])){
            col2[[i]] <- paste(col2[[i]], collapse = ", ")
        }
    }
    summ <- as.data.frame(cbind(col1, col2))
    rownames(summ) <- NULL
    colnames(summ) <- c("Parameter", "Value")
    summ
}
summ <- summarize_svm(svm1)
kable(summ, format = "simple")

Parameter	Value
call	best.tune(METHOD = svm, train.x = LABEL ~ ., data = train_df, , ranges = tune_grid, tunecontrol = ctrl, kernel = “radial”)
cost	1
gamma	4
num_classes	2
classes	Legitimate, Phishing
support_vectors_total	3685
support_vectors_split	1514, 2171

It uses a cost of 1 and a gamma of 4.

SVM Model 2: Linear

A summary of the best linear SVM model that we arrived at during tuning is below:

fn <- "svm2.rds"
if (!file.exists(fn)){
    ctrl <-  tune.control(sampling = "cross", cross = 3, nrepeat = 1)
    tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000))
    svm_tune2 <- tune(svm, LABEL ~ .,
                       data = train_df, kernel = "linear",
                       ranges = tune_grid, tunecontrol = ctrl)
    svm2 <- svm_tune2$best.model
    saveRDS(svm2, "svm2.rds")
}else{
    svm2 <- readRDS("svm2.rds")
}
summ <- summarize_svm(svm2)
kable(summ, format = "simple")

Parameter	Value
call	best.tune(METHOD = svm, train.x = LABEL ~ ., data = train_df, , ranges = tune_grid, tunecontrol = ctrl, kernel = “linear”)
cost	0.1
gamma	0.2
num_classes	2
classes	Legitimate, Phishing
support_vectors_total	3404
support_vectors_split	1702, 1702

It uses a cost of 0.1. (Gamma was held constant at 0.2 during cross-validation.)

Model Evaluation

We make predictions on the test data using both models, and we construct confusion matrices for them.

pred_svm1 <- predict(svm1, test_df, type = "class")
svm1cm_complete <- confusionMatrix(pred_svm1, test_df$LABEL,
                                    positive = "Phishing")
svm1cm <- as.data.frame(svm1cm_complete$table)
svm1cm$Reference <- factor(svm1cm$Reference,
                           levels = rev(levels(svm1cm$Reference)))
svm1cm <- svm1cm |>
    mutate(
        Label = case_when(
            Prediction == "Legitimate" & Reference == "Legitimate" ~ "TN",
            Prediction == "Phishing" & Reference == "Phishing" ~ "TP",
            Prediction == "Legitimate" & Reference == "Phishing" ~ "FN",
            Prediction == "Phishing" & Reference == "Legitimate" ~ "FP"),
        Model = "SVM Model 1: Radial Basis")
pred_svm2 <- predict(svm2, test_df, type = "class")
svm2cm_complete <- confusionMatrix(pred_svm2, test_df$LABEL,
                                    positive = "Phishing")
svm2cm <- as.data.frame(svm2cm_complete$table)
svm2cm$Reference <- factor(svm2cm$Reference,
                           levels = rev(levels(svm2cm$Reference)))
svm2cm <- svm2cm |>
    mutate(
        Label = case_when(
            Prediction == "Legitimate" & Reference == "Legitimate" ~ "TN",
            Prediction == "Phishing" & Reference == "Phishing" ~ "TP",
            Prediction == "Legitimate" & Reference == "Phishing" ~ "FN",
            Prediction == "Phishing" & Reference == "Legitimate" ~ "FP"),
        Model = "SVM Model 2: Linear")

cm <- bind_rows(svm1cm, svm2cm)
p4 <- cm |>
    ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
    geom_tile(col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label), vjust = 3) + 
    scale_fill_gradient(low = "white", high = pal[8]) +
    scale_x_discrete(position = "top") +
    facet_wrap(Model ~ ., ncol = 2, strip.position = "bottom") +
    labs(title = "Confusion Matrices for SVM Models") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank(),
          axis.text.y = element_text(angle = 90, hjust = 0.5),
          axis.ticks = element_blank(),
          legend.position = "right",
          strip.placement = "outside")
p4

We can see that the radial basis SVM model has a pretty even mix of False Positives and False Negatives, whereas the linear SVM model suffers much more from False Negatives than False Positives. The linear SVM model also has fewer False Positives than the radial basis SVM model, but still results in more total classification errors because of its large number of False Negatives.

For the sake of comparison, we also load the confusion matrices for the tree models we built in Homework 2.

p5 <- readRDS("data622_hw2_p4.rds")
p5

Recall that the SVM models were tested on a larger number of observations than the tree models because the tree models utilized more of the total observations as training data, so direct comparisons of the raw numbers would be erroneous. However, we can state generally that the Random Forest model and the radial basis SVM model achieved similar levels of balance between False Positives and False Negatives, whereas the linear SVM model and the Decision Tree models all resulted in much more of one kind of error than the other.

We calculate the performance metrics for both SVM models, and we load the performance metrics for the tree models we built in Homework 2 for comparison.

metrics <- as.data.frame(cbind(rbind(svm1cm_complete$byClass,
                                     svm2cm_complete$byClass),
                               rbind(svm1cm_complete$overall,
                                     svm2cm_complete$overall)))
rownames(metrics) <- c("SVM Model 1: Radial Basis",
                       "SVM Model 2: Linear")
my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data622_homework3/main/data622_hw2_metrics.csv"
metrics_prev <- read.csv(my_url2, row.names = 1)
keep <- c("Accuracy", "Kappa", "Precision", "Recall", "F1", "Specificity")
metrics <- metrics |>
    select(all_of(keep)) |>
    round(3) |>
    bind_rows(metrics_prev)
kable(metrics, format = "simple")

	Accuracy	Kappa	Precision	Recall	F1	Specificity
SVM Model 1: Radial Basis	0.871	0.721	0.826	0.818	0.822	0.901
SVM Model 2: Linear	0.850	0.664	0.858	0.705	0.774	0.933
Decision Tree Model 1	0.839	0.639	0.840	0.691	0.758	0.925
Decision Tree Model 2	0.837	0.654	0.754	0.821	0.786	0.846
Random Forest Model	0.888	0.759	0.848	0.846	0.847	0.913

The radial basis SVM model is more accurate than the linear SVM model, and it has a better balance between precision and recall, as we were able to detect visually. Both SVM models were more accurate than the two Decision Tree models, but neither was as accurate as the Random Forest model. The Random Forest model also has the best balance between precision and recall overall, as well as the best recall period, but it is worth noting that the linear SVM model beats it in precision.

Conclusion

Although the Random Forest model is still the best performer, the radial basis SVM model is not that far behind it. It also required a lot less training data to achieve these decent results than the Random Forest model used. It generalized very well to the test data, which is a trademark ability of SVM models that are able to find large margins. They can ignore noise, and they don’t suffer from overfitting, so they lead to relatively small amounts of classification error.

Had the SVM models been able to utilize more data during training, there’s a decent chance they could have outperformed the Random Forest model. We have worked with SVM models before, and training them using the tune function from the e1071 library seemed slower than our previous attempts using the train function from the caret library. When we have more time, we will consider an experiment that compares the computation times. If utilizing different libraries doesn’t make a difference, we would like to find out what does because even though we know SVM models can become slow to train with large datasets, we didn’t think our dataset was really that large in the grand scheme of things.