library(caret)
library(DataExplorer)
library(e1071)
library(ggcorrplot)
library(knitr)
library(MASS)
select <- dplyr::select
library(png)
library(psych)
library(RColorBrewer)
library(tidyverse)
We load the
dataset of Web sites labeled either Phishing or
Legitimate that we used in Homework 2. As a reminder, below
are the first 10 observations in the dataset, and for the sake of
readability, only the first 12 columns are displayed.
cur_theme <- theme_set(theme_classic())
pal <- brewer.pal(n = 12, name = "Paired")
my_url <- "https://raw.githubusercontent.com/geedoubledee/data622_homework3/main/web-page-phishing.csv"
phishing_df <- read.csv(my_url, skip = 1)
rem <- c("phishing", "n_hastag", "n_hypens")
phishing_df <- phishing_df |>
mutate(LABEL = factor(phishing, labels = c("Legitimate", "Phishing")),
n_hashtag = n_hastag,
n_hyphens = n_hypens) |>
relocate(LABEL, .before = url_length) |>
relocate(n_hashtag, .before = n_dollar) |>
select(-all_of(rem))
kable(phishing_df[1:10, 1:12], format = "simple")
| LABEL | url_length | n_dots | n_underline | n_slash | n_questionmark | n_equal | n_at | n_and | n_exclamation | n_space | n_tilde |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Legitimate | 37 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Phishing | 77 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Phishing | 126 | 4 | 2 | 0 | 1 | 3 | 0 | 2 | 0 | 0 | 0 |
| Legitimate | 18 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Legitimate | 55 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Phishing | 32 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Legitimate | 19 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Phishing | 81 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Legitimate | 42 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Legitimate | 104 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The first column is the response variable that we will again attempt
to predict: a binary factor named LABEL. In addition to the
response variable, there are 19 integer predictor variables:
classes <- as.data.frame(unlist(lapply(phishing_df, class))) |>
rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
group_by(Class) |>
summarize(Count = n(),
Variables = paste(sort(unique(Variable)),collapse=", ")) |>
filter(Class == "integer")
kable(classes_summary, format = "simple")
| Class | Count | Variables |
|---|---|---|
| integer | 19 | n_and, n_asterisk, n_at, n_comma, n_dollar, n_dots, n_equal, n_exclamation, n_hashtag, n_hyphens, n_percent, n_plus, n_questionmark, n_redirection, n_slash, n_space, n_tilde, n_underline, url_length |
All of these predictor variables except for url_length
and n_redirection represent counts of specific punctuation
characters within the Web sites’ urls. The former is the count of all
characters within the url, and the latter is the count of redirects
within the url.
The exploratory data analysis that follows is largely the same as was performed during Homework 2.
We check for any missing values within the dataset.
rem <- c("discrete_columns", "continuous_columns",
"total_observations", "memory_usage")
completeness <- introduce(phishing_df) |>
select(-all_of(rem))
knitr::kable(t(completeness), format = "simple")
| rows | 100077 |
| columns | 20 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 100077 |
Of the 100,000+ observations, none contain missing values we need to address for any of the variables.
We check the distribution of the response variable to see if there’s
a class imbalance between Phishing Web sites and
Legitimate Web sites.
pal <- brewer.pal(n = 12, name = "Paired")
cols <- pal[c(2, 8)]
names(cols) <- c("Legitimate", "Phishing")
obs = nrow(phishing_df)
p1 <- phishing_df |>
ggplot(aes(x = LABEL)) +
geom_histogram(aes(color = LABEL, fill = LABEL), stat = "count") +
geom_text(stat = "count", aes(label = paste0(round(
after_stat(count) / obs * 100, 1), "%")),
size = 5, color = "white", vjust = 2, fontface = "bold") +
scale_color_manual(values = cols) +
scale_fill_manual(values = cols) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Distribution of Phishing & Legitimate Web sites",
y = "COUNT") +
theme(legend.position = "none")
p1
The ratio of Legitimate to Phishing Web
sites is not quite 2:1, so the classes are only marginally imbalanced
here, and no oversampling corrections will be required later.
We summarize the distributions of the predictor variables.
rem <- c("vars", "n", "trimmed", "mad", "skew", "kurtosis", "se")
excl <- c("LABEL*")
describe <- describe(phishing_df) |>
select(-all_of(rem))
describe <- describe |>
filter(!rownames(describe) %in% excl)
knitr::kable(describe, format = "simple")
| mean | sd | median | min | max | range | |
|---|---|---|---|---|---|---|
| url_length | 39.1776832 | 47.9718467 | 24 | 4 | 4165 | 4161 |
| n_dots | 2.2243972 | 1.2550458 | 2 | 1 | 24 | 23 |
| n_underline | 0.1377240 | 0.7239953 | 0 | 0 | 21 | 21 |
| n_slash | 1.1353858 | 1.8285255 | 0 | 0 | 44 | 44 |
| n_questionmark | 0.0243912 | 0.1677892 | 0 | 0 | 9 | 9 |
| n_equal | 0.2158338 | 0.9598019 | 0 | 0 | 23 | 23 |
| n_at | 0.0221429 | 0.2683926 | 0 | 0 | 43 | 43 |
| n_and | 0.1433296 | 0.9136555 | 0 | 0 | 26 | 26 |
| n_exclamation | 0.0026080 | 0.0822074 | 0 | 0 | 10 | 10 |
| n_space | 0.0048762 | 0.1445694 | 0 | 0 | 18 | 18 |
| n_tilde | 0.0036172 | 0.0784998 | 0 | 0 | 5 | 5 |
| n_comma | 0.0023782 | 0.0795583 | 0 | 0 | 11 | 11 |
| n_plus | 0.0024681 | 0.1043821 | 0 | 0 | 19 | 19 |
| n_asterisk | 0.0040968 | 0.2840456 | 0 | 0 | 60 | 60 |
| n_hashtag | 0.0004497 | 0.0580279 | 0 | 0 | 13 | 13 |
| n_dollar | 0.0018985 | 0.0974124 | 0 | 0 | 10 | 10 |
| n_percent | 0.1092858 | 1.6953268 | 0 | 0 | 174 | 174 |
| n_redirection | 0.3615316 | 0.7754916 | 0 | -1 | 17 | 18 |
| n_hyphens | 0.4051880 | 1.2854651 | 0 | 0 | 43 | 43 |
The median for many of these predictor variables is 0, which suggests
their distributions are degenerate. One predictor,
n_redirection, also seems to have a nonsensical range that
includes -1, but we leave these values as-is since the Support Vector
Machine (SVM) models we will be building can’t handle missing values
unless they’ve been imputed, and we favor the original values over
imputation here.
Out of these predictors, 14 of them demonstrate near-zero variance:
nzv <- nearZeroVar(phishing_df, names = TRUE, saveMetrics = FALSE)
nzv
## [1] "n_underline" "n_questionmark" "n_equal" "n_at"
## [5] "n_and" "n_exclamation" "n_space" "n_tilde"
## [9] "n_comma" "n_plus" "n_asterisk" "n_hashtag"
## [13] "n_dollar" "n_percent"
Since they would only serve as noise in our models, we remove them.
phishing_df <- phishing_df |>
select(-all_of(nzv))
We take a look at the distributions of the remaining predictor variables.
skip <- c("LABEL")
phishing_piv <- phishing_df |>
pivot_longer(cols = !all_of(skip), names_to = "PREDICTOR",
values_to = "VALUE")
p2 <- phishing_piv |>
ggplot(aes(x = VALUE, color = LABEL, fill = LABEL)) +
geom_histogram(data = subset(phishing_piv, LABEL == "Legitimate"),
alpha = 0.5) +
geom_histogram(data = subset(phishing_piv, LABEL == "Phishing"),
alpha = 0.5) +
scale_color_manual(values = cols) +
scale_fill_manual(values = cols) +
scale_y_continuous(labels = scales::comma) +
facet_wrap(PREDICTOR ~ ., ncol = 2, scales = "free_x") +
labs(title = "Distribution of Remaining Predictor Variables",
y = "COUNT") +
theme(legend.position = "top")
p2
The distributions for the remaining predictor variables are all right-skewed. We forego any transformations since they weren’t appropriate during Homework 2, and we’ll be making comparisons later between the SVM models we will be building and the tree models we developed then.
We visualize correlations between the response variable and the remaining predictors, as well as any predictor-predictor correlations. In the interest of ignoring clutter, only correlations greater than 0.1 (in absolute value) are displayed.
plot_corr_range <- function(df, mn=0.1, mx=1.0, excl=c(NA)){
palette <- brewer.pal(n = 7, name = "RdBu")[c(1, 4, 7)]
tit = sprintf("Correlations Between %s and %s (Absolute Value)", mn, mx)
r <- model.matrix(~0+., data = df) |>
cor() |>
round(digits=2)
is.na(r) <- abs(r) > mx
is.na(r) <- abs(r) < mn
if (!is.na(excl)){
r <- as.data.frame(r) |>
select(-all_of(excl)) |>
filter(!rownames(r) %in% excl)
}
p <- r |>
ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE,
lab_size = 3, tl.cex = 10, tl.srt = 90,
colors = palette, outline.color = "white") +
labs(title = tit) +
theme(plot.title.position = "plot")
p
}
excl <- c("LABELLegitimate")
p3 <- plot_corr_range(df = phishing_df, excl = excl)
p3
We see that n_slash is strongly positively correlated
with Phishing Web sites, and url_length is
moderately positively correlated with Phishing Web sites.
So the more slashes a url contains, and the longer the url is, the more
likely it is that the url belongs to a Phishing Web site.
No other predictor variables have as strong of a correlation with the
response variable as either of these.
We also see that url_length is pretty correlated with
most other predictors, including n_slash. We generated a
set of models that excluded either url_length or
n_slash to compare performance, but they performed worse
than models that included both predictors, so we ultimately decided not
to exclude either.
We then split the data into train and test sets. Because the SVM models we will be building can be slow to train on large datasets, and we have a large number of observations, we reduce the percentage of data used to train the model from the typical 70% to a more manageable 10%. This reduction seemed a little extreme at first, but we made several attempts to use larger percentages of the data to train the models, and the computation costs only became reasonable at this figure. We appreciate the gain in speed when building the SVM models, and we will see later that they perform very well despite using only a fraction of the data on which the tree models we built in Homework 2 were trained.
set.seed(816)
sample <- sample(nrow(phishing_df),
round(nrow(phishing_df) * 0.1),
replace = FALSE)
train_df <- phishing_df[sample, ]
test_df <- phishing_df[-sample, ]
We build both radial basis and linear SVM models, using only three-fold cross-validation to further reduce our computation costs.
A summary of the best radial basis SVM model that we arrived at during tuning is below:
fn <- "svm1.rds"
if (!file.exists(fn)){
ctrl <- tune.control(sampling = "cross", cross = 3, nrepeat = 1)
tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000),
gamma = c(0.5, 1, 2, 3, 4))
svm_tune1 <- tune(svm, LABEL ~ .,
data = train_df, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm1 <- svm_tune1$best.model
saveRDS(svm1, "svm1.rds")
}else{
svm1 <- readRDS("svm1.rds")
}
summarize_svm <- function(svm_model){
col1 <- c("call", "cost", "gamma", "num_classes", "classes",
"support_vectors_total", "support_vectors_split")
subset <- c("call", "cost", "gamma", "nclasses", "levels",
"tot.nSV", "nSV")
col2 <- svm_model[subset]
copy <- col2
for (i in 1:length(copy)){
if (is.vector(copy[[i]])){
col2[[i]] <- paste(col2[[i]], collapse = ", ")
}
}
summ <- as.data.frame(cbind(col1, col2))
rownames(summ) <- NULL
colnames(summ) <- c("Parameter", "Value")
summ
}
summ <- summarize_svm(svm1)
kable(summ, format = "simple")
| Parameter | Value |
|---|---|
| call | best.tune(METHOD = svm, train.x = LABEL ~ ., data = train_df, , ranges = tune_grid, tunecontrol = ctrl, kernel = “radial”) |
| cost | 1 |
| gamma | 4 |
| num_classes | 2 |
| classes | Legitimate, Phishing |
| support_vectors_total | 3685 |
| support_vectors_split | 1514, 2171 |
It uses a cost of 1 and a gamma of 4.
A summary of the best linear SVM model that we arrived at during tuning is below:
fn <- "svm2.rds"
if (!file.exists(fn)){
ctrl <- tune.control(sampling = "cross", cross = 3, nrepeat = 1)
tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000))
svm_tune2 <- tune(svm, LABEL ~ .,
data = train_df, kernel = "linear",
ranges = tune_grid, tunecontrol = ctrl)
svm2 <- svm_tune2$best.model
saveRDS(svm2, "svm2.rds")
}else{
svm2 <- readRDS("svm2.rds")
}
summ <- summarize_svm(svm2)
kable(summ, format = "simple")
| Parameter | Value |
|---|---|
| call | best.tune(METHOD = svm, train.x = LABEL ~ ., data = train_df, , ranges = tune_grid, tunecontrol = ctrl, kernel = “linear”) |
| cost | 0.1 |
| gamma | 0.2 |
| num_classes | 2 |
| classes | Legitimate, Phishing |
| support_vectors_total | 3404 |
| support_vectors_split | 1702, 1702 |
It uses a cost of 0.1. (Gamma was held constant at 0.2 during cross-validation.)
We make predictions on the test data using both models, and we construct confusion matrices for them.
pred_svm1 <- predict(svm1, test_df, type = "class")
svm1cm_complete <- confusionMatrix(pred_svm1, test_df$LABEL,
positive = "Phishing")
svm1cm <- as.data.frame(svm1cm_complete$table)
svm1cm$Reference <- factor(svm1cm$Reference,
levels = rev(levels(svm1cm$Reference)))
svm1cm <- svm1cm |>
mutate(
Label = case_when(
Prediction == "Legitimate" & Reference == "Legitimate" ~ "TN",
Prediction == "Phishing" & Reference == "Phishing" ~ "TP",
Prediction == "Legitimate" & Reference == "Phishing" ~ "FN",
Prediction == "Phishing" & Reference == "Legitimate" ~ "FP"),
Model = "SVM Model 1: Radial Basis")
pred_svm2 <- predict(svm2, test_df, type = "class")
svm2cm_complete <- confusionMatrix(pred_svm2, test_df$LABEL,
positive = "Phishing")
svm2cm <- as.data.frame(svm2cm_complete$table)
svm2cm$Reference <- factor(svm2cm$Reference,
levels = rev(levels(svm2cm$Reference)))
svm2cm <- svm2cm |>
mutate(
Label = case_when(
Prediction == "Legitimate" & Reference == "Legitimate" ~ "TN",
Prediction == "Phishing" & Reference == "Phishing" ~ "TP",
Prediction == "Legitimate" & Reference == "Phishing" ~ "FN",
Prediction == "Phishing" & Reference == "Legitimate" ~ "FP"),
Model = "SVM Model 2: Linear")
cm <- bind_rows(svm1cm, svm2cm)
p4 <- cm |>
ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(col = "black") +
geom_text(aes(label = Freq)) +
geom_text(aes(label = Label), vjust = 3) +
scale_fill_gradient(low = "white", high = pal[8]) +
scale_x_discrete(position = "top") +
facet_wrap(Model ~ ., ncol = 2, strip.position = "bottom") +
labs(title = "Confusion Matrices for SVM Models") +
theme(axis.line.x = element_blank(),
axis.line.y = element_blank(),
axis.text.y = element_text(angle = 90, hjust = 0.5),
axis.ticks = element_blank(),
legend.position = "right",
strip.placement = "outside")
p4
We can see that the radial basis SVM model has a pretty even mix of False Positives and False Negatives, whereas the linear SVM model suffers much more from False Negatives than False Positives. The linear SVM model also has fewer False Positives than the radial basis SVM model, but still results in more total classification errors because of its large number of False Negatives.
For the sake of comparison, we also load the confusion matrices for the tree models we built in Homework 2.
p5 <- readRDS("data622_hw2_p4.rds")
p5
Recall that the SVM models were tested on a larger number of observations than the tree models because the tree models utilized more of the total observations as training data, so direct comparisons of the raw numbers would be erroneous. However, we can state generally that the Random Forest model and the radial basis SVM model achieved similar levels of balance between False Positives and False Negatives, whereas the linear SVM model and the Decision Tree models all resulted in much more of one kind of error than the other.
We calculate the performance metrics for both SVM models, and we load the performance metrics for the tree models we built in Homework 2 for comparison.
metrics <- as.data.frame(cbind(rbind(svm1cm_complete$byClass,
svm2cm_complete$byClass),
rbind(svm1cm_complete$overall,
svm2cm_complete$overall)))
rownames(metrics) <- c("SVM Model 1: Radial Basis",
"SVM Model 2: Linear")
my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data622_homework3/main/data622_hw2_metrics.csv"
metrics_prev <- read.csv(my_url2, row.names = 1)
keep <- c("Accuracy", "Kappa", "Precision", "Recall", "F1", "Specificity")
metrics <- metrics |>
select(all_of(keep)) |>
round(3) |>
bind_rows(metrics_prev)
kable(metrics, format = "simple")
| Accuracy | Kappa | Precision | Recall | F1 | Specificity | |
|---|---|---|---|---|---|---|
| SVM Model 1: Radial Basis | 0.871 | 0.721 | 0.826 | 0.818 | 0.822 | 0.901 |
| SVM Model 2: Linear | 0.850 | 0.664 | 0.858 | 0.705 | 0.774 | 0.933 |
| Decision Tree Model 1 | 0.839 | 0.639 | 0.840 | 0.691 | 0.758 | 0.925 |
| Decision Tree Model 2 | 0.837 | 0.654 | 0.754 | 0.821 | 0.786 | 0.846 |
| Random Forest Model | 0.888 | 0.759 | 0.848 | 0.846 | 0.847 | 0.913 |
The radial basis SVM model is more accurate than the linear SVM model, and it has a better balance between precision and recall, as we were able to detect visually. Both SVM models were more accurate than the two Decision Tree models, but neither was as accurate as the Random Forest model. The Random Forest model also has the best balance between precision and recall overall, as well as the best recall period, but it is worth noting that the linear SVM model beats it in precision.
Although the Random Forest model is still the best performer, the radial basis SVM model is not that far behind it. It also required a lot less training data to achieve these decent results than the Random Forest model used. It generalized very well to the test data, which is a trademark ability of SVM models that are able to find large margins. They can ignore noise, and they don’t suffer from overfitting, so they lead to relatively small amounts of classification error.
Had the SVM models been able to utilize more data during training,
there’s a decent chance they could have outperformed the Random Forest
model. We have worked with SVM models before, and training them using
the tune function from the e1071 library
seemed slower than our previous attempts using the train
function from the caret library. When we have more time, we
will consider an experiment that compares the computation times. If
utilizing different libraries doesn’t make a difference, we would like
to find out what does because even though we know SVM models can become
slow to train with large datasets, we didn’t think our dataset was
really that large in the grand scheme of things.