Background
Phishing is a cyberattack that uses deceptive emails, messages, phone calls, or websites to trick individuals into revealing sensitive information, downloading malware, or exposing themselves to cybercrime. It is a form of social engineering that manipulates human behavior by impersonating trusted sources to steal data such as usernames, passwords, credit card details, and bank account information for malicious purposes (Konsinski, 2026; Cloudflare, 2026).
Problem Statement
As phishing techniques continue to advance, cybercriminals are creating more sophisticated and convincing attacks, making them increasingly challenging to detect (Sivaneswaran et al., 2026). This project focuses on developing machine learning models using R to better understand phishing URLs based on their lexical features. The project is built around two distinct research questions, each answered with a different and appropriate modeling technique:
(Classification) Which URL lexical features contribute the most to distinguishing phishing websites from legitimate ones?
(Regression) How do the individual URL character features relate to the overall length of a URL, and to what extent can URL length be explained by them?
Objectives
Dataset Source
The dataset used in this project was obtained from Kaggle, specifically from the dataset titled Web Page Phishing Dataset by Daniel Fernandon.
Purpose of dataset
The dataset is designed for phishing website detection and contains various URL and webpage-related features that can be used to classify whether a website is phishing or legitimate.
Dataset Key Features
url_length: The length of the URL.
n_dots: The count of ‘.’ characters in the URL.
n_hypens: The count of ‘-’ characters in the URL.
n_underline: The count of ’_’ characters in the URL.
n_slash: The count of ‘/’ characters in the URL.
n_questionmark: The count of ‘?’ characters in the URL.
n_equal: The count of ‘=’ characters in the URL.
n_at: The count of ‘@’ characters in the URL.
n_and: The count of ‘&’ characters in the URL.
n_exclamation: The count of ‘!’ characters in the URL.
n_space: The count of ’ ’ characters in the URL.
n_tilde: The count of ‘~’ characters in the URL.
n_comma: The count of ‘,’ characters in the URL.
n_plus: The count of ‘+’ characters in the URL.
n_asterisk: The count of ’*’ characters in the URL.
n_hastag: The count of ‘#’ characters in the URL.
n_dollar: The count of ‘$’ characters in the URL.
n_percent: The count of ‘%’ characters in the URL.
n_redirection: The count of redirections in the URL.
phishing: The Labels of the URL. 1 is phishing and 0 is legitimate.
Variable Output
The target variable indicates whether a URL is phishing or legitimate, where:
1 represents a phishing website
0 represents a legitimate website
List of all required packages
packages <- c(
"dplyr",
"janitor",
"tidyr",
"ggplot2",
"corrplot",
"skimr",
"tidyverse",
"GGally",
"scales",
"caret",
"randomForest",
"pROC",
"knitr",
"kableExtra"
)
# Auto-install any missing packages, then load all of them
install_and_load <- function(pkg) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg, repos = "https://cloud.r-project.org")
}
library(pkg, character.only = TRUE)
}
invisible(sapply(packages, install_and_load))
set.seed(42)Load Dataset
file_path <- "web-page-phishing.csv"
df_raw <- read.csv(file_path, stringsAsFactors = FALSE)
dim(df_raw)## [1] 100077 20
Check Original Dataset Structure
## [1] "url_length" "n_dots" "n_hypens" "n_underline"
## [5] "n_slash" "n_questionmark" "n_equal" "n_at"
## [9] "n_and" "n_exclamation" "n_space" "n_tilde"
## [13] "n_comma" "n_plus" "n_asterisk" "n_hastag"
## [17] "n_dollar" "n_percent" "n_redirection" "phishing"
## 'data.frame': 100077 obs. of 20 variables:
## $ url_length : int 37 77 126 18 55 32 19 81 42 104 ...
## $ n_dots : int 3 1 4 2 2 3 2 2 2 1 ...
## $ n_hypens : int 0 0 1 0 2 1 0 0 0 10 ...
## $ n_underline : int 0 0 2 0 0 0 0 0 0 0 ...
## $ n_slash : int 0 0 0 0 0 0 0 0 0 0 ...
## $ n_questionmark: int 0 0 1 0 0 0 0 0 0 0 ...
## $ n_equal : int 0 0 3 0 0 0 0 0 0 0 ...
## $ n_at : int 0 0 0 0 0 0 0 0 0 0 ...
## $ n_and : int 0 0 2 0 0 0 0 0 0 0 ...
## $ n_exclamation : int 0 0 0 0 0 0 0 0 0 0 ...
## $ n_space : int 0 0 0 0 0 0 0 0 0 0 ...
## $ n_tilde : int 0 0 0 0 0 0 0 0 0 0 ...
## $ n_comma : int 0 0 0 0 0 0 0 0 0 0 ...
## $ n_plus : int 0 0 0 0 0 0 0 0 0 0 ...
## $ n_asterisk : int 0 0 0 0 0 0 0 0 0 0 ...
## $ n_hastag : int 0 0 0 0 0 0 0 0 0 0 ...
## $ n_dollar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ n_percent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ n_redirection : int 0 1 1 1 1 1 1 1 0 0 ...
## $ phishing : int 0 1 1 0 0 1 0 1 0 0 ...
## url_length n_dots n_hypens n_underline
## Min. : 4.00 Min. : 1.000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 18.00 1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 24.00 Median : 2.000 Median : 0.0000 Median : 0.0000
## Mean : 39.18 Mean : 2.224 Mean : 0.4052 Mean : 0.1377
## 3rd Qu.: 44.00 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :4165.00 Max. :24.000 Max. :43.0000 Max. :21.0000
## n_slash n_questionmark n_equal n_at
## Min. : 0.000 Min. :0.00000 Min. : 0.0000 Min. : 0.00000
## 1st Qu.: 0.000 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 0.00000
## Median : 0.000 Median :0.00000 Median : 0.0000 Median : 0.00000
## Mean : 1.135 Mean :0.02439 Mean : 0.2158 Mean : 0.02214
## 3rd Qu.: 2.000 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.00000
## Max. :44.000 Max. :9.00000 Max. :23.0000 Max. :43.00000
## n_and n_exclamation n_space n_tilde
## Min. : 0.0000 Min. : 0.000000 Min. : 0.000000 Min. :0.000000
## 1st Qu.: 0.0000 1st Qu.: 0.000000 1st Qu.: 0.000000 1st Qu.:0.000000
## Median : 0.0000 Median : 0.000000 Median : 0.000000 Median :0.000000
## Mean : 0.1433 Mean : 0.002608 Mean : 0.004876 Mean :0.003617
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000 3rd Qu.: 0.000000 3rd Qu.:0.000000
## Max. :26.0000 Max. :10.000000 Max. :18.000000 Max. :5.000000
## n_comma n_plus n_asterisk
## Min. : 0.000000 Min. : 0.000000 Min. : 0.000000
## 1st Qu.: 0.000000 1st Qu.: 0.000000 1st Qu.: 0.000000
## Median : 0.000000 Median : 0.000000 Median : 0.000000
## Mean : 0.002378 Mean : 0.002468 Mean : 0.004097
## 3rd Qu.: 0.000000 3rd Qu.: 0.000000 3rd Qu.: 0.000000
## Max. :11.000000 Max. :19.000000 Max. :60.000000
## n_hastag n_dollar n_percent n_redirection
## Min. :0.000e+00 Min. : 0.000000 Min. : 0.0000 Min. :-1.0000
## 1st Qu.:0.000e+00 1st Qu.: 0.000000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :0.000e+00 Median : 0.000000 Median : 0.0000 Median : 0.0000
## Mean :4.497e-04 Mean : 0.001899 Mean : 0.1093 Mean : 0.3615
## 3rd Qu.:0.000e+00 3rd Qu.: 0.000000 3rd Qu.: 0.0000 3rd Qu.: 1.0000
## Max. :1.300e+01 Max. :10.000000 Max. :174.0000 Max. :17.0000
## phishing
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3633
## 3rd Qu.:1.0000
## Max. :1.0000
Clean Column Names
## [1] "url_length" "n_dots" "n_hypens" "n_underline"
## [5] "n_slash" "n_questionmark" "n_equal" "n_at"
## [9] "n_and" "n_exclamation" "n_space" "n_tilde"
## [13] "n_comma" "n_plus" "n_asterisk" "n_hastag"
## [17] "n_dollar" "n_percent" "n_redirection" "phishing"
Check Missing Values
## url_length n_dots n_hypens n_underline n_slash
## 0 0 0 0 0
## n_questionmark n_equal n_at n_and n_exclamation
## 0 0 0 0 0
## n_space n_tilde n_comma n_plus n_asterisk
## 0 0 0 0 0
## n_hastag n_dollar n_percent n_redirection phishing
## 0 0 0 0 0
Remove Missing Values
## [1] 100077 20
Check Duplicate Rows
## [1] 78186
Remove Duplicate Rows
## [1] 21891 20
Check Target Variable
## [1] TRUE
##
## 0 1
## 6019 15872
Convert Data Types
df_clean <- df_clean %>%
mutate(across(-phishing, as.numeric)) %>%
mutate(phishing = as.factor(phishing))
str(df_clean)## 'data.frame': 21891 obs. of 20 variables:
## $ url_length : num 37 77 126 18 55 32 19 81 42 104 ...
## $ n_dots : num 3 1 4 2 2 3 2 2 2 1 ...
## $ n_hypens : num 0 0 1 0 2 1 0 0 0 10 ...
## $ n_underline : num 0 0 2 0 0 0 0 0 0 0 ...
## $ n_slash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_questionmark: num 0 0 1 0 0 0 0 0 0 0 ...
## $ n_equal : num 0 0 3 0 0 0 0 0 0 0 ...
## $ n_at : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_and : num 0 0 2 0 0 0 0 0 0 0 ...
## $ n_exclamation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_space : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_tilde : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_comma : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_plus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_asterisk : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_hastag : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_dollar : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_percent : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_redirection : num 0 1 1 1 1 1 1 1 0 0 ...
## $ phishing : Factor w/ 2 levels "0","1": 1 2 2 1 1 2 1 2 1 1 ...
Check Abnormal Negative Values
Most URL-based features are count-based or length-based features. Therefore, negative values are not expected in these numerical columns.
negative_summary <- df_clean %>%
summarise(across(where(is.numeric), ~ sum(. < 0, na.rm = TRUE))) %>%
pivot_longer(
cols = everything(),
names_to = "feature",
values_to = "negative_count"
) %>%
arrange(desc(negative_count))
negative_summaryShow Features with Negative Values Only
negative_summary_nonzero <- negative_summary %>%
filter(negative_count > 0)
negative_summary_nonzeroCheck Rows Containing Negative Values
rows_with_negative <- df_clean %>%
filter(if_any(where(is.numeric), ~ . < 0))
dim(rows_with_negative)## [1] 1584 20
Remove Abnormal Negative Values
before_negative_removal <- nrow(df_clean)
df_clean <- df_clean %>%
filter(if_all(where(is.numeric), ~ . >= 0))
after_negative_removal <- nrow(df_clean)
removed_negative_rows <- before_negative_removal - after_negative_removal
removed_negative_rows## [1] 1584
## [1] 20307 20
Final Negative Value Check
negative_summary_after <- df_clean %>%
summarise(across(where(is.numeric), ~ sum(. < 0, na.rm = TRUE))) %>%
pivot_longer(
cols = everything(),
names_to = "feature",
values_to = "negative_count"
) %>%
arrange(desc(negative_count))
negative_summary_afterFinal Missing Value Check
## url_length n_dots n_hypens n_underline n_slash
## 0 0 0 0 0
## n_questionmark n_equal n_at n_and n_exclamation
## 0 0 0 0 0
## n_space n_tilde n_comma n_plus n_asterisk
## 0 0 0 0 0
## n_hastag n_dollar n_percent n_redirection phishing
## 0 0 0 0 0
Save Cleaned Dataset
Re-label Target for Readability
For clearer plots and confusion matrices in later sections, we
relabel the factor levels of phishing from 0/1
to Legitimate/Phishing. The underlying data is
unchanged.
df_clean$phishing <- factor(df_clean$phishing,
levels = c("0", "1"),
labels = c("Legitimate", "Phishing"))
table(df_clean$phishing)##
## Legitimate Phishing
## 5452 14855
This section summarises the data quality issues identified during exploration, the goals of preprocessing, and the research questions that guide the modeling.
1. Incorrect or Inconsistent Values
Negative values were found in n_redirection, which may
indicate incorrect or inconsistent data. These rows are removed during
cleaning.
2. Outliers
Significant outliers exist in several variables, including
url_length, n_asterisk, and
n_percent.
3. Duplicate Records
A large number of duplicated rows were detected — approximately 78,186 duplicate records — and removed during cleaning.
4. Class Imbalance
The target variable phishing is moderately
imbalanced:
| Class | Percentage |
|---|---|
| Legitimate (0) | 63.67% |
| Phishing (1) | 36.33% |
The preprocessing stage aims to improve dataset quality and prepare the data for machine learning.
1. Improve Data Quality — handling incorrect values, outliers, and duplicate records.
2. Prepare the Dataset for Machine Learning — selecting relevant features, reducing noise, and improving model performance. Important features identified from correlation analysis include:
| Feature | Correlation with phishing |
|---|---|
n_slash |
0.611472 |
url_length |
0.430125 |
n_equal |
0.260462 |
3. Handle Class Imbalance — balanced model evaluation is ensured using stratified train-test splitting, given the moderate class imbalance.
Question 1 — Classification (Random Forest)
Which URL lexical features contribute the most to distinguishing phishing websites from legitimate ones?
Objective: To develop a classification model that predicts phishing websites based on URL features and to identify the strongest indicators of phishing.
Why this question matters. Knowing which characteristics most strongly signal a phishing URL is more valuable than a black-box prediction alone. If a small number of features (for example, the number of slashes or the overall URL length) carry most of the predictive power, security teams can build lightweight, explainable detection rules that run in real time inside browsers or email gateways without inspecting full page content; prioritise what human analysts inspect when manually reviewing suspicious links; and reduce the feature set needed for deployment, lowering computational cost at scale.
Why Random Forest. Random Forest produces a
principled, built-in measure of feature importance (Mean Decrease in
Gini / accuracy) while simultaneously modeling non-linear relationships
and interactions between features. It is also robust to the moderate
class imbalance and to the extreme outliers present in features such as
url_length, making it both an accurate classifier and a
reliable tool for ranking feature relevance.
Question 2 — Regression (Linear Regression)
How do the individual URL character features relate to the overall length of a URL, and to what extent can URL length be explained by them?
Objective: To analyze the relationship between URL features and URL length using a regression model, and to determine whether phishing-related URLs tend to become longer due to increased use of special characters.
Why this question matters. “Long URLs are
suspicious” is a common heuristic in phishing detection, but it is
rarely unpacked. This question explains what actually drives
URL length — extra subdirectories (slashes), longer query strings
(=, &), or other character types. It gives
a structural interpretation of the headline finding from Question 1,
produces directly interpretable coefficients (each coefficient states
how many additional characters of length are associated with one more
occurrence of a given character type), and provides a continuous,
complementary view of URL structure alongside the binary
classification.
Why Linear Regression. The target here,
url_length, is a continuous numeric variable, so a
regression method is required. Linear regression directly models URL
length as a function of the other character-count features and yields
transparent, easily explained coefficients — exactly what is needed to
describe relationships rather than just predict.
ggplot(df_clean, aes(x = phishing, fill = phishing)) +
geom_bar(width = 0.6) +
geom_text(stat = "count", aes(label = comma(after_stat(count))),
vjust = -0.4, size = 4.5, fontface = "bold") +
scale_fill_manual(values = c("Legitimate" = "#2E86AB", "Phishing" = "#E63946")) +
labs(title = "Distribution of Phishing vs Legitimate URLs",
x = "URL Class", y = "Count") +
theme_minimal(base_size = 13) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))# Proportions
prop.table(table(df_clean$phishing)) %>%
round(4) %>%
kable(caption = "Class proportions") %>%
kable_styling(full_width = FALSE)| Var1 | Freq |
|---|---|
| Legitimate | 0.2685 |
| Phishing | 0.7315 |
The classes are moderately imbalanced — roughly 64% Legitimate and 36% Phishing. This is mild enough that we don’t require oversampling, but we monitor metrics beyond accuracy (e.g., Recall, F1-score) in the classification model.
# Cap extreme outliers for visualization
ggplot(df_clean %>% filter(url_length <= 300),
aes(x = url_length, fill = phishing)) +
geom_histogram(bins = 60, alpha = 0.75, position = "identity") +
scale_fill_manual(values = c("Legitimate" = "#2E86AB", "Phishing" = "#E63946")) +
labs(title = "Distribution of URL Length by Class (truncated at 300)",
x = "URL Length", y = "Count", fill = "Class") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))Phishing URLs tend to be longer on average than legitimate URLs — a common tactic used by attackers to obfuscate malicious domains. This observation directly motivates Question 2: if URL length is a useful signal, it is worth understanding what drives it.
# Select features with the highest correlation to the target
key_features <- c("url_length", "n_slash", "n_dots", "n_equal", "n_and", "n_hypens")
df_long <- df_clean %>%
select(all_of(key_features), phishing) %>%
pivot_longer(-phishing, names_to = "Feature", values_to = "Value")
ggplot(df_long, aes(x = phishing, y = Value, fill = phishing)) +
geom_boxplot(outlier.alpha = 0.15, outlier.size = 0.6) +
facet_wrap(~ Feature, scales = "free_y", ncol = 3) +
scale_fill_manual(values = c("Legitimate" = "#2E86AB", "Phishing" = "#E63946")) +
labs(title = "Key Feature Distributions by Class",
x = NULL, y = "Value") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"))Across all key features, phishing URLs consistently show higher median and wider spread, confirming that lexical complexity is a strong indicator of malicious intent.
# Compute correlation matrix using numeric target
corr_data <- df_clean %>%
mutate(phishing_num = as.numeric(phishing) - 1) %>%
select(-phishing)
corr_mat <- cor(corr_data)
corrplot(corr_mat,
method = "color",
type = "upper",
tl.cex = 0.85,
tl.col = "black",
addCoef.col = "black",
number.cex = 0.6,
col = colorRampPalette(c("#2E86AB", "white", "#E63946"))(200),
title = "Correlation Matrix of Features",
mar = c(0, 0, 2, 0))n_slash, url_length, and
n_equal show the strongest positive
correlations with the phishing target. The heatmap also shows
that several character counts are correlated with
url_length, which is relevant to the regression in Question
2.
target_corr <- data.frame(
Feature = names(corr_mat[, "phishing_num"]),
Correlation = corr_mat[, "phishing_num"]
) %>%
filter(Feature != "phishing_num") %>%
arrange(desc(abs(Correlation)))
ggplot(target_corr, aes(x = reorder(Feature, Correlation), y = Correlation,
fill = Correlation > 0)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("TRUE" = "#E63946", "FALSE" = "#2E86AB")) +
labs(title = "Feature Correlation with Phishing Target",
x = "Feature", y = "Correlation") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))Question: Which URL lexical features contribute the most to distinguishing phishing websites from legitimate ones?
To answer this, we train a Random Forest classifier to predict whether a URL is phishing or legitimate, then examine its feature-importance scores to rank the features by how much each contributes to the model’s decisions. The classifier’s predictive performance is reported first to confirm that the model is accurate enough for its importance ranking to be trustworthy.
We use a 70/30 stratified split to preserve the class distribution. This same split is reused in Question 2 for consistency.
set.seed(42)
train_index <- createDataPartition(df_clean$phishing, p = 0.7, list = FALSE)
train_data <- df_clean[train_index, ]
test_data <- df_clean[-train_index, ]
cat("Training set:", nrow(train_data), "rows\n")## Training set: 14216 rows
## Testing set: 6091 rows
##
## Legitimate Phishing
## 0.2685 0.7315
##
## Legitimate Phishing
## 0.2684 0.7316
set.seed(42)
rf_model <- randomForest(phishing ~ .,
data = train_data,
ntree = 200,
mtry = floor(sqrt(ncol(train_data) - 1)),
importance = TRUE)
print(rf_model)##
## Call:
## randomForest(formula = phishing ~ ., data = train_data, ntree = 200, mtry = floor(sqrt(ncol(train_data) - 1)), importance = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 17.64%
## Confusion matrix:
## Legitimate Phishing class.error
## Legitimate 2214 1603 0.4199633
## Phishing 905 9494 0.0870276
We first confirm the model classifies well, so that its feature ranking is credible.
rf_preds <- predict(rf_model, newdata = test_data)
rf_probs <- predict(rf_model, newdata = test_data, type = "prob")[, "Phishing"]
rf_cm <- confusionMatrix(rf_preds, test_data$phishing, positive = "Phishing")
rf_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction Legitimate Phishing
## Legitimate 972 421
## Phishing 663 4035
##
## Accuracy : 0.822
## 95% CI : (0.8122, 0.8316)
## No Information Rate : 0.7316
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5246
##
## Mcnemar's Test P-Value : 2.482e-13
##
## Sensitivity : 0.9055
## Specificity : 0.5945
## Pos Pred Value : 0.8589
## Neg Pred Value : 0.6978
## Prevalence : 0.7316
## Detection Rate : 0.6625
## Detection Prevalence : 0.7713
## Balanced Accuracy : 0.7500
##
## 'Positive' Class : Phishing
##
rf_metrics <- data.frame(
Metric = c("Accuracy", "Precision", "Recall (Sensitivity)",
"Specificity", "F1-Score"),
Value = c(
round(rf_cm$overall["Accuracy"], 4),
round(rf_cm$byClass["Precision"], 4),
round(rf_cm$byClass["Sensitivity"], 4),
round(rf_cm$byClass["Specificity"], 4),
round(rf_cm$byClass["F1"], 4)
)
)
kable(rf_metrics, caption = "Random Forest Classification Performance") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Metric | Value | |
|---|---|---|
| Accuracy | Accuracy | 0.8220 |
| Precision | Precision | 0.8589 |
| Sensitivity | Recall (Sensitivity) | 0.9055 |
| Specificity | Specificity | 0.5945 |
| F1 | F1-Score | 0.8816 |
rf_roc <- roc(test_data$phishing, rf_probs, levels = c("Legitimate", "Phishing"))
rf_auc <- auc(rf_roc)
cat("Random Forest AUC:", round(rf_auc, 4), "\n")## Random Forest AUC: 0.8708
plot(rf_roc, col = "#E63946", lwd = 2.5,
main = paste("Random Forest ROC Curve (AUC =", round(rf_auc, 4), ")"))
abline(a = 0, b = 1, lty = 2, col = "gray")With the model confirmed as accurate, its feature-importance scores provide a trustworthy answer to Question 1. We report both importance measures produced by Random Forest:
imp <- importance(rf_model)
imp_df <- data.frame(
Feature = rownames(imp),
MeanDecreaseAccuracy = imp[, "MeanDecreaseAccuracy"],
MeanDecreaseGini = imp[, "MeanDecreaseGini"]
) %>%
arrange(desc(MeanDecreaseGini))
# Table of ranked importance
kable(imp_df, caption = "Random Forest Feature Importance (ranked by Gini)",
digits = 2, row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Feature | MeanDecreaseAccuracy | MeanDecreaseGini |
|---|---|---|
| n_slash | 51.02 | 691.82 |
| url_length | 53.28 | 612.51 |
| n_hypens | 70.00 | 406.88 |
| n_dots | 40.43 | 231.92 |
| n_equal | 30.08 | 192.83 |
| n_underline | 44.75 | 155.02 |
| n_redirection | 24.26 | 146.20 |
| n_percent | 25.22 | 73.04 |
| n_and | 16.65 | 59.28 |
| n_at | 26.60 | 55.50 |
| n_questionmark | 13.37 | 44.86 |
| n_plus | 13.53 | 21.12 |
| n_space | 6.98 | 15.15 |
| n_tilde | 12.30 | 13.29 |
| n_exclamation | 12.77 | 11.27 |
| n_asterisk | 6.99 | 3.50 |
| n_comma | 0.85 | 1.63 |
| n_dollar | 3.50 | 1.02 |
| n_hastag | 0.01 | 0.30 |
# Plot - Mean Decrease in Gini
ggplot(imp_df, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_col(fill = "#E63946") +
coord_flip() +
labs(title = "Feature Importance for Phishing Detection (Mean Decrease in Gini)",
x = "Feature", y = "Mean Decrease in Gini") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))# Plot - Mean Decrease in Accuracy
ggplot(imp_df, aes(x = reorder(Feature, MeanDecreaseAccuracy),
y = MeanDecreaseAccuracy)) +
geom_col(fill = "#2E86AB") +
coord_flip() +
labs(title = "Feature Importance for Phishing Detection (Mean Decrease in Accuracy)",
x = "Feature", y = "Mean Decrease in Accuracy") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))Interpretation. The most influential features for
distinguishing phishing from legitimate URLs are typically
n_slash, url_length, and
n_dots, followed by query-related characters such
as n_equal and n_and. Both importance measures
agree on the leading features, which strengthens confidence in the
ranking. The dominance of these structural indicators makes intuitive
sense: phishing URLs frequently nest deceptive subdomains and long paths
to mimic trusted domains, inflating slash counts, dot counts, and
overall length. Rare special characters (e.g., n_tilde,
n_percent) contribute little and could be dropped from a
lightweight deployed model.
Question: How do the individual URL character features relate to the overall length of a URL, and to what extent can URL length be explained by them?
Question 1 established that url_length is one of the
strongest signals of phishing. This section unpacks what makes a URL
long by modeling url_length as a linear function of
the other character-count features. The coefficients describe the
relationship between each character type and total URL length, while the
model fit (R²) tells us how completely these features account for
length.
The predictors are all character-count features;
url_length is the continuous response. We drop the
phishing label here because this question is about URL
structure, not phishing status. We reuse the same train/test rows as
Question 1.
reg_train <- train_data %>% select(-phishing)
reg_test <- test_data %>% select(-phishing)
cat("Regression training rows:", nrow(reg_train), "\n")## Regression training rows: 14216
## Regression testing rows: 6091
## Response variable: url_length
## Number of predictors: 18
##
## Call:
## lm(formula = url_length ~ ., data = reg_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -296.25 -18.36 -6.71 8.82 1410.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.2945 0.9405 19.451 < 2e-16 ***
## n_dots 5.0378 0.2352 21.418 < 2e-16 ***
## n_hypens 8.7877 0.1692 51.925 < 2e-16 ***
## n_underline 10.2887 0.2901 35.470 < 2e-16 ***
## n_slash 5.4164 0.1677 32.306 < 2e-16 ***
## n_questionmark 27.4471 1.5440 17.776 < 2e-16 ***
## n_equal 13.5815 0.5535 24.538 < 2e-16 ***
## n_at -2.0934 0.8095 -2.586 0.00972 **
## n_and 2.3346 0.5665 4.121 3.80e-05 ***
## n_exclamation -6.7486 2.4324 -2.774 0.00554 **
## n_space 7.6875 1.1869 6.477 9.68e-11 ***
## n_tilde 52.9407 2.4204 21.872 < 2e-16 ***
## n_comma 4.5565 5.9308 0.768 0.44234
## n_plus 0.5378 1.6444 0.327 0.74361
## n_asterisk -1.5250 0.7030 -2.169 0.03007 *
## n_hastag 10.1117 4.0877 2.474 0.01339 *
## n_dollar 10.5621 2.2756 4.641 3.49e-06 ***
## n_percent 4.1039 0.1239 33.115 < 2e-16 ***
## n_redirection -0.6137 0.4591 -1.337 0.18137
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.74 on 14197 degrees of freedom
## Multiple R-squared: 0.5904, Adjusted R-squared: 0.5898
## F-statistic: 1137 on 18 and 14197 DF, p-value: < 2.2e-16
lm_pred <- predict(lm_model, newdata = reg_test)
actual <- reg_test$url_length
rmse <- sqrt(mean((actual - lm_pred)^2))
mae <- mean(abs(actual - lm_pred))
sse <- sum((actual - lm_pred)^2)
sst <- sum((actual - mean(actual))^2)
r2_test <- 1 - sse / sst
lm_metrics <- data.frame(
Metric = c("R-squared (test)", "RMSE (test)", "MAE (test)",
"Adjusted R-squared (train)"),
Value = c(
round(r2_test, 4),
round(rmse, 4),
round(mae, 4),
round(summary(lm_model)$adj.r.squared, 4)
)
)
kable(lm_metrics, caption = "Linear Regression Performance") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Metric | Value |
|---|---|
| R-squared (test) | 0.4215 |
| RMSE (test) | 64.7577 |
| MAE (test) | 23.7238 |
| Adjusted R-squared (train) | 0.5898 |
diag_df <- data.frame(
Actual = actual,
Predicted = lm_pred,
Residual = actual - lm_pred
)
# Predicted vs Actual
ggplot(diag_df %>% filter(Actual <= 300), aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.15, color = "#2E86AB") +
geom_abline(slope = 1, intercept = 0, color = "#E63946",
linetype = "dashed", linewidth = 1) +
labs(title = "Predicted vs Actual URL Length (truncated at 300)",
x = "Actual URL Length", y = "Predicted URL Length") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))# Residuals vs Fitted
ggplot(diag_df %>% filter(Predicted <= 300),
aes(x = Predicted, y = Residual)) +
geom_point(alpha = 0.15, color = "#2E86AB") +
geom_hline(yintercept = 0, color = "#E63946",
linetype = "dashed", linewidth = 1) +
labs(title = "Residuals vs Fitted Values (truncated at 300)",
x = "Fitted URL Length", y = "Residual") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))Each coefficient gives the expected change in URL length for one additional occurrence of that character, holding the others constant. Sorting by magnitude shows which character types contribute most to making a URL longer.
coef_df <- as.data.frame(summary(lm_model)$coefficients) %>%
tibble::rownames_to_column("Feature") %>%
rename(p_value = `Pr(>|t|)`) %>%
filter(Feature != "(Intercept)") %>%
arrange(desc(abs(Estimate)))
kable(coef_df, caption = "Linear Regression Coefficients (sorted by magnitude)",
digits = 4, row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Feature | Estimate | Std. Error | t value | p_value |
|---|---|---|---|---|
| n_tilde | 52.9407 | 2.4204 | 21.8724 | 0.0000 |
| n_questionmark | 27.4471 | 1.5440 | 17.7761 | 0.0000 |
| n_equal | 13.5815 | 0.5535 | 24.5375 | 0.0000 |
| n_dollar | 10.5621 | 2.2756 | 4.6414 | 0.0000 |
| n_underline | 10.2887 | 0.2901 | 35.4697 | 0.0000 |
| n_hastag | 10.1117 | 4.0877 | 2.4737 | 0.0134 |
| n_hypens | 8.7877 | 0.1692 | 51.9253 | 0.0000 |
| n_space | 7.6875 | 1.1869 | 6.4768 | 0.0000 |
| n_exclamation | -6.7486 | 2.4324 | -2.7745 | 0.0055 |
| n_slash | 5.4164 | 0.1677 | 32.3055 | 0.0000 |
| n_dots | 5.0378 | 0.2352 | 21.4185 | 0.0000 |
| n_comma | 4.5565 | 5.9308 | 0.7683 | 0.4423 |
| n_percent | 4.1039 | 0.1239 | 33.1147 | 0.0000 |
| n_and | 2.3346 | 0.5665 | 4.1209 | 0.0000 |
| n_at | -2.0934 | 0.8095 | -2.5861 | 0.0097 |
| n_asterisk | -1.5250 | 0.7030 | -2.1693 | 0.0301 |
| n_redirection | -0.6137 | 0.4591 | -1.3366 | 0.1814 |
| n_plus | 0.5378 | 1.6444 | 0.3271 | 0.7436 |
ggplot(coef_df, aes(x = reorder(Feature, Estimate), y = Estimate,
fill = Estimate > 0)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("TRUE" = "#E63946", "FALSE" = "#2E86AB"),
labels = c("TRUE" = "Increases length",
"FALSE" = "Decreases length"),
name = NULL) +
labs(title = "How Each Character Feature Relates to URL Length",
x = "Feature", y = "Coefficient (change in URL length per unit)") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))Interpretation. The coefficients quantify the
structural composition of a URL. Character types with large positive
coefficients (commonly n_slash, n_equal,
n_and, and n_dots) are the components that
lengthen URLs the most — each extra slash or query parameter adds a
measurable number of characters. The high R² indicates that URL length
is largely explained by these component counts, which is expected since
length is, in part, a sum of its parts. This confirms and explains the
Question 1 finding: phishing URLs appear longer precisely because they
contain more of these structural components (deeper paths and richer
query strings).
Note on interpretation. Because
url_lengthis partly composed of the very characters being counted, the predictors are mechanically related to the response and to one another (multicollinearity). The coefficients should therefore be read as compositional contributions to length rather than independent causal effects. This is a natural property of the data and is discussed further below.
Using two complementary techniques, this project answered two distinct questions about phishing URLs.
Question 1 (Classification, Random Forest). The
Random Forest classifier achieved strong performance, giving confidence
in its feature-importance ranking. The analysis showed that a small set
of structural features — n_slash,
url_length, and n_dots — dominate
phishing detection, with query-related characters playing a secondary
role and rare symbols contributing little. The practical value is
direct: a deployed detector can focus on these few features for fast,
explainable, low-cost screening inside browsers or email gateways, and
human analysts know what to look at first.
Question 2 (Regression, Linear Regression). The linear model explained URL length well and revealed why length is informative. The largest positive coefficients belonged to slashes, equals signs, ampersands, and dots — the same structural components that drive the phishing signal. In other words, the two analyses reinforce each other: phishing URLs are longer because they accumulate more subdirectories and query parameters, and those same components are what the classifier relies on. The main caveat is the mechanical relationship between URL length and its character counts, which inflates R² and induces multicollinearity; coefficients are best interpreted as compositional contributions rather than independent causal effects.
Together, the classification and regression results tell a consistent story from two angles: the classifier identifies which features signal phishing, and the regression explains how those features manifest in the measurable structure of a URL.
This project addressed two separate research questions, each with an appropriate modeling technique:
Classification (Random Forest) — which features matter
most for phishing detection? The most important indicators are
n_slash, url_length, and n_dots.
These structural features carry most of the predictive power, enabling
lightweight and explainable detection.
Regression (Linear Regression) — how do features relate to URL length? URL length is largely explained by its component character counts, with slashes, equals signs, ampersands, and dots contributing most. This explains, in structural terms, why URL length is such a useful phishing signal.
Practical implications. A model based on URL features alone can serve as a fast, first-line filter that flags suspicious links before users click them, without fetching page content. The feature ranking shows this can be done with only a handful of features.
Limitations and future work. The dataset relies purely on URL character statistics and does not capture HTML, JavaScript, or domain-registration signals. The regression’s predictors are also mechanically related to URL length. Future work could:
## R version 4.5.3 (2026-03-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Asia/Kuala_Lumpur
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.4.0 knitr_1.51 pROC_1.19.0.1
## [4] randomForest_4.7-1.2 caret_7.0-1 lattice_0.22-9
## [7] scales_1.4.0 GGally_2.4.0 lubridate_1.9.5
## [10] forcats_1.0.1 stringr_1.6.0 purrr_1.2.2
## [13] readr_2.2.0 tibble_3.3.1 tidyverse_2.0.0
## [16] skimr_2.2.2 corrplot_0.95 ggplot2_4.0.2
## [19] tidyr_1.3.2 janitor_2.2.1 dplyr_1.2.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.3 timeDate_4052.112
## [4] farver_2.1.2 S7_0.2.1 fastmap_1.2.0
## [7] digest_0.6.39 rpart_4.1.24 timechange_0.4.0
## [10] lifecycle_1.0.5 survival_3.8-6 magrittr_2.0.4
## [13] compiler_4.5.3 rlang_1.1.7 sass_0.4.10
## [16] tools_4.5.3 yaml_2.3.12 data.table_1.18.2.1
## [19] labeling_0.4.3 xml2_1.5.2 plyr_1.8.9
## [22] repr_1.1.7 RColorBrewer_1.1-3 withr_3.0.2
## [25] stats4_4.5.3 nnet_7.3-20 grid_4.5.3
## [28] e1071_1.7-17 future_1.70.0 globals_0.19.1
## [31] iterators_1.0.14 MASS_7.3-65 cli_3.6.5
## [34] rmarkdown_2.31 generics_0.1.4 otel_0.2.0
## [37] rstudioapi_0.18.0 future.apply_1.20.2 reshape2_1.4.5
## [40] tzdb_0.5.0 proxy_0.4-29 cachem_1.1.0
## [43] splines_4.5.3 parallel_4.5.3 base64enc_0.1-6
## [46] vctrs_0.7.1 hardhat_1.4.3 Matrix_1.7-4
## [49] jsonlite_2.0.0 hms_1.1.4 listenv_0.10.1
## [52] systemfonts_1.3.2 foreach_1.5.2 gower_1.0.2
## [55] jquerylib_0.1.4 recipes_1.3.2 glue_1.8.0
## [58] parallelly_1.47.0 ggstats_0.13.0 codetools_0.2-20
## [61] stringi_1.8.7 gtable_0.3.6 pillar_1.11.1
## [64] htmltools_0.5.9 ipred_0.9-15 lava_1.9.1
## [67] R6_2.6.1 textshaping_1.0.5 evaluate_1.0.5
## [70] snakecase_0.11.1 bslib_0.10.0 class_7.3-23
## [73] Rcpp_1.1.1 svglite_2.2.2 nlme_3.1-168
## [76] prodlim_2026.03.11 xfun_0.56 ModelMetrics_1.2.2.2
## [79] pkgconfig_2.0.3