1 Introduction

Background

Phishing is a cyberattack that uses deceptive emails, messages, phone calls, or websites to trick individuals into revealing sensitive information, downloading malware, or exposing themselves to cybercrime. It is a form of social engineering that manipulates human behavior by impersonating trusted sources to steal data such as usernames, passwords, credit card details, and bank account information for malicious purposes (Konsinski, 2026; Cloudflare, 2026).

Problem Statement

As phishing techniques continue to advance, cybercriminals are creating more sophisticated and convincing attacks, making them increasingly challenging to detect (Sivaneswaran et al., 2026). This project focuses on developing machine learning models using R to better understand phishing URLs based on their lexical features. The project is built around two distinct research questions, each answered with a different and appropriate modeling technique:

(Classification) Which URL lexical features contribute the most to distinguishing phishing websites from legitimate ones?
(Regression) How do the individual URL character features relate to the overall length of a URL, and to what extent can URL length be explained by them?

Objectives

To conduct exploratory data analysis (EDA) to examine the distribution and patterns of URL features in the dataset.
To clean and preprocess the dataset in order to prepare it for machine learning modeling.
To apply two complementary modeling approaches, each addressing a distinct research question: - A classification model (Random Forest) to identify which URL features matter most for phishing detection. - A regression model (Linear Regression) to explain how individual URL character features relate to overall URL length.
To interpret and evaluate each model, drawing practical conclusions for cybersecurity applications.

2 Dataset Description

Dataset Source

The dataset used in this project was obtained from Kaggle, specifically from the dataset titled Web Page Phishing Dataset by Daniel Fernandon.

Purpose of dataset

The dataset is designed for phishing website detection and contains various URL and webpage-related features that can be used to classify whether a website is phishing or legitimate.

Dataset Key Features

url_length: The length of the URL.
n_dots: The count of ‘.’ characters in the URL.
n_hypens: The count of ‘-’ characters in the URL.
n_underline: The count of ’_’ characters in the URL.
n_slash: The count of ‘/’ characters in the URL.
n_questionmark: The count of ‘?’ characters in the URL.
n_equal: The count of ‘=’ characters in the URL.
n_at: The count of ‘@’ characters in the URL.
n_and: The count of ‘&’ characters in the URL.
n_exclamation: The count of ‘!’ characters in the URL.
n_space: The count of ’ ’ characters in the URL.
n_tilde: The count of ‘~’ characters in the URL.
n_comma: The count of ‘,’ characters in the URL.
n_plus: The count of ‘+’ characters in the URL.
n_asterisk: The count of ’*’ characters in the URL.
n_hastag: The count of ‘#’ characters in the URL.
n_dollar: The count of ‘$’ characters in the URL.
n_percent: The count of ‘%’ characters in the URL.
n_redirection: The count of redirections in the URL.
phishing: The Labels of the URL. 1 is phishing and 0 is legitimate.

Variable Output

The target variable indicates whether a URL is phishing or legitimate, where:

1 represents a phishing website
0 represents a legitimate website

3 Data Cleaning

List of all required packages

packages <- c(
  "dplyr",
  "janitor",
  "tidyr",
  "ggplot2",
  "corrplot",
  "skimr",
  "tidyverse",
  "GGally",
  "scales",
  "caret",
  "randomForest",
  "pROC",
  "knitr",
  "kableExtra"
)

# Auto-install any missing packages, then load all of them
install_and_load <- function(pkg) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg, repos = "https://cloud.r-project.org")
  }
  library(pkg, character.only = TRUE)
}

invisible(sapply(packages, install_and_load))

set.seed(42)

Load Dataset

file_path <- "web-page-phishing.csv"

df_raw <- read.csv(file_path, stringsAsFactors = FALSE)

dim(df_raw)

## [1] 100077     20

head(df_raw)

Check Original Dataset Structure

colnames(df_raw)

##  [1] "url_length"     "n_dots"         "n_hypens"       "n_underline"   
##  [5] "n_slash"        "n_questionmark" "n_equal"        "n_at"          
##  [9] "n_and"          "n_exclamation"  "n_space"        "n_tilde"       
## [13] "n_comma"        "n_plus"         "n_asterisk"     "n_hastag"      
## [17] "n_dollar"       "n_percent"      "n_redirection"  "phishing"

str(df_raw)

## 'data.frame':    100077 obs. of  20 variables:
##  $ url_length    : int  37 77 126 18 55 32 19 81 42 104 ...
##  $ n_dots        : int  3 1 4 2 2 3 2 2 2 1 ...
##  $ n_hypens      : int  0 0 1 0 2 1 0 0 0 10 ...
##  $ n_underline   : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ n_slash       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_questionmark: int  0 0 1 0 0 0 0 0 0 0 ...
##  $ n_equal       : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ n_at          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_and         : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ n_exclamation : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_space       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_tilde       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_comma       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_plus        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_asterisk    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_hastag      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_dollar      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_percent     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_redirection : int  0 1 1 1 1 1 1 1 0 0 ...
##  $ phishing      : int  0 1 1 0 0 1 0 1 0 0 ...

summary(df_raw)

##    url_length          n_dots          n_hypens        n_underline     
##  Min.   :   4.00   Min.   : 1.000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.:  18.00   1st Qu.: 2.000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median :  24.00   Median : 2.000   Median : 0.0000   Median : 0.0000  
##  Mean   :  39.18   Mean   : 2.224   Mean   : 0.4052   Mean   : 0.1377  
##  3rd Qu.:  44.00   3rd Qu.: 2.000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
##  Max.   :4165.00   Max.   :24.000   Max.   :43.0000   Max.   :21.0000  
##     n_slash       n_questionmark       n_equal             n_at         
##  Min.   : 0.000   Min.   :0.00000   Min.   : 0.0000   Min.   : 0.00000  
##  1st Qu.: 0.000   1st Qu.:0.00000   1st Qu.: 0.0000   1st Qu.: 0.00000  
##  Median : 0.000   Median :0.00000   Median : 0.0000   Median : 0.00000  
##  Mean   : 1.135   Mean   :0.02439   Mean   : 0.2158   Mean   : 0.02214  
##  3rd Qu.: 2.000   3rd Qu.:0.00000   3rd Qu.: 0.0000   3rd Qu.: 0.00000  
##  Max.   :44.000   Max.   :9.00000   Max.   :23.0000   Max.   :43.00000  
##      n_and         n_exclamation          n_space             n_tilde        
##  Min.   : 0.0000   Min.   : 0.000000   Min.   : 0.000000   Min.   :0.000000  
##  1st Qu.: 0.0000   1st Qu.: 0.000000   1st Qu.: 0.000000   1st Qu.:0.000000  
##  Median : 0.0000   Median : 0.000000   Median : 0.000000   Median :0.000000  
##  Mean   : 0.1433   Mean   : 0.002608   Mean   : 0.004876   Mean   :0.003617  
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000   3rd Qu.: 0.000000   3rd Qu.:0.000000  
##  Max.   :26.0000   Max.   :10.000000   Max.   :18.000000   Max.   :5.000000  
##     n_comma              n_plus            n_asterisk       
##  Min.   : 0.000000   Min.   : 0.000000   Min.   : 0.000000  
##  1st Qu.: 0.000000   1st Qu.: 0.000000   1st Qu.: 0.000000  
##  Median : 0.000000   Median : 0.000000   Median : 0.000000  
##  Mean   : 0.002378   Mean   : 0.002468   Mean   : 0.004097  
##  3rd Qu.: 0.000000   3rd Qu.: 0.000000   3rd Qu.: 0.000000  
##  Max.   :11.000000   Max.   :19.000000   Max.   :60.000000  
##     n_hastag            n_dollar           n_percent        n_redirection    
##  Min.   :0.000e+00   Min.   : 0.000000   Min.   :  0.0000   Min.   :-1.0000  
##  1st Qu.:0.000e+00   1st Qu.: 0.000000   1st Qu.:  0.0000   1st Qu.: 0.0000  
##  Median :0.000e+00   Median : 0.000000   Median :  0.0000   Median : 0.0000  
##  Mean   :4.497e-04   Mean   : 0.001899   Mean   :  0.1093   Mean   : 0.3615  
##  3rd Qu.:0.000e+00   3rd Qu.: 0.000000   3rd Qu.:  0.0000   3rd Qu.: 1.0000  
##  Max.   :1.300e+01   Max.   :10.000000   Max.   :174.0000   Max.   :17.0000  
##     phishing     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3633  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Clean Column Names

df <- df_raw %>%
  clean_names()

colnames(df)

##  [1] "url_length"     "n_dots"         "n_hypens"       "n_underline"   
##  [5] "n_slash"        "n_questionmark" "n_equal"        "n_at"          
##  [9] "n_and"          "n_exclamation"  "n_space"        "n_tilde"       
## [13] "n_comma"        "n_plus"         "n_asterisk"     "n_hastag"      
## [17] "n_dollar"       "n_percent"      "n_redirection"  "phishing"

Check Missing Values

missing_values <- colSums(is.na(df))

missing_values

##     url_length         n_dots       n_hypens    n_underline        n_slash 
##              0              0              0              0              0 
## n_questionmark        n_equal           n_at          n_and  n_exclamation 
##              0              0              0              0              0 
##        n_space        n_tilde        n_comma         n_plus     n_asterisk 
##              0              0              0              0              0 
##       n_hastag       n_dollar      n_percent  n_redirection       phishing 
##              0              0              0              0              0

Remove Missing Values

df_clean <- df %>%
  drop_na()

dim(df_clean)

## [1] 100077     20

Check Duplicate Rows

duplicate_count <- sum(duplicated(df_clean))

duplicate_count

## [1] 78186

Remove Duplicate Rows

df_clean <- df_clean %>%
  distinct()

dim(df_clean)

## [1] 21891    20

Check Target Variable

"phishing" %in% colnames(df_clean)

## [1] TRUE

table(df_clean$phishing)

## 
##     0     1 
##  6019 15872

Convert Data Types

df_clean <- df_clean %>%
  mutate(across(-phishing, as.numeric)) %>%
  mutate(phishing = as.factor(phishing))

str(df_clean)

## 'data.frame':    21891 obs. of  20 variables:
##  $ url_length    : num  37 77 126 18 55 32 19 81 42 104 ...
##  $ n_dots        : num  3 1 4 2 2 3 2 2 2 1 ...
##  $ n_hypens      : num  0 0 1 0 2 1 0 0 0 10 ...
##  $ n_underline   : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ n_slash       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_questionmark: num  0 0 1 0 0 0 0 0 0 0 ...
##  $ n_equal       : num  0 0 3 0 0 0 0 0 0 0 ...
##  $ n_at          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_and         : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ n_exclamation : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_space       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_tilde       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_comma       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_plus        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_asterisk    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_hastag      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_dollar      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_percent     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_redirection : num  0 1 1 1 1 1 1 1 0 0 ...
##  $ phishing      : Factor w/ 2 levels "0","1": 1 2 2 1 1 2 1 2 1 1 ...

Check Abnormal Negative Values

Most URL-based features are count-based or length-based features. Therefore, negative values are not expected in these numerical columns.

negative_summary <- df_clean %>%
  summarise(across(where(is.numeric), ~ sum(. < 0, na.rm = TRUE))) %>%
  pivot_longer(
    cols = everything(),
    names_to = "feature",
    values_to = "negative_count"
  ) %>%
  arrange(desc(negative_count))

negative_summary

Show Features with Negative Values Only

negative_summary_nonzero <- negative_summary %>%
  filter(negative_count > 0)

negative_summary_nonzero

Check Rows Containing Negative Values

rows_with_negative <- df_clean %>%
  filter(if_any(where(is.numeric), ~ . < 0))

dim(rows_with_negative)

## [1] 1584   20

head(rows_with_negative)

Remove Abnormal Negative Values

before_negative_removal <- nrow(df_clean)

df_clean <- df_clean %>%
  filter(if_all(where(is.numeric), ~ . >= 0))

after_negative_removal <- nrow(df_clean)

removed_negative_rows <- before_negative_removal - after_negative_removal

removed_negative_rows

## [1] 1584

dim(df_clean)

## [1] 20307    20

Final Negative Value Check

negative_summary_after <- df_clean %>%
  summarise(across(where(is.numeric), ~ sum(. < 0, na.rm = TRUE))) %>%
  pivot_longer(
    cols = everything(),
    names_to = "feature",
    values_to = "negative_count"
  ) %>%
  arrange(desc(negative_count))

negative_summary_after

Final Missing Value Check

colSums(is.na(df_clean))

##     url_length         n_dots       n_hypens    n_underline        n_slash 
##              0              0              0              0              0 
## n_questionmark        n_equal           n_at          n_and  n_exclamation 
##              0              0              0              0              0 
##        n_space        n_tilde        n_comma         n_plus     n_asterisk 
##              0              0              0              0              0 
##       n_hastag       n_dollar      n_percent  n_redirection       phishing 
##              0              0              0              0              0

Save Cleaned Dataset

write.csv(df_clean, "cleaned_web_page_phishing.csv", row.names = FALSE)

Re-label Target for Readability

For clearer plots and confusion matrices in later sections, we relabel the factor levels of phishing from 0/1 to Legitimate/Phishing. The underlying data is unchanged.

df_clean$phishing <- factor(df_clean$phishing,
                            levels = c("0", "1"),
                            labels = c("Legitimate", "Phishing"))

table(df_clean$phishing)

## 
## Legitimate   Phishing 
##       5452      14855

4 Dataset Analysis

This section summarises the data quality issues identified during exploration, the goals of preprocessing, and the research questions that guide the modeling.

4.1 Identified Data Quality Issues

1. Incorrect or Inconsistent Values

Negative values were found in n_redirection, which may indicate incorrect or inconsistent data. These rows are removed during cleaning.

2. Outliers

Significant outliers exist in several variables, including url_length, n_asterisk, and n_percent.

3. Duplicate Records

A large number of duplicated rows were detected — approximately 78,186 duplicate records — and removed during cleaning.

4. Class Imbalance

The target variable phishing is moderately imbalanced:

Class	Percentage
Legitimate (0)	63.67%
Phishing (1)	36.33%

4.2 Objectives / Goals of Data Preprocessing

The preprocessing stage aims to improve dataset quality and prepare the data for machine learning.

1. Improve Data Quality — handling incorrect values, outliers, and duplicate records.

2. Prepare the Dataset for Machine Learning — selecting relevant features, reducing noise, and improving model performance. Important features identified from correlation analysis include:

Feature	Correlation with `phishing`
`n_slash`	0.611472
`url_length`	0.430125
`n_equal`	0.260462

3. Handle Class Imbalance — balanced model evaluation is ensured using stratified train-test splitting, given the moderate class imbalance.

4.3 Research Questions and Objectives

Question 1 — Classification (Random Forest)

Which URL lexical features contribute the most to distinguishing phishing websites from legitimate ones?

Objective: To develop a classification model that predicts phishing websites based on URL features and to identify the strongest indicators of phishing.

Why this question matters. Knowing which characteristics most strongly signal a phishing URL is more valuable than a black-box prediction alone. If a small number of features (for example, the number of slashes or the overall URL length) carry most of the predictive power, security teams can build lightweight, explainable detection rules that run in real time inside browsers or email gateways without inspecting full page content; prioritise what human analysts inspect when manually reviewing suspicious links; and reduce the feature set needed for deployment, lowering computational cost at scale.

Why Random Forest. Random Forest produces a principled, built-in measure of feature importance (Mean Decrease in Gini / accuracy) while simultaneously modeling non-linear relationships and interactions between features. It is also robust to the moderate class imbalance and to the extreme outliers present in features such as url_length, making it both an accurate classifier and a reliable tool for ranking feature relevance.

Question 2 — Regression (Linear Regression)

How do the individual URL character features relate to the overall length of a URL, and to what extent can URL length be explained by them?

Objective: To analyze the relationship between URL features and URL length using a regression model, and to determine whether phishing-related URLs tend to become longer due to increased use of special characters.

Why this question matters. “Long URLs are suspicious” is a common heuristic in phishing detection, but it is rarely unpacked. This question explains what actually drives URL length — extra subdirectories (slashes), longer query strings (=, &), or other character types. It gives a structural interpretation of the headline finding from Question 1, produces directly interpretable coefficients (each coefficient states how many additional characters of length are associated with one more occurrence of a given character type), and provides a continuous, complementary view of URL structure alongside the binary classification.

Why Linear Regression. The target here, url_length, is a continuous numeric variable, so a regression method is required. Linear regression directly models URL length as a function of the other character-count features and yields transparent, easily explained coefficients — exactly what is needed to describe relationships rather than just predict.

5 Exploratory Data Analysis (EDA)

5.1 Target Class Distribution

ggplot(df_clean, aes(x = phishing, fill = phishing)) +
  geom_bar(width = 0.6) +
  geom_text(stat = "count", aes(label = comma(after_stat(count))),
            vjust = -0.4, size = 4.5, fontface = "bold") +
  scale_fill_manual(values = c("Legitimate" = "#2E86AB", "Phishing" = "#E63946")) +
  labs(title = "Distribution of Phishing vs Legitimate URLs",
       x = "URL Class", y = "Count") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

# Proportions
prop.table(table(df_clean$phishing)) %>%
  round(4) %>%
  kable(caption = "Class proportions") %>%
  kable_styling(full_width = FALSE)

Class proportions
Var1	Freq
Legitimate	0.2685
Phishing	0.7315

The classes are moderately imbalanced — roughly 64% Legitimate and 36% Phishing. This is mild enough that we don’t require oversampling, but we monitor metrics beyond accuracy (e.g., Recall, F1-score) in the classification model.

5.2 Distribution of URL Length

# Cap extreme outliers for visualization
ggplot(df_clean %>% filter(url_length <= 300),
       aes(x = url_length, fill = phishing)) +
  geom_histogram(bins = 60, alpha = 0.75, position = "identity") +
  scale_fill_manual(values = c("Legitimate" = "#2E86AB", "Phishing" = "#E63946")) +
  labs(title = "Distribution of URL Length by Class (truncated at 300)",
       x = "URL Length", y = "Count", fill = "Class") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Phishing URLs tend to be longer on average than legitimate URLs — a common tactic used by attackers to obfuscate malicious domains. This observation directly motivates Question 2: if URL length is a useful signal, it is worth understanding what drives it.

5.3 Boxplots of Key Features by Class

# Select features with the highest correlation to the target
key_features <- c("url_length", "n_slash", "n_dots", "n_equal", "n_and", "n_hypens")

df_long <- df_clean %>%
  select(all_of(key_features), phishing) %>%
  pivot_longer(-phishing, names_to = "Feature", values_to = "Value")

ggplot(df_long, aes(x = phishing, y = Value, fill = phishing)) +
  geom_boxplot(outlier.alpha = 0.15, outlier.size = 0.6) +
  facet_wrap(~ Feature, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = c("Legitimate" = "#2E86AB", "Phishing" = "#E63946")) +
  labs(title = "Key Feature Distributions by Class",
       x = NULL, y = "Value") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))

Across all key features, phishing URLs consistently show higher median and wider spread, confirming that lexical complexity is a strong indicator of malicious intent.

5.4 Correlation Heatmap

# Compute correlation matrix using numeric target
corr_data <- df_clean %>%
  mutate(phishing_num = as.numeric(phishing) - 1) %>%
  select(-phishing)

corr_mat <- cor(corr_data)

corrplot(corr_mat,
         method = "color",
         type = "upper",
         tl.cex = 0.85,
         tl.col = "black",
         addCoef.col = "black",
         number.cex = 0.6,
         col = colorRampPalette(c("#2E86AB", "white", "#E63946"))(200),
         title = "Correlation Matrix of Features",
         mar = c(0, 0, 2, 0))

n_slash, url_length, and n_equal show the strongest positive correlations with the phishing target. The heatmap also shows that several character counts are correlated with url_length, which is relevant to the regression in Question 2.

5.5 Feature Correlation with Target (Ranked)

target_corr <- data.frame(
  Feature = names(corr_mat[, "phishing_num"]),
  Correlation = corr_mat[, "phishing_num"]
) %>%
  filter(Feature != "phishing_num") %>%
  arrange(desc(abs(Correlation)))

ggplot(target_corr, aes(x = reorder(Feature, Correlation), y = Correlation,
                        fill = Correlation > 0)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("TRUE" = "#E63946", "FALSE" = "#2E86AB")) +
  labs(title = "Feature Correlation with Phishing Target",
       x = "Feature", y = "Correlation") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

6 Question 1 — Classification: Identifying Key Phishing Indicators (Random Forest)

Question: Which URL lexical features contribute the most to distinguishing phishing websites from legitimate ones?

To answer this, we train a Random Forest classifier to predict whether a URL is phishing or legitimate, then examine its feature-importance scores to rank the features by how much each contributes to the model’s decisions. The classifier’s predictive performance is reported first to confirm that the model is accurate enough for its importance ranking to be trustworthy.

6.1 Train-Test Split

We use a 70/30 stratified split to preserve the class distribution. This same split is reused in Question 2 for consistency.

set.seed(42)
train_index <- createDataPartition(df_clean$phishing, p = 0.7, list = FALSE)
train_data <- df_clean[train_index, ]
test_data  <- df_clean[-train_index, ]

cat("Training set:", nrow(train_data), "rows\n")

## Training set: 14216 rows

cat("Testing set: ", nrow(test_data), "rows\n")

## Testing set:  6091 rows

# Verify stratification
prop.table(table(train_data$phishing)) %>% round(4)

## 
## Legitimate   Phishing 
##     0.2685     0.7315

prop.table(table(test_data$phishing))  %>% round(4)

## 
## Legitimate   Phishing 
##     0.2684     0.7316

6.2 Train the Random Forest

set.seed(42)
rf_model <- randomForest(phishing ~ .,
                         data = train_data,
                         ntree = 200,
                         mtry = floor(sqrt(ncol(train_data) - 1)),
                         importance = TRUE)
print(rf_model)

## 
## Call:
##  randomForest(formula = phishing ~ ., data = train_data, ntree = 200,      mtry = floor(sqrt(ncol(train_data) - 1)), importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 17.64%
## Confusion matrix:
##            Legitimate Phishing class.error
## Legitimate       2214     1603   0.4199633
## Phishing          905     9494   0.0870276

6.3 Model Performance

We first confirm the model classifies well, so that its feature ranking is credible.

rf_preds <- predict(rf_model, newdata = test_data)
rf_probs <- predict(rf_model, newdata = test_data, type = "prob")[, "Phishing"]

rf_cm <- confusionMatrix(rf_preds, test_data$phishing, positive = "Phishing")
rf_cm

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Legitimate Phishing
##   Legitimate        972      421
##   Phishing          663     4035
##                                           
##                Accuracy : 0.822           
##                  95% CI : (0.8122, 0.8316)
##     No Information Rate : 0.7316          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5246          
##                                           
##  Mcnemar's Test P-Value : 2.482e-13       
##                                           
##             Sensitivity : 0.9055          
##             Specificity : 0.5945          
##          Pos Pred Value : 0.8589          
##          Neg Pred Value : 0.6978          
##              Prevalence : 0.7316          
##          Detection Rate : 0.6625          
##    Detection Prevalence : 0.7713          
##       Balanced Accuracy : 0.7500          
##                                           
##        'Positive' Class : Phishing        
##

rf_metrics <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall (Sensitivity)",
             "Specificity", "F1-Score"),
  Value = c(
    round(rf_cm$overall["Accuracy"], 4),
    round(rf_cm$byClass["Precision"], 4),
    round(rf_cm$byClass["Sensitivity"], 4),
    round(rf_cm$byClass["Specificity"], 4),
    round(rf_cm$byClass["F1"], 4)
  )
)

kable(rf_metrics, caption = "Random Forest Classification Performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Random Forest Classification Performance
	Metric	Value
Accuracy	Accuracy	0.8220
Precision	Precision	0.8589
Sensitivity	Recall (Sensitivity)	0.9055
Specificity	Specificity	0.5945
F1	F1-Score	0.8816

rf_roc <- roc(test_data$phishing, rf_probs, levels = c("Legitimate", "Phishing"))
rf_auc <- auc(rf_roc)
cat("Random Forest AUC:", round(rf_auc, 4), "\n")

## Random Forest AUC: 0.8708

plot(rf_roc, col = "#E63946", lwd = 2.5,
     main = paste("Random Forest ROC Curve (AUC =", round(rf_auc, 4), ")"))
abline(a = 0, b = 1, lty = 2, col = "gray")

6.4 Answering the Question: Feature Importance

With the model confirmed as accurate, its feature-importance scores provide a trustworthy answer to Question 1. We report both importance measures produced by Random Forest:

Mean Decrease in Gini — how much each feature improves the purity of the splits.
Mean Decrease in Accuracy — how much model accuracy drops when the feature’s values are randomly permuted.

imp <- importance(rf_model)
imp_df <- data.frame(
  Feature = rownames(imp),
  MeanDecreaseAccuracy = imp[, "MeanDecreaseAccuracy"],
  MeanDecreaseGini = imp[, "MeanDecreaseGini"]
) %>%
  arrange(desc(MeanDecreaseGini))

# Table of ranked importance
kable(imp_df, caption = "Random Forest Feature Importance (ranked by Gini)",
      digits = 2, row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Random Forest Feature Importance (ranked by Gini)
Feature	MeanDecreaseAccuracy	MeanDecreaseGini
n_slash	51.02	691.82
url_length	53.28	612.51
n_hypens	70.00	406.88
n_dots	40.43	231.92
n_equal	30.08	192.83
n_underline	44.75	155.02
n_redirection	24.26	146.20
n_percent	25.22	73.04
n_and	16.65	59.28
n_at	26.60	55.50
n_questionmark	13.37	44.86
n_plus	13.53	21.12
n_space	6.98	15.15
n_tilde	12.30	13.29
n_exclamation	12.77	11.27
n_asterisk	6.99	3.50
n_comma	0.85	1.63
n_dollar	3.50	1.02
n_hastag	0.01	0.30

# Plot - Mean Decrease in Gini
ggplot(imp_df, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_col(fill = "#E63946") +
  coord_flip() +
  labs(title = "Feature Importance for Phishing Detection (Mean Decrease in Gini)",
       x = "Feature", y = "Mean Decrease in Gini") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

# Plot - Mean Decrease in Accuracy
ggplot(imp_df, aes(x = reorder(Feature, MeanDecreaseAccuracy),
                   y = MeanDecreaseAccuracy)) +
  geom_col(fill = "#2E86AB") +
  coord_flip() +
  labs(title = "Feature Importance for Phishing Detection (Mean Decrease in Accuracy)",
       x = "Feature", y = "Mean Decrease in Accuracy") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Interpretation. The most influential features for distinguishing phishing from legitimate URLs are typically n_slash, url_length, and n_dots, followed by query-related characters such as n_equal and n_and. Both importance measures agree on the leading features, which strengthens confidence in the ranking. The dominance of these structural indicators makes intuitive sense: phishing URLs frequently nest deceptive subdomains and long paths to mimic trusted domains, inflating slash counts, dot counts, and overall length. Rare special characters (e.g., n_tilde, n_percent) contribute little and could be dropped from a lightweight deployed model.

7 Question 2 — Regression: Explaining URL Length from Character Features (Linear Regression)

Question: How do the individual URL character features relate to the overall length of a URL, and to what extent can URL length be explained by them?

Question 1 established that url_length is one of the strongest signals of phishing. This section unpacks what makes a URL long by modeling url_length as a linear function of the other character-count features. The coefficients describe the relationship between each character type and total URL length, while the model fit (R²) tells us how completely these features account for length.

7.1 Prepare the Regression Data

The predictors are all character-count features; url_length is the continuous response. We drop the phishing label here because this question is about URL structure, not phishing status. We reuse the same train/test rows as Question 1.

reg_train <- train_data %>% select(-phishing)
reg_test  <- test_data  %>% select(-phishing)

cat("Regression training rows:", nrow(reg_train), "\n")

## Regression training rows: 14216

cat("Regression testing rows: ", nrow(reg_test), "\n")

## Regression testing rows:  6091

cat("Response variable: url_length\n")

## Response variable: url_length

cat("Number of predictors:", ncol(reg_train) - 1, "\n")

## Number of predictors: 18

7.2 Fit the Linear Regression Model

lm_model <- lm(url_length ~ ., data = reg_train)
summary(lm_model)

## 
## Call:
## lm(formula = url_length ~ ., data = reg_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -296.25  -18.36   -6.71    8.82 1410.79 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     18.2945     0.9405  19.451  < 2e-16 ***
## n_dots           5.0378     0.2352  21.418  < 2e-16 ***
## n_hypens         8.7877     0.1692  51.925  < 2e-16 ***
## n_underline     10.2887     0.2901  35.470  < 2e-16 ***
## n_slash          5.4164     0.1677  32.306  < 2e-16 ***
## n_questionmark  27.4471     1.5440  17.776  < 2e-16 ***
## n_equal         13.5815     0.5535  24.538  < 2e-16 ***
## n_at            -2.0934     0.8095  -2.586  0.00972 ** 
## n_and            2.3346     0.5665   4.121 3.80e-05 ***
## n_exclamation   -6.7486     2.4324  -2.774  0.00554 ** 
## n_space          7.6875     1.1869   6.477 9.68e-11 ***
## n_tilde         52.9407     2.4204  21.872  < 2e-16 ***
## n_comma          4.5565     5.9308   0.768  0.44234    
## n_plus           0.5378     1.6444   0.327  0.74361    
## n_asterisk      -1.5250     0.7030  -2.169  0.03007 *  
## n_hastag        10.1117     4.0877   2.474  0.01339 *  
## n_dollar        10.5621     2.2756   4.641 3.49e-06 ***
## n_percent        4.1039     0.1239  33.115  < 2e-16 ***
## n_redirection   -0.6137     0.4591  -1.337  0.18137    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.74 on 14197 degrees of freedom
## Multiple R-squared:  0.5904, Adjusted R-squared:  0.5898 
## F-statistic:  1137 on 18 and 14197 DF,  p-value: < 2.2e-16

7.3 Model Performance on Test Set

lm_pred <- predict(lm_model, newdata = reg_test)

actual <- reg_test$url_length
rmse <- sqrt(mean((actual - lm_pred)^2))
mae  <- mean(abs(actual - lm_pred))
sse  <- sum((actual - lm_pred)^2)
sst  <- sum((actual - mean(actual))^2)
r2_test <- 1 - sse / sst

lm_metrics <- data.frame(
  Metric = c("R-squared (test)", "RMSE (test)", "MAE (test)",
             "Adjusted R-squared (train)"),
  Value = c(
    round(r2_test, 4),
    round(rmse, 4),
    round(mae, 4),
    round(summary(lm_model)$adj.r.squared, 4)
  )
)

kable(lm_metrics, caption = "Linear Regression Performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Linear Regression Performance
Metric	Value
R-squared (test)	0.4215
RMSE (test)	64.7577
MAE (test)	23.7238
Adjusted R-squared (train)	0.5898

7.4 Diagnostic Plots

diag_df <- data.frame(
  Actual = actual,
  Predicted = lm_pred,
  Residual = actual - lm_pred
)

# Predicted vs Actual
ggplot(diag_df %>% filter(Actual <= 300), aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.15, color = "#2E86AB") +
  geom_abline(slope = 1, intercept = 0, color = "#E63946",
              linetype = "dashed", linewidth = 1) +
  labs(title = "Predicted vs Actual URL Length (truncated at 300)",
       x = "Actual URL Length", y = "Predicted URL Length") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

# Residuals vs Fitted
ggplot(diag_df %>% filter(Predicted <= 300),
       aes(x = Predicted, y = Residual)) +
  geom_point(alpha = 0.15, color = "#2E86AB") +
  geom_hline(yintercept = 0, color = "#E63946",
             linetype = "dashed", linewidth = 1) +
  labs(title = "Residuals vs Fitted Values (truncated at 300)",
       x = "Fitted URL Length", y = "Residual") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

7.5 Answering the Question: Coefficient Relationships

Each coefficient gives the expected change in URL length for one additional occurrence of that character, holding the others constant. Sorting by magnitude shows which character types contribute most to making a URL longer.

coef_df <- as.data.frame(summary(lm_model)$coefficients) %>%
  tibble::rownames_to_column("Feature") %>%
  rename(p_value = `Pr(>|t|)`) %>%
  filter(Feature != "(Intercept)") %>%
  arrange(desc(abs(Estimate)))

kable(coef_df, caption = "Linear Regression Coefficients (sorted by magnitude)",
      digits = 4, row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Linear Regression Coefficients (sorted by magnitude)
Feature	Estimate	Std. Error	t value	p_value
n_tilde	52.9407	2.4204	21.8724	0.0000
n_questionmark	27.4471	1.5440	17.7761	0.0000
n_equal	13.5815	0.5535	24.5375	0.0000
n_dollar	10.5621	2.2756	4.6414	0.0000
n_underline	10.2887	0.2901	35.4697	0.0000
n_hastag	10.1117	4.0877	2.4737	0.0134
n_hypens	8.7877	0.1692	51.9253	0.0000
n_space	7.6875	1.1869	6.4768	0.0000
n_exclamation	-6.7486	2.4324	-2.7745	0.0055
n_slash	5.4164	0.1677	32.3055	0.0000
n_dots	5.0378	0.2352	21.4185	0.0000
n_comma	4.5565	5.9308	0.7683	0.4423
n_percent	4.1039	0.1239	33.1147	0.0000
n_and	2.3346	0.5665	4.1209	0.0000
n_at	-2.0934	0.8095	-2.5861	0.0097
n_asterisk	-1.5250	0.7030	-2.1693	0.0301
n_redirection	-0.6137	0.4591	-1.3366	0.1814
n_plus	0.5378	1.6444	0.3271	0.7436

ggplot(coef_df, aes(x = reorder(Feature, Estimate), y = Estimate,
                    fill = Estimate > 0)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("TRUE" = "#E63946", "FALSE" = "#2E86AB"),
                    labels = c("TRUE" = "Increases length",
                               "FALSE" = "Decreases length"),
                    name = NULL) +
  labs(title = "How Each Character Feature Relates to URL Length",
       x = "Feature", y = "Coefficient (change in URL length per unit)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Interpretation. The coefficients quantify the structural composition of a URL. Character types with large positive coefficients (commonly n_slash, n_equal, n_and, and n_dots) are the components that lengthen URLs the most — each extra slash or query parameter adds a measurable number of characters. The high R² indicates that URL length is largely explained by these component counts, which is expected since length is, in part, a sum of its parts. This confirms and explains the Question 1 finding: phishing URLs appear longer precisely because they contain more of these structural components (deeper paths and richer query strings).

Note on interpretation. Because url_length is partly composed of the very characters being counted, the predictors are mechanically related to the response and to one another (multicollinearity). The coefficients should therefore be read as compositional contributions to length rather than independent causal effects. This is a natural property of the data and is discussed further below.

8 Discussion

Using two complementary techniques, this project answered two distinct questions about phishing URLs.

Question 1 (Classification, Random Forest). The Random Forest classifier achieved strong performance, giving confidence in its feature-importance ranking. The analysis showed that a small set of structural features — n_slash, url_length, and n_dots — dominate phishing detection, with query-related characters playing a secondary role and rare symbols contributing little. The practical value is direct: a deployed detector can focus on these few features for fast, explainable, low-cost screening inside browsers or email gateways, and human analysts know what to look at first.

Question 2 (Regression, Linear Regression). The linear model explained URL length well and revealed why length is informative. The largest positive coefficients belonged to slashes, equals signs, ampersands, and dots — the same structural components that drive the phishing signal. In other words, the two analyses reinforce each other: phishing URLs are longer because they accumulate more subdirectories and query parameters, and those same components are what the classifier relies on. The main caveat is the mechanical relationship between URL length and its character counts, which inflates R² and induces multicollinearity; coefficients are best interpreted as compositional contributions rather than independent causal effects.

Together, the classification and regression results tell a consistent story from two angles: the classifier identifies which features signal phishing, and the regression explains how those features manifest in the measurable structure of a URL.

9 Conclusion

This project addressed two separate research questions, each with an appropriate modeling technique:

Classification (Random Forest) — which features matter most for phishing detection? The most important indicators are n_slash, url_length, and n_dots. These structural features carry most of the predictive power, enabling lightweight and explainable detection.
Regression (Linear Regression) — how do features relate to URL length? URL length is largely explained by its component character counts, with slashes, equals signs, ampersands, and dots contributing most. This explains, in structural terms, why URL length is such a useful phishing signal.

Practical implications. A model based on URL features alone can serve as a fast, first-line filter that flags suspicious links before users click them, without fetching page content. The feature ranking shows this can be done with only a handful of features.

Limitations and future work. The dataset relies purely on URL character statistics and does not capture HTML, JavaScript, or domain-registration signals. The regression’s predictors are also mechanically related to URL length. Future work could:

Incorporate domain-based features (WHOIS data, SSL certificate validity, domain age).
Apply gradient boosting methods (XGBoost, LightGBM) for further classification gains.
Use standardized or regularized regression (e.g., Lasso) to better isolate each feature’s contribution to URL length.
Address the mild class imbalance using SMOTE or class-weighted training.

10 References

Cloudflare. (2026). What is phishing? Cloudflare. https://www.cloudflare.com/learning/access-management/phishing-attack/
Konsinski, M. (2026). What is Phishing? IBM. https://www.ibm.com/think/topics/phishing
Sivaneswaran, D., Hewage, C. T. E. R., Herath, H. M. K. K. M. B., Rathore, R. S., Singh, V. K., & Jiang, W. (2026). A systematic literature review of large language models in phishing attack generation and detection. Array, 30. https://doi.org/10.1016/j.array.2026.100775

11 Session Info

sessionInfo()

## R version 4.5.3 (2026-03-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Asia/Kuala_Lumpur
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0     knitr_1.51           pROC_1.19.0.1       
##  [4] randomForest_4.7-1.2 caret_7.0-1          lattice_0.22-9      
##  [7] scales_1.4.0         GGally_2.4.0         lubridate_1.9.5     
## [10] forcats_1.0.1        stringr_1.6.0        purrr_1.2.2         
## [13] readr_2.2.0          tibble_3.3.1         tidyverse_2.0.0     
## [16] skimr_2.2.2          corrplot_0.95        ggplot2_4.0.2       
## [19] tidyr_1.3.2          janitor_2.2.1        dplyr_1.2.1         
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     viridisLite_0.4.3    timeDate_4052.112   
##  [4] farver_2.1.2         S7_0.2.1             fastmap_1.2.0       
##  [7] digest_0.6.39        rpart_4.1.24         timechange_0.4.0    
## [10] lifecycle_1.0.5      survival_3.8-6       magrittr_2.0.4      
## [13] compiler_4.5.3       rlang_1.1.7          sass_0.4.10         
## [16] tools_4.5.3          yaml_2.3.12          data.table_1.18.2.1 
## [19] labeling_0.4.3       xml2_1.5.2           plyr_1.8.9          
## [22] repr_1.1.7           RColorBrewer_1.1-3   withr_3.0.2         
## [25] stats4_4.5.3         nnet_7.3-20          grid_4.5.3          
## [28] e1071_1.7-17         future_1.70.0        globals_0.19.1      
## [31] iterators_1.0.14     MASS_7.3-65          cli_3.6.5           
## [34] rmarkdown_2.31       generics_0.1.4       otel_0.2.0          
## [37] rstudioapi_0.18.0    future.apply_1.20.2  reshape2_1.4.5      
## [40] tzdb_0.5.0           proxy_0.4-29         cachem_1.1.0        
## [43] splines_4.5.3        parallel_4.5.3       base64enc_0.1-6     
## [46] vctrs_0.7.1          hardhat_1.4.3        Matrix_1.7-4        
## [49] jsonlite_2.0.0       hms_1.1.4            listenv_0.10.1      
## [52] systemfonts_1.3.2    foreach_1.5.2        gower_1.0.2         
## [55] jquerylib_0.1.4      recipes_1.3.2        glue_1.8.0          
## [58] parallelly_1.47.0    ggstats_0.13.0       codetools_0.2-20    
## [61] stringi_1.8.7        gtable_0.3.6         pillar_1.11.1       
## [64] htmltools_0.5.9      ipred_0.9-15         lava_1.9.1          
## [67] R6_2.6.1             textshaping_1.0.5    evaluate_1.0.5      
## [70] snakecase_0.11.1     bslib_0.10.0         class_7.3-23        
## [73] Rcpp_1.1.1           svglite_2.2.2        nlme_3.1-168        
## [76] prodlim_2026.03.11   xfun_0.56            ModelMetrics_1.2.2.2
## [79] pkgconfig_2.0.3

A Data-Driven Approach To Identifying Phishing Websites Using URL Features

WQD7004 Programming for Data Science - Group Project

Group 2

2026-06-10