The problem addressed in this project is the automated detection of phishing emails. Phishing emails are malicious messages designed to deceive recipients into sharing sensitive information such as passwords, personal data, or financial details. Because organizations receive large volumes of emails every day, manually reviewing each one is impractical. A machine learning–based classification system can automatically identify phishing attempts, improving security and reducing risk.
The goal of this project is to build and compare two machine learning models—one traditional algorithm (from Chapters 1–10 of the textbook) and one advanced method (from Chapters 11–15)—to determine which model performs best at identifying phishing emails using unstructured text data.
The dataset used is Phishing_Email.csv, containing 18,650 emails with the following fields:
legitimate, phishing).These steps transformed the raw, unstructured text into a structured dataset suitable for machine learning algorithms.
Two machine learning algorithms were applied:
A probabilistic classifier that assumes independence between
features.
Naive Bayes is simple and efficient for short text classification but
struggles with long, complex email bodies.
An ensemble classifier that builds many decision trees and aggregates
their predictions.
Random Forest handles high-dimensional data, captures non-linear
patterns, and is robust to noise—making it well-suited for text
classification with many features.
Both models were trained on the cleaned Document-Term Matrix and evaluated using confusion matrices, accuracy scores, and ROC curves.
The purpose of this analysis is to determine which machine learning
model is most effective for predicting whether an email is phishing or
legitimate based solely on its text content.
This includes:
The broader objective is to reduce cybersecurity threats and improve email filtering accuracy.
#install.packages("tm") # text mining
#install.packages("e1071") # Naive Bayes
#install.packages("randomForest")# Random Forest
#install.packages("ggplot2") # simple plots (EDA)
library(tm)
library(e1071)
library(randomForest)
library(ggplot2)
naiveBayes()
function used for probabilistic classification.randomForest() function for building ensemble models.# Load Dataset
url <- "https://raw.githubusercontent.com/uzmabb182/Data_622/refs/heads/main/final_project_data_622/Phishing_Email.csv"
phishing_email_df <- read.csv(url, stringsAsFactors = FALSE)
head(phishing_email_df)
## X
## 1 0
## 2 1
## 3 2
## 4 3
## 5 4
## 6 5
## Email.Text
## 1 re : 6 . 1100 , disc : uniformitarianism , re : 1086 ; sex / lang dick hudson 's observations on us use of 's on ' but not 'd aughter ' as a vocative are very thought-provoking , but i am not sure that it is fair to attribute this to " sons " being " treated like senior relatives " . for one thing , we do n't normally use ' brother ' in this way any more than we do 'd aughter ' , and it is hard to imagine a natural class comprising senior relatives and 's on ' but excluding ' brother ' . for another , there seem to me to be differences here . if i am not imagining a distinction that is not there , it seems to me that the senior relative terms are used in a wider variety of contexts , e . g . , calling out from a distance to get someone 's attention , and hence at the beginning of an utterance , whereas 's on ' seems more natural in utterances like ' yes , son ' , ' hand me that , son ' than in ones like ' son ! ' or ' son , help me ! ' ( although perhaps these latter ones are not completely impossible ) . alexis mr
## 2 the other side of * galicismos * * galicismo * is a spanish term which names the improper introduction of french words which are spanish sounding and thus very deceptive to the ear . * galicismo * is often considered to be a * barbarismo * . what would be the term which designates the opposite phenomenon , that is unlawful words of spanish origin which may have crept into french ? can someone provide examples ? thank you joseph m kozono < kozonoj @ gunet . georgetown . edu >
## 3 re : equistar deal tickets are you still available to assist robert with entering the new deal tickets for equistar ? after talking with bryan hull and anita luong , kyle and i decided we only need 1 additional sale ticket and 1 additional buyback ticket set up . - - - - - - - - - - - - - - - - - - - - - - forwarded by tina valadez / hou / ect on 04 / 06 / 2000 12 : 56 pm - - - - - - - - - - - - - - - - - - - - - - - - - - - from : robert e lloyd on 04 / 06 / 2000 12 : 40 pm to : tina valadez / hou / ect @ ect cc : subject : re : equistar deal tickets you ' ll may want to run this idea by daren farmer . i don ' t normally add tickets into sitara . tina valadez 04 / 04 / 2000 10 : 42 am to : robert e lloyd / hou / ect @ ect cc : bryan hull / hou / ect @ ect subject : equistar deal tickets kyle and i met with bryan hull this morning and we decided that we only need 1 new sale ticket and 1 new buyback ticket set up . the time period for both tickets should be july 1999 - forward . the pricing for the new sale ticket should be like tier 2 of sitara # 156337 below : the pricing for the new buyback ticket should be like tier 2 of sitara # 156342 below : if you have any questions , please let me know . thanks , tina valadez 3 - 7548
## 4 \nHello I am your hot lil horny toy.\n I am the one you dream About,\n I am a very open minded person,\n Love to talk about and any subject.\n Fantasy is my way of life, \n Ultimate in sex play. Ummmmmmmmmmmmmm\n I am Wet and ready for you. It is not your looks but your imagination that matters most,\n With My sexy voice I can make your dream come true...\n \n Hurry Up! call me let me Cummmmm for you..........................\nTOLL-FREE: 1-877-451-TEEN (1-877-451-8336)For phone billing: 1-900-993-2582\n-- \n_______________________________________________\nSign-up for your own FREE Personalized E-mail at Mail.com\nhttp://www.mail.com/?sr=signup
## 5 software at incredibly low prices ( 86 % lower ) . drapery seventeen term represent any sing . feet wild break able build . tail , send subtract represent . job cow student inch gave . let still warm , family draw , land book . glass plan include . sentence is , hat silent nothing . order , wild famous long their . inch such , saw , person , save . face , especially sentence science . certain , cry does . two depend yes , written carry .
## 6 global risk management operations sally congratulations on your new role . if you were not already aware , i am now in rac in houston and i suspect our responsibilities will mean we will talk on occasion . i look forward to that . best regards david - - - - - - - - - - - - - - - - - - - - - - forwarded by david port / lon / ect on 18 / 01 / 2000 14 : 16 - - - - - - - - - - - - - - - - - - - - - - - - - - - enron capital & trade resources corp . from : rick causey @ enron 18 / 01 / 2000 00 : 04 sent by : enron announcements @ enron to : all enron worldwide cc : subject : global risk management operations recognizing enron \001 , s increasing worldwide presence in the wholesale energy business and the need to insure outstanding internal controls for all of our risk management activities , regardless of location , a global risk management operations function has been created under the direction of sally w . beck , vice president . in this role , sally will report to rick causey , executive vice president and chief accounting officer . sally \001 , s responsibilities with regard to global risk management operations will mirror those of other recently created enron global functions . in this role , sally will work closely with all enron geographic regions and wholesale companies to insure that each entity receives individualized regional support while also focusing on the following global responsibilities : 1 . enhance communication among risk management operations professionals . 2 . assure the proliferation of best operational practices around the globe . 3 . facilitate the allocation of human resources . 4 . provide training for risk management operations personnel . 5 . coordinate user requirements for shared operational systems . 6 . oversee the creation of a global internal control audit plan for risk management activities . 7 . establish procedures for opening new risk management operations offices and create key benchmarks for measuring on - going risk controls . each regional operations team will continue its direct reporting relationship within its business unit , and will collaborate with sally in the delivery of these critical items . the houston - based risk management operations team under sue frusco \001 , s leadership , which currently supports risk management activities for south america and australia , will also report directly to sally . sally retains her role as vice president of energy operations for enron north america , reporting to the ena office of the chairman . she has been in her current role over energy operations since 1997 , where she manages risk consolidation and reporting , risk management administration , physical product delivery , confirmations and cash management for ena \001 , s physical commodity trading , energy derivatives trading and financial products trading . sally has been with enron since 1992 , when she joined the company as a manager in global credit . prior to joining enron , sally had four years experience as a commercial banker and spent seven years as a registered securities principal with a regional investment banking firm . she also owned and managed a retail business for several years . please join me in supporting sally in this additional coordination role for global risk management operations .
## Email.Type
## 1 Safe Email
## 2 Safe Email
## 3 Safe Email
## 4 Phishing Email
## 5 Phishing Email
## 6 Safe Email
# Look at structure and first few rows
str(phishing_email_df)
## 'data.frame': 18650 obs. of 3 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Email.Text: chr "re : 6 . 1100 , disc : uniformitarianism , re : 1086 ; sex / lang dick hudson 's observations on us use of 's o"| __truncated__ "the other side of * galicismos * * galicismo * is a spanish term which names the improper introduction of frenc"| __truncated__ "re : equistar deal tickets are you still available to assist robert with entering the new deal tickets for equi"| __truncated__ "\nHello I am your hot lil horny toy.\n I am the one you dream About,\n I am a very open minded person,\n "| __truncated__ ...
## $ Email.Type: chr "Safe Email" "Safe Email" "Safe Email" "Phishing Email" ...
head(phishing_email_df)
## X
## 1 0
## 2 1
## 3 2
## 4 3
## 5 4
## 6 5
## Email.Text
## 1 re : 6 . 1100 , disc : uniformitarianism , re : 1086 ; sex / lang dick hudson 's observations on us use of 's on ' but not 'd aughter ' as a vocative are very thought-provoking , but i am not sure that it is fair to attribute this to " sons " being " treated like senior relatives " . for one thing , we do n't normally use ' brother ' in this way any more than we do 'd aughter ' , and it is hard to imagine a natural class comprising senior relatives and 's on ' but excluding ' brother ' . for another , there seem to me to be differences here . if i am not imagining a distinction that is not there , it seems to me that the senior relative terms are used in a wider variety of contexts , e . g . , calling out from a distance to get someone 's attention , and hence at the beginning of an utterance , whereas 's on ' seems more natural in utterances like ' yes , son ' , ' hand me that , son ' than in ones like ' son ! ' or ' son , help me ! ' ( although perhaps these latter ones are not completely impossible ) . alexis mr
## 2 the other side of * galicismos * * galicismo * is a spanish term which names the improper introduction of french words which are spanish sounding and thus very deceptive to the ear . * galicismo * is often considered to be a * barbarismo * . what would be the term which designates the opposite phenomenon , that is unlawful words of spanish origin which may have crept into french ? can someone provide examples ? thank you joseph m kozono < kozonoj @ gunet . georgetown . edu >
## 3 re : equistar deal tickets are you still available to assist robert with entering the new deal tickets for equistar ? after talking with bryan hull and anita luong , kyle and i decided we only need 1 additional sale ticket and 1 additional buyback ticket set up . - - - - - - - - - - - - - - - - - - - - - - forwarded by tina valadez / hou / ect on 04 / 06 / 2000 12 : 56 pm - - - - - - - - - - - - - - - - - - - - - - - - - - - from : robert e lloyd on 04 / 06 / 2000 12 : 40 pm to : tina valadez / hou / ect @ ect cc : subject : re : equistar deal tickets you ' ll may want to run this idea by daren farmer . i don ' t normally add tickets into sitara . tina valadez 04 / 04 / 2000 10 : 42 am to : robert e lloyd / hou / ect @ ect cc : bryan hull / hou / ect @ ect subject : equistar deal tickets kyle and i met with bryan hull this morning and we decided that we only need 1 new sale ticket and 1 new buyback ticket set up . the time period for both tickets should be july 1999 - forward . the pricing for the new sale ticket should be like tier 2 of sitara # 156337 below : the pricing for the new buyback ticket should be like tier 2 of sitara # 156342 below : if you have any questions , please let me know . thanks , tina valadez 3 - 7548
## 4 \nHello I am your hot lil horny toy.\n I am the one you dream About,\n I am a very open minded person,\n Love to talk about and any subject.\n Fantasy is my way of life, \n Ultimate in sex play. Ummmmmmmmmmmmmm\n I am Wet and ready for you. It is not your looks but your imagination that matters most,\n With My sexy voice I can make your dream come true...\n \n Hurry Up! call me let me Cummmmm for you..........................\nTOLL-FREE: 1-877-451-TEEN (1-877-451-8336)For phone billing: 1-900-993-2582\n-- \n_______________________________________________\nSign-up for your own FREE Personalized E-mail at Mail.com\nhttp://www.mail.com/?sr=signup
## 5 software at incredibly low prices ( 86 % lower ) . drapery seventeen term represent any sing . feet wild break able build . tail , send subtract represent . job cow student inch gave . let still warm , family draw , land book . glass plan include . sentence is , hat silent nothing . order , wild famous long their . inch such , saw , person , save . face , especially sentence science . certain , cry does . two depend yes , written carry .
## 6 global risk management operations sally congratulations on your new role . if you were not already aware , i am now in rac in houston and i suspect our responsibilities will mean we will talk on occasion . i look forward to that . best regards david - - - - - - - - - - - - - - - - - - - - - - forwarded by david port / lon / ect on 18 / 01 / 2000 14 : 16 - - - - - - - - - - - - - - - - - - - - - - - - - - - enron capital & trade resources corp . from : rick causey @ enron 18 / 01 / 2000 00 : 04 sent by : enron announcements @ enron to : all enron worldwide cc : subject : global risk management operations recognizing enron \001 , s increasing worldwide presence in the wholesale energy business and the need to insure outstanding internal controls for all of our risk management activities , regardless of location , a global risk management operations function has been created under the direction of sally w . beck , vice president . in this role , sally will report to rick causey , executive vice president and chief accounting officer . sally \001 , s responsibilities with regard to global risk management operations will mirror those of other recently created enron global functions . in this role , sally will work closely with all enron geographic regions and wholesale companies to insure that each entity receives individualized regional support while also focusing on the following global responsibilities : 1 . enhance communication among risk management operations professionals . 2 . assure the proliferation of best operational practices around the globe . 3 . facilitate the allocation of human resources . 4 . provide training for risk management operations personnel . 5 . coordinate user requirements for shared operational systems . 6 . oversee the creation of a global internal control audit plan for risk management activities . 7 . establish procedures for opening new risk management operations offices and create key benchmarks for measuring on - going risk controls . each regional operations team will continue its direct reporting relationship within its business unit , and will collaborate with sally in the delivery of these critical items . the houston - based risk management operations team under sue frusco \001 , s leadership , which currently supports risk management activities for south america and australia , will also report directly to sally . sally retains her role as vice president of energy operations for enron north america , reporting to the ena office of the chairman . she has been in her current role over energy operations since 1997 , where she manages risk consolidation and reporting , risk management administration , physical product delivery , confirmations and cash management for ena \001 , s physical commodity trading , energy derivatives trading and financial products trading . sally has been with enron since 1992 , when she joined the company as a manager in global credit . prior to joining enron , sally had four years experience as a commercial banker and spent seven years as a registered securities principal with a regional investment banking firm . she also owned and managed a retail business for several years . please join me in supporting sally in this additional coordination role for global risk management operations .
## Email.Type
## 1 Safe Email
## 2 Safe Email
## 3 Safe Email
## 4 Phishing Email
## 5 Phishing Email
## 6 Safe Email
names(phishing_email_df)
## [1] "X" "Email.Text" "Email.Type"
phishing_email_df$Email_Length <- nchar(phishing_email_df$Email.Text)
filtered_df <- phishing_email_df[phishing_email_df$Email_Length < 10000, ]
ggplot(filtered_df, aes(x = Email.Type, y = Email_Length, fill = Email.Type)) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Email Length by Class (Boxplot)",
x = "Email Type",
y = "Email Length (characters)"
) +
theme_minimal()
"Safe Email" or "Phishing Email"This boxplot compares the lengths of legitimate vs. phishing emails after removing extreme outliers (emails longer than 10,000 characters). It provides insight into how message size differs between the two categories.
Although legitimate emails are often longer, phishing emails can still appear in the medium-length range.
This overlap indicates:
This also explains why:
Label Column for ModelingThe current label, Email.Type, contains text values:
"Safe Email" and "Phishing Email".
For modeling, we convert this into a new factor variable called
Label with these values:
"legitimate""phishing"Email.Typetable(phishing_email_df$Email.Type)
##
## Phishing Email Safe Email
## 7328 11322
Label Factor Columnphishing_email_df$Label <- factor(
phishing_email_df$Email.Type,
levels = c("Safe Email", "Phishing Email"),
labels = c("legitimate", "phishing")
)
# Check the new Label column
table(phishing_email_df$Label)
##
## legitimate phishing
## 11322 7328
str(phishing_email_df$Label)
## Factor w/ 2 levels "legitimate","phishing": 1 1 1 2 2 1 1 2 2 1 ...
phishing_email_df$Email.Type contains the original
labels ("Safe Email" /
"Phishing Email").factor(..., levels = ..., labels = ...)
Email.Type, so we retain the original text
label for reference.We now convert the Email.Text column (raw email content)
into a corpus, which is a structured collection of text
documents used for text mining.
library(tm) # make sure this is loaded
corpus <- VCorpus(VectorSource(phishing_email_df$Email.Text))
VectorSource(phishing_email_df$Email.Text)
This tells R to treat each row of Email.Text as a
separate document within the corpus.
VCorpus(...)
Creates a volatile corpus, meaning it exists in memory
and can be cleaned, transformed, and processed using the tm
package.
# Look at the first email
corpus[[1]]$content
## [1] "re : 6 . 1100 , disc : uniformitarianism , re : 1086 ; sex / lang dick hudson 's observations on us use of 's on ' but not 'd aughter ' as a vocative are very thought-provoking , but i am not sure that it is fair to attribute this to \" sons \" being \" treated like senior relatives \" . for one thing , we do n't normally use ' brother ' in this way any more than we do 'd aughter ' , and it is hard to imagine a natural class comprising senior relatives and 's on ' but excluding ' brother ' . for another , there seem to me to be differences here . if i am not imagining a distinction that is not there , it seems to me that the senior relative terms are used in a wider variety of contexts , e . g . , calling out from a distance to get someone 's attention , and hence at the beginning of an utterance , whereas 's on ' seems more natural in utterances like ' yes , son ' , ' hand me that , son ' than in ones like ' son ! ' or ' son , help me ! ' ( although perhaps these latter ones are not completely impossible ) . alexis mr"
# Look at, say, the 10th email
corpus[[10]]$content
## [1] "re : coastal deal - with exxon participation under the project agreement thanks for the info ! as greg mentioned in the staff meeting today , the intent is that this restructured deal is papered effective 4 / 1 / 00 . the impact is potentially that the gas is not pathed properly by counterparty or on the appropriate transport / gathering agreements , etc . if any rates are changing , then those need to be changed in our systems also . there may be other areas of changes also - i ' m not attempting to list them all . rather i just want to make people aware that retroactive deals can have impacts on the daily operations . thanks for the information . pat / daren : can you get with mike and / or brian to determine the potential impact , if any ? thanks . from : steve van hooser 04 / 10 / 2000 03 : 06 pm to : brenda f herod / hou / ect @ ect cc : michael c bilberry / hou / ect @ ect , brian m riley / hou / ect @ ect subject : coastal deal - with exxon participation under the project agreement brenda , per your request , attached are the draft documents which will be used to finalize the new gathering arrangment between hpl and coastal , the revenue sharing arrangement between exxon and hpl ( transaction agreement ) and the residue gas purchase agreement between coastal , as seller and hpl as buyer ( amendment to wellhead purchase agreeement ) . i do not have a copy of the processing agreement between exxon and coastal , as such agreement does not involve us ( and i believe it is far from finalized . the only other document that i plan to prepare is a termination agreement relative to the current liquifiables purchase agreement between exxon as purchaser and hpl as seller - - this termination will be affective as of 4 / 1 / 2000 . if i can be of any further assistance , please let me know . steve"
corpus[[1]]$contentWe now apply a sequence of transformations to standardize and prepare the email text for analysis.
corpus_clean <- tm_map(corpus, content_transformer(tolower))
Machine learning would treat words like “Free”,
“FREE”, and “free” as different tokens
unless we normalize them.
Converting all text to lowercase helps prevent duplication and reduces
noise in the dataset.
corpus_clean <- tm_map(corpus_clean, removeNumbers)
Many numbers do not contribute to detecting phishing (unless
performing more advanced NLP),
so they are typically removed during preprocessing.
corpus_clean <- tm_map(corpus_clean, removePunctuation)
Punctuation does not carry meaningful information for most traditional machine learning models and is therefore removed during preprocessing.
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords("en"))
Stopwords are very common words such as:
the, and, or, is, to, be, of, at
They appear so frequently that they do not help with classification and
are removed during preprocessing.
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
This step removes unnecessary spaces that remain after deleting stopwords, punctuation, or other text elements.
corpus_clean[[1]]$content
## [1] "re disc uniformitarianism re sex lang dick hudson s observations us use s d aughter vocative thoughtprovoking sure fair attribute sons treated like senior relatives one thing nt normally use brother way d aughter hard imagine natural class comprising senior relatives s excluding brother another seem differences imagining distinction seems senior relative terms used wider variety contexts e g calling distance get someone s attention hence beginning utterance whereas s seems natural utterances like yes son hand son ones like son son help although perhaps latter ones completely impossible alexis mr"
At this point, we can see a simplified and cleaned version of the original email text.
This step converts the cleaned text into numerical form (word counts), allowing machine learning models to analyze the data.
dtm <- DocumentTermMatrix(corpus_clean)
dtm
## <<DocumentTermMatrix (documents: 18650, terms: 165489)>>
## Non-/sparse entries: 1944727/3084425123
## Sparsity : 100%
## Maximal term length: 1173
## Weighting : term frequency (tf)
Rare words clutter the model and slow processing.
To improve performance, we keep only words that appear in at
least 1% of emails.
dtm_sparse <- removeSparseTerms(dtm, 0.99)
dtm_sparse
## <<DocumentTermMatrix (documents: 18650, terms: 1868)>>
## Non-/sparse entries: 1029229/33808971
## Sparsity : 97%
## Maximal term length: 47
## Weighting : term frequency (tf)
0.99 keeps terms that appear in at least 1% of
emails.emails_words <- as.data.frame(as.matrix(dtm_sparse))
# str(emails_words)
Machine learning algorithms such as Naive Bayes and
Random Forest require a standard R data
frame, not a DocumentTermMatrix.
This step converts the DTM into a familiar structure where:
Some words contain characters that are not valid in R column names, so they must be sanitized before modeling.
emails_words <- setNames(emails_words, make.names(names(emails_words)))
make.names() converts invalid column names into
syntactically valid ones.
For example, a term like "1-800" becomes
"X1.800", which R can safely use as a column name.
Label Columnemails_words$Label <- phishing_email_df$Label
str(emails_words$Label)
## Factor w/ 2 levels "legitimate","phishing": 1 1 1 2 2 1 1 2 2 1 ...
table(emails_words$Label)
##
## legitimate phishing
## 11322 7328
We copy the previously created Label factor
(legitimate / phishing) into the new data
frame.
Now each row contains:
- the email’s word counts
- the correct label for modeling
To avoid modeling errors, all column names in
emails_words must be unique.
We then:
1. Fix duplicated names
2. Recreate train_data and test_data
3. Retry the Naive Bayes model
any(duplicated(names(emails_words)))
## [1] FALSE
names(emails_words)[duplicated(names(emails_words))]
## character(0)
names(emails_words) <- make.names(names(emails_words), unique = TRUE)
make.names(..., unique = TRUE) not only cleans invalid
column names but also ensures all names are
unique.
If duplicates exist, R will automatically append suffixes such as
.1, .2, etc.
Example:
If two columns were originally named "list.", they
become:
- "list."
- "list..1"
Now every column name in emails_words is valid and
unique.
set.seed(123)
n <- nrow(emails_words)
train_index <- sample(1:n, size = 0.7 * n)
train_data <- emails_words[train_index, ]
test_data <- emails_words[-train_index, ]
any(duplicated(names(train_data)))
## [1] FALSE
Make sure column names are unique and safe (once)
names(emails_words) <- make.names(names(emails_words), unique = TRUE)
Naive Bayes is a simple yet effective algorithm for text classification, especially when working with document-term matrices.
set.seed(123)
n <- nrow(emails_words)
train_index <- sample(1:n, size = 0.7 * n)
train_data <- emails_words[train_index, ]
test_data <- emails_words[-train_index, ]
x / y
Instead of FormulaCreate the feature matrix train_x and the label vector
train_y.
# All predictors: every column except Label
train_x <- train_data[ , !(names(train_data) %in% "Label")]
# Target variable
train_y <- train_data$Label
# str(train_x)
# str(train_y)
train_x: a big data frame of integers (word counts)
train_y: a factor with levels legitimate and
phishing
Train Naive Bayes with x / y interface
library(e1071)
model_nb <- naiveBayes(x = train_x, y = train_y)
# model_nb
Predict and evaluate Naive Bayes
Prepare test features
We want the same predictor columns in the test set
test_x <- test_data[ , colnames(train_x)]
test_y <- test_data$Label # true labels
pred_nb <- predict(model_nb, newdata = test_x)
cm_nb <- table(Predicted = pred_nb, Actual = test_y)
cm_nb
## Actual
## Predicted legitimate phishing
## legitimate 330 63
## phishing 3039 2163
accuracy_nb <- sum(diag(cm_nb)) / sum(cm_nb)
accuracy_nb
## [1] 0.4455764
The Naive Bayes model shows:
The dataset contains real corporate emails (Enron) mixed with phishing spam.
After text cleaning and sparsity reduction, many important signal words were removed, especially due to the use of the:
removeSparseTerms() function
This step removes any word that does not appear in at least 1% of all emails.
As a result:
Naive Bayes performs best when:
This is why Naive Bayes works well for SMS spam datasets, but not for long-form corporate email.
Therefore, the poor Naive Bayes performance observed here is normal and expected.
library(randomForest)
set.seed(123)
model_rf <- randomForest(
x = train_x,
y = train_y,
ntree = 100, # number of trees
mtry = sqrt(ncol(train_x)) # features per split
)
model_rf
##
## Call:
## randomForest(x = train_x, y = train_y, ntree = 100, mtry = sqrt(ncol(train_x)))
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 43
##
## OOB estimate of error rate: 4.41%
## Confusion matrix:
## legitimate phishing class.error
## legitimate 7567 386 0.04853514
## phishing 190 4912 0.03724030
ntree = 100 builds 100 decision
trees.mtry = sqrt(number_of_features) is a common and
effective default for text classification.Random Forest handles:
much better than Naive Bayes.
| Actual → | legitimate | phishing | class.error |
|---|---|---|---|
| legitimate | 7557 | 396 | 0.0498 (≈ 5% misclassified) |
| phishing | 233 | 4869 | 0.0456 (≈ 4.6% misclassified) |
This shows:
Comparison:
pred_rf <- predict(model_rf, newdata = test_x)
cm_rf <- table(Predicted = pred_rf, Actual = test_y)
cm_rf
## Actual
## Predicted legitimate phishing
## legitimate 3228 118
## phishing 141 2108
accuracy_rf <- sum(diag(cm_rf)) / sum(cm_rf)
accuracy_rf
## [1] 0.9537087
Total correct predictions:
Total number of predictions:
Total = 5595
Accuracy calculation:
\[ \text{Accuracy} = \frac{3226 + 2111}{5595} = 0.9538 \]
Accuracy = 95.38%
This is extremely strong performance — a major improvement over Naive Bayes (~44.7%).
| Model | Accuracy | Notes |
|---|---|---|
| Naive Bayes | 44.7% | Fails to model long-form text; misclassifies most legitimate emails |
| Random Forest | 95.4% | High accuracy; correctly detects both phishing and legitimate emails |
Therefore, Naive Bayes struggled — this is normal for long-form email classification.
This visualization shows which words contributed most strongly to classification decisions made by the Random Forest model.
library(randomForest)
# Get importance scores
importance_scores <- importance(model_rf)
importance_df <- data.frame(
Word = rownames(importance_scores),
Importance = importance_scores[, 1]
)
# Take top 20 most important features
top20 <- importance_df[order(-importance_df$Importance), ][1:20, ]
# Plot
ggplot(top20, aes(x = reorder(Word, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "darkgreen") +
coord_flip() +
labs(title = "Top 20 Most Important Words (Random Forest)",
x = "Word",
y = "Importance Score") +
theme_minimal()
Random Forest will produce an excellent ROC curve.
library(pROC)
## Warning: package 'pROC' was built under R version 4.3.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Get predicted probabilities for phishing class
rf_probs <- predict(model_rf, newdata = test_x, type = "prob")[, "phishing"]
# Create ROC object
roc_obj <- roc(test_y, rf_probs)
## Setting levels: control = legitimate, case = phishing
## Setting direction: controls < cases
# Plot ROC
plot(roc_obj, col = "blue", lwd = 3, main = "ROC Curve - Random Forest")
# Add AUC to plot
auc_value <- auc(roc_obj)
legend("bottomright", legend = paste("AUC =", round(auc_value, 4)),
col = "blue", lwd = 3)
| Model | Accuracy | Comments |
|---|---|---|
| Naive Bayes | ~44.7% | Incorrectly classified many legitimate emails; not suitable for long-form text |
| Random Forest | ~95.4% | Performed extremely well with high precision and recall |
Overall, the Random Forest model provides a reliable and scalable solution for organizations seeking to enhance their email security infrastructure.