The objective of this project is to classify emails in a data set as either spam or ham. Spam emails are considered to be junk, unsolicited and may possibly contain phishing and other harmful links that can make people expose their sensitive and other personal information. Whereas, ham emails are considered to be intended and safe legitimate messages in a mailbox.
I obtained the Spam Ham data from Kaggle: https://www.kaggle.com/datasets/venky73/spam-mails-dataset/
library(tidyverse)
library(openintro)
library(dplyr)
library(NLP)
#install.packages("tm")
library(tm)## Warning: package 'tm' was built under R version 4.3.3
url <- "https://raw.githubusercontent.com/pujaroy280/DATA607Project4/main/spam_ham_dataset.csv"
df_spam_ham <- read.csv(url)
head(df_spam_ham)## X label
## 1 605 ham
## 2 2349 ham
## 3 3624 ham
## 4 4685 spam
## 5 2030 ham
## 6 2949 ham
## text
## 1 Subject: enron methanol ; meter # : 988291\nthis is a follow up to the note i gave you on monday , 4 / 3 / 00 { preliminary\nflow data provided by daren } .\nplease override pop ' s daily volume { presently zero } to reflect daily\nactivity you can obtain from gas control .\nthis change is needed asap for economics purposes .
## 2 Subject: hpl nom for january 9 , 2001\n( see attached file : hplnol 09 . xls )\n- hplnol 09 . xls
## 3 Subject: neon retreat\nho ho ho , we ' re around to that most wonderful time of the year - - - neon leaders retreat time !\ni know that this time of year is extremely hectic , and that it ' s tough to think about anything past the holidays , but life does go on past the week of december 25 through january 1 , and that ' s what i ' d like you to think about for a minute .\non the calender that i handed out at the beginning of the fall semester , the retreat was scheduled for the weekend of january 5 - 6 . but because of a youth ministers conference that brad and dustin are connected with that week , we ' re going to change the date to the following weekend , january 12 - 13 . now comes the part you need to think about .\ni think we all agree that it ' s important for us to get together and have some time to recharge our batteries before we get to far into the spring semester , but it can be a lot of trouble and difficult for us to get away without kids , etc . so , brad came up with a potential alternative for how we can get together on that weekend , and then you can let me know which you prefer .\nthe first option would be to have a retreat similar to what we ' ve done the past several years . this year we could go to the heartland country inn ( www . . com ) outside of brenham . it ' s a nice place , where we ' d have a 13 - bedroom and a 5 - bedroom house side by side . it ' s in the country , real relaxing , but also close to brenham and only about one hour and 15 minutes from here . we can golf , shop in the antique and craft stores in brenham , eat dinner together at the ranch , and spend time with each other . we ' d meet on saturday , and then return on sunday morning , just like what we ' ve done in the past .\nthe second option would be to stay here in houston , have dinner together at a nice restaurant , and then have dessert and a time for visiting and recharging at one of our homes on that saturday evening . this might be easier , but the trade off would be that we wouldn ' t have as much time together . i ' ll let you decide .\nemail me back with what would be your preference , and of course if you ' re available on that weekend . the democratic process will prevail - - majority vote will rule ! let me hear from you as soon as possible , preferably by the end of the weekend . and if the vote doesn ' t go your way , no complaining allowed ( like i tend to do ! )\nhave a great weekend , great golf , great fishing , great shopping , or whatever makes you happy !\nbobby
## 4 Subject: photoshop , windows , office . cheap . main trending\nabasements darer prudently fortuitous undergone\nlighthearted charm orinoco taster\nrailroad affluent pornographic cuvier\nirvin parkhouse blameworthy chlorophyll\nrobed diagrammatic fogarty clears bayda\ninconveniencing managing represented smartness hashish\nacademies shareholders unload badness\ndanielson pure caffein\nspaniard chargeable levin\n
## 5 Subject: re : indian springs\nthis deal is to book the teco pvr revenue . it is my understanding that teco\njust sends us a check , i haven ' t received an answer as to whether there is a\npredermined price associated with this deal or if teco just lets us know what\nwe are giving . i can continue to chase this deal down if you need .
## 6 Subject: ehronline web address change\nthis message is intended for ehronline users only .\ndue to a recent change to ehronline , the url ( aka " web address " ) for accessing ehronline needs to be changed on your computer . the change involves adding the letter " s " to the " http " reference in the url . the url for accessing ehronline should be : https : / / ehronline . enron . com .\nthis change should be made by those who have added the url as a favorite on the browser .
## label_num
## 1 0
## 2 0
## 3 0
## 4 1
## 5 0
## 6 0
##
## 0 1
## 3672 1499
# Create a corpus from the text data in the 'text' column of the dataframe df_spam_ham
corpus = VCorpus(VectorSource(df_spam_ham$text))
# Convert all text to lowercase
corpus = tm_map(corpus, content_transformer(tolower))
# Convert all text to plain text documents
corpus = tm_map(corpus, PlainTextDocument)
# Remove punctuation from the text
corpus = tm_map(corpus, removePunctuation)
# Remove English stopwords (e.g., 'a', 'an', 'the') from the text
corpus = tm_map(corpus, removeWords, stopwords("en"))## <<DocumentTermMatrix (documents: 5171, terms: 49690)>>
## Non-/sparse entries: 332421/256614569
## Sparsity : 100%
## Maximal term length: 24
## Weighting : term frequency (tf)
## <<DocumentTermMatrix (documents: 5171, terms: 6)>>
## Non-/sparse entries: 13158/17868
## Sparsity : 58%
## Maximal term length: 7
## Weighting : term frequency (tf)
# Convert the sparse Document-Term Matrix (spdtm) to a data frame
emails_Sparse = as.data.frame(as.matrix(spdtm))
# Retrieve the column names representing the words in the data set
# Calculate the sum of each column to determine the frequency of each word
colnames(colSums(emails_Sparse))## NULL
## thanks please will 2000 enron subject
## 1898 3198 4132 4386 6555 8060
## Warning: package 'caTools' was built under R version 4.3.3
# Set the random seed for reproducibility
set.seed(123)
# Split the data into training and testing sets using a 95-5 split ratio
spl = sample.split(emails_Sparse$spam, .95)
# Subset the data into a training set using the split
train = subset(emails_Sparse, spl == TRUE)
# Subset the data into a testing set using the split
test = subset(emails_Sparse, spl == FALSE)# Train a logistic regression model to predict 'spam' using all predictors
spamlog = glm(spam ~ ., data = train, family = "binomial")## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = spam ~ ., family = "binomial", data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.16044 0.10258 1.564 0.1178
## `2000` -1.22525 0.11542 -10.616 <2e-16 ***
## enron -17.10993 268.90246 -0.064 0.9493
## please -0.16300 0.05135 -3.175 0.0015 **
## subject 0.07698 0.08814 0.873 0.3825
## thanks -1.89713 0.12623 -15.029 <2e-16 ***
## will -0.04864 0.02815 -1.728 0.0840 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5914.7 on 4911 degrees of freedom
## Residual deviance: 4028.6 on 4905 degrees of freedom
## AIC: 4042.6
##
## Number of Fisher Scoring iterations: 21
# Predict probabilities of 'spam' for the training set using the trained logistic regression model
predTrainLog = predict(spamlog, type = "response")
# Create a contingency table comparing actual 'spam' labels with predicted labels based on a threshold of 0.5
table(train$spam, predTrainLog > 0.5)##
## FALSE TRUE
## 0 2560 928
## 1 214 1210
# Calculate the accuracy of the model on the training set
# The numerator represents the number of correctly predicted instances (true positives and true negatives)
# The denominator represents the total number of instances in the training set
accuracy_train <- (2547 + 1214) / nrow(train)
print(accuracy_train)## [1] 0.7656759
# Predict probabilities of 'spam' for the test set using the trained logistic regression model
predTestLog = predict(spamlog, newdata = test, type = "response")
# Create a contingency table comparing actual 'spam' labels with predicted labels based on a threshold of 0.5
table(test$spam, predTestLog > 0.5)##
## FALSE TRUE
## 0 129 55
## 1 7 68
# Calculate the accuracy of the model on the test set
# The numerator represents the number of correctly predicted instances (true positives and true negatives)
# The denominator represents the total number of instances in the test set
accuracy_test <- (139 + 68) / nrow(test)
print(accuracy_test)## [1] 0.7992278
In conclusion, for this project I used a logistic regression model to classify emails as spam or non-spam based on their content. The model learned to predict the probability of an email being spam based on the presence of certain words in the text. I evaluated the performance of the model on both the training and test datasets. The model achieved an accuracy of approximately 76.6% on the training set and 79.9% on the test set, indicating that it can effectively classify emails as spam or non-spam. To further improve the model’s performance, additional techniques such as feature engineering, hyperparameter tuning, or using more advanced machine learning algorithms could be explored.