In this project, we created a model that classifies emails as spam or ham.
library(readr)
library(dplyr)
library(tidyr)
library(tidyverse)
library(wordcloud)
library(tm)
library(naivebayes)
library(e1071)
library(RTextTools)
library(caret)
library(quanteda)
library(rsample)
url <- "https://raw.githubusercontent.com/kristinlussi/DATA_607/main/Project4/spam_ham_dataset.csv"
spamham <- read_csv(url, show_col_types = FALSE) %>%
as.data.frame()
head(spamham)
## ...1 label
## 1 605 ham
## 2 2349 ham
## 3 3624 ham
## 4 4685 spam
## 5 2030 ham
## 6 2949 ham
## text
## 1 Subject: enron methanol ; meter # : 988291\r\nthis is a follow up to the note i gave you on monday , 4 / 3 / 00 { preliminary\r\nflow data provided by daren } .\r\nplease override pop ' s daily volume { presently zero } to reflect daily\r\nactivity you can obtain from gas control .\r\nthis change is needed asap for economics purposes .
## 2 Subject: hpl nom for january 9 , 2001\r\n( see attached file : hplnol 09 . xls )\r\n- hplnol 09 . xls
## 3 Subject: neon retreat\r\nho ho ho , we ' re around to that most wonderful time of the year - - - neon leaders retreat time !\r\ni know that this time of year is extremely hectic , and that it ' s tough to think about anything past the holidays , but life does go on past the week of december 25 through january 1 , and that ' s what i ' d like you to think about for a minute .\r\non the calender that i handed out at the beginning of the fall semester , the retreat was scheduled for the weekend of january 5 - 6 . but because of a youth ministers conference that brad and dustin are connected with that week , we ' re going to change the date to the following weekend , january 12 - 13 . now comes the part you need to think about .\r\ni think we all agree that it ' s important for us to get together and have some time to recharge our batteries before we get to far into the spring semester , but it can be a lot of trouble and difficult for us to get away without kids , etc . so , brad came up with a potential alternative for how we can get together on that weekend , and then you can let me know which you prefer .\r\nthe first option would be to have a retreat similar to what we ' ve done the past several years . this year we could go to the heartland country inn ( www . . com ) outside of brenham . it ' s a nice place , where we ' d have a 13 - bedroom and a 5 - bedroom house side by side . it ' s in the country , real relaxing , but also close to brenham and only about one hour and 15 minutes from here . we can golf , shop in the antique and craft stores in brenham , eat dinner together at the ranch , and spend time with each other . we ' d meet on saturday , and then return on sunday morning , just like what we ' ve done in the past .\r\nthe second option would be to stay here in houston , have dinner together at a nice restaurant , and then have dessert and a time for visiting and recharging at one of our homes on that saturday evening . this might be easier , but the trade off would be that we wouldn ' t have as much time together . i ' ll let you decide .\r\nemail me back with what would be your preference , and of course if you ' re available on that weekend . the democratic process will prevail - - majority vote will rule ! let me hear from you as soon as possible , preferably by the end of the weekend . and if the vote doesn ' t go your way , no complaining allowed ( like i tend to do ! )\r\nhave a great weekend , great golf , great fishing , great shopping , or whatever makes you happy !\r\nbobby
## 4 Subject: photoshop , windows , office . cheap . main trending\r\nabasements darer prudently fortuitous undergone\r\nlighthearted charm orinoco taster\r\nrailroad affluent pornographic cuvier\r\nirvin parkhouse blameworthy chlorophyll\r\nrobed diagrammatic fogarty clears bayda\r\ninconveniencing managing represented smartness hashish\r\nacademies shareholders unload badness\r\ndanielson pure caffein\r\nspaniard chargeable levin\r\n
## 5 Subject: re : indian springs\r\nthis deal is to book the teco pvr revenue . it is my understanding that teco\r\njust sends us a check , i haven ' t received an answer as to whether there is a\r\npredermined price associated with this deal or if teco just lets us know what\r\nwe are giving . i can continue to chase this deal down if you need .
## 6 Subject: ehronline web address change\r\nthis message is intended for ehronline users only .\r\ndue to a recent change to ehronline , the url ( aka " web address " ) for accessing ehronline needs to be changed on your computer . the change involves adding the letter " s " to the " http " reference in the url . the url for accessing ehronline should be : https : / / ehronline . enron . com .\r\nthis change should be made by those who have added the url as a favorite on the browser .
## label_num
## 1 0
## 2 0
## 3 0
## 4 1
## 5 0
## 6 0
spamham_corpus <- VCorpus(VectorSource(spamham$text)) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(stripWhitespace) %>%
tm_map(content_transformer(function(x) gsub(".com", "", x))) %>%
tm_map(content_transformer(function(x) gsub("http\\S+|www\\S+", "", x))) %>%
tm_map(content_transformer(function(x) gsub("subject", "", x))) %>%
tm_map(content_transformer(function(x) gsub("http", "", x)))
spamham_df <- data.frame(text = sapply(spamham_corpus, as.character), stringsAsFactors = FALSE)
spamham_clean <- data.frame(spamham_df, spamham$label) %>%
rename("label" = spamham.label)
head(spamham_clean)
## text
## 1 enron methanol meter follow note gave monday preliminary flow data provided daren please override pop s daily volume presently zero reflect daily activity can obtain gas control change needed asap economics purposes
## 2 hpl nom january see attached file hplnol xls hplnol xls
## 3 neon retreat ho ho ho re around wonderful time year neon leaders retreat time know time year extremely hectic s tough think anything past holidays life go past week december january s d like think minute calender handed beginning fall semester retreat scheduled weekend january youth ministers conference brad dustin connected week re going change date following weekend january nowes part need think think agree s important us get together time recharge batteries get far spring semester can lot trouble difficult us get away without kids etc brad came potential alternative can get together weekend can let know prefer first option retreat similar ve done past several years year go heartland country inn www outside brenham s nice place d bedroom bedroom house side side s country real relaxing also close brenham one hour minutes can golf shop antique craft stores brenham eat dinner together ranch spend time d meet saturday return sunday morning just like ve done past second option stay houston dinner together nice restaurant dessert time visiting recharging one homes saturday evening might easier trade wouldn t much time together ll let decide email back preference course re available weekend democratic process prevail majority vote rule let hear soon possible preferably end weekend vote doesn t go wayplaining allowed like tend great weekend great golf great fishing great shopping whatever makes happy bobby
## 4 photoshop windows office cheap main trending abasements darer prudently fortuitous undergone lighthearted charm orinoco taster railroad affluent pornographic cuvier irvin parkhouse blameworthy chlorophyll robed diagrammatic fogarty clears bayda inconveniencing managing represented smartness hashish academies shareholders unload badness danielson pure caffein spaniard chargeable levin
## 5 re indian springs deal book teco pvr revenue understanding teco just sends us check haven t received answer whether predermined price associated deal teco just lets us know giving can continue chase deal need
## 6 ehronline web address change message intended ehronline users due recent change ehronline url aka web address accessing ehronline needs changedputer change involves adding letter s reference url url accessing ehronline ehronline enron change made added url favorite browser
## label
## 1 ham
## 2 ham
## 3 ham
## 4 spam
## 5 ham
## 6 ham
set.seed(1234)
# split data
splitIndex <- initial_split(spamham_clean, strata = label)
# training set
train_set <- training(splitIndex)
# test set
test_set <- testing(splitIndex)
# calculate Training & Test Spam labels count
train_labels <- train_set$label
test_labels <- test_set$label
# Proportion for training set
prop.table(table(train_labels))
## train_labels
## ham spam
## 0.7101599 0.2898401
# proportion for test set
prop.table(table(test_labels))
## test_labels
## ham spam
## 0.7099768 0.2900232
# Naive Bayes
model_classifier <- naiveBayes(train_set, train_labels)
# Predict the test set
test_pred <- predict(model_classifier, newdata = test_set)
# confusion matrix
confusionMatrix(as.factor(test_pred), as.factor(test_labels), positive = "spam",
dnn = c("Prediction", "Actual"))
## Confusion Matrix and Statistics
##
## Actual
## Prediction ham spam
## ham 918 0
## spam 0 375
##
## Accuracy : 1
## 95% CI : (0.9972, 1)
## No Information Rate : 0.71
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00
## Specificity : 1.00
## Pos Pred Value : 1.00
## Neg Pred Value : 1.00
## Prevalence : 0.29
## Detection Rate : 0.29
## Detection Prevalence : 0.29
## Balanced Accuracy : 1.00
##
## 'Positive' Class : spam
##
Our model has a 100% accuracy with a confidence interval of (0.9972, 1).
# spam word cloud
spam_indices <- which(spamham_clean$label == "spam")
wordcloud(spamham_corpus[spam_indices], min.freq=40)
# ham word cloud
ham_indices <- which(spamham_clean$label == "ham")
wordcloud(spamham_corpus[ham_indices], min.freq=40)