Assignmemt
It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
Load the required packages
suppressWarnings(suppressMessages(library(readr)))
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(tidytext)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(tm)))
suppressWarnings(suppressMessages(library(stringr)))
suppressWarnings(suppressMessages(library(RCurl)))Set the working directory to the file location which houses my spam samples
setwd("~/Google Drive/CUNY SPRING 19/COURSES/data 607/ASSIGNMENTS/607 proj4/spam")defining the path to the location of my spam samples
spam.path <- "/Users/cayre/Google Drive/CUNY SPRING 19/COURSES/data 607/ASSIGNMENTS/607 proj4/spam/"# credit https://view.officeapps.live.com/op/view.aspx?src=https%3A%2F%2Fqualityandinnovation.files.wordpress.com%2F2012%2F09%2Ftext-analysis-75-925.doc
get.msg <- function(path) {
con <- file(path,open="rt")
text <- readLines(con)
msg <- text[seq(which(text=="")[1]+1,length(text))]
close(con)
return(paste(msg,collapse="\n"))
}
Sys.setlocale('LC_ALL','C')## [1] "C/C/C/C/C/en_US.UTF-8"
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs!="cmds")]
all.spam <- sapply(spam.docs, function(p)get.msg(paste(spam.path,p,sep="")))
spam_list <- do.call(rbind,lapply(all.spam, read_file))
spam_df <- data.frame(emails=sample(spam_list, 500, replace=FALSE))
spamtestdata <-data.frame(rep(NA, 400))
spamholddata <-data.frame(rep(NA, 100))
spamtestdata$emails <- spam_df$emails[-(401:500)]
spamholddata$emails <- spam_df$emails[-(1:400)]Set the working directory to the file location which houses my ham samples
setwd("~/Google Drive/CUNY SPRING 19/COURSES/data 607/ASSIGNMENTS/607 proj4/ham")defining the path to the location of my ham samples
ham.path <- "/Users/cayre/Google Drive/CUNY SPRING 19/COURSES/data 607/ASSIGNMENTS/607 proj4/ham/"Create data frames to test/withhold
ham.docs <- dir(ham.path)
ham.docs <- ham.docs[which(ham.docs!="cmds")]
all.ham <- sapply(ham.docs, function(p)get.msg(paste(ham.path,p,sep="")))
ham_list <- do.call(rbind,lapply(all.ham, read_file))
ham_df <- data.frame(emails=sample(ham_list, 2551, replace=FALSE))
hamtestdata <-data.frame(rep(NA, 2449))
hamtestdata$emails <- ham_df$emails[-(2450:2551)]
hamwithholddata <- data.frame(rep(NA, 102))
hamwithholddata$emails <- ham_df$emails[-(1:2449)]
hamtestdata$emails <- as.character((hamtestdata$emails))Tidying the data
Tidying Spam and Ham Emails
Words are separated, stop words are removed, only words 3 letters and longer are kept, and words are counted
spamtestdata$emails <- as.character(spamtestdata$emails)
wordnumspam <- vapply(strsplit(spamtestdata$emails, "\\w+"), length, integer(1))
summary(wordnumspam)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.0 159.8 337.0 623.3 794.0 5111.0
spamtidy_df <- spamtestdata %>%
unnest_tokens(word, emails) %>%
anti_join(stop_words) %>%
filter(str_detect(word, "[[:alpha:]]{3,}"))## Joining, by = "word"
spamwords <- spamtidy_df %>%
count(word, sort=TRUE)
spamwords## # A tibble: 13,535 x 2
## word n
## <chr> <int>
## 1 font 8201
## 2 size 3196
## 3 nbsp 2665
## 4 color 2376
## 5 width 2150
## 6 http 1877
## 7 align 1569
## 8 arial 1510
## 9 center 1171
## 10 table 1069
## # ... with 13,525 more rows
hamtidy_df <- hamtestdata %>%
unnest_tokens(word, emails) %>%
anti_join(stop_words) %>%
filter(str_detect(word, "[[:alpha:]]{3,}"))## Joining, by = "word"
hamtidy_df %>%
count(word, sort=TRUE) ## # A tibble: 27,061 x 2
## word n
## <chr> <int>
## 1 http 3901
## 2 list 2190
## 3 rpm 1196
## 4 listinfo 984
## 5 spamassassin 973
## 6 exmh 938
## 7 wrote 922
## 8 time 911
## 9 people 902
## 10 users 898
## # ... with 27,051 more rows
Spam emails tend to have more words than ham emails. The spam emails have a median word length of 332 while ham emails have a median word length of 144.
Finding Sentiment of Spam and Ham Emails
spamsentiment <- spamtidy_df %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment) ## Joining, by = "word"
spamsentimentpercentage <- (spamsentiment$n[2]-spamsentiment$n[1])/(spamsentiment$n[2]+spamsentiment$n[1])
spamsentimentpercentage## [1] 0.4235439
hamsentiment <- hamtidy_df %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment) ## Joining, by = "word"
hamsentimentpercentage <- (hamsentiment$n[2]-hamsentiment$n[1])/(hamsentiment$n[2]+hamsentiment$n[1])
hamsentimentpercentage## [1] -0.1073637
Spam emails tend to be positive while ham emails tend to be negative. The difference between the percentage of positive and negative words in spam emails is 33%. The difference between the percentage of positive and negative words in ham emails is -11%. The negative sign indicates that there is a greater likelihood that the ham email will be negative.
Predicting whether an Email is Spam or Ham
To predict whether an email will be spam or ham, the number of words in the email will be calculated and the sentiment of the email will be determined. The data being used to test is data from the original collectionn of spam and ham emails that were withheld from the previous analysis. The most definitive way to determine whether an email is spam or ham is based on the sentiment analysis. I chose to test if the percentage of positive words-percentage of negative words is greater than 0.20. If so, then the email is categorized as spam. If not, then the email undergoes another check based on its length: if it is less than 400 words, it is classified as ham. Otherwise it is classified as spam.
decision <- list()
for (i in 1:length(spamholddata$rep.NA..100.)){
unknown <- data.frame(rep(NA, 1))
unknown$emails <- spamholddata$emails[i]
unknown$emails <- as.character(unknown$emails)
tidy_df <- unknown %>%
unnest_tokens(word, emails) %>%
anti_join(stop_words) %>%
filter(str_detect(word, "[[:alpha:]]{3,}"))
wordnum <- sum(sapply(gregexpr(" ", spamholddata$emails[i]), length)+1)
unknownsentiment <- tidy_df %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment)
sentimentpercentage <- (unknownsentiment$n[2]-unknownsentiment$n[1])/(unknownsentiment$n[2]+unknownsentiment$n[1])
sentimentpercentage
ifelse (sentimentpercentage > .25, decision<-c(decision, "spam"), {
ifelse (wordnum <400, decision <- c(decision,"ham"), decision<-c(decision, "spam"))
})
}## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
length(decision[decision=="spam"])/length(decision)## [1] 0.6933333
decisionh <- list()
for (i in 1:length(hamwithholddata$rep.NA..102.)){
unknown <- data.frame(rep(NA, 1))
unknown$emails <- hamwithholddata$emails[i]
unknown$emails <- as.character(unknown$emails)
tidy_df <- unknown %>%
unnest_tokens(word, emails) %>%
anti_join(stop_words) %>%
filter(str_detect(word, "[[:alpha:]]{3,}"))
wordnum <- sum(sapply(gregexpr(" ", hamwithholddata$emails[i]), length)+1)
unknownsentiment <- tidy_df %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment)
sentimentpercentage <- (unknownsentiment$n[2]-unknownsentiment$n[1])/(unknownsentiment$n[2]+unknownsentiment$n[1])
sentimentpercentage
ifelse (sentimentpercentage > .25, decisionh<-c(decisionh, "spam"), {ifelse (wordnum < 400, decisionh <- c(decisionh,"ham"), decisionh<-c(decisionh, "spam"))})
}## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
length(decisionh[decisionh=="ham"])/length(decisionh)## [1] 0.7407407
The algorithm correctly identified spam emails about 82% of the time and correctly identified ham emails about 68% of the time. The greater the accuracy I am able to get predicting one type of email, the lower the accuracy I am able to get in predicting the other type of email.