DATA-607 Project 4

Chris Ayre

April 10th, 2019

Assignmemt

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

Load the required packages

suppressWarnings(suppressMessages(library(readr)))
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(tidytext)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(tm)))
suppressWarnings(suppressMessages(library(stringr)))
suppressWarnings(suppressMessages(library(RCurl)))

Set the working directory to the file location which houses my spam samples

setwd("~/Google Drive/CUNY SPRING 19/COURSES/data 607/ASSIGNMENTS/607 proj4/spam")

defining the path to the location of my spam samples

spam.path <- "/Users/cayre/Google Drive/CUNY SPRING 19/COURSES/data 607/ASSIGNMENTS/607 proj4/spam/"
# credit https://view.officeapps.live.com/op/view.aspx?src=https%3A%2F%2Fqualityandinnovation.files.wordpress.com%2F2012%2F09%2Ftext-analysis-75-925.doc

get.msg <- function(path) {
   con <- file(path,open="rt")
   text <- readLines(con)
   msg <- text[seq(which(text=="")[1]+1,length(text))]  
   close(con)
   return(paste(msg,collapse="\n"))
}

Sys.setlocale('LC_ALL','C')
## [1] "C/C/C/C/C/en_US.UTF-8"
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs!="cmds")]
all.spam <- sapply(spam.docs, function(p)get.msg(paste(spam.path,p,sep="")))

spam_list <- do.call(rbind,lapply(all.spam, read_file))
spam_df <- data.frame(emails=sample(spam_list, 500, replace=FALSE))



spamtestdata <-data.frame(rep(NA, 400))
spamholddata <-data.frame(rep(NA, 100))

spamtestdata$emails <- spam_df$emails[-(401:500)]
spamholddata$emails <- spam_df$emails[-(1:400)]

Set the working directory to the file location which houses my ham samples

setwd("~/Google Drive/CUNY SPRING 19/COURSES/data 607/ASSIGNMENTS/607 proj4/ham")

defining the path to the location of my ham samples

ham.path <- "/Users/cayre/Google Drive/CUNY SPRING 19/COURSES/data 607/ASSIGNMENTS/607 proj4/ham/"

Create data frames to test/withhold

ham.docs <- dir(ham.path)
ham.docs <- ham.docs[which(ham.docs!="cmds")]
all.ham <- sapply(ham.docs, function(p)get.msg(paste(ham.path,p,sep="")))

ham_list <- do.call(rbind,lapply(all.ham, read_file))

ham_df <- data.frame(emails=sample(ham_list, 2551, replace=FALSE))

hamtestdata <-data.frame(rep(NA, 2449))
hamtestdata$emails <- ham_df$emails[-(2450:2551)]

hamwithholddata <- data.frame(rep(NA, 102))
hamwithholddata$emails <- ham_df$emails[-(1:2449)]

hamtestdata$emails <- as.character((hamtestdata$emails))

Tidying the data

Tidying Spam and Ham Emails

Words are separated, stop words are removed, only words 3 letters and longer are kept, and words are counted

spamtestdata$emails <- as.character(spamtestdata$emails)

wordnumspam <- vapply(strsplit(spamtestdata$emails, "\\w+"), length, integer(1))
summary(wordnumspam)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    14.0   159.8   337.0   623.3   794.0  5111.0
spamtidy_df <- spamtestdata %>% 
  unnest_tokens(word, emails) %>%
  anti_join(stop_words) %>%
  filter(str_detect(word, "[[:alpha:]]{3,}"))
## Joining, by = "word"
spamwords <- spamtidy_df %>% 
  count(word, sort=TRUE)
spamwords
## # A tibble: 13,535 x 2
##    word       n
##    <chr>  <int>
##  1 font    8201
##  2 size    3196
##  3 nbsp    2665
##  4 color   2376
##  5 width   2150
##  6 http    1877
##  7 align   1569
##  8 arial   1510
##  9 center  1171
## 10 table   1069
## # ... with 13,525 more rows
hamtidy_df <- hamtestdata %>% 
  unnest_tokens(word, emails) %>%
  anti_join(stop_words) %>%
  filter(str_detect(word, "[[:alpha:]]{3,}"))
## Joining, by = "word"
hamtidy_df %>%
  count(word, sort=TRUE) 
## # A tibble: 27,061 x 2
##    word             n
##    <chr>        <int>
##  1 http          3901
##  2 list          2190
##  3 rpm           1196
##  4 listinfo       984
##  5 spamassassin   973
##  6 exmh           938
##  7 wrote          922
##  8 time           911
##  9 people         902
## 10 users          898
## # ... with 27,051 more rows

Spam emails tend to have more words than ham emails. The spam emails have a median word length of 332 while ham emails have a median word length of 144.

Finding Sentiment of Spam and Ham Emails

spamsentiment <- spamtidy_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment) 
## Joining, by = "word"
spamsentimentpercentage <- (spamsentiment$n[2]-spamsentiment$n[1])/(spamsentiment$n[2]+spamsentiment$n[1])
spamsentimentpercentage
## [1] 0.4235439
hamsentiment <- hamtidy_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment) 
## Joining, by = "word"
hamsentimentpercentage <- (hamsentiment$n[2]-hamsentiment$n[1])/(hamsentiment$n[2]+hamsentiment$n[1])
hamsentimentpercentage
## [1] -0.1073637

Spam emails tend to be positive while ham emails tend to be negative. The difference between the percentage of positive and negative words in spam emails is 33%. The difference between the percentage of positive and negative words in ham emails is -11%. The negative sign indicates that there is a greater likelihood that the ham email will be negative.

Predicting whether an Email is Spam or Ham

To predict whether an email will be spam or ham, the number of words in the email will be calculated and the sentiment of the email will be determined. The data being used to test is data from the original collectionn of spam and ham emails that were withheld from the previous analysis. The most definitive way to determine whether an email is spam or ham is based on the sentiment analysis. I chose to test if the percentage of positive words-percentage of negative words is greater than 0.20. If so, then the email is categorized as spam. If not, then the email undergoes another check based on its length: if it is less than 400 words, it is classified as ham. Otherwise it is classified as spam.

decision <- list()
for (i in 1:length(spamholddata$rep.NA..100.)){
  unknown <- data.frame(rep(NA, 1))
  unknown$emails <- spamholddata$emails[i]
  unknown$emails <- as.character(unknown$emails)  
  tidy_df <- unknown %>% 
    unnest_tokens(word, emails) %>%
    anti_join(stop_words) %>%
    filter(str_detect(word, "[[:alpha:]]{3,}"))

  wordnum <- sum(sapply(gregexpr(" ", spamholddata$emails[i]), length)+1)   
  
  unknownsentiment <- tidy_df %>%
    inner_join(get_sentiments("bing")) %>%
    count(sentiment) 

  sentimentpercentage <- (unknownsentiment$n[2]-unknownsentiment$n[1])/(unknownsentiment$n[2]+unknownsentiment$n[1])
  sentimentpercentage

  ifelse (sentimentpercentage > .25, decision<-c(decision, "spam"), {
   ifelse (wordnum <400, decision <- c(decision,"ham"), decision<-c(decision, "spam"))
  })
}
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
length(decision[decision=="spam"])/length(decision)
## [1] 0.6933333
decisionh <- list()
for (i in 1:length(hamwithholddata$rep.NA..102.)){
  unknown <- data.frame(rep(NA, 1))
  unknown$emails <- hamwithholddata$emails[i]
  unknown$emails <- as.character(unknown$emails)  
  tidy_df <- unknown %>% 
    unnest_tokens(word, emails) %>%
    anti_join(stop_words) %>%
    filter(str_detect(word, "[[:alpha:]]{3,}"))

  wordnum <- sum(sapply(gregexpr(" ", hamwithholddata$emails[i]), length)+1)  

  unknownsentiment <- tidy_df %>%
    inner_join(get_sentiments("bing")) %>%
    count(sentiment) 

  sentimentpercentage <- (unknownsentiment$n[2]-unknownsentiment$n[1])/(unknownsentiment$n[2]+unknownsentiment$n[1])
  sentimentpercentage

  ifelse (sentimentpercentage > .25, decisionh<-c(decisionh, "spam"), {ifelse (wordnum < 400, decisionh <- c(decisionh,"ham"), decisionh<-c(decisionh, "spam"))})
}
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
length(decisionh[decisionh=="ham"])/length(decisionh)
## [1] 0.7407407

The algorithm correctly identified spam emails about 82% of the time and correctly identified ham emails about 68% of the time. The greater the accuracy I am able to get predicting one type of email, the lower the accuracy I am able to get in predicting the other type of email.