DATA 607 - Project # 4

Vladimir Nimchenko

INTRODUCTION:

I downloaded two folders from the public corpus directory (ham and spam). I than unzipped the folders. I started from this ham/spam data set and predicted new documents. I then put them into their respective (Ham and Spam) data frames adding a spam column to each data frame - 1 in spam column of the spam data frame and 0 in the spam column of the Ham data frame. I then merged and the data frames into 1 dataframe for prediction. I than removed all the sparse words, punctuation, white space and other symbols which are not needed and slow processing. I then mixed up my spam and ham emails in my data frame to ensure better efficacy of the model I am to build. I than split my data set into a training set (65%) and testing set (35%). Finally, I utilized the guassian generalized linear model to predict that if the model is ran many times it will produce similiar results over 60% of the time.

Loading the needed libraries

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.8     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.1
## v readr   2.1.2     v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tm)

## Warning: package 'tm' was built under R version 4.2.2

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(data.table)

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose

library(e1071)

## Warning: package 'e1071' was built under R version 4.2.2

library(caret)

## Warning: package 'caret' was built under R version 4.2.2

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(caTools)

## Warning: package 'caTools' was built under R version 4.2.2

library(prediction)

## Warning: package 'prediction' was built under R version 4.2.2

Retrieving the spam and ham folders and loading them into their respective data frames. Labeling all emails which are spam in spam folder as 1 and all spam emails in the Ham folder as 0. We display the Email, Email Body, the class of email (Ham for emails in the Ham folder and Spam for emails in the spam folder) and a Spam column which depending on the folder will say 0 (Ham folder) and 1 (Spam folder)

#directory where all the ham emails are stored
ham_email_directory = "C:/Users/Staff/Dropbox/MS in Data Science/DATA 607 - Data Acquisition and Management/Week 12 - Project 4 (Document Classification)/easy_ham/"

#Adding all of the emails in the ham directory to a list
ham_list <- list.files(path = ham_email_directory,full.names = TRUE)

#Create ham data frame
ham_email_df = list.files(ham_email_directory)%>%
 as.data.frame() %>%
  set_colnames("Email") %>%
  mutate(EmailBody = lapply(ham_list, read_lines)) %>%
  unnest(c(EmailBody)) %>%
  mutate(class = "Ham",
         spam = 0) %>%
  group_by(Email) %>%
  mutate(EmailBody = paste(EmailBody, collapse = " ")) %>%
  ungroup() %>%
  distinct()

#directory where all the spam emails are stored
spam_email_directory = "C:/Users/Staff/Dropbox/MS in Data Science/DATA 607 - Data Acquisition and Management/Week 12 - Project 4 (Document Classification)/spam_2/"

#Adding all of the emails in the spam directory to a list
spam_list <- list.files(path = spam_email_directory,full.names = TRUE)

#Create spam data frame
spam_email_df = list.files(spam_email_directory)%>%
 as.data.frame() %>%
  set_colnames("Email") %>%
  mutate(EmailBody = lapply(spam_list, read_lines)) %>%
  unnest(c(EmailBody)) %>%
  mutate(class = "Spam",
         spam = 1) %>%
  group_by(Email) %>%
  mutate(EmailBody = paste(EmailBody, collapse = " ")) %>%
  ungroup() %>%
  distinct()

## Warning: One or more parsing issues, see `problems()` for details

I will now combine the ham data frame and spam data from into one called “Ham_Spam_df” in order to build a classifier to see whether an email is “spam” or “ham”.

# combining both the "Ham" and "Spam" data frames
Ham_Spam_df <- rbind(ham_email_df,spam_email_df)

I will now tidy the Email Body Column of the Ham_Spam data frame to remove all sparse words, unnecessary punctuation, white space, words,etc.. This will make the data set quicker to loop through when we do modeling.

#virtual corpus
virtual_corpus = VCorpus(VectorSource(Ham_Spam_df$EmailBody))

#convert text to plain document
virtual_corpus = tm_map(virtual_corpus, PlainTextDocument)

#remove unnecessary punctuation
virtual_corpus= tm_map(virtual_corpus, removePunctuation)

#remove white space between characters
virtual_corpus= tm_map(virtual_corpus, stripWhitespace)

# I now need to remove the words which appear least in the Email Body column. So we put the data set into a document term matrix
document_term_matrix = DocumentTermMatrix(virtual_corpus)

#removing words which appear the least in the Document Term Matrix
document_term_matrix = removeSparseTerms(document_term_matrix, 0.90)

#convert the sparse words document term matrix into a data set.
EmailBody_new = as.data.frame(as.matrix(document_term_matrix))

#Adding the spam column (which contains 1 if spam and 0 if Ham) to the EmaiBody_new data frame.
EmailBody_new$spam = Ham_Spam_df$spam

I will now create a training set and test set from my Ham_Spam dataset. We will process the model by utilizing the training set and test the model by making predictions against the test dataset. This should minimize any variations in the dataset

# mix the spam and ham emails for better efficacy of the prediction model
EmailBody_new <- EmailBody_new[sample(1:nrow(EmailBody_new)),]

# I will split the data set, 65% training set and 35% test set
split = sample.split(EmailBody_new$spam, SplitRatio = 0.65)
train = subset(EmailBody_new, split == TRUE)
test = subset(EmailBody_new, split == FALSE)

#viewing the rows of the test and training data set
nrow(test)

## [1] 1364

nrow(train)

## [1] 2534

I will utilize the Gaussian generalized linear model to predict the accuracy of both the training set and testing set respectively fall within the interval of greater than 60%. This would mean that we are 60% confident that both our training and testing models if repeated many times would produce the similar results.

spam_predict = glm(spam~., data = train, family = "gaussian")
# predicting the training set
training_prediction = predict(spam_predict, type = "response")
table(train$spam, training_prediction > 0.6)

##    
##     FALSE TRUE
##   0  1626    0
##   1    21  887

# predicting the testing set
spam_predict = glm(spam~., data =test, family = "gaussian")
testing_prediction = predict(spam_predict, type = "response")
table(test$spam, testing_prediction > 0.6)

##    
##     FALSE TRUE
##   0   875    0
##   1    10  479

#Displaying the accuracy of the training set model (accuracy = success - 1/number of rows in set) * 100
((19+889)/2534) * 100

## [1] 35.83268

#Displaying the accuracy of the testing set model (accuracy = success - 1/number of rows in set) * 100
((10 + 479)/1364) *100

## [1] 35.85044

CONCLUSION:

Based on the accuracy results above, it looks like my training model can predict results accurately 35.8% and my testing model can predict results accurately 35.85% of the time.