As a group, we worked with two files containing spam and ham to predict if a document is spam or not .By utilizing our ‘training’ documents, our group was able to classify the “test” documents. We were able to communicate via zoom meeting and collaborating with Github.
For this project, we started with a spam/ham dataset, then predicted the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). We are provided with the corpus (https://spamassassin.apache.org/old/publiccorpus/) and instructions on how to download the ham and spam files.
We started by loading our libraries which were TidyVerse,tm, and magrittr. The first step was to load the corpus data into our frames ‘spam_folder’ and ‘ham_folder’. For better understanding we then utilized ‘list.files’ on our ‘spam_folder’ object which produces a character vector of the names of files or directories in the named directory. We then read that info into a data frame. We specified our column names. Then we used the ‘lapply’ function which is useful for performing operations on list objects and returns a list object of the same length as the original set. It takes a list, vector or data frame as input and gives output in the list.
library(tidyverse)
library(tm)
## Warning: package 'tm' was built under R version 4.0.3
## Warning: package 'NLP' was built under R version 4.0.3
library(magrittr)
library(qdap)
## Warning: package 'qdap' was built under R version 4.0.3
## Warning: package 'qdapDictionaries' was built under R version 4.0.3
## Warning: package 'qdapRegex' was built under R version 4.0.3
## Warning: package 'qdapTools' was built under R version 4.0.3
library(data.table)
library(e1071)
## Warning: package 'e1071' was built under R version 4.0.3
library(caret)
## Warning: package 'caret' was built under R version 4.0.3
We used two folder for our project - spam and easy_ham_2.
spam_folder <- 'C:/data/607project4/spam/'
ham_folder <- 'C:/data/607project4/easy_ham_2/'
length(list.files(path = spam_folder))
## [1] 500
spam_files <- list.files(path = spam_folder, full.names = TRUE)
ham_files <- list.files(path = ham_folder, full.names = TRUE)
spam <- list.files(path = spam_folder) %>%
as.data.frame() %>%
set_colnames("file") %>%
mutate(text = lapply(spam_files, read_lines)) %>%
unnest(c(text)) %>%
mutate(class = "spam",
spam = 1) %>%
group_by(file) %>%
mutate(text = paste(text, collapse = " ")) %>%
ungroup() %>%
distinct()
ham <- list.files(path = ham_folder) %>%
as.data.frame() %>%
set_colnames("file") %>%
mutate(text = lapply(ham_files, read_lines)) %>%
unnest(c(text)) %>%
mutate(class = "ham",
spam = 0) %>%
group_by(file) %>%
mutate(text = paste(text, collapse = " ")) %>%
ungroup() %>%
distinct()
This was perhaps the section we most found challenging. The following step in our process was to use an rbind function for the contents of ‘ham’ and ‘spam’. From previous assignment experience we knew the rbind function can be used to combine several vectors, matrices and/or data frames by rows. Keeping in mind that select() keeps only the variables mentioned we utilized this with values of class,spam, file, text. Another handy function to tidy the data that we utilized was str_replace. Basically, we filtered our ‘ham_spam’ for any empty/white spaces caused by factors such as ‘tabs’. We also found helpful the content transformers function. This is basically functions which modify the content of an R object. In this case we used it to modify punctuation and to replace it with a space.
During the process of completing this project we have also learned to manipulate data with functionalities such as a document-term matrix or term-document matrix which is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. This is a matrix where each row represents one document, each column represents one term (word), each value (typically) contains the number of appearances of that term in that document. We experimented as well using function removeSparseTerms which removes those terms which don’t appear too often in our data. In this instance we removed word that did not appear in at least 10 documents.
ham_spam <- rbind(ham, spam) %>%
select(class, spam,file, text)
ham_spam$text <- ham_spam$text %>%
str_replace(.,"[\\r\\n\\t]+", "")
replacePunctuation <- content_transformer(function(x) {return (gsub("[[:punct:]]", " ", x))})
corpus <- Corpus(VectorSource(ham_spam$text)) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(replacePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace)
## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., replacePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm,1-(10/length(corpus)))
#test<-as.matrix(dtm)
inspect(dtm)
## <<DocumentTermMatrix (documents: 1900, terms: 3816)>>
## Non-/sparse entries: 282538/6967862
## Sparsity : 96%
## Maximal term length: 19
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aug com font http linux list localhost net org received
## 1317 0 4 1 242 17 16 13 254 5 10
## 1345 12 56 0 38 71 10 6 0 9 5
## 1380 11 18 0 5 5 18 8 0 3 9
## 1637 0 36 149 22 0 1 5 1 8 6
## 1645 2 4 0 7 0 1 5 17 12 5
## 1682 0 5 542 1 0 2 5 1 3 4
## 1805 0 6 542 1 0 3 5 2 3 4
## 1835 0 13 126 6 0 3 5 1 6 3
## 1876 0 8 542 1 0 2 5 4 6 6
## 693 0 24 0 10 0 6 7 13 7 5
dim(dtm)
## [1] 1900 3816
email_dtm <- dtm %>%
as.matrix() %>%
as.data.frame() %>%
sapply(., as.numeric) %>%
as.data.frame() %>%
mutate(class = ham_spam$class) %>%
select(class, everything())
email_dtm$class <- as.factor(email_dtm$class)
#Training & Test set
sample_size <- floor(0.8 * nrow(email_dtm))
set.seed(1564)
index <- sample(seq_len(nrow(email_dtm)), size = sample_size)
dtm_train <- email_dtm[index, ]
dtm_test <- email_dtm[-index, ]
#Training & Test Spam Count
train_labels <- dtm_train$class
test_labels <- dtm_test$class
#Proportion for training & test Spam
prop.table(table(train_labels))
## train_labels
## ham spam
## 0.7434211 0.2565789
prop.table(table(test_labels))
## test_labels
## ham spam
## 0.7105263 0.2894737
dtm_train[ , 2:3816] <- ifelse(dtm_train[ , 2:3816] == 0, "No", "Yes")
dtm_test[ , 2:3816] <- ifelse(dtm_test[ , 2:3816] == 0, "No", "Yes")
model_classifier <- naiveBayes(dtm_train, train_labels)
test_pred <- predict(model_classifier, dtm_test)
confusionMatrix(test_pred, test_labels, positive = "spam",
dnn = c("Prediction","Actual"))
## Confusion Matrix and Statistics
##
## Actual
## Prediction ham spam
## ham 269 4
## spam 1 106
##
## Accuracy : 0.9868
## 95% CI : (0.9696, 0.9957)
## No Information Rate : 0.7105
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9678
##
## Mcnemar's Test P-Value : 0.3711
##
## Sensitivity : 0.9636
## Specificity : 0.9963
## Pos Pred Value : 0.9907
## Neg Pred Value : 0.9853
## Prevalence : 0.2895
## Detection Rate : 0.2789
## Detection Prevalence : 0.2816
## Balanced Accuracy : 0.9800
##
## 'Positive' Class : spam
##
As tested using the Naive Bayes model from the e1071 model, we were able to accurately predict roughly 99% of the emails into the proper categories. There is also a 98% sensitivity rate which means that 98% of the spam emails were classified correctly and a 99% specificity rate means that 99% of the ham emails were classified correctly.