We are to create a program that can classify a text document using training documents that are already classified. This program will classify email as ‘spam’, i.e., unwanted email, or ‘ham’, i.e., wanted email.
# install.packages("tm")
# install.packages("caTools")
# install.packages("caret")
# install.packages("kernlab")
# install.packages("R.utils")
# install.packages("topicmodels")
# install.packages("quanteda")
# install.packages ("naivebayes")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: NLP
##
##
## Attaching package: 'NLP'
##
##
## The following object is masked from 'package:ggplot2':
##
## annotate
##
##
## Loading required package: RColorBrewer
##
## Loading required package: lattice
##
##
## Attaching package: 'caret'
##
##
## The following object is masked from 'package:purrr':
##
## lift
##
##
##
## Attaching package: 'kernlab'
##
##
## The following object is masked from 'package:purrr':
##
## cross
##
##
## The following object is masked from 'package:ggplot2':
##
## alpha
##
##
##
## Attaching package: 'magrittr'
##
##
## The following object is masked from 'package:purrr':
##
## set_names
##
##
## The following object is masked from 'package:tidyr':
##
## extract
The files of spam and ham emails were extracted from http://spamassassin.apache.org/old/publiccorpus/ per Prof. Catlin’s video. I chose the “20021010_spam.tar.bz2 for spam data”, and “20030228_easy_ham_2.tar.bz2” for ham data. Both were downloaded and extracted into a folders on my local computer.
# Loading ham data
ham_folder <- "C:/Users/Home/Documents/SpamHam/easy_ham2"
spam_folder <- "C:/Users/Home/Documents/SpamHam/spam"
length(list.files(path = ham_folder))
## [1] 1401
length(list.files(path = spam_folder))
## [1] 501
ham_files <- list.files(path = ham_folder, full.names = TRUE)
spam_files <- list.files(path = spam_folder, full.names = TRUE)
# Create ham data frame
ham <- list.files(path = ham_folder) %>%
as.data.frame() %>%
set_colnames("file") %>%
mutate(text = lapply(ham_files, read_lines)) %>%
unnest(c(text)) %>%
mutate(class = "ham",
type = 0) %>% # categorizes spam emails as type 0
group_by(file) %>%
mutate(text = paste(text, collapse = " ")) %>%
ungroup() %>%
distinct()
# Create spam data frame
spam <- list.files(path = spam_folder) %>%
as.data.frame() %>%
set_colnames("file") %>%
mutate(text = lapply(spam_files, read_lines)) %>%
unnest(c(text)) %>%
mutate(class = "spam",
type = 1) %>% # categorizes spam emails as type 1
group_by(file) %>%
mutate(text = paste(text, collapse = " ")) %>%
ungroup() %>%
distinct()
spamham_df <- rbind(spam, ham)%>%
select(class,type,file, text)
Here I will tidy the corpus from both folder files, by removing numbers, punctuation, stopwords and common non-content words, i.e. “like to”, “and”, “the”, “etc.”, which have no value. Excess white space will be removed, and finally I will reduce the terms to their stems.
# I kept getting strange errors like "Error in FUN(content(x), ...) : invalid multibyte string 1" so I searched the web and found this suggestion at Stackoverflow.com. After I used it my code ran fine.
Sys.setlocale("LC_ALL", "C")
## [1] "C"
Taking a look at the spamham_df reveals that additional tidying is necessary. Here I will use the ‘tm’ package to assist in removing white space, punctuation and help transform it into a suitable corpus.
spamham_df$text <- spamham_df$text %>%
str_replace(.,"[\\r\\n\\t]+", "")
clean_corpus <- Corpus(VectorSource(spamham_df$text))
cleanCorpus <- clean_corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, stopwords()) %>%
tm_map(stripWhitespace)
cleanCorpus
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1902
The document-term matrix (dtm) is the mathematical matrix that describes the frequency of terms that occurs in a collection of documents. I will create from the combined corpus.
dtm <- DocumentTermMatrix(cleanCorpus)
# Remove outliers of very rare terms or infrequent words
dtm.99 <- removeSparseTerms(dtm, sparse = 0.99)
inspect(dtm.99)
## <<DocumentTermMatrix (documents: 1902, terms: 2327)>>
## Non-/sparse entries: 228825/4197129
## Sparsity : 95%
## Maximal term length: 66
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aug esmtp iluglinuxie jul localhost mon postfix received tue wed
## 1194 0 3 0 7 4 3 3 5 4 0
## 1314 0 5 0 9 4 0 5 7 9 0
## 1818 0 6 0 10 3 0 6 10 0 10
## 182 0 2 0 0 3 7 2 15 0 0
## 1846 12 2 0 16 3 0 2 5 6 0
## 1881 11 1 0 0 5 2 1 9 4 4
## 197 3 2 0 0 3 3 1 15 0 0
## 245 2 3 0 1 2 1 1 5 1 0
## 405 0 1 0 0 2 0 1 4 0 0
## 422 0 4 6 0 3 0 4 15 0 0
Now that we have ~95% sparsity and can account for 2327 terms, we can visualize our corpus. I chose to use a wordcloud.
# Spamham word cloud (reference: ""How to Generate Word Clouds in R", cited at end)
set.seed(5678) # to ensure reproducibility
wordcloud(cleanCorpus, min.freq = 1000,
max.words=150, random.order=FALSE,
colors=brewer.pal(8, "Dark2")) #from RColorBrewer package
The wordcloud illustrates words from our corpus (in order of prominence)
by font size. Here we see that the frequently used words are “received”;
“aug”; “esmtp”; “localhost”; and “july”.
I reshuffled the data frame to ensure randomization.
# reshuffle the data frame
set.seed(5678)
rows <- sample(nrow(spamham_df))
spamhamdf2 <- spamham_df[rows, ]
80% of data is partitioned to be training
20% of data is partitioned to be testing (hold outs)
# Split the data set into the Training set and Test set
trainIndex <- createDataPartition(spamhamdf2$type, p=0.80, list=FALSE)
dataTrain <- as.data.frame(spamhamdf2[trainIndex,]) # training
dataTest <- spamhamdf2[-trainIndex,] # testing
summary (dataTrain)
## class type file text
## Length:1522 Min. :0.0000 Length:1522 Length:1522
## Class :character 1st Qu.:0.0000 Class :character Class :character
## Mode :character Median :0.0000 Mode :character Mode :character
## Mean :0.2674
## 3rd Qu.:1.0000
## Max. :1.0000
summary (dataTest)
## class type file text
## Length:380 Min. :0.0000 Length:380 Length:380
## Class :character 1st Qu.:0.0000 Class :character Class :character
## Mode :character Median :0.0000 Mode :character Mode :character
## Mean :0.2474
## 3rd Qu.:0.0000
## Max. :1.0000
We began with a corpus of 1,902 rows, then split it by 80% Training, or 1522 rows(rounded) for the training set. Allocating 20% of the testing set yields us 380 rows, or 20% of 1902, rounded.
# Create training and test corpus. We will use the 'cleanCorpus' previously created.
train_corpus <- cleanCorpus[1:1522] #for rows 1 to 1522
test_corpus <- cleanCorpus[1523:1902] #for rows 1523 to 1902
The DTM for our cleanCorpus was previously defined, so we are going to use it to create a “train_dtm” and “test_dtm”.
train_dtm <- dtm.99[1:1522,] #pass rows 1-1522 from training set
test_dtm <- dtm.99[1523:1902,] #pass rows 1523-1902 from testing set
train_dtm
## <<DocumentTermMatrix (documents: 1522, terms: 2327)>>
## Non-/sparse entries: 181237/3360457
## Sparsity : 95%
## Maximal term length: 66
## Weighting : term frequency (tf)
test_dtm
## <<DocumentTermMatrix (documents: 380, terms: 2327)>>
## Non-/sparse entries: 47588/836672
## Sparsity : 95%
## Maximal term length: 66
## Weighting : term frequency (tf)
five_words <- findFreqTerms(train_dtm, 5)
five_words [1:5]
## [1] "access" "aligndcenterbfont" "andor"
## [4] "aug" "best"
The five most frequent words/phrases in out training set are “access”, “affordable”, “aligndcenterbfont”, and “aligndmiddle”. We will uses these to help train our model
email_train <- DocumentTermMatrix(train_corpus, control=list(dictionary = five_words))
email_test <- DocumentTermMatrix(test_corpus, control=list(dictionary = five_words))
“Naive Bayes classification needs present or absent info on each word in a message. We have counts of occurances. Convert the document-term matrices.”
#Convert count info to "Yes" or "No"
convert_count <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
#Convert document-term matrices:
email_train <- apply(email_train, 2, convert_count)
email_test <- apply(email_test, 2, convert_count)
#Create naive Bayes classifier object
email_classifier <- naiveBayes(email_train, factor(dataTrain$type))
class(email_classifier) #verify the class of this classifier
## [1] "naiveBayes"
# Predictions on test data
email_pred <- predict(email_classifier, newdata=email_test)
table(email_pred, dataTest$type)
##
## email_pred 0 1
## 0 183 65
## 1 103 29
Based on the email_predicion reusuts, the model I built performed did not perform well. It only accurately predicted 64% (183/286) emails as “ham”, and 31% (65/94) emails as “spam”. Given more time, I would review/ rewrite my algorithm in search of a better outcome.