Objective:
Build a spam classifier using labeled training data and predict on test data
library(tm) # clean/organize data
library(wordcloud) # to display most frequent words in viz
suppressWarnings(library(e1071)) # for naive bayes classifier
library(gmodels) # for confusion matrix
suppressWarnings(library(SnowballC)) # clean/organize data
library(tidyverse)
filename <- '/Users/euniceok/PycharmProjects/cuny/spring2019/Week10Text/data/sms_spam.csv'
spam <- read.csv(filename, stringsAsFactors = FALSE, encoding="UTF-8")
spam$type <- factor(spam$type) # convert type to factor
table(spam$type) # see how many ham vs spam messages are in the text dataset
##
## ham spam
## 4827 747
spam_messages <- subset(spam, type =="spam")
ham_messages <- subset(spam, type=="ham")
suppressWarnings(wordcloud(spam_messages$text, max.words = 100, scale = c(3, 0.5)))
suppressWarnings(wordcloud(ham_messages$text, max.words = 100, scale = c(3,0.5)))
Clearly, distinct sets of the most frequent words emerge in each wordcloud.
# generate a corpus, a collection of text documents
corpus <-VCorpus(VectorSource(spam$text))
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5574
# generate document term matrix, in which row = message and col = word
# words are lowercased, numbers and punctuation are removed and stemming is performed
dtm <- DocumentTermMatrix(corpus, control = list(
tolower=TRUE,
removeNumbers = TRUE,
removePunctuation = TRUE,
stemming = TRUE
))
dtm
## <<DocumentTermMatrix (documents: 5574, terms: 7022)>>
## Non-/sparse entries: 57133/39083495
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
Note: the data is very sparse and the maximal term length is 40, which seems reasonable.
# Split the dataset into 75% training and 25% testing subsets.
# 75% of sample size
smp_size <-floor(0.75 * nrow(spam)) # it is 4180
# set seed to make partition reproducible
set.seed(123)
# randomly select train indeces as 75% of dataset
train_ind <-sample(seq_len(nrow(spam)), size=smp_size)
trainLabels <- spam[train_ind,]$type
testLabels <- spam[-train_ind,]$type
# check that proportions of ham/spam are fairly similar between two datasets
prop.table(table(trainLabels))
## trainLabels
## ham spam
## 0.8660287 0.1339713
prop.table(table(testLabels))
## testLabels
## ham spam
## 0.8658537 0.1341463
# split data on the document term matrix (which has been pre-processed)
train <- dtm[train_ind,]
test <- dtm[-train_ind,]
# check dimensions of subset data make sense
print(paste(dim(train), dim(test)))
## [1] "4180 1394" "7022 7022"
# identify words used more frequently than 5x to ensure model is useful
# note, must use document term matrix data
freqWords <-findFreqTerms(train,5)
freqTrain <- train[,freqWords]
freqTest <- test[,freqWords]
freqTrain
## <<DocumentTermMatrix (documents: 4180, terms: 1241)>>
## Non-/sparse entries: 35542/5151838
## Sparsity : 99%
## Maximal term length: 19
## Weighting : term frequency (tf)
freqTest
## <<DocumentTermMatrix (documents: 1394, terms: 1241)>>
## Non-/sparse entries: 11974/1717980
## Sparsity : 99%
## Maximal term length: 19
## Weighting : term frequency (tf)
Note sparsity is better and maximal term length is shorter
# since DTM uses 1s and 0s but Naive Bayes classifer works on categorical features,
# convert 1 and 0 to Yes or No. Apply to every column (ie margin=2)
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
trained <- apply(freqTrain, MARGIN=2,
convert_counts)
tested <- apply(freqTest, MARGIN=2,
convert_counts)
# train model
classifier <-naiveBayes(trained, trainLabels)
# check out the output for the word "call"
# results indicate a message with this word has a higher probability of being spam
classifier[2]$tables$call
## call
## trainLabels No Yes
## ham 0.94447514 0.05552486
## spam 0.54642857 0.45357143
# evaluate the performance of the classifer
testPredict <- predict(classifier, tested)
CrossTable(testPredict, testLabels,
prop.chisq = FALSE, prop.t = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1394
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1203 | 14 | 1217 |
## | 0.988 | 0.012 | 0.873 |
## | 0.997 | 0.075 | |
## -------------|-----------|-----------|-----------|
## spam | 4 | 173 | 177 |
## | 0.023 | 0.977 | 0.127 |
## | 0.003 | 0.925 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1207 | 187 | 1394 |
## | 0.866 | 0.134 | |
## -------------|-----------|-----------|-----------|
##
##
According to the confusion matrix, 14 + 4 or 18 out of 1394 sms messages were classified incorrectly. It is more likely that users would be more concerned about actual messages accidentally classified as spam and we see that there are only 4 false negatives. This model may or may not be useful depending on what users considers is a acceptable false negative rate.
References:
Source data: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
Tutorial: http://www.dbenson.co.uk/Rparts/subpages/spamR/