Data 607 Spring 2019 Project 4

Objective:

Build a spam classifier using labeled training data and predict on test data

Install packages

library(tm) # clean/organize data
library(wordcloud) # to display most frequent words in viz 
suppressWarnings(library(e1071)) # for naive bayes classifier
library(gmodels) # for confusion matrix
suppressWarnings(library(SnowballC)) # clean/organize data
library(tidyverse)

Read in sms dataset

filename <- '/Users/euniceok/PycharmProjects/cuny/spring2019/Week10Text/data/sms_spam.csv'
spam <- read.csv(filename, stringsAsFactors = FALSE, encoding="UTF-8")
spam$type <- factor(spam$type) # convert type to factor
table(spam$type) # see how many ham vs spam messages are in the text dataset

## 
##  ham spam 
## 4827  747

Peek at word cloud vizzes of each category: spam and ham

spam_messages <- subset(spam, type =="spam")
ham_messages <- subset(spam, type=="ham")

Spam word cloud

suppressWarnings(wordcloud(spam_messages$text, max.words = 100, scale = c(3, 0.5)))

Ham word cloud

suppressWarnings(wordcloud(ham_messages$text, max.words = 100, scale = c(3,0.5)))

Clearly, distinct sets of the most frequent words emerge in each wordcloud.

Data Prep

# generate a corpus, a collection of text documents
corpus <-VCorpus(VectorSource(spam$text))
corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5574

# generate document term matrix, in which row = message and col = word
# words are lowercased, numbers and punctuation are removed and stemming is performed
dtm <- DocumentTermMatrix(corpus, control = list(
  tolower=TRUE,
  removeNumbers = TRUE,
  removePunctuation = TRUE, 
  stemming = TRUE
))
dtm

## <<DocumentTermMatrix (documents: 5574, terms: 7022)>>
## Non-/sparse entries: 57133/39083495
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

Note: the data is very sparse and the maximal term length is 40, which seems reasonable.

Data Partitioning and Cleaning

# Split the dataset into 75% training and 25% testing subsets.

# 75% of sample size
smp_size <-floor(0.75 * nrow(spam)) # it is 4180 

# set seed to make partition reproducible
set.seed(123)

# randomly select train indeces as 75% of dataset 
train_ind <-sample(seq_len(nrow(spam)), size=smp_size)

trainLabels <- spam[train_ind,]$type
testLabels <- spam[-train_ind,]$type

# check that proportions of ham/spam are fairly similar between two datasets
prop.table(table(trainLabels))

## trainLabels
##       ham      spam 
## 0.8660287 0.1339713

prop.table(table(testLabels))

## testLabels
##       ham      spam 
## 0.8658537 0.1341463

# split data on the document term matrix (which has been pre-processed)
train <- dtm[train_ind,]
test <- dtm[-train_ind,]

# check dimensions of subset data make sense
print(paste(dim(train), dim(test)))

## [1] "4180 1394" "7022 7022"

# identify words used more frequently than 5x to ensure model is useful
# note, must use document term matrix data 
freqWords <-findFreqTerms(train,5)
freqTrain <- train[,freqWords]
freqTest <- test[,freqWords]

freqTrain

## <<DocumentTermMatrix (documents: 4180, terms: 1241)>>
## Non-/sparse entries: 35542/5151838
## Sparsity           : 99%
## Maximal term length: 19
## Weighting          : term frequency (tf)

freqTest

## <<DocumentTermMatrix (documents: 1394, terms: 1241)>>
## Non-/sparse entries: 11974/1717980
## Sparsity           : 99%
## Maximal term length: 19
## Weighting          : term frequency (tf)

Note sparsity is better and maximal term length is shorter

# since DTM uses 1s and 0s but Naive Bayes classifer works on categorical features, 
# convert 1 and 0 to Yes or No. Apply to every column (ie margin=2)
convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}
trained <- apply(freqTrain, MARGIN=2,
               convert_counts)
tested <- apply(freqTest, MARGIN=2,
              convert_counts)

Training and Testing

# train model
classifier <-naiveBayes(trained, trainLabels)

# check out the output for the word "call"
# results indicate a message with this word has a higher probability of being spam
classifier[2]$tables$call

##            call
## trainLabels         No        Yes
##        ham  0.94447514 0.05552486
##        spam 0.54642857 0.45357143

# evaluate the performance of the classifer
testPredict <- predict(classifier, tested)
CrossTable(testPredict, testLabels,
           prop.chisq = FALSE, prop.t = FALSE,
           dnn = c('predicted', 'actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1394 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1203 |        14 |      1217 | 
##              |     0.988 |     0.012 |     0.873 | 
##              |     0.997 |     0.075 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         4 |       173 |       177 | 
##              |     0.023 |     0.977 |     0.127 | 
##              |     0.003 |     0.925 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1207 |       187 |      1394 | 
##              |     0.866 |     0.134 |           | 
## -------------|-----------|-----------|-----------|
## 
##

According to the confusion matrix, 14 + 4 or 18 out of 1394 sms messages were classified incorrectly. It is more likely that users would be more concerned about actual messages accidentally classified as spam and we see that there are only 4 false negatives. This model may or may not be useful depending on what users considers is a acceptable false negative rate.

References:

Source data: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
Tutorial: http://www.dbenson.co.uk/Rparts/subpages/spamR/