Is it Spam or Ham?

We begin by loading the relevant packages:

library(tm)
library(wordcloud)
library(e1071)
library(gmodels)
library(SnowballC)

The data are loaded and examined. This dataset can be found on kaggle and other places and contains 2 columns and 5574 rows or observations of either spam or ham emails.

spam <- read.csv('sms_spam.csv')
str(spam)
## 'data.frame':    5574 obs. of  2 variables:
##  $ type: Factor w/ 2 levels "ham","spam": 1 1 2 1 1 2 1 1 2 2 ...
##  $ text: Factor w/ 5160 levels " &lt;#&gt;  in mca. But not conform.",..: 1151 3227 1052 4252 2876 1078 986 428 4757 1286 ...
table(spam$type)
## 
##  ham spam 
## 4827  747

In order to examine the emails to find the most frequent words, we create a bag of words by breaking apart the email sentences, removing any numbers, punctuation, and stopwords, as well as converting the words to lower case and breaking them into their word root parts. We use the TermDocumentMatrix function to perform all tasks at once.

# create bag of words from the dataset, convert to lowercase and remove numbers, puncuation, stopwords, word stems
corpus <- VCorpus(VectorSource(spam$text)) 
tdm <- TermDocumentMatrix(corpus, control = list(
    tolower = TRUE,
    removeNumbers = TRUE,
    removePunctuation = TRUE,
    stemming = TRUE,
    stopwords = TRUE
))

str(tdm)
## List of 6
##  $ i       : int [1:44123] 218 408 788 790 1061 1271 2374 2410 3071 4168 ...
##  $ j       : int [1:44123] 1 1 1 1 1 1 1 1 1 1 ...
##  $ v       : num [1:44123] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 6945
##  $ ncol    : int 5574
##  $ dimnames:List of 2
##   ..$ Terms: chr [1:6945] "…thank" "“harri" "£award" "£call" ...
##   ..$ Docs : chr [1:5574] "1" "2" "3" "4" ...
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
inspect(tdm)
## <<TermDocumentMatrix (terms: 6945, documents: 5574)>>
## Non-/sparse entries: 44123/38667307
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)
## Sample             :
##       Docs
## Terms  1086 1580 1864 2159 2371 2381 2435 2850 3018 5107
##   call    0    0    0    0    0    0    0    0    0    0
##   can     0    0    0    0    0    1    3    0    0    0
##   come    1    0    0    0    0    0    0    0    0    0
##   dont    1    0    3    0    0    0    0    0    0    0
##   free    0    0    0    0    1    0    1    0    0    1
##   get     1    0    0    0    0    0    1    0    0    0
##   just    0    0    0    0    0    0    0    0    0    0
##   ltgt    0   18    0    0    0    1    6    0    2    0
##   now     0    0    0    0    0    0    0    0    0    0
##   will    9    0    0    0    0    0    0    0    0    0

Then we convert the bag of words to a matrix, collect the frequency for each word, and create a data frame with the word and it’s frequency. Now we can display a word cloud of the top 50 most frequently used words in the dataset.

#Convert TDM to matrix, collect word frequencies, create a data frame with each word & freq
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

#Create word cloud of top 50 words
wordcloud(d$word,d$freq, scale=c(4,1), max.words = 50, random.order = FALSE)

Next we will split apart the data into either spam or ham so that we can examine the most frequent words in each group independently.

Ham WordCloud

#separate data into spam and ham
spam_messages <- subset(spam,type=="spam")
ham_messages <- subset(spam, type=="ham")

# create ham bag of words
corpus <- VCorpus(VectorSource(ham_messages$text)) 
tdm_ham <- TermDocumentMatrix(corpus, control = list(
    tolower = TRUE,
    removeNumbers = TRUE,
    removePunctuation = TRUE,
    stemming = TRUE,
    stopwords = TRUE
))

str(tdm_ham)
## List of 6
##  $ i       : int [1:34463] 179 355 683 684 906 1089 2056 2086 2667 3605 ...
##  $ j       : int [1:34463] 1 1 1 1 1 1 1 1 1 1 ...
##  $ v       : num [1:34463] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 5914
##  $ ncol    : int 4827
##  $ dimnames:List of 2
##   ..$ Terms: chr [1:5914] "…thank" "aah" "aaniy" "aaooooright" ...
##   ..$ Docs : chr [1:4827] "1" "2" "3" "4" ...
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
inspect(tdm_ham)
## <<TermDocumentMatrix (terms: 5914, documents: 4827)>>
## Non-/sparse entries: 34463/28512415
## Sparsity           : 100%
## Maximal term length: 38
## Weighting          : term frequency (tf)
## Sample             :
##       Docs
## Terms  1356 1604 1857 2039 2047 2095 2459 2604 4416 922
##   call    0    0    0    0    0    0    0    0    0   0
##   can     0    0    0    0    1    3    0    0    0   0
##   come    0    0    0    0    0    0    0    0    0   1
##   dont    0    3    0    0    0    0    0    0    0   1
##   get     0    0    0    0    0    1    0    0    0   1
##   just    0    0    0    0    0    0    0    0    0   0
##   know    0    1    0    0    0    0    0    0    0   0
##   ltgt   18    0    0    0    1    6    0    2    0   0
##   now     0    0    0    0    0    0    0    0    0   0
##   will    0    0    0    0    0    0    0    0    0   9
#Convert tdm_ham to matrix, collect word frequencies, create a data frame with each word & freq
m <- as.matrix(tdm_ham)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

#Create word cloud of top 50 ham words
wordcloud(d$word,d$freq, scale=c(4,.5), max.words = 50, random.order = FALSE)

Some of the most common ham words are love, want, got, just, will, now, can, like, call, get, dont, day, know, time. These are very expressive and personal words to use and suggest a closer relationship between the sender and receiver of the email.

Spam WordCloud

# spam bag of words
corpus <- VCorpus(VectorSource(spam_messages$text)) 
tdm_spam <- TermDocumentMatrix(corpus, control = list(
    tolower = TRUE,
    removeNumbers = TRUE,
    removePunctuation = TRUE,
    stemming = TRUE,
    stopwords = TRUE
))

str(tdm_spam)
## List of 6
##  $ i       : int [1:9660] 75 309 357 466 515 545 896 1207 1219 1237 ...
##  $ j       : int [1:9660] 1 1 1 1 1 1 1 1 1 1 ...
##  $ v       : num [1:9660] 1 1 1 2 1 1 1 1 1 1 ...
##  $ nrow    : int 1843
##  $ ncol    : int 747
##  $ dimnames:List of 2
##   ..$ Terms: chr [1:1843] "“harri" "£award" "£call" "£ea" ...
##   ..$ Docs : chr [1:747] "1" "2" "3" "4" ...
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
inspect(tdm_spam)
## <<TermDocumentMatrix (terms: 1843, documents: 747)>>
## Non-/sparse entries: 9660/1367061
## Sparsity           : 99%
## Maximal term length: 40
## Weighting          : term frequency (tf)
## Sample             :
##        Docs
## Terms   209 291 364 371 372 529 622 67 69 744
##   call    0   0   0   0   1   1   0  1  0   0
##   claim   0   0   0   0   0   0   0  1  0   0
##   free    1   0   0   0   2   2   0  1  0   1
##   mobil   0   1   1   1   0   0   1  0  0   1
##   now     1   1   1   2   1   1   1  2  0   0
##   prize   0   0   0   0   0   0   0  0  0   0
##   repli   1   0   0   2   0   0   0  0  0   0
##   stop    1   1   0   1   0   0   1  0  0   0
##   text    0   1   0   0   0   0   1  0  0   0
##   txt     1   1   1   1   1   1   1  0  0   0
#Convert tdm_spam to matrix, collect word frequencies, create a data frame with each word & freq
m <- as.matrix(tdm_spam)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

#Create word cloud of top 50 spam words
wordcloud(d$word,d$freq, scale=c(5,1), max.words = 50, random.order = FALSE)

Some of the most common words in the spam wordcloud are call, text, now, urgent, prize, cash, get, free, mobil, stop, week, etc. These words are more intent on creating a sense of urgency. Interesting that call shows up in both spam and ham emails! Perhaps call is a word that is used frequently in language in general.

Modeling spam vs ham

In order to model the dataset to make predictions on an email being either spam or ham, we must first break up the data into a train and test set, with 75% of the data in the training set and withholding 25% to test against. We can see in the tables that the data is split in a way that is representative of the the train/test sets.

n <- nrow(spam)
trainLabels <-spam[1:floor(n * 0.75),]$type
testLabels <- spam[(floor(n * 0.75)+1):n,]$type
prop.table(table(trainLabels))
## trainLabels
##       ham      spam 
## 0.8648325 0.1351675
prop.table(table(testLabels))
## testLabels
##       ham      spam 
## 0.8694405 0.1305595
dtm <- t(tdm)
ndtm <- nrow(dtm)
dtmTrain <- dtm[1:floor(ndtm * 0.75),]
dtmTest <- dtm[(floor(ndtm * 0.75)+1):ndtm,]

We can also choose to reduce the number of words to a selection of the most frequently seen words in the dataset.

freqWords <- findFreqTerms(dtmTrain,5)
freqTrain <- dtmTrain[,freqWords]
freqTest <- dtmTest[,freqWords]

convert_counts <- function(x) {
    x <- ifelse(x > 0, "Yes", "No")
}

And finally we are ready to train the model which in this case is a NaiveBayes Classifier.

First we train the model against the training data:

train <- apply(freqTrain, MARGIN = 2,
               convert_counts)
test <- apply(freqTest, MARGIN = 2,
              convert_counts)

classifier <- naiveBayes(train, trainLabels)

From the word cloud, we saw that the word call shows up in both spam and ham emails, however the prevalence is much greater in the spam set.

classifier[2]$tables$call
##            call
## trainLabels         No        Yes
##        ham  0.94356846 0.05643154
##        spam 0.56283186 0.43716814

And then we test our model using the withheld test data. We can display the model’s results in a table:

testPredict <- predict(classifier, test)
CrossTable(testPredict, testLabels,
           prop.chisq = FALSE, prop.t = FALSE,
           dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1394 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1203 |        20 |      1223 | 
##              |     0.984 |     0.016 |     0.877 | 
##              |     0.993 |     0.110 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         9 |       162 |       171 | 
##              |     0.053 |     0.947 |     0.123 | 
##              |     0.007 |     0.890 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1212 |       182 |      1394 | 
##              |     0.869 |     0.131 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

We have 9 missed ham and 20 missed spam from a total of 1394 which is not too bad but certainly not perfect.

We could potentially improve the model’s performance by fine tuning using the laplace argument to perform laplace smoothing.