Example is retrieved from the book Machine Learning with R (Brett Lantz).

1. Collecting Data

Data is adapted from the SMS Spam Collection at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/. The data includes SMS messages and labels indicating if the message is spam or ham. Naive Bayes classifier will determine the word patterns to label these SMS messages.

2. Exploring and Preparing the Data

We first read the into a data frame. There are 5559 messages (4812 - ham, 747 - spam) with 2 attributes (type and text) in the data set.

# read the sms data into the sms data frame
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)

# examine the structure of the sms data
str(sms_raw)
## 'data.frame':    5559 obs. of  2 variables:
##  $ type: chr  "ham" "ham" "ham" "spam" ...
##  $ text: chr  "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or ÂŁ10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out!"| __truncated__ ...
# convert spam/ham to factor.
sms_raw$type <- factor(sms_raw$type)

# examine the type variable more carefully
str(sms_raw$type)
##  Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...
table(sms_raw$type)
## 
##  ham spam 
## 4812  747
Cleaning and Standardizing Text Data

We will load text mining package (tm) to clean messages (breaking up sentences into words, removing unnecessary words, punctuations and numbers).

We need to create collection of text documents - corpus.

# build a corpus using the text mining (tm) package
library(tm)
sms_corpus <- VCorpus(VectorSource(sms_raw$text))
# examine the sms corpus
print(sms_corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5559
inspect(sms_corpus[1:2])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 49
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 23
as.character(sms_corpus[[1]])
## [1] "Hope you are having a good week. Just checking in"
lapply(sms_corpus[1:2], as.character)
## $`1`
## [1] "Hope you are having a good week. Just checking in"
## 
## $`2`
## [1] "K..give back my thanks."

First we will convert all letters into lower case.

# clean up the corpus using tm_map()
sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))

# show the difference between sms_corpus and corpus_clean
as.character(sms_corpus[[1]])
## [1] "Hope you are having a good week. Just checking in"
as.character(sms_corpus_clean[[1]])
## [1] "hope you are having a good week. just checking in"
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers) # remove numbers
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords()) # remove stop words
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation) # remove punctuation

Word stemming will be used to transform words into base form.

# illustration of word stemming
library(SnowballC)
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace) # eliminate unneeded whitespace

# examine the final clean corpus
lapply(sms_corpus[1:3], as.character)
## $`1`
## [1] "Hope you are having a good week. Just checking in"
## 
## $`2`
## [1] "K..give back my thanks."
## 
## $`3`
## [1] "Am also doing in cbe only. But have to pay."
lapply(sms_corpus_clean[1:3], as.character)
## $`1`
## [1] "hope good week just check"
## 
## $`2`
## [1] "kgive back thank"
## 
## $`3`
## [1] "also cbe pay"
Splitting Text Documents into Words

With DocumentTermMatrix() function corpus will be transformed into a sparse matrix called a Document Term Matrix (DTM) in which rows indicate messages and columns indicate words.

# create a document-term sparse matrix
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
Creating test and training datasets
# creating training and test datasets
sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test  <- sms_dtm[4170:5559, ]

# also save the labels
sms_train_labels <- sms_raw[1:4169, ]$type
sms_test_labels  <- sms_raw[4170:5559, ]$type

# check that the proportion of spam is similar
prop.table(table(sms_train_labels))
## sms_train_labels
##       ham      spam 
## 0.8647158 0.1352842
prop.table(table(sms_test_labels))
## sms_test_labels
##       ham      spam 
## 0.8683453 0.1316547
Visualizing - Word Clouds
library(wordcloud)
wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)

# subset the training data into spam and ham groups
spam <- subset(sms_raw, type == "spam")
ham  <- subset(sms_raw, type == "ham")
Visualizing - Spam Word Cloud
wordcloud(spam$text, max.words = 50, scale = c(3, 1))

Visualizing - Ham Word Cloud
wordcloud(ham$text, max.words = 50, scale = c(3, 0.5))

Creating indicator features for frequent words

In sparse matrix there are more than 6500 columns and each of them represents different word. We need to remove infrequent words since these words probably won’t be useful in classification. In this case we will remove words that appear less then 5 times and consider the frequent words in training and test data.

# save frequently-appearing terms to a character vector
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
str(sms_freq_words)
##  chr [1:1136] "abiola" "abl" "abt" "accept" "access" ...
# create DTMs with only the frequent terms
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]

Since Naive Bayes algorithm works on categorical attributes, we need to convert the numeric values in sparse matrix into yes/no categorical variables (yes indicates word appears in the sms).

# convert counts to a factor
convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}
# apply() convert_counts() to columns of train/test data
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_test  <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)

3. Training a model on the data

library(e1071)
sms_classifier <- naiveBayes(sms_train, sms_train_labels)

4. Evaluating Model Performance

We will store the Naive Bayes predictions in sms_test_pred and then compare it with true labels.

sms_test_pred <- predict(sms_classifier, sms_test)

head(sms_test_pred)
## [1] ham  ham  ham  ham  spam ham 
## Levels: ham spam
library(gmodels)
CrossTable(sms_test_pred, sms_test_labels,
           prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
           dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1390 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1201 |        30 |      1231 | 
##              |     0.995 |     0.164 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         6 |       153 |       159 | 
##              |     0.005 |     0.836 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1207 |       183 |      1390 | 
##              |     0.868 |     0.132 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

The overall accuracy rate of the model is 0.974. 6 legitimate messages identified as spam and 30 spam messages could not be recognized. Legimate messages that labeled as spam (false positives) can cause serious problem since people can miss important information. We will work on to achieve better performance.

5. Improving Model Performance

We will add Laplace parameter to the model to improve the performance. The words that do not appear in any spam or ham messages will have indisputable say in the classification.

sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1)
sms_test_pred2 <- predict(sms_classifier2, sms_test)
CrossTable(sms_test_pred2, sms_test_labels,
           prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
           dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1390 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1202 |        28 |      1230 | 
##              |     0.996 |     0.153 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         5 |       155 |       160 | 
##              |     0.004 |     0.847 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1207 |       183 |      1390 | 
##              |     0.868 |     0.132 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

In second revised model accuracy rate improved (%97.6). False positives reduced from 6 to 5 messages.