The objective of this project is to classify SMS messages as spam or ham (not spam). A Naive Bayes classifier approach will be used. This example is taken from chapter 4 of Machine Learning with R, Second Addition"
An example of the conditional probability that will be computed is as follows P(Spam|Hospital) = P(Hospital|Spam)P(Spam)/P(Hospital), which is the formula for determining the probability that a message is spam given that it contains the word “Hospital” in the message.
1. Loading Data
Load the data into R.
setwd("/Users/sauce/desktop/R")
sms_raw <- read.csv("SMSSpamCollection", header = FALSE, sep = "\t", quote = "", col.names = c("type","text"), stringsAsFactors = FALSE)
The data frame has 5574 observations with 2 observations of either spam or ham.
str(sms_raw)
As seen above, the “type”" element is a character.Change it to a factor for the analysis.
sms_raw$type <- factor(sms_raw$type)
str(sms_raw$type)
table(sms_raw$type)
Now the “type” variable is a factor with 2 levels. Of the 5574 messages, 747 are spam, which is about 13.4%.
2. Text Mining
For Naive Bayes to run effectively, the test data needs to be transformed. This begins with using the “tm” package in R to create a volitile coprus that contains the “text” vector from our data frame.
install.packages("tm")
library(tm)
sms_corpus <- VCorpus(VectorSource(sms_raw$text))
print(sms_corpus)
Check out the first few messages in the new corpus, which is basically a list that can be manipulated with list operations.
inspect(sms_corpus[1:3])
Use “as.character” function to see what a message looks like.
as.character(sms_corpus[[3]])
In order to standardize the messages, the data set must be tranformed to all lower case letters. The words “Free”, “free”, and “FREE” should all be treated as the same word. Use the “tm_map”" funtion in R, and use the “content_transformer” function to transform the text.
sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))
Remome words that appear often but don’t contribute to our objective. These words include “to”, “and”, “but” and “or”.
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords())
Remove punctuation as well using the “removePunctuation” function.
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)
as.character((sms_corpus_clean[[3]]))
Perform “stemming” to the text data to strip the suffix from words like “jumping”, so the words “jumping” “jumps” and “jumped” are all transformed into “jump”. Stemming can be perfromed using the “tm” package with help from the “SnowballC” package.
install.packages("SnowballC")
library(SnowballC)
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
And now the final step in text mining is to remove white space from the document.
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)
as.character(sms_corpus_clean[[3]])
Perform tokenization using the “DocumentTermMatrix” function. This creates a matrix in which the rows indicat documents (SMS messages in this case) and the columns indicate words. It should be noted that the “DocumentTermMaxtrix” function has the power to do all of the text mining above in one command.
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
3. Data Preparation
Split our data into training and testing sets, so that after Naive Bayes spam filter algorithm is built it can be applied to unseen data. Divide our data set into 75% training and 25% testing.
.75 * 5574
[1] 4180.5
.25 * 5574
[1] 1393.5
Because the dataset is random, the first 4180 messages can be used for the training set - there’s no need to randomize the data first.
sms_dtm_train <- sms_dtm[1:4180, ]
sms_dtm_test <- sms_dtm[4181:5559, ]
Save vectors labeling rows in the training and testing vectors
sms_train_labels <- sms_raw[1:4180, ]$type
sms_test_labels <- sms_raw[4181:5559,]$type
Make sure that the proportion of spam is similar in the training and testing data set.
prop.table(table(sms_train_labels))
sms_train_labels
ham spam
0.8648325 0.1351675
prop.table(table(sms_test_labels))
sms_test_labels
ham spam
0.8694706 0.1305294
Each have approx. 13% spam.
4. Visualization
Create a wordcloud of the frequency of the words in the dataset using the package “wordcloud”.
install.packages("wordcloud")
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/wordcloud_2.5.tgz'
Content type 'application/x-gzip' length 127386 bytes (124 KB)
==================================================
downloaded 124 KB
The downloaded binary packages are in
/var/folders/6b/yb20hcv16nz__qyd0fcg7b9h0000gn/T//Rtmp1ALMfi/downloaded_packages
install.packages("RColorBrewer")
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/RColorBrewer_1.1-2.tgz'
Content type 'application/x-gzip' length 24183 bytes (23 KB)
==================================================
downloaded 23 KB
The downloaded binary packages are in
/var/folders/6b/yb20hcv16nz__qyd0fcg7b9h0000gn/T//RtmpBiLPJT/downloaded_packages
library(wordcloud)
wordcloud(sms_corpus_clean, max.words = 50, random.order = FALSE)

Compare wordclouds between spam and ham.
spam <- subset(sms_raw, type == "spam")
ham <- subset(sms_raw, type == "ham")
wordcloud(spam$text, max.words = 50, scale = c(3, 0.5))

wordcloud(ham$text, max.words = 50, scale = c(3, 0.5))

5. Preparation for Naive Bayes
Remove words from the matrix that appear less than 5 times.
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
str(sms_freq_words)
chr [1:1161] "abiola" "abl" "abt" "accept" ...
Limit our Document Term Matrix to only include words in the sms_freq_vector. We want all the rows, but we want to limit the column to these words in the frequency vector.
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]
The naive bayes classifier works with categorical reatures, so we need to convert the matrix to “yes” and “no” categorical variables. To do this we’ll build a convert_counts function and apply it to our data.
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
This replaces values greater than 0 with yes, and values not greater than 0 with no. Let’s apply it to our data.
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)
The resulting matrixes will be character type, with cells indicating “yes” or “no” if the word represented by the column appears in the message represented by the row.
6. Train Model on the Data.
Use the e1071 package to impliment the Naive Bayes algorithm on the data, and predict whether a message is likely to be spam or ham.
install.packages("e1071")
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/e1071_1.6-8.tgz'
Content type 'application/x-gzip' length 742779 bytes (725 KB)
==================================================
downloaded 725 KB
The downloaded binary packages are in
/var/folders/6b/yb20hcv16nz__qyd0fcg7b9h0000gn/T//RtmpBiLPJT/downloaded_packages
library(e1071)
package ‘e1071’ was built under R version 3.3.2
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
7. Predict and Evaluate the Model.
sms_test_pred <- predict(sms_classifier, sms_test)
Evaluate the predition with the actual data using a crosstable from the gmodels package.
library(gmodels)
CrossTable(sms_test_pred, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted', 'actual'))
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
|-------------------------|
Total Observations in Table: 1379
| actual
predicted | ham | spam | Row Total |
-------------|-----------|-----------|-----------|
ham | 1190 | 20 | 1210 |
| 0.983 | 0.017 | 0.877 |
| 0.992 | 0.111 | |
-------------|-----------|-----------|-----------|
spam | 9 | 160 | 169 |
| 0.053 | 0.947 | 0.123 |
| 0.008 | 0.889 | |
-------------|-----------|-----------|-----------|
Column Total | 1199 | 180 | 1379 |
| 0.869 | 0.131 | |
-------------|-----------|-----------|-----------|
29/1379
[1] 0.02102973
As shown in the table only 29/1379 messages were classified incorrectly. This means that the algorithm classifed the testing set as spam or ham with approx. 98% accuracy. That’s impressive. To improve the model, one might tamper with the Laplace value, colect more sms data, or try splitting the dataset randomly into training and testing.
I suspect the accuracy would increase as the dataset gets bigger. The more data there is to train the algorith, the more effective it would be in predicting Spam or Ham.
Show the 5 most frequent words in the sms data:
sack <- TermDocumentMatrix(sms_corpus_clean)
pack <- as.matrix(sack)
snack <- sort(rowSums(pack), decreasing = TRUE)
hack <- data.frame(word = names(snack), freq=snack)
head(hack, 5)
And the 5 most frequent words from each class:
wordcloud(spam$text, max.words = 5, scale = c(3, 0.5))

wordcloud(ham$text, max.words = 5, scale = c(3, 0.5))

As shown in the word clouds, the most frequent words from the spam messages are “call”, “free”, “now”, “mobile”, and “text”. And in the Ham messages, the 5 most frequent words are “can”, “get”, “just”, “will”, and “now”.
