Used Libraries

library(tm)       # text mining
library(wordcloud)# creating word clouds
library(e1071)    # used naive Bayes implementation
library(gmodels)  # used to create cross tables
library(ROCR)     # visualizing ROC curves
library(caret)    # classification training
library(vcd)      # to calculate Kappa statistics
library(irr)      # another lib to calculate Kappa

Load Data

sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE) 
str(sms_raw)

## 'data.frame':    5574 obs. of  2 variables:
##  $ type: chr  "ham" "ham" "spam" "ham" ...
##  $ text: chr  "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C"| __truncated__ "U dun say so early hor... U c already then say..." ...

sms_raw$type <- factor(sms_raw$type) 
str(sms_raw$type)

##  Factor w/ 2 levels "ham","spam": 1 1 2 1 1 2 1 1 2 2 ...

Data Pre-Processing

I first create the corpus, which contains 5574 text messages. Before splitting it into words, I first clean it by transforming all the letter to lower, removing numbers, punctuation and stop words, and stripping white spaces between words.

sms_corpus <- Corpus(VectorSource(sms_raw$text)) 
print(sms_corpus)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5574

#inspect(sms_corpus[1:3]) 
lapply(sms_corpus[1:3], as.character)

## $`1`
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
## 
## $`2`
## [1] "Ok lar... Joking wif u oni..."
## 
## $`3`
## [1] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

# data pre-processing:
corpus_clean <- tm_map(sms_corpus, content_transformer(tolower)) 
corpus_clean <- tm_map(corpus_clean, removeNumbers) 
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)

#inspect(corpus_clean[1:3])
lapply(corpus_clean[1:3], as.character)

## $`1`
## [1] "go jurong point crazy available bugis n great world la e buffet cine got amore wat"
## 
## $`2`
## [1] "ok lar joking wif u oni"
## 
## $`3`
## [1] "free entry wkly comp win fa cup final tkts st may text fa receive entry questionstd txt ratetcs apply s"

sms_dtm <- DocumentTermMatrix(corpus_clean) # create a sparse matrix of tokens
sms_dtm

## <<DocumentTermMatrix (documents: 5574, terms: 7957)>>
## Non-/sparse entries: 43114/44309204
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

After cleaning, as examples show, the messages have been limited to the most interesting words and punctuation and capitalization have been removed. These messages were tokenized and a sparse matrix sms_dtm was created.

Data Preparation

Creating Training and Testing Sets

I divide all the data into 75% of training and 25% of testing data.

sms_raw_train <- sms_raw[1:4169, ]
sms_raw_test  <- sms_raw[4170:5574, ]

sms_dtm_train <- sms_dtm[1:4169, ] 
sms_dtm_test  <- sms_dtm[4170:5574, ] 

sms_corpus_train <- corpus_clean[1:4169] 
sms_corpus_test  <- corpus_clean[4170:5574]

To confirm that the subsets are representative of the complete set of SMS data, I compare the proportion of spam in the training and test data frames:

prop.table(table(sms_raw_train$type))

## 
##       ham      spam 
## 0.8647158 0.1352842

prop.table(table(sms_raw_test$type))

## 
##       ham      spam 
## 0.8697509 0.1302491

Both the training data and test data contain about 13 percent spam. This suggests that the spam messages were divided evenly between the two datasets.

Visualizing Word Clouds

I create a word cloud from sms_corpus_train corpus. The word order is not random here, which means that the higher-frequency words are placed closer to the center.

Now, I will compare the clouds for SMS spam and ham. The spam words are on the left, and the ham - on the right.

col=brewer.pal(6,"Dark2")
wordcloud(sms_corpus_train, min.freq = 40, random.order = FALSE, colors=col)

spam <- subset(sms_raw_train, type == "spam") 
ham <- subset(sms_raw_train, type == "ham")

par(mfrow=c(1,2))
wordcloud(spam$text, max.words = 40, scale = c(3, 0.5), colors = col)
wordcloud(ham$text, max.words = 40, scale = c(3, 0.5), colors = col)

Spam SMS messages include words such as urgent, free, mobile, call, claim, and stop; these terms do not appear in the ham cloud at all. Instead, ham messages use words such as can, sorry, need, and time. These stark differences suggest that the naive Bayes model will have some strong key words to differentiate between the classes.

Creating Indicator Features for Frequent Words

Now I transform sparse matrix into a data structure that can be used to train a naive Bayes classifier. Currently, the sparse matrix contains 7.957 features: a feature for every word that appears in at least one SMS message. It’s very unlikely that all of these are useful for classification. To reduce the number of features, I eliminate any words that appear in less than five SMS messages.

head(findFreqTerms(sms_dtm_train, 5), 25) # appear at least 5 times

##  [1] "â<U+0080><U+0093>"       "abiola"    "able"      "abt"       "accept"   
##  [6] "access"    "account"   "across"    "activate"  "actually" 
## [11] "add"       "address"   "admirer"   "advance"   "aft"      
## [16] "afternoon" "aftr"      "age"       "ago"       "ahead"    
## [21] "aight"     "aint"      "air"       "aiyah"     "alex"

Dictionary <- function(x) {
    if( is.character(x) ) {
        return (x)
    }
    stop('x is not a character vector')
}

# save this list of frequent terms for use later:
sms_dict <- Dictionary(findFreqTerms(sms_dtm_train, 5))

A dictionary allows me to specify which words should appear in a document term matrix. Next, I limit the training and test matrixes to only the words in the preceding dictionary:

sms_train <- DocumentTermMatrix(sms_corpus_train, list(dictionary = sms_dict)) 
sms_test  <- DocumentTermMatrix(sms_corpus_test, list(dictionary = sms_dict))

The training and test data now includes roughly 1.200 features corresponding only to words appearing in at least five messages.

The naive Bayes classifier is typically trained on data with categorical features. In the sparse matrix, I will change a count of the times a word appears in a message to a categorical variable with values Yes/No, depending on whether the word appears at all.

convert_counts <- function(x) {    
  x <- ifelse(x > 0, 1, 0)    
  x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
  return(x) 
} 

sms_train <- apply(sms_train, MARGIN = 2, convert_counts) 
sms_test  <- apply(sms_test, MARGIN = 2, convert_counts)

Training Model on the Data

Now, I build the model on the sms_train matrix using a naive Bayes method:

sms_classifier <- naiveBayes(sms_train, sms_raw_train$type)

Evaluating Model Performance

In order to evaluate the model, I first run the predictions and then produce a CrossTable and a confusion matrix. To be sure about the results, I additionally calculate the Kappa statistics.

sms_test_pred <- predict(sms_classifier, sms_test) 
predicted_prob <- predict(sms_classifier, sms_test, type = "raw")

sms_results <- data.frame(actual_type = sms_raw_test$type,
                          predict_type = sms_test_pred,
                          prob_spam = predicted_prob[,2])

CrossTable(sms_test_pred, 
           sms_raw_test$type,
           prop.chisq = FALSE, 
           prop.t = FALSE,
           dnn = c('predicted', 'actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1405 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1216 |        28 |      1244 | 
##              |     0.977 |     0.023 |     0.885 | 
##              |     0.995 |     0.153 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         6 |       155 |       161 | 
##              |     0.037 |     0.963 |     0.115 | 
##              |     0.005 |     0.847 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1222 |       183 |      1405 | 
##              |     0.870 |     0.130 |           | 
## -------------|-----------|-----------|-----------|
## 
##

confusionMatrix(sms_results$predict_type,
                sms_results$actual_type,
                positive = "spam")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  ham spam
##       ham  1216   28
##       spam    6  155
##                                           
##                Accuracy : 0.9758          
##                  95% CI : (0.9663, 0.9832)
##     No Information Rate : 0.8698          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8874          
##  Mcnemar's Test P-Value : 0.0003164       
##                                           
##             Sensitivity : 0.8470          
##             Specificity : 0.9951          
##          Pos Pred Value : 0.9627          
##          Neg Pred Value : 0.9775          
##              Prevalence : 0.1302          
##          Detection Rate : 0.1103          
##    Detection Prevalence : 0.1146          
##       Balanced Accuracy : 0.9210          
##                                           
##        'Positive' Class : spam            
##

Kappa(table(sms_results$actual_type, sms_results$predict_type))

##             value     ASE     z Pr(>|z|)
## Unweighted 0.8874 0.01898 46.76        0
## Weighted   0.8874 0.01898 46.76        0

kappa2(sms_results[1:2])

##  Cohen's Kappa for 2 Raters (Weights: unweighted)
## 
##  Subjects = 1405 
##    Raters = 2 
##     Kappa = 0.887 
## 
##         z = 33.4 
##   p-value = 0

The cross table shows that around 0.5 percent of messages were incorrectly classified as spam, although being ham, which results in a high sensitivity. About 15 percent of messages were incorrectly classified as ham although being an actual spam, which results in a slightly lower level of specificity. Overall, the results of the model are quite good, with the accuracy of about 0.98 and a high kappa value of 0.89.

Visualizing performance

Visualization of a ROC curve shows that the classifier is much closer to a perfect classifier than the dashed line indicating a useless classifier.

pred <- prediction(predictions = sms_results$prob_spam, 
                   labels = sms_results$actual_type)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, main = "ROC curve for SMS spam filter",
     col = "blue", lwd = 2)
abline(a = 0, b = 1, lwd = 1, lty = 2)

perf.auc <- performance(pred, measure = "auc")
unlist(perf.auc@y.values)

## [1] 0.9945556

The area under the ROC curve (AUC) shows an “outstanding” performance of a classifier. However, I will look for the ways to outperform the current prediction.

Improving Model Performance

In order to improve the model performance, I will add the Laplace estimator now.

sms_classifier2 <- naiveBayes(sms_train, sms_raw_train$type, laplace = 1) 
sms_test_pred2 <- predict(sms_classifier2, sms_test)
CrossTable(sms_test_pred2, 
           sms_raw_test$type,
           prop.chisq = FALSE,
           prop.t = FALSE, 
           prop.r = FALSE,
           dnn = c('predicted', 'actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1405 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1217 |        30 |      1247 | 
##              |     0.996 |     0.164 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         5 |       153 |       158 | 
##              |     0.004 |     0.836 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1222 |       183 |      1405 | 
##              |     0.870 |     0.130 |           | 
## -------------|-----------|-----------|-----------|
## 
##

sms_results <- cbind(sms_results, predict_type2 = sms_test_pred2)

confusionMatrix(sms_results$predict_type2,
                sms_results$actual_type,
                positive = "spam")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  ham spam
##       ham  1217   30
##       spam    5  153
##                                           
##                Accuracy : 0.9751          
##                  95% CI : (0.9655, 0.9826)
##     No Information Rate : 0.8698          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8833          
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.8361          
##             Specificity : 0.9959          
##          Pos Pred Value : 0.9684          
##          Neg Pred Value : 0.9759          
##              Prevalence : 0.1302          
##          Detection Rate : 0.1089          
##    Detection Prevalence : 0.1125          
##       Balanced Accuracy : 0.9160          
##                                           
##        'Positive' Class : spam            
##

Thus, I’ve managed to reduce the number of false positives (ham messages erroneously classified as spam) from 6 to 5. Although this is a small improvement, it is important to reduce the potential for important messages to be missed if the model is too aggressive at filtering spam.

predicted_prob <- predict(sms_classifier2, sms_test, type = "raw")

sms_results <- cbind(sms_results, prob_spam2 = predicted_prob[,2])

Visualizing comparative performance

Comparison of two ROC curves (blue for the first classifier, and red for the second one with Laplace estimator) shows that the second classifier is a bit closer to a perfect classifier than the first one.

pred <- prediction(predictions = sms_results$prob_spam, 
                   labels = sms_results$actual_type)
pred2 <- prediction(predictions = sms_results$prob_spam2, 
                   labels = sms_results$actual_type)

perf <- performance(pred, measure = "tpr", x.measure = "fpr")
perf2 <- performance(pred2, measure = "tpr", x.measure = "fpr")

plot(perf, main = "ROC curves for SMS spam filter",
     col = "blue", lwd = 2)
plot(perf2, col = "red", lwd = 1, add = TRUE)
abline(a = 0, b = 1, lwd = 1, lty = 2)

perf.auc <- performance(pred, measure = "auc")
unlist(perf.auc@y.values)

## [1] 0.9945556

perf.auc2 <- performance(pred2, measure = "auc")
unlist(perf.auc2@y.values)

## [1] 0.994815

The same is proved by the comparison of AUC values: for the second classifier the value is a bit higher. This suggests that the second classifier with Laplace estimator provides a better performance in general. However, one should not forget that the number of true positives for spam has reduced with the second classifier. The performance is mostly increased due to the increase of specificity.

Filtering mobile phone spam

Jekaterina Novikova

January 26, 2016