Introduction

This example is from Book: Machine learning with R by Brett Lantz, chapter 4.

A link to the book https://bit.ly/3gsf2e0

This project is for educational purpose only.

The aim of the this project is to identify spam SMS text messages, taking into account the shorthand lingo. we will transform the data into a bag-of-words format.

Required packages

We will use packages tm, snowballC, wordcloud, e1071, and gmodels.

library(tm)
library(SnowballC)
library(wordcloud)
library(e1071)
library(gmodels)

Step 1 - collecting data

#Reading the csv file
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)

#Exploring structure of the data
str(sms_raw)

## 'data.frame':    5559 obs. of  2 variables:
##  $ type: chr  "ham" "ham" "ham" "spam" ...
##  $ text: chr  "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or Â£10,000 cash needs your URGENT collection. 09066364349 NOW from Landline"| __truncated__ ...

table(sms_raw$type)

## 
##  ham spam 
## 4812  747

#Converting column type to factor, and inspecting the structure of the factor column.
sms_raw$type <- factor(sms_raw$type)
str(sms_raw$type)

##  Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...

Step 2 - exploring and preparing the data

Data preparation - cleanding and standardizing text data

We will use tm package to filter the text from numbers, punctuation, stop words, etc.

#Use VectorSource to crate source object, then we will feed this source to Vcorpus() function to create a volatile corpus stored in the memory.
sms_corpus <- VCorpus(VectorSource(sms_raw$text))

#Print the corpus
print(sms_corpus)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5559

#Inspect the first two SMS messages
inspect(sms_corpus[1:2])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 49
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 23

#View a specific message messages
as.character(sms_corpus[[1]])

## [1] "Hope you are having a good week. Just checking in"

#View multiple messages

lapply(sms_corpus[1:3], as.character)

## $`1`
## [1] "Hope you are having a good week. Just checking in"
## 
## $`2`
## [1] "K..give back my thanks."
## 
## $`3`
## [1] "Am also doing in cbe only. But have to pay."

Now we will transform the data by cleaning and mapping them

#Keep all letters in lower case using tolower() function
sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))

#Inspect first message
as.character(sms_corpus_clean[[1]])

## [1] "hope you are having a good week. just checking in"

#Remove numbers
sms_corpus_clean<- tm_map(sms_corpus_clean, removeNumbers)

Next we will remove stop words and punctuation

#Remove stop words
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords())

#Remove punctuation
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)

Next data preparation step is stemming , which strips the word from the suffix to return it to the base form.

#Applying stemDocument function on the data
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)

Now it is the time to remove white spaces

#Stripping White spaces
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)

#Inspect SMS messages after transformation
lapply(sms_corpus_clean[1:3], as.character)

## $`1`
## [1] "hope good week just check"
## 
## $`2`
## [1] "kgive back thank"
## 
## $`3`
## [1] "also cbe pay"

Data preparation - splitting text documents into words

Now we will create Document-term Matrix or DTM where SMS messages are in a row and words are in columns, this process called tokenization.

#Create a DTM sparse matrix 
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

sms_dtm

## <<DocumentTermMatrix (documents: 5559, terms: 6582)>>
## Non-/sparse entries: 42182/36547156
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

Data preparation - creating training and text datasets

#Create train and test datasets as a percentages 75% and 25% of the sms_dtm dataset
sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test <- sms_dtm[4170:5559, ]

#Creating labels
sms_train_labels <- sms_raw[1:4169, ]$type
sms_test_labels <- sms_raw[4170:5559, ]$type

#Compare both train and test labels have the same ratio between spam and ham
prop.table(table(sms_train_labels))

## sms_train_labels
##       ham      spam 
## 0.8647158 0.1352842

prop.table(table(sms_test_labels))

## sms_test_labels
##       ham      spam 
## 0.8683453 0.1316547

Visualizing text data - word clouds

#We will use wordcloud package to create a word cloud directly from the sms_dtm
wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)

Now we inspect word clouds for spam and ham messages from the sms raw data

spam <- subset(sms_raw, type == "spam")
ham <- subset(sms_raw, type == "ham")

#create word clouds, due to there are some errors while producing the Rmarkdown file, I didn't print out the wordclouds for both spam and ham data sets

#wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
#wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))

Data preparation - creating indicator features for frequent words

We will remove features or words that are not shown in at least 5 messages, this to minimize the number of features.

# Find frequent words which are at least in 1% of the messages, will create a vector
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)

#Inspect the vector
str(sms_freq_words)

##  chr [1:1139] "â‚¬â\200œ" "â£wk" "abiola" "abl" "abt" "accept" "access" ...

#Now we need to filter our DTM to include only the words in the frequent words vector

sms_dtm_freq_train <- sms_dtm_train[, sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[, sms_freq_words]

#Now we need to convert the frequency numbers into categorical format, we will create a function to convert the counts into Yes, NO categorical format

convert_counts <- function(x) {
    x<- ifelse(x>0, "Yes", "No")
}

sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)

Step 3 - training a model on the data

Building the model

We will use function () from e1071 package to build the model.

sms_classifier <- naiveBayes(sms_train, sms_train_labels)

Step 4 - evaluating model performance

Preform predictions

Now we will use the classifier model to predict SMS message class in the sms_test dataset

sms_test_pred <- predict(sms_classifier, sms_test)

Evaluate the model

CrossTable(sms_test_pred, sms_test_labels, 
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, 
           dnn = c("predicted", "actual"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1390 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1201 |        30 |      1231 | 
##              |     0.864 |     0.022 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         6 |       153 |       159 | 
##              |     0.004 |     0.110 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1207 |       183 |      1390 | 
## -------------|-----------|-----------|-----------|
## 
##

We can see 6 as false negative, 30 as false positive. the 6 mislabeled SMS messages may be critical.

Step 5 - improving model performance

We will use Lablace estimator, and set lablace = 1 in the classifier model

I set laplace = 0.1, and this helped to reduce the false negative from 6 to 5. I noticed that when setting laplace = 1, the number of the false negative increased and became 18 messages, so I minimized laplace as much I could.

#Build the model
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 0.1)

#Predict the labels on the sms_test data set
sms_test_pred2 <- predict(sms_classifier2, sms_test)

#construct the confusion matrix
CrossTable(sms_test_pred2, sms_test_labels, 
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, 
           dnn = c("predicted", "actual"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1390 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1202 |        26 |      1228 | 
##              |     0.865 |     0.019 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         5 |       157 |       162 | 
##              |     0.004 |     0.113 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1207 |       183 |      1390 | 
## -------------|-----------|-----------|-----------|
## 
##

Filtering mobile phone spam with the Naive Bayes algorithm

Mohamed Ali Hefnawy

8/4/2020