This example is from Book: Machine learning with R by Brett Lantz, chapter 4.
A link to the book https://bit.ly/3gsf2e0
This project is for educational purpose only.
The aim of the this project is to identify spam SMS text messages, taking into account the shorthand lingo. we will transform the data into a bag-of-words format.
We will use packages tm, snowballC, wordcloud, e1071, and gmodels.
library(tm)
library(SnowballC)
library(wordcloud)
library(e1071)
library(gmodels)
#Reading the csv file
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)
#Exploring structure of the data
str(sms_raw)
## 'data.frame': 5559 obs. of 2 variables:
## $ type: chr "ham" "ham" "ham" "spam" ...
## $ text: chr "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline"| __truncated__ ...
table(sms_raw$type)
##
## ham spam
## 4812 747
#Converting column type to factor, and inspecting the structure of the factor column.
sms_raw$type <- factor(sms_raw$type)
str(sms_raw$type)
## Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...
We will use tm package to filter the text from numbers, punctuation, stop words, etc.
#Use VectorSource to crate source object, then we will feed this source to Vcorpus() function to create a volatile corpus stored in the memory.
sms_corpus <- VCorpus(VectorSource(sms_raw$text))
#Print the corpus
print(sms_corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5559
#Inspect the first two SMS messages
inspect(sms_corpus[1:2])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 49
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 23
#View a specific message messages
as.character(sms_corpus[[1]])
## [1] "Hope you are having a good week. Just checking in"
#View multiple messages
lapply(sms_corpus[1:3], as.character)
## $`1`
## [1] "Hope you are having a good week. Just checking in"
##
## $`2`
## [1] "K..give back my thanks."
##
## $`3`
## [1] "Am also doing in cbe only. But have to pay."
Now we will transform the data by cleaning and mapping them
#Keep all letters in lower case using tolower() function
sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))
#Inspect first message
as.character(sms_corpus_clean[[1]])
## [1] "hope you are having a good week. just checking in"
#Remove numbers
sms_corpus_clean<- tm_map(sms_corpus_clean, removeNumbers)
Next we will remove stop words and punctuation
#Remove stop words
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords())
#Remove punctuation
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)
Next data preparation step is stemming , which strips the word from the suffix to return it to the base form.
#Applying stemDocument function on the data
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
Now it is the time to remove white spaces
#Stripping White spaces
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)
#Inspect SMS messages after transformation
lapply(sms_corpus_clean[1:3], as.character)
## $`1`
## [1] "hope good week just check"
##
## $`2`
## [1] "kgive back thank"
##
## $`3`
## [1] "also cbe pay"
Now we will create Document-term Matrix or DTM where SMS messages are in a row and words are in columns, this process called tokenization.
#Create a DTM sparse matrix
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
sms_dtm
## <<DocumentTermMatrix (documents: 5559, terms: 6582)>>
## Non-/sparse entries: 42182/36547156
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
#Create train and test datasets as a percentages 75% and 25% of the sms_dtm dataset
sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test <- sms_dtm[4170:5559, ]
#Creating labels
sms_train_labels <- sms_raw[1:4169, ]$type
sms_test_labels <- sms_raw[4170:5559, ]$type
#Compare both train and test labels have the same ratio between spam and ham
prop.table(table(sms_train_labels))
## sms_train_labels
## ham spam
## 0.8647158 0.1352842
prop.table(table(sms_test_labels))
## sms_test_labels
## ham spam
## 0.8683453 0.1316547
#We will use wordcloud package to create a word cloud directly from the sms_dtm
wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)
Now we inspect word clouds for spam and ham messages from the sms raw data
spam <- subset(sms_raw, type == "spam")
ham <- subset(sms_raw, type == "ham")
#create word clouds, due to there are some errors while producing the Rmarkdown file, I didn't print out the wordclouds for both spam and ham data sets
#wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
#wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))
We will remove features or words that are not shown in at least 5 messages, this to minimize the number of features.
# Find frequent words which are at least in 1% of the messages, will create a vector
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
#Inspect the vector
str(sms_freq_words)
## chr [1:1139] "€â\200œ" "â£wk" "abiola" "abl" "abt" "accept" "access" ...
#Now we need to filter our DTM to include only the words in the frequent words vector
sms_dtm_freq_train <- sms_dtm_train[, sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[, sms_freq_words]
#Now we need to convert the frequency numbers into categorical format, we will create a function to convert the counts into Yes, NO categorical format
convert_counts <- function(x) {
x<- ifelse(x>0, "Yes", "No")
}
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)
We will use function () from e1071 package to build the model.
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
Now we will use the classifier model to predict SMS message class in the sms_test dataset
sms_test_pred <- predict(sms_classifier, sms_test)
CrossTable(sms_test_pred, sms_test_labels,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c("predicted", "actual"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1390
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1201 | 30 | 1231 |
## | 0.864 | 0.022 | |
## -------------|-----------|-----------|-----------|
## spam | 6 | 153 | 159 |
## | 0.004 | 0.110 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1207 | 183 | 1390 |
## -------------|-----------|-----------|-----------|
##
##
We can see 6 as false negative, 30 as false positive. the 6 mislabeled SMS messages may be critical.
I set laplace = 0.1, and this helped to reduce the false negative from 6 to 5. I noticed that when setting laplace = 1, the number of the false negative increased and became 18 messages, so I minimized laplace as much I could.
#Build the model
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 0.1)
#Predict the labels on the sms_test data set
sms_test_pred2 <- predict(sms_classifier2, sms_test)
#construct the confusion matrix
CrossTable(sms_test_pred2, sms_test_labels,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c("predicted", "actual"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1390
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1202 | 26 | 1228 |
## | 0.865 | 0.019 | |
## -------------|-----------|-----------|-----------|
## spam | 5 | 157 | 162 |
## | 0.004 | 0.113 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1207 | 183 | 1390 |
## -------------|-----------|-----------|-----------|
##
##
This project is able to classify SMS messages with accuracy more than 97 % using Naive Bayes algorithm.