Natural Language Processing - A classification of SMS as Ham and Spam using Naive Bayes Algorithm

Naive Bayes (Part 1)

The part 1 of this project will focus on the classification of SMS messages to HAM and SPAM. The HAM is for the good SMS while the SPAM is for the bad one. The primary focus is to use the Naive Bayes classifier and train our model using the Laplace Estimator.

The following steps will be taken:

Getting Dataset
Loading Dataset
Data Exploration and Observation
Data Cleansing and Standardization
Word Cloud (Optional)
Creating DTM Sparse Matrix
Creating Training and Test Data
Reducing the Word Frequency
Observe the Process (so far)
Apply the Naive Bayes Algorithm for Model
Improving Model
Remark

1. Getting Dataset

The dataset was downloaded to a desire folder from the link - SMS Spam Collection v. 1.

2. Loading Dataset

ham_spam_SMS_messages <- read.table("SMSSpamCollection.txt", sep = "\t", stringsAsFactors = F, quote = "")

head(ham_spam_SMS_messages)

##     V1
## 1  ham
## 2  ham
## 3 spam
## 4  ham
## 5  ham
## 6 spam
##                                                                                                                                                            V2
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
## 6        FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv

names(ham_spam_SMS_messages) <- c("Type", "Text")

head(ham_spam_SMS_messages)

##   Type
## 1  ham
## 2  ham
## 3 spam
## 4  ham
## 5  ham
## 6 spam
##                                                                                                                                                          Text
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
## 6        FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv

3. Data Exploration and Observation

In this section, the data exploration was to inspect the structure of the datasetby using the str function.

str(ham_spam_SMS_messages)

## 'data.frame':    5574 obs. of  2 variables:
##  $ Type: chr  "ham" "ham" "spam" "ham" ...
##  $ Text: chr  "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ "U dun say so early hor... U c already then say..." ...

Since we are dealing with a classification problem, it is important that we convert the <> field of the dataset to FACTOR (HAM - 1 and SPAM - 2).

ham_spam_SMS_messages$Type <- factor(ham_spam_SMS_messages$Type)

Then, view the result of the converted field as follows:

str(ham_spam_SMS_messages$Type)

##  Factor w/ 2 levels "ham","spam": 1 1 2 1 1 2 1 1 2 2 ...

Afterwards, check the count for the unique factors.

table(ham_spam_SMS_messages$Type)

## 
##  ham spam 
## 4827  747

4. Data Cleansing and Standardization

To do the data cleaning and standardization, important that the following are done using the tm_map() function:

Install packages where necessary - tm and slam
Create a Volatile Corpus (VCorpus) for the data
tolower(): Make all characters lowercase
removePunctuation(): Remove all punctuation marks
removeNumbers(): Remove numbers
stripWhitespace(): Remove excess whitespace
stopwords(): Remove stopwords/filler words such as to, and, but et cetera
stemDocument() for whole document and wordStem() for single words

library(tm)

## Loading required package: NLP

NOTE:

The “Corpus” is a collection of text documents.
There are two types of CORPUS data type, the Permanent Corpus, PCorpus and the Volatile Corpus, VCorpus. The difference between the two corpuses is in the way they are stored (on the computer).
For this task, a VCorpus is used because it is held in the computer’s Random Access Memory (would be destroyed when the R object containing it is destroyed) while PCorpus is stored on the disk (stores outside the memory, for example in a Database). So for memory efficiency, the VCorpus is preferred.
The VCorpus is part of the tm used for natural language processing (NLP). It is needed to convert to Corpus.
Also, to be noted, is that tolower is part of base R, while the other cleansing functions are part of tm package.

ham_spam_SMS_messages_corpus <- VCorpus(VectorSource(ham_spam_SMS_messages$Text))

By trying to view the VCorpus result, the following is seen, which does not give more useful information.

ham_spam_SMS_messages_corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5574

Therefore, to see more useful details, we can inspect the the summary brief of the VCorpus by using inspect and as.character. The traditional head will not yield desired output as it only display as Metadata.

inspect(ham_spam_SMS_messages_corpus[1:3])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 111
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 29
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 155

as.character(ham_spam_SMS_messages_corpus[[1]])

## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."

Additionally, the lapply function can be used with the special function as.character, which gives more detail information of the data as shown below:

lapply(ham_spam_SMS_messages_corpus[1:6], as.character)

## $`1`
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
## 
## $`2`
## [1] "Ok lar... Joking wif u oni..."
## 
## $`3`
## [1] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
## 
## $`4`
## [1] "U dun say so early hor... U c already then say..."
## 
## $`5`
## [1] "Nah I don't think he goes to usf, he lives around here though"
## 
## $`6`
## [1] "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv"

As part of the process of preparing data, the next things to do are change all the (text) data characters into lowercase; remove all punctuation marks; remove numbers; remove excess whitespace; remove stopwords/filler words; and stem the words accordingly.

tolower(): Make all characters lowercase; this is ideal in Natural Language Processing.

ham_spam_SMS_corpus_clean <- tm_map(ham_spam_SMS_messages_corpus, content_transformer(tolower))

removePunctuation(): Remove all punctuation marks.

ham_spam_SMS_corpus_clean <- tm_map(ham_spam_SMS_corpus_clean, removePunctuation)

removeNumbers(): Remove numbers

ham_spam_SMS_corpus_clean <- tm_map(ham_spam_SMS_corpus_clean, removeNumbers)

stripWhitespace(): Remove excess whitespace. For this, the SnowballC library needs to be loaded (the package shoud be installed should the library be missing)

library(SnowballC)
ham_spam_SMS_corpus_clean <- tm_map(ham_spam_SMS_corpus_clean, stripWhitespace)

stopwords(): Remove stopwords/filler words such as to, and, but et cetera.

ham_spam_SMS_corpus_clean <- tm_map(ham_spam_SMS_corpus_clean, removeWords, stopwords()) # stopwords("en")

stemDocument() for whole document and wordStem() for single words

ham_spam_SMS_corpus_clean <- tm_map(ham_spam_SMS_corpus_clean, stemDocument)

5. Word Cloud (Optional)

This is an optional task, we can view the WORD CLOUD to visual the extent of some words. To do this, the wordcloud package needs to be installed; then the wordcloud library loaded accordingly.

library(wordcloud)

## Loading required package: RColorBrewer

The output of the wordcloud is given below for observation:

wordcloud(ham_spam_SMS_corpus_clean, scale=c(2,.5), min.freq = 10, max.words = 300,
          random.order = FALSE, rot.per = .5, 
          colors= palette())

6. Creating DTM Sparse Matrix

The Document-Term Matrix (DTM) or Term-Document Matrix (TDM) is a mathematical matrix that describes the frequency of terms that occur basically in documents’ collection. See more here. For a DTM, the rows correspond to the documents while the columns correspond to terms in the collection. The TDM is a transpose of the DTM, meaning the rows correspond to terms and the columns are documents. The TDM will be ideal for cases where the number of documents is small while the word list is large. For this task, the DTM is created.

ham_spam_SMS_corpus_dtm <- DocumentTermMatrix(ham_spam_SMS_corpus_clean)
ham_spam_SMS_corpus_dtm

## <<DocumentTermMatrix (documents: 5574, terms: 6827)>>
## Non-/sparse entries: 42495/38011203
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

To see a more detail DTM output, the special function as.matrix was used. The result as shown below, provides insights into the frequency of words in the documents (this view is only for the top 10 rows and first 10 columns).

ham_spam_SMS_corpus_dtmMAtrix <- as.matrix(ham_spam_SMS_corpus_dtm)
ham_spam_SMS_corpus_dtmMAtrix[1:10, 1:10]

##     Terms
## Docs â‘morrow â‘rent â“harri â£â£ â£award â£call â£ea â£k â£million
##   1         0      0       0    0       0      0    0   0         0
##   2         0      0       0    0       0      0    0   0         0
##   3         0      0       0    0       0      0    0   0         0
##   4         0      0       0    0       0      0    0   0         0
##   5         0      0       0    0       0      0    0   0         0
##   6         0      0       0    0       0      0    0   0         0
##   7         0      0       0    0       0      0    0   0         0
##   8         0      0       0    0       0      0    0   0         0
##   9         0      0       0    0       0      0    0   0         0
##   10        0      0       0    0       0      0    0   0         0
##     Terms
## Docs â£minmobsmorelkpoboxhpfl
##   1                         0
##   2                         0
##   3                         0
##   4                         0
##   5                         0
##   6                         0
##   7                         0
##   8                         0
##   9                         0
##   10                        0

7. Creating Training and Test Data

To create the Training and Test Data, a split ratio is used with the training set having more. In this task, there are 5574. A reasonable ratio is 70%:30%. This gives: [1:3901, ] and [3901:5574, ].

ham_spam_SMS_training_set <- ham_spam_SMS_corpus_dtm[1:3901, ]
ham_spam_SMS_test_set <- ham_spam_SMS_corpus_dtm[3901:5574, ]

Afterwards, the Type labels are created from the raw data to be able to check how the HAM and SPAM data are distributed; and the prop.table is used to see the distribution.

ham_spam_SMS_training_set_Labels <- ham_spam_SMS_messages[1:3901, ]$Type
ham_spam_SMS_test_set_Labels <- ham_spam_SMS_messages[3901:5574, ]$Type

prop.table(table(ham_spam_SMS_training_set_Labels))

## ham_spam_SMS_training_set_Labels
##       ham      spam 
## 0.8669572 0.1330428

prop.table(table(ham_spam_SMS_test_set_Labels))

## ham_spam_SMS_test_set_Labels
##       ham      spam 
## 0.8637993 0.1362007

With the result showing about 87% Ham and 13% Spam for Training Set label and about 86% Ham and 14% Spam for the Test Set label, it shows the distribution is fairly even and we have got a dataset to work it.

8. Reducing the Word Frequency

At this point, words with significantly low frequencies are trimmed out of the training data to improve performance. In this case, terms (words) appearing less than 10 times are removed.

ham_spam_SMS_freq_words <- findFreqTerms(ham_spam_SMS_training_set, 10)

str(ham_spam_SMS_freq_words)

##  chr [1:609] "â€¦" "abiola" "abl" "abt" "account" "actual" "address" ...

The more frequent terms from the training set are then used to get all the rows containing the frequent terms.

ham_spam_SMS_training_set_freq10 <- ham_spam_SMS_training_set[, ham_spam_SMS_freq_words]
ham_spam_SMS_test_set_freq10 <- ham_spam_SMS_test_set[, ham_spam_SMS_freq_words]

View the sample output as shown below:

str(ham_spam_SMS_training_set_freq10)

## List of 6
##  $ i       : int [1:19906] 1 1 1 1 1 1 1 2 2 3 ...
##  $ j       : int [1:19906] 36 112 209 210 401 567 593 267 581 27 ...
##  $ v       : num [1:19906] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 3901
##  $ ncol    : int 609
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:3901] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:609] "â€¦" "abiola" "abl" "abt" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

str(ham_spam_SMS_test_set_freq10)

## List of 6
##  $ i       : int [1:8594] 1 1 1 1 2 3 3 4 4 4 ...
##  $ j       : int [1:8594] 259 312 445 535 363 282 539 270 432 452 ...
##  $ v       : num [1:8594] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 1674
##  $ ncol    : int 609
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:1674] "3901" "3902" "3903" "3904" ...
##   ..$ Terms: chr [1:609] "â€¦" "abiola" "abl" "abt" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

The string output above shows the frequency of words accordingly. But the priority is not in the frequency of words or terms are repeated but it is important we check whether the words is there or not. To do this, a simple BOOLEAN operational function is created for a YES, if the value is more than 0 and NO, if it is 0.

convert_counts <- function(x) {x <- ifelse(x > 0, "YES", "NO")}

The Boolean function is then applied as follows training and test data set with frequency more than 10.

ham_spam_SMS_train <- apply(ham_spam_SMS_training_set_freq10, MARGIN = 2, convert_counts)
ham_spam_SMS_test <- apply(ham_spam_SMS_test_set_freq10, MARGIN = 2, convert_counts)

9. Observe the Process (so far)

str(ham_spam_SMS_train)

##  chr [1:3901, 1:609] "NO" "NO" "NO" "NO" "NO" "NO" "NO" "NO" "NO" "NO" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ Docs : chr [1:3901] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:609] "â€¦" "abiola" "abl" "abt" ...

str(ham_spam_SMS_test)

##  chr [1:1674, 1:609] "NO" "NO" "NO" "NO" "NO" "NO" "NO" "NO" "NO" "NO" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ Docs : chr [1:1674] "3901" "3902" "3903" "3904" ...
##   ..$ Terms: chr [1:609] "â€¦" "abiola" "abl" "abt" ...

10. Apply the Naive Bayes Algorithm

The next step is to train the model using the Naive Bayes Algorithm.

# install.packages("e1071")
library(e1071)
ham_spam_NB_ModelClassifier <- naiveBayes(ham_spam_SMS_train, ham_spam_SMS_training_set_Labels) # Creates the Model Classifier

The Naive Bayes Model performance is then evaluated as follows:

ham_spam_NB_Predict <- predict(ham_spam_NB_ModelClassifier, ham_spam_SMS_test) # Tests the Model Classifier

The effectiveness of the created prediction is then tested using the gmodels

# install.packages("gmodels")
library(gmodels)
CrossTable(ham_spam_NB_Predict, ham_spam_SMS_test_set_Labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c('NB Prediction', 'Actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1674 
## 
##  
##               | Actual 
## NB Prediction |       ham |      spam | Row Total | 
## --------------|-----------|-----------|-----------|
##           ham |      1436 |        32 |      1468 | 
##               |     0.978 |     0.022 |     0.877 | 
##               |     0.993 |     0.140 |           | 
## --------------|-----------|-----------|-----------|
##          spam |        10 |       196 |       206 | 
##               |     0.049 |     0.951 |     0.123 | 
##               |     0.007 |     0.860 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |      1446 |       228 |      1674 | 
##               |     0.864 |     0.136 |           | 
## --------------|-----------|-----------|-----------|
## 
##

From the prediction performance test result above (Cell Content above), it was observed that the model was able to predict 1436 correct HAM and 32 wrong SPAM for the actual; and 10 wrong HAM and 196 correct SPAM.

The result shows:

HAM: 97.8% correct predictions, and 2.2% wrong predictions
SPAM: 95.1% correct predictions, and 4.9% wrong predictions

11. Improving Model

This result appears very good, but to improve the result, the LAPLACE estimator is used. The default value of the LAPLACE estimator is 0. The default (0) disables Laplace smoothing, which is the case for the created model above.

The earlier created model could be written as ham_spam_NB_ModelClassifier <- naiveBayes(ham_spam_SMS_train, ham_spam_SMS_training_set_Labels, laplace = 0) but now, the LAPLACE estimator will be tuned for 1 and 2 to check whether an improved performance can be obtained.

LAPLACE estimator as 1:

ham_spam_NB_ModelClassifier_L1 <- naiveBayes(ham_spam_SMS_train, ham_spam_SMS_training_set_Labels, laplace = 1)

ham_spam_NB_Predict_L1 <- predict(ham_spam_NB_ModelClassifier_L1, ham_spam_SMS_test) # Tests the Model Classifier

Checking the effectiveness of the model once again (but now with a laplace estimator of 1).

CrossTable(ham_spam_NB_Predict_L1, ham_spam_SMS_test_set_Labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c('NB Prediction', 'Actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1674 
## 
##  
##               | Actual 
## NB Prediction |       ham |      spam | Row Total | 
## --------------|-----------|-----------|-----------|
##           ham |      1440 |        30 |      1470 | 
##               |     0.980 |     0.020 |     0.878 | 
##               |     0.996 |     0.132 |           | 
## --------------|-----------|-----------|-----------|
##          spam |         6 |       198 |       204 | 
##               |     0.029 |     0.971 |     0.122 | 
##               |     0.004 |     0.868 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |      1446 |       228 |      1674 | 
##               |     0.864 |     0.136 |           | 
## --------------|-----------|-----------|-----------|
## 
##

The result shows:

HAM: 98% correct predictions, and 2% wrong predictions
SPAM: 97.1% correct predictions, and 2.9% wrong predictions

This shows much improvement as the model was able to predict 1440 correct HAM and 30 wrong SPAM for the actual; and 6 wrong HAM and 198 correct SPAM.

LAPLACE estimator as 2:

ham_spam_NB_ModelClassifier_L2 <- naiveBayes(ham_spam_SMS_train, ham_spam_SMS_training_set_Labels, laplace = 2)

ham_spam_NB_Predict_L2 <- predict(ham_spam_NB_ModelClassifier_L1, ham_spam_SMS_test) # Tests the Model Classifier

Checking the effectiveness of the model once again (but now with a laplace estimator of 2).

CrossTable(ham_spam_NB_Predict_L2, ham_spam_SMS_test_set_Labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c('NB Prediction', 'Actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1674 
## 
##  
##               | Actual 
## NB Prediction |       ham |      spam | Row Total | 
## --------------|-----------|-----------|-----------|
##           ham |      1440 |        30 |      1470 | 
##               |     0.980 |     0.020 |     0.878 | 
##               |     0.996 |     0.132 |           | 
## --------------|-----------|-----------|-----------|
##          spam |         6 |       198 |       204 | 
##               |     0.029 |     0.971 |     0.122 | 
##               |     0.004 |     0.868 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |      1446 |       228 |      1674 | 
##               |     0.864 |     0.136 |           | 
## --------------|-----------|-----------|-----------|
## 
##

Remark

It is observed that the LAPLACE estimator created an improved model that became steady (in this case). The result shows that smoothing the model with 1 or 2 gave the same result performance. There might be cases where laplace estimator might need to be tuned until an appropriate performance is achieved.

Natural Language Processing - A classification of SMS as Ham and Spam using Naive Bayes Algorithm

Dayo John

22 April 2018

Naive Bayes (Part 1)

1. Getting Dataset

2. Loading Dataset

3. Data Exploration and Observation

4. Data Cleansing and Standardization

5. Word Cloud (Optional)

6. Creating DTM Sparse Matrix

7. Creating Training and Test Data

8. Reducing the Word Frequency

9. Observe the Process (so far)

10. Apply the Naive Bayes Algorithm

11. Improving Model

Remark