SMS-spam.csv dataset is collected from the course webpage. The dataset is a data frame structure that contains 5559 observations (# of SMS) each with two columns, the “type” column that indicates whether the SMS is a SPAM(trashed) message or a HAM (legitimate) message, and the “text” column that contains the SMS message content.
The sms_spam.csv contains two columns that are both character data. In order to the “type” column to be used as a classifer label in the future, it was convert into a factor vector. The table() of the “type” featue allow us to know that there are 4812 messages classfieid as SPAM while 747 mesages classified as HAM.
setwd("C:/Users/Emily/Desktop/GRADUATE PROGRAM COURSES/STAT6620 Machine Learning with R/Machine Learning with R, Second Edition_Code/Chapter 04")
The working directory was changed to C:/Users/Emily/Desktop/GRADUATE PROGRAM COURSES/STAT6620 Machine Learning with R/Machine Learning with R, Second Edition_Code/Chapter 04 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)
str(sms_raw)
'data.frame': 5559 obs. of 2 variables:
$ type: chr "ham" "ham" "ham" "spam" ...
$ text: chr "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or æ‹¢10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out"| __truncated__ ...
sms_raw$type <- factor(sms_raw$type)
str(sms_raw$type)
Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...
table(sms_raw$type)
ham spam
4812 747
Text data cannot be dealt with directly in a data frame structure, it needed to be converted as a valitle corpus (a text document with information of all SMS in a short term memory). We can look at the vcorpus information in terms of number of documets (same as number of messages) in a metadata through the use of inspect(); we can also look at the content (the actual text message) through the use of as.character().
library(tm)
sms_corpus <- VCorpus(VectorSource(sms_raw$text))
print(sms_corpus)
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 5559
inspect(sms_corpus[1:2])
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 49
[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 23
as.character(sms_corpus[[1]])
[1] "Hope you are having a good week. Just checking in"
lapply(sms_corpus[1:2], as.character)
$`1`
[1] "Hope you are having a good week. Just checking in"
$`2`
[1] "K..give back my thanks."
Within each SMS in the vcorpus (text document), data are in its raw format. We need to process them before we can applied it with machine learning algorithm, specifically the Naived Bayes Classification this time. Data cleaning processes here include: to change all possible capital letters in lower cases, remove numercal value that neither indicate spam or ham; remove a list of stop words such as:“a”,“an”,“the”,“for”,“is”…that neither indicate spam or ham, remove any punctuation, and convert all stem words in its root form. After the data cleaning, final version of the vcorpus were transformed into a Document-Term-Matrix (DTM) where each row of the DTM is representing individual possible number of SMS from the original sms dataset, and the column represents each unique words that ever show up in the collection of 5559 SMS after the data cleaning processes. There are still 5559 rows as number of SMS, and now with only 6576 columns each represents a unique word as a unique feature for a sms.
replacePunctuation <- function(x) { gsub("[[:punct:]]+", " ", x) }
library(SnowballC)
sms_dtm <- DocumentTermMatrix(sms_corpus, control = list(
tolower = TRUE,
removeNumbers = TRUE,
stopwords = function(x) { removeWords(x, stopwords()) },
removePunctuation = TRUE,
stemming = TRUE
))
sms_dtm
<<DocumentTermMatrix (documents: 5559, terms: 6576)>>
Non-/sparse entries: 42173/36513811
Sparsity : 100%
Maximal term length: 40
Weighting : term frequency (tf)
The DTM, working like a dataframe, is then splited into a trained dataset with the top 75% of the sms data, and a tested dataset with the bottom 25% of the sms data.
sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test <- sms_dtm[4170:5559, ]
str(sms_dtm_train)
List of 6
$ i : int [1:31641] 1 1 1 1 1 2 2 2 3 3 ...
$ j : int [1:31641] 962 2278 2578 2945 6224 430 2997 5636 183 911 ...
$ v : num [1:31641] 1 1 1 1 1 1 1 1 1 1 ...
$ nrow : int 4169
$ ncol : int 6576
$ dimnames:List of 2
..$ Docs : chr [1:4169] "1" "2" "3" "4" ...
..$ Terms: chr [1:6576] "鈧é""| __truncated__ "鈧éharri""| __truncated__ "鈧鈥?"| __truncated__ "鈥榤orrow""| __truncated__ ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
str(sms_dtm_test)
List of 6
$ i : int [1:10532] 1 1 1 2 2 2 2 2 2 2 ...
$ j : int [1:10532] 1107 1486 2558 852 1107 2816 4215 4268 4731 6226 ...
$ v : num [1:10532] 1 1 1 1 1 1 1 1 1 1 ...
$ nrow : int 1390
$ ncol : int 6576
$ dimnames:List of 2
..$ Docs : chr [1:1390] "4170" "4171" "4172" "4173" ...
..$ Terms: chr [1:6576] "鈧é""| __truncated__ "鈧éharri""| __truncated__ "鈧鈥?"| __truncated__ "鈥榤orrow""| __truncated__ ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
Similarly, a trained and tested labels are made as classifier labels for Naived Bayes classification later. Since we had only work with the data from the “text” column in the original sms data, we are making such labels from the “type” column of the original sms data splitting in proportions of row exactly like we did for the DTM datasets.
We can use a prop.table() to convert the number of spam/ham labels in both the trained and tested lebel vectors into a fractional values to make sure the proportion of spam/ham are similar in both labels. Otherwise, if the proportion of spam/ham in trained dataset is different from those of the tested dataset, it might severely affect the Naived Bayes classfication performance. Then a randomization of rows is needed to be done before seprate into traiend/tested datasets.
sms_train_labels <- sms_raw[1:4169, ]$type
sms_test_labels <- sms_raw[4170:5559, ]$type
prop.table(table(sms_train_labels))
sms_train_labels
ham spam
0.8647158 0.1352842
prop.table(table(sms_test_labels))
sms_test_labels
ham spam
0.8683453 0.1316547
We can create a wordcloud from the cleaned vcorpus (after all data cleaning preocesses has been done) to look at the most frequent words in the overall collection of words(“bag of words”) available. The wordcoud is set in a way to look at words that shows up at least 60 times, or a little bit more than 1% of the total unique words in the cleaned vcorpus, and words that shows up more frequently will be placed in the middle of the wordcloud than other words that are less frequently show up.
library(wordcloud)
wordcloud(sms_corpus_clean, min.freq = 60, random.order = FALSE)
A wordcloud of spam or ham sms can be made specifically by first splitting the original full dataset into a spam dataset when the “type” feature of the dataframe is “spam”, and similarly a ham dataset when the “type” feature of the dataframe is “ham”. Then make wordclod sepearetly of these two splitted datasets. both spam and ham wordcloud are set to contain 50 most frequent words ranging with most frequent word in larger font and less frequent word in smaller font.
spam <- subset(sms_raw, type == "spam")
ham <- subset(sms_raw, type == "ham")
wordcloud(spam$text, max.words = 50, scale = c(5, 0.5))
The wordclouds allow us to see the difference of frequent words in spam and ham datasets. For example, from the spam wordcloud, we know that the spam dataset contains words such as “call”, “now”,“free”, “text” and “mobile” that frequently show up; while the ham wordcloud shows us that the ham dataset contain frequent words such as “will”, “get”, “now”,“just”, “can”. Since many of the words in one wordcloud is different from the other, we can make a preliminary guess that each sets of dataset contains words distinctive of each other and the Naived Bayes algorithm should be good to estimate spam/ham based on words from the incoming sms.
We can also recognize the fact that spam wordcloud contains words that show up in two extreme in frequency since some words are very big and many words are very small, whereas the ham worcloud show up words that are in relative frequency since their font sizes are similar, as we set the scale of the font size for the words based on their occurance frequency in each dataset.
wordcloud(ham$text, max.words = 50, scale = c(3.5, 0.5))
After the data visualization using wordcloud, we can further processes the data into a final neat training and tested dataset and labels. Since the DTM contains mostly sparse cells, we want to eliminate most of the least frequent words that are hard to use for Naived Bayes machine learning because of the little information it gets. Using the findFreqTerms() on the trained dtm, we set the value of 5 to get a vector of words that shows up at least 5 times in the trained dtm. There are total of 1139 words that show up at least 5 times in the trained dtm. Usign this information, we can selected and work with only these freuqnet words in both datasets. The number of columns for each trained and tested datasets are then shrink from 6576 column (words) to 1139 column (words), and these words are the most frequently show up words we should use for the training.
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
str(sms_freq_words)
chr [1:1139] "鈧鈥?"| __truncated__ "abiola" "abl" "abt" "accept" "access" ...
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]
Now both the trained and tested DTMs contains numerical values representing the number of occurance of each word (column) in each sms (row). In order for the Naived Bayes to be working for categorical data. We create a function to detect numerical values and convert it to a “Yes” or “No” based on each situation. The function convert the cell’s numerical value into “No” if the value is 0, and a “Yes” if the value is greater than 0. Using the apply() function and the numerical-to-catergorical funcation we just created, we can have our final trained and tested DTM that contains either “Yes” or “No” value in each cell.
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)
Now that we have both the final trained and tested (DTM) datasets, and a trained and tested labels respective to the data. We can use a naivedBayes() to train and create a model based on the trained dataset and its repective trained label. The model contained learned information of final 1139 words and each word’s likelihood of being in spam or ham.
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
The model is then applied in tested dataset using the predict() and output a vector of tested data prediction of whether each sms should be spam or ham based on the words detected in the tested data.
The tested prediction vector is then compared with the original test labels vector (the actual class label of the test dataset) using a CrossTable() for Naived Bayes classfication accuracy.
sms_test_pred <- predict(sms_classifier, sms_test)
The result shows that there are 6 incidents of false positive (legitimate sms that were incorreclty identified as spam) and 30 incidents of false negative (spam sms that were incorrectly identified as legitimate). The accuracy for this case is 97.4% ((1390-36)/1390*100). The ideal thinking is that we want to improve our Naived Bayes classification algorithm by increasing our prediction accuracy (lower both false negative and false positive).
In reality, it is better that we can lower more on the false positive than on the false negative, because we would be in a much worse case to miss important message when a legitimate message were mis-classified as a spam, than to manually handle the spam when a spam message were mis-classified as a legitimate one.
To improve the Naived Bayes model performance, we add an option of “laplace=1” to the final trained dataset to make sure every unique word show up “once” per class level. (here the class levels are “spam” and “ham”). This elimiated the problem of having a zero probability for a word when it had never shown up in a certain class level. Since the Naived Bayes forumla are multiplied in a chain, a previous zero probability word will effectively overrule all of the other evidence. We want to avoid this (problematic logical) situation as if a word never shown up in a ham sms does not means if it show up again, along with other words, should autometically classified as a spam.
Adding a laplace = 1 options, we can both lowe the false positive and false negative and increase our accuracy up to 97.6% ((1390-33)/1390).
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1)
sms_test_pred2 <- predict(sms_classifier2, sms_test)
Conclusion: Naived Bayes classification algorithm is effectively useful for dealing with categorical data and text mining. The foundamental theory it uses is the Bayes conditional probabilistic model for finding a posterior probability given certain conditions. It is called “Naived” because under the assumption that we believe all features (collections of words) in the dataset are equally important and independent. Using the Naived Bayes classfication algorithm, we are at more than 97% accuracy in predicting a spam message based on the words it contains.