In this task, we will use Naive Bayes algorithem to try to detect whither a Short Message Service (SMS) is a spam or a ham (legitimte).
The dataset will be used is simply the sms content and a response variable which indicate that the message is a spam or ham. The data was collected via a SMS Spam Collection can be found in this link http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/.
Text data are usually challenging to deal with. That is because it is hard for the machine to read and understand an unstructure data. We will use a technique called bag-of-words, which basically ignores the ordering of the word and just tells you whether a specific word appears in this message or not.
We first start by reading thedata using read.csv() function. The data has 5559 observations and 2 variables. The str() function is always a good way to see the structur of your dataset. We note that both variables are stored as character; however,it is always prefered to store your response as a factor. This can be done by using factor()function.
sms_raw <- read.csv("./sms_spam.csv", stringsAsFactors = FALSE)
str(sms_raw)
## 'data.frame': 5559 obs. of 2 variables:
## $ type: chr "ham" "ham" "ham" "spam" ...
## $ text: chr "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out"| __truncated__ ...
sms_raw$type <-factor(sms_raw$type)
We can apply the function table() on the reponse to see the frequency of each class.
table(sms_raw$type)
##
## ham spam
## 4812 747
round(prop.table(table(sms_raw$type))*100,digits = 2)
##
## ham spam
## 86.56 13.44
Next we need to process our data to make it ready for analysis. To do so, we will use a powerfull package designed specially for text mining called tm. In text mining, acorpus is defined as a collection of text documents. a document could be an entire book or simply a text message. The VCorpus() function is used to create a corpus. Note that the function VectorSource()is usedto specify an R vecotr source of data,since our data is already loaded in R. The function inspect() can be used to view a sample of the corpus. In addition, the function as.character() can be used to view the actual content of any message.
if (!require(tm)){
install.packages("tm")
library(tm)
}
## Loading required package: tm
## Warning: package 'tm' was built under R version 3.1.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.1.3
sms_corpus <- VCorpus(VectorSource(sms_raw$text))
inspect(sms_corpus[1:2])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 49
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 23
as.character(sms_corpus[[1]])
## [1] "Hope you are having a good week. Just checking in"
lapply(X = sms_corpus[1:2],FUN = as.character)
## $`1`
## [1] "Hope you are having a good week. Just checking in"
##
## $`2`
## [1] "K..give back my thanks."
Next we will need to clean the texts by removing panctuations, tranforming or mapping te text into lower cases and removing stop words such as the, a … etc.
sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))
as.character(sms_corpus_clean[[1]])
## [1] "hope you are having a good week. just checking in"
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords())
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)
Another important process in cleaning the data is called stemming, which takes a word and transform it to its origin for example Learning or learned will be transformed to learn. We can do that using the stemDocument function in tm package. Note the package SnowballC might be required to apply the stemming function. Next we will remove the leading and trailing white spaces.
if(!require(SnowballC)){install.packages("SnowballC");library(SnowballC)}
## Loading required package: SnowballC
## Warning: package 'SnowballC' was built under R version 3.1.3
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)
Now we will split the text into words using the Document Term Matrix (DTM).
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
sms_dtm
## <<DocumentTermMatrix (documents: 5559, terms: 6576)>>
## Non-/sparse entries: 42173/36513811
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
Now our data is ready for analysis, but before that, we will split the data into training and testing datasets. we will devide the data into 70% training and 30% testing. We aslo would need to save both response for later use. Note both datasets contains about 13% of spam.
sms_dtm_train <- sms_dtm[1:3891, ]
sms_dtm_test <- sms_dtm[3892:5559, ]
sms_train_labels <- sms_raw[1:3891, ]$type
sms_test_labels <- sms_raw[3892:5559, ]$type
prop.table(table(sms_train_labels))
## sms_train_labels
## ham spam
## 0.8648162 0.1351838
prop.table(table(sms_test_labels))
## sms_test_labels
## ham spam
## 0.867506 0.132494
A good way to visualize a text data is to create a words cloud. the wordscloud package has a function called wordcloud that can be applyed on the corpus. It would be better to observe both the spam and ham word clouds seperatly.
if(!require(wordcloud)){install.packages("wordcloud");library(wordcloud)}
## Loading required package: wordcloud
## Warning: package 'wordcloud' was built under R version 3.1.3
## Loading required package: RColorBrewer
## Warning: package 'RColorBrewer' was built under R version 3.1.3
wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)
The DTM has 6500 features of words and some of them are not useful specially those with lower frequency. Therefore, we will use the function findFreqTerms() to only keep the words appear at least 5 times in different SMS.
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
sms_dtm_freq_train<- sms_dtm_train[ , sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)
Now we will apply Naive Bayes to our clean data. To do so, we will use the library e1071 which includes many machine learning algorithms. The function NaiveBayes taks the training set and its label as an input and train the model. After that the function predict use the test data to apply the model on the tested data.
if(!require(e1071)){install.packages("e1071");library(e1071)}
## Loading required package: e1071
## Warning: package 'e1071' was built under R version 3.1.3
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
sms_test_pred <- predict(sms_classifier, sms_test)
To present our results, we use the function CrossTable() from the gmodels library, which displays the confusion matix below.
if(!require(gmodels)){install.packages(gmodels);library(gmodels)}
## Loading required package: gmodels
## Warning: package 'gmodels' was built under R version 3.1.3
CrossTable(sms_test_pred, sms_test_labels,
prop.chisq = FALSE, prop.t = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1668
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1440 | 34 | 1474 |
## | 0.977 | 0.023 | 0.884 |
## | 0.995 | 0.154 | |
## -------------|-----------|-----------|-----------|
## spam | 7 | 187 | 194 |
## | 0.036 | 0.964 | 0.116 |
## | 0.005 | 0.846 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1447 | 221 | 1668 |
## | 0.868 | 0.132 | |
## -------------|-----------|-----------|-----------|
##
##
It can be see that the accuracy rate is 97.41% which is very good.
Next we will try to improve the model. We will use the idea of laplace estimator. This can be done by assigning the value 1 to the laplace parameter.
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1)
sms_test_pred2 <- predict(sms_classifier2, sms_test)
CrossTable(sms_test_pred2, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1668
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1442 | 34 | 1476 |
## | 0.997 | 0.154 | |
## -------------|-----------|-----------|-----------|
## spam | 5 | 187 | 192 |
## | 0.003 | 0.846 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1447 | 221 | 1668 |
## | 0.868 | 0.132 | |
## -------------|-----------|-----------|-----------|
##
##
Note that the false spam classified SMS’s were reduced from 30 to 28 and the false ham classified also reduced from 6 to 5.
In this taskwe will try to predict whither a review posted on Amazon Fine Food product is helpfull or not. Good reviews can be put on top so that custmers can be benefit from them.
The Amazon Fine Food Reviews dataset consists of 568,454 food reviews Amazon users left up to October 2012. The data can be found in Kaggle through this Link. The data set contains the following variables: * ProductId - unique identifier for the product * UserId - unqiue identifier for the user * ProfileName * HelpfulnessNumerator - number of users who found the review helpful * HelpfulnessDenominator - number of users who indicated whether they found the review helpful * Score - rating between 1 and 5 * Time - timestamp for the review * Summary - brief summary of the review * Text - text of the review
First we downloaded and loaded the data in R using read.csv() function since the data are stored in a csv file. Then we will get red off some of the variables that we wont use such as Product and User ID, ProfileName, Time and Summary.
Next we will create our response variable which is a drived variable that indicates whither a review is useful or not. We will devide HelpfulnessNumerator by HelpfulnessDenominator and then if the ratio is above 50% then we concider the review is helfull otherwise its not helpful. The response variable helpfull is a factor with two levels Y and N.
rev <- read.csv("./Reviews.csv")
rev <-rev[,-c(1,2,3,8,9)]
rev <- subset(rev,rev$HelpfulnessDenominator != 0)
rev$Text <- as.character(rev$Text)
rev$ProfileName <- as.character(rev$ProfileName)
rev$len <- vapply(strsplit(rev$Text, "\\W+"), length, integer(1))
for (i in 1:dim(rev)[1] ){
if ( rev$HelpfulnessNumerator[i]/rev$HelpfulnessDenominator[i] > 0.5){rev$helpfull[i] = "Y"}
else {rev$helpfull[i] ="N"}
}
rev$helpfull<-factor(rev$helpfull)
table(rev$helpfull)
##
## N Y
## 4466 13350
round(prop.table(table(rev$helpfull))*100,2)
##
## N Y
## 25.07 74.93
We can see that the number of observation has dropped to 17816 after excluding all records that has a value of 0 in the HelpfulnessDenominator variable. In addition, the table() function shows that there are 25% of not helpfull reviews and 75% of helpfull reviews.
We then create a corpus for our reviews and inspect it
rev_corpus <- VCorpus(VectorSource(rev$Text))
inspect(rev_corpus[1:2])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 263
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 509
as.character(rev_corpus[[1]])
## [1] "I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most."
The corpus is then cleaned using the function tm_map() just like the sms data. After that, a Decument Term Matrix is created. Following that, due to a memory limitation we will sample 5000 observation to run our model on. Therefore, our training set will have 3500 observations and our test data will have 1500 observations.
rev_corpus <- tm_map(rev_corpus, content_transformer(tolower))
rev_corpus <- tm_map(rev_corpus, removeNumbers)
rev_corpus <- tm_map(rev_corpus, removeWords, stopwords())
rev_corpus <- tm_map(rev_corpus, removePunctuation)
rev_corpus <- tm_map(rev_corpus, stemDocument)
rev_corpus <- tm_map(rev_corpus, stripWhitespace)
rev_dtm <- DocumentTermMatrix(rev_corpus)
rev_train <- rev_dtm[1:12471, ]
rev_test <- rev_dtm[12472:17816, ]
#sample from the dataset because they are too big
set.seed(1)
rev_train <- rev_train[sample(1:nrow(rev_train), 3500,replace=FALSE),]
set.seed(2)
rev_test <- rev_test[sample(1:nrow(rev_test), 1500,replace=FALSE),]
rev_train_labels <- rev[1:12471, ]$helpfull
rev_test_labels <- rev[12472:17816, ]$helpfull
set.seed(1)
rev_train_labels <- rev_train_labels[sample(1:nrow(rev_train), 3500,replace=FALSE)]
set.seed(2)
rev_test_labels <- rev_test_labels[sample(1:nrow(rev_test), 1500,replace=FALSE)]
prop.table(table(rev_train_labels))
## rev_train_labels
## N Y
## 0.2542857 0.7457143
prop.table(table(rev_test_labels))
## rev_test_labels
## N Y
## 0.22 0.78
It would help to create a wordscloud to visualize the dataset.
suppressMessages(suppressWarnings(wordcloud(rev_corpus, min.freq = 100, random.order = FALSE, colors=brewer.pal(8, "Dark2"), random.color=TRUE)))
A separate word cloud is created for both usefull and not usefull reviews. Note that there are not much difference interms of content.
It would help to keep only words that are more frequent; therefore, we will drop any word that appears less than 5 times in all reviews.
rev_freq_words <- findFreqTerms(rev_train, 5)
rev_train<- rev_train[ , rev_freq_words]
rev_test <- rev_test[ , rev_freq_words]
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
rev_train <- apply(rev_train, MARGIN = 2, convert_counts)
rev_test <- apply(rev_test, MARGIN = 2, convert_counts)
Now we use naiveBayes() function to train our model. Then we use the predict() function to test our model.
rev_classifier <- naiveBayes(rev_train, rev_train_labels)
rev_test_pred <- predict(rev_classifier, rev_test)
CrossTable(rev_test_pred, rev_test_labels,
prop.chisq = FALSE, prop.t = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1500
##
##
## | actual
## predicted | N | Y | Row Total |
## -------------|-----------|-----------|-----------|
## N | 50 | 196 | 246 |
## | 0.203 | 0.797 | 0.164 |
## | 0.152 | 0.168 | |
## -------------|-----------|-----------|-----------|
## Y | 280 | 974 | 1254 |
## | 0.223 | 0.777 | 0.836 |
## | 0.848 | 0.832 | |
## -------------|-----------|-----------|-----------|
## Column Total | 330 | 1170 | 1500 |
## | 0.220 | 0.780 | |
## -------------|-----------|-----------|-----------|
##
##
From the confusion matrix, it is noted that the model accuracy is about 68%.
To improve our model we will try to use Laplace indicator just like the sms data.
rev_classifier <- naiveBayes(rev_train, rev_train_labels,laplace = 1)
rev_test_pred <- predict(rev_classifier, rev_test)
CrossTable(rev_test_pred, rev_test_labels,
prop.chisq = FALSE, prop.t = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1500
##
##
## | actual
## predicted | N | Y | Row Total |
## -------------|-----------|-----------|-----------|
## N | 35 | 99 | 134 |
## | 0.261 | 0.739 | 0.089 |
## | 0.106 | 0.085 | |
## -------------|-----------|-----------|-----------|
## Y | 295 | 1071 | 1366 |
## | 0.216 | 0.784 | 0.911 |
## | 0.894 | 0.915 | |
## -------------|-----------|-----------|-----------|
## Column Total | 330 | 1170 | 1500 |
## | 0.220 | 0.780 | |
## -------------|-----------|-----------|-----------|
##
##
The model now has an accuracy rate of 73.7% which is 5% increase. However, this is not a very good model. There might be better model that will provide better accuracy.
Upon searching for a Google Sheet add-ons for sentiment analsysi I came across Text Analysis offered by aylien.com. After searching in their websites. I could not find any mention of what algorithm they use.