#install.packages('tm')
#install.packages('wordcloud')
#install.packages('e1071')
#install.packages('gmodels')
#install.packages('SnowballC')
library("tm") #clean and organize data
## Warning: package 'tm' was built under R version 3.5.3
## Loading required package: NLP
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.5.3
library("SnowballC") #clean and organize data
library("stringr")
library("wordcloud") #display more frequent words
## Warning: package 'wordcloud' was built under R version 3.5.3
## Loading required package: RColorBrewer
library("gmodels")
## Warning: package 'gmodels' was built under R version 3.5.3
library("e1071") #Naive Bayes Classifier
## Warning: package 'e1071' was built under R version 3.5.3
library("tidyr")
library("DT")
## Warning: package 'DT' was built under R version 3.5.3
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.5.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
https://towardsdatascience.com/naive-bayes-explained-9d2b96f4a9c0
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
http://blog.aylien.com/naive-bayes-for-dummies-a-simple-explanation/
It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
A simple example best explains the application of Naive Bayes for classification. When writing this blog I came across many examples of Naive Bayes in action. Some were too complicated, some dealt with more than Naive Bayes and used other related algorithms, but we found a really simple example on StackOverflow which we’ll run through in this blog. It explains the concept really well and runs through the simple maths behind it without getting too technical. So, let’s say we have data on 1000 pieces of fruit. The fruit being a Banana, Orange or some Other fruit and imagine we know 3 features of each fruit, whether it’s long or not, sweet or not and yellow or not, as displayed in the table below:
So from the table what do we already know? 50% of the fruits are bananas 30% are oranges 20% are other fruits Based on our training set we can also say the following: From 500 bananas 400 (0.8) are Long, 350 (0.7) are Sweet and 450 (0.9) are Yellow Out of 300 oranges 0 are Long, 150 (0.5) are Sweet and 300 (1) are Yellow From the remaining 200 fruits, 100 (0.5) are Long, 150 (0.75) are Sweet and 50 (0.25) are Yellow Which should provide enough evidence to predict the class of another fruit as it’s introduced. So let’s say we’re given the features of a piece of fruit and we need to predict the class. If we’re told that the additional fruit is Long, Sweet and Yellow, we can classify it using the following formula and subbing in the values for each outcome, whether it’s a Banana, an Orange or Other Fruit. The one with the highest probability (score) being the winner. Banana: P(Banana|Long,Sweet,Yellow)=fracP(Long|Banana)cdotP(Sweet|Banana)cdotP(Yellow|Banana)cdotP(Banana)P(Long)cdotP(Sweet)cdotP(Yellow) =frac0.8times0.7times0.9times0.5P(evidence) =frac0.252P(evidence) Orange: P(Orange|Long,Sweet,Yellow)=0 Other Fruit: P(Other|Long,Sweet,Yellow)=fracP(Long|Other)cdotP(Sweet|Other)cdotP(Yellow|Other)cdotP(Other)P(Long)cdotP(Sweet)cdotP(Yellow) =frac0.5times0.75times0.25times0.2P(evidence) =frac0.01875P(evidence) In this case, based on the higher score (0.01875 lt 0.252) we can assume this Long, Sweet and Yellow fruit is, in fact, a Banana. Now that we’ve seen a basic example of Naive Bayes in action, you can easily see how it can be applied to Text Classification problems such as spam detection, sentiment analysis and categorization. By looking at documents as a set of words, which would represent features, and labels (e.g. “spam” and “ham” in case of spam detection) as classes we can start to classify documents and text automatically. You can read more about Text Classification in our Text Analysis 101 Series.
Even though the naive assumption is rarely true, the algorithm performs surprisingly good in many cases Handles high dimensional data well. Easy to parallelize and handles big data well Performs better than more complicated models when the data set is small
The estimated probability is often inaccurate because of the naive assumption. Not ideal for regression use or probability estimation When data is abundant, other more complicated models tend to outperform Naive Bayes
data <- read.csv(file="https://raw.githubusercontent.com/AjayArora35/Data-607-Project-4/master/sms_spam.csv", header=TRUE)
head(data) %>% kable() %>% kable_styling()
| type | text |
|---|---|
| ham | Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat… |
| ham | Ok lar… Joking wif u oni… |
| spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s |
| ham | U dun say so early hor… U c already then say… |
| ham | Nah I don’t think he goes to usf, he lives around here though |
| spam | FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv |
ham <- length(which(data$type == "ham"))
spam <- length(which(data$type == "spam"))
all <- data.frame("ham" = c(ham), "spam" = c(spam))
all %>% kable() %>% kable_styling()
| ham | spam |
|---|---|
| 4827 | 747 |
spam_messages <- subset(data, data$type == "spam")
ham_messages <- subset(data, data$type == "ham")
head(spam_messages$text)
## [1] Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## [2] FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
## [3] WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
## [4] Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030
## [5] SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info
## [6] URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18
## 5160 Levels: 'An Amazing Quote'' - "Sometimes in life its difficult to decide whats wrong!! a lie that brings a smile or the truth that brings a tear...." ...
head(ham_messages$text)
## [1] Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## [2] Ok lar... Joking wif u oni...
## [3] U dun say so early hor... U c already then say...
## [4] Nah I don't think he goes to usf, he lives around here though
## [5] Even my brother is not like to speak with me. They treat me like aids patent.
## [6] As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
## 5160 Levels: 'An Amazing Quote'' - "Sometimes in life its difficult to decide whats wrong!! a lie that brings a smile or the truth that brings a tear...." ...
#Clean the data before representing as a visual. E.g. '⣠1000'
words <- Corpus(VectorSource(spam_messages$text))
words <- tm_map(words, stripWhitespace)
## Warning in tm_map.SimpleCorpus(words, stripWhitespace): transformation
## drops documents
words <- tm_map(words, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(words, content_transformer(tolower)):
## transformation drops documents
words <- tm_map(words, removeNumbers)
## Warning in tm_map.SimpleCorpus(words, removeNumbers): transformation drops
## documents
words <- tm_map(words, removePunctuation)
## Warning in tm_map.SimpleCorpus(words, removePunctuation): transformation
## drops documents
words <- tm_map(words, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(words, removeWords, stopwords("english")):
## transformation drops documents
words <- tm_map(words, stemDocument)
## Warning in tm_map.SimpleCorpus(words, stemDocument): transformation drops
## documents
wordcloud(words, max.words = 100, random.order=FALSE, rot.per=0.30,
colors=brewer.pal(7, "Dark2"))
words1 <- Corpus(VectorSource(ham_messages$text))
words1 <- tm_map(words1, stripWhitespace)
## Warning in tm_map.SimpleCorpus(words1, stripWhitespace): transformation
## drops documents
words1 <- tm_map(words1, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(words1, content_transformer(tolower)):
## transformation drops documents
words1 <- tm_map(words1, removeNumbers)
## Warning in tm_map.SimpleCorpus(words1, removeNumbers): transformation drops
## documents
words1 <- tm_map(words1, removePunctuation)
## Warning in tm_map.SimpleCorpus(words1, removePunctuation): transformation
## drops documents
words1 <- tm_map(words1, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(words1, removeWords, stopwords("english")):
## transformation drops documents
words1 <- tm_map(words1, stemDocument)
## Warning in tm_map.SimpleCorpus(words1, stemDocument): transformation drops
## documents
wordcloud(words1, max.words = 100, random.order=FALSE, rot.per=0.30,
colors=brewer.pal(7, "Dark2"), scale = c(3,0.5))
#converting the messages into a collection of text documents known as a "corpus"
#Create "document term matrix" in which rows indicate each message and the columns indicate each word.
#Words are converted to lower case, numbers and punctuation are removed. Stemming is also perfomed. This removes the suffix from words, making it easier for analysis as it #combines words with similar meanings. 'Calling', 'Calls', 'Called', for example, would be converted to 'Call'.
corpus <- VCorpus(VectorSource(data$text))
dtm <- DocumentTermMatrix(corpus, control = list(
tolower = TRUE,
removeNumbers = TRUE,
removePunctuation = TRUE,
stemming = TRUE,
removeWords, stopwords("english"),
stripWhitespace = TRUE,
asPlain = TRUE
))
#Conventionally, 75-80% of data is set aside for training the model and 20-25% is used to the test the model.
spam_train_set <- data[1:4000,]$type
spam_test_set <- data[4001:5559,]$type
length(spam_train_set)
## [1] 4000
length(spam_test_set)
## [1] 1559
dtmTrain <- dtm[1:4000,]
dtmTest <- dtm[4001:5559,]
#Create frequencies of words
freqWords <- findFreqTerms(dtmTrain,0)
freqTrain <- dtmTrain[,freqWords]
freqTest <- dtmTest[,freqWords]
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
train <- apply(freqTrain, MARGIN = 2, convert_counts)
test <- apply(freqTest, MARGIN = 2,convert_counts)
#Now, we train the classifier on the training set that was set aside.
classifier <- naiveBayes(train, spam_train_set)
#classifier[2]$tables$call
#Using the training data, we now test it to see how well it performed.
testPredict <- predict(classifier, test)
CrossTable(testPredict, spam_test_set,
prop.chisq = FALSE, prop.t = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1559
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1338 | 23 | 1361 |
## | 0.983 | 0.017 | 0.873 |
## | 0.993 | 0.109 | |
## -------------|-----------|-----------|-----------|
## spam | 10 | 188 | 198 |
## | 0.051 | 0.949 | 0.127 |
## | 0.007 | 0.891 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1348 | 211 | 1559 |
## | 0.865 | 0.135 | |
## -------------|-----------|-----------|-----------|
##
##
The performance is: 10+23/1559 = .02 = 2%. This indicates that 2% of messages were misclassified as spam. This is probably attributed to the fact we are not removing least frequently used words (freqWords <- findFreqTerms(dtmTrain,0)). Let’s see how this sample data does with SVM classifiers.
#SVM
#https://www.svm-tutorial.com/2014/11/svm-classify-text-r/
library(caret)
## Warning: package 'caret' was built under R version 3.5.3
## Loading required package: lattice
svmclassify <- svm(spam_train_set ~ text , data = train )
svmresult <- predict(svmclassify, na.omit(test))
CrossTable(svmresult, spam_test_set,
prop.chisq = FALSE, prop.t = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1559
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1323 | 175 | 1498 |
## | 0.883 | 0.117 | 0.961 |
## | 0.981 | 0.829 | |
## -------------|-----------|-----------|-----------|
## spam | 25 | 36 | 61 |
## | 0.410 | 0.590 | 0.039 |
## | 0.019 | 0.171 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1348 | 211 | 1559 |
## | 0.865 | 0.135 | |
## -------------|-----------|-----------|-----------|
##
##
Classifying the same data from Naive Bayes using SVM resulted in the following output. The performance is: 175+25/1559 = 0.12 = 12.8% of messages were misclassified as spam. SVM did not perform better than Naive Bayes in this test using the same data set.