Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker. With ever increasing data size, it is no longer feasible to read text manually and understand the emotion. Instead, an algorithm is used that extracts emotions from thousands of text documents in seconds. If we have a training data set, a classifier such as Naive Bayes can be used for classification of text reviews. We will review how NB can be used to achieve the same.
Basic Setup
#set working directory
setwd("C:/Users/awani/Desktop/50daysofAnalytics")
#load libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(twitteR, wordcloud, tm, tidyr, tidytext, syuzhet, ngram, NLP, RColorBrewer, RTextTools, e1071, caret, knitr)
#read data
sentiment = read.csv("movie_review_sent.csv", stringsAsFactors = F)
Data Cleaning
The data will have a bunch of stuff not conveying any kind of sentiments like names, punctuation and others. It is better to remove them before moving forward. some custom cleaning might be required but the code will take care most common data cleaning steps.
### clean data ####
text = sentiment$text # save tweets to another data set "text"
text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",text) #remove names
text = gsub("http[^[:blank:]]+","",text) #remove html links
text = gsub("@\\w+","",text) #remove people names
text = gsub("[[:punct:]]","",text) #remove punctuations
text = trimws(text, which = c("both", "left", "right")) # remove whitespace
text = gsub('[[:digit:]]+', '', text) # remove digits
text = gsub("[\r\n]", "", text) # remove line breaks
text = iconv(text, to = "ASCII//TRANSLIT") # remove not readable standard text
text = iconv(text, "ASCII", "UTF-8", sub="") # remove not readable standard text
text = tolower(text) # lower case
sentiment$text = text
Exploratory Analysis
Word cloud is quick and effective method to visualize word frequency and identify most repeated words in the corpus. One must, however, be vigilant while making word clouds. If not used properly they often paint a misleading picture.
# dependent variable
kable(table(sentiment$Positive), col.names = c("Positive", "Frequency"), align = "l")
| Positive | Frequency |
|---|---|
| 0 | 586 |
| 1 | 983 |
#independent variable - word cloud
corpus = Corpus(VectorSource(text)) # convert tweets to corpus
#some more cleaning
corpus = tm_map(corpus, removeWords, stopwords("english")) #remove stopwords like "and","the" and"that"
corpus = tm_map(corpus, stripWhitespace) # remove whitespace
# word frequency
uniqwords = as.matrix(TermDocumentMatrix(corpus)) # covert corpus to Term Document matrix
wordfreq = sort(rowSums(uniqwords),decreasing=TRUE) # Count frequency of words in each tweet
WCinput = data.frame(word = names(wordfreq),freq=wordfreq) # word frequency dataframe
#generate the wordcloud
wordcloud(words = WCinput$word, freq = WCinput$freq, min.freq = 2,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Data Preparation
Before we move on to training the classifier, we need do split the data into training and validation sets. We will create a document matrix with all the unique words with a frequency of more than 5 in the reviews and use them as independent variables in the our classifier. We also need to ensure that data is in correct format.
#covert dependent to factor type
sentiment$Positive = as.factor(sentiment$Positive)
#create corpus
corpus = Corpus(VectorSource(sentiment$text))
#create doucment
NBinput = DocumentTermMatrix(corpus)
# Partion the data into trainin and validation
Index = sample(1:nrow(sentiment), size = round(0.7*nrow(sentiment)), replace=FALSE)
text.train = sentiment[Index ,]
text.test = sentiment[-Index ,]
# doc
doc.train = NBinput[Index,] #document matrix for training
doc.test = NBinput[-Index,] #document matrix for test
#corpus
corpus.train = corpus[Index] #corpus for training
corpus.test = corpus[-Index] # corpus for test
## There are many unique words, lets only use those words which occur more than 5 times in the document for NB
fivefreq = findFreqTerms(doc.train, 5) # generate list of words which occur 5 times or more
# restrict the document to have only those words which occur five times or more
dOC.train.nb = DocumentTermMatrix(corpus.train, control=list(dictionary = fivefreq))
doc.test.nb = DocumentTermMatrix(corpus.test, control=list(dictionary = fivefreq))
#Convert frequency count to "No" or "Yes"
convert_count = function(x) {
y = ifelse(x > 0, 1,0)
y = factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
# create training and validation features
trainNB = apply(dOC.train.nb, 2, convert_count)
testNB = apply(doc.test.nb, 2, convert_count)
Train NB Classifier
Now that we have our data as we need, training the model becomes very simple. We just need to use naiveBayes function for this purpose with trainNB data set.
#run naive bayes
classifier = naiveBayes(trainNB, factor(text.train$Positive), laplace = 1)
Prediction and Confusion Matrix
To understand how good the classifier is, we will predict sentiments in test data set. A confusion matrix will then give us metric like overall accuracy, sensitivity and specificity. The model performance according to these metrics are satisfactory.
#predict using Naive Bayes
pred = predict(classifier, newdata=testNB)
#confusion matrix
confusionMatrix(pred, text.test$Positive)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 135 126
## 1 39 171
##
## Accuracy : 0.6497
## 95% CI : (0.6047, 0.6928)
## No Information Rate : 0.6306
## P-Value [Acc > NIR] : 0.209
##
## Kappa : 0.3186
## Mcnemar's Test P-Value : 2.155e-11
##
## Sensitivity : 0.7759
## Specificity : 0.5758
## Pos Pred Value : 0.5172
## Neg Pred Value : 0.8143
## Prevalence : 0.3694
## Detection Rate : 0.2866
## Detection Prevalence : 0.5541
## Balanced Accuracy : 0.6758
##
## 'Positive' Class : 0
##