Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker. With ever increasing data size, it is no longer feasible to read text manually and understand the emotion. Instead, an algorithm is used that extracts emotions from thousands of text documents in seconds. If we have a training data set, a classifier such as Naive Bayes can be used for classification of text reviews. We will review how NB can be used to achieve the same.

Basic Setup

#set working directory 
setwd("C:/Users/awani/Desktop/50daysofAnalytics")

#load libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(twitteR, wordcloud, tm, tidyr, tidytext, syuzhet, ngram, NLP, RColorBrewer, RTextTools, e1071, caret, knitr)

#read data
sentiment = read.csv("movie_review_sent.csv", stringsAsFactors = F)

 

Data Cleaning

The data will have a bunch of stuff not conveying any kind of sentiments like names, punctuation and others. It is better to remove them before moving forward. some custom cleaning might be required but the code will take care most common data cleaning steps.

### clean data ####

text = sentiment$text   # save tweets to another data set "text"

text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",text)         #remove names
text = gsub("http[^[:blank:]]+","",text)                   #remove html links
text = gsub("@\\w+","",text)                               #remove people names
text = gsub("[[:punct:]]","",text)                         #remove punctuations
text = trimws(text, which = c("both", "left", "right"))    # remove whitespace

text = gsub('[[:digit:]]+', '', text)                      # remove digits
text = gsub("[\r\n]", "", text)                            # remove line breaks
text = iconv(text, to = "ASCII//TRANSLIT")                 # remove not readable standard text
text = iconv(text, "ASCII", "UTF-8", sub="")               # remove not readable standard text
text = tolower(text)                                       # lower case

sentiment$text = text

 

Exploratory Analysis

Word cloud is quick and effective method to visualize word frequency and identify most repeated words in the corpus. One must, however, be vigilant while making word clouds. If not used properly they often paint a misleading picture.

# dependent variable
kable(table(sentiment$Positive), col.names = c("Positive", "Frequency"),  align = "l")
Positive Frequency
0 586
1 983
#independent variable - word cloud
corpus = Corpus(VectorSource(text))                          # convert tweets to corpus

#some more cleaning
corpus = tm_map(corpus, removeWords, stopwords("english"))   #remove stopwords like "and","the" and"that"
corpus = tm_map(corpus, stripWhitespace)                     # remove whitespace

# word frequency
uniqwords = as.matrix(TermDocumentMatrix(corpus))            # covert corpus to Term Document matrix 
wordfreq = sort(rowSums(uniqwords),decreasing=TRUE)          # Count frequency of words in each tweet
WCinput = data.frame(word = names(wordfreq),freq=wordfreq)   # word frequency dataframe

#generate the wordcloud
wordcloud(words = WCinput$word, freq = WCinput$freq, min.freq = 2,
          max.words=200, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))

 

Data Preparation

Before we move on to training the classifier, we need do split the data into training and validation sets. We will create a document matrix with all the unique words with a frequency of more than 5 in the reviews and use them as independent variables in the our classifier. We also need to ensure that data is in correct format.

Data Processing

#covert dependent to factor type
sentiment$Positive = as.factor(sentiment$Positive)

#create corpus
corpus = Corpus(VectorSource(sentiment$text)) 

#create doucment 
NBinput = DocumentTermMatrix(corpus)

# Partion the data into trainin and validation
Index = sample(1:nrow(sentiment), size = round(0.7*nrow(sentiment)), replace=FALSE)
text.train = sentiment[Index ,]
text.test = sentiment[-Index ,]

# doc
doc.train = NBinput[Index,]      #document matrix for training
doc.test = NBinput[-Index,]     #document matrix for test

#corpus
corpus.train = corpus[Index]     #corpus for training
corpus.test = corpus[-Index]     # corpus for test

Limit Independent Varaibles in the model

## There are many unique words, lets only use those words which occur more than 5 times in the document for NB

fivefreq = findFreqTerms(doc.train, 5)          # generate list of words which occur 5 times or more

# restrict the document to have only those words which occur five times or more
dOC.train.nb = DocumentTermMatrix(corpus.train, control=list(dictionary = fivefreq))
doc.test.nb = DocumentTermMatrix(corpus.test, control=list(dictionary = fivefreq))


#Convert frequency count to "No" or "Yes"
convert_count = function(x) {
  y = ifelse(x > 0, 1,0)
  y = factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}

# create training and validation features
trainNB = apply(dOC.train.nb, 2, convert_count)
testNB = apply(doc.test.nb, 2, convert_count)

 

Train NB Classifier

Now that we have our data as we need, training the model becomes very simple. We just need to use naiveBayes function for this purpose with trainNB data set.

#run naive bayes
classifier = naiveBayes(trainNB, factor(text.train$Positive), laplace = 1)

 

Prediction and Confusion Matrix

To understand how good the classifier is, we will predict sentiments in test data set. A confusion matrix will then give us metric like overall accuracy, sensitivity and specificity. The model performance according to these metrics are satisfactory.

#predict using Naive Bayes
pred = predict(classifier, newdata=testNB)

#confusion matrix
confusionMatrix(pred, text.test$Positive)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 135 126
##          1  39 171
##                                           
##                Accuracy : 0.6497          
##                  95% CI : (0.6047, 0.6928)
##     No Information Rate : 0.6306          
##     P-Value [Acc > NIR] : 0.209           
##                                           
##                   Kappa : 0.3186          
##  Mcnemar's Test P-Value : 2.155e-11       
##                                           
##             Sensitivity : 0.7759          
##             Specificity : 0.5758          
##          Pos Pred Value : 0.5172          
##          Neg Pred Value : 0.8143          
##              Prevalence : 0.3694          
##          Detection Rate : 0.2866          
##    Detection Prevalence : 0.5541          
##       Balanced Accuracy : 0.6758          
##                                           
##        'Positive' Class : 0               
##