Text Mining: Sentiment Analysis

Preliminary

We will use one new packages, the syuzhet package. If you do not already have this package installed, you will first install it using the install.packages() function.

install.packages("syuzhet")

We will also use the tm and caret packages. Next, we load these packages for use in the session.

library(tm)
library(caret)
library(syuzhet)

In the lessons that follow, we use the imdb_reviews.csv file, which contains 1000 movie reviews from IMDB and an assigned polarity value (positive_flag) indicating the sentiment of the review (0 = negative, 1 = positive). Each review has a unique identifier, doc_id, and the review text (text).

For sentiment analysis using lexicons, we use our original data, from the imdb_reviews.csv file. In the sentiment analysis using classification, we use the prepared data in the imdb_df.csv file.

We use the read.csv() function to import the CSV files into R. We set stringsAsFactors = FALSE to keep any character columns as-is. We also use the na.strings argument to specify when character strings (in the text column/variable) should be treated as NA, or missing values. We use na.strings = c("", " ") to specify that empty text documents ("") and documents with white space (" ") should be converted to NA values.

imdb <- read.csv(file = "imdb_reviews.csv",
                 stringsAsFactors = FALSE,
                 na.strings = c("", " "))

imdb_df <- read.csv(file = "imdb_df.csv",
                    stringsAsFactors = FALSE)

Sentiment Analysis: Lexicon

Data Exploration & Preparation

For Sentiment Analysis using Lexicons, we will use the original data, our imdb dataframe. We can obtain the structure of our data and preview the variables.

str(imdb)

## 'data.frame':    1000 obs. of  3 variables:
##  $ doc_id       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ text         : chr  "A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  " "Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  " "Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridi"| __truncated__ "Very little music or anything to speak of.  " ...
##  $ positive_flag: int  0 0 0 0 1 0 0 1 0 1 ...

The positive_flag variable is our variable of interest. First, we can convert it to a nominal variable. For compatibility with our lexicons, we convert the 0 label to -1.

imdb$positive_flag <- factor(x = imdb$positive_flag,
                             labels = c(-1,1))

We can plot the distribution of our positive_flag variable.

plot(imdb$positive_flag, # sentiment
     main = "Review Sentiment", # plot title
     xlab = "Positive Flag") # x-axis label

Sentiment Scoring

Next, we use the get_sentiment() function from the syuzhet package to obtain sentiment scores for our documents. We will add columns to our imdb dataframe with the sentiment scores.

Jockers

imdb$jockers <- get_sentiment(char_v = imdb$text, # text data
                              method = "syuzhet") # 'jockers' lexicon

Bing

imdb$bing <- get_sentiment(char_v = imdb$text, # text data
                           method = "bing") # bing lexicon

AFINN

imdb$afinn <- get_sentiment(char_v = imdb$text, # text data
                            method = "afinn") # AFINN lexicon

imdb$nrc <- get_sentiment(char_v = imdb$text, # text data
                          method = "nrc") # NRC lexicon

For NRC, we can also get emotion information. we can get values for 8 emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust) using the get_nrc_sentiment() function. Then, we can add up the instances of these emotions using the colSums() function to get total emotion information for the document collection.

nrc_emot <- get_nrc_sentiment(char_v = imdb$text)
colSums(nrc_emot[ ,1:8])

##        anger anticipation      disgust         fear          joy      sadness 
##          290          394          311          339          414          351 
##     surprise        trust 
##          241          473

We can visualize the distributions of our sentiment scores for our four lexicons.

Next, we can convert our sentiment assignments to be compatible with our positive_flag variable. We will create a new dataframe to hold the sentiment, named sents_sub, which is a duplicate of the last 4 columns of our imdb dataframe (the columns containing the 4 sentiment scores).

sents_sub <- imdb[ ,(ncol(imdb)-3):ncol(imdb)]

We convert sentiment scores based on their sign (-1, 0, 1) using the sign() function.

sents_sub <- data.frame(lapply(X = sents_sub,
                               FUN = sign))

We will consider neutral sentiment to be positive, so we replace 0 values with 1s.

sents_sub[sents_sub == 0] <- 1

Finally, we convert the columns to factors, so that we can compare the sentiment (-1,1) to our positive_flag variable.

sents_sub <- data.frame(lapply(X = sents_sub, 
                               FUN = as.factor))

Now, we can plot the distributions of our compatible assignments.

We can use the confusionMatrix() function to compare our assigned sentiment to the actual sentiment (positive_flag). We will compare only one of the lexicon’s results, but each of the lexicon sentiments could be compared against the actual sentiment.

jockers_conf <- confusionMatrix(data = sents_sub$jockers, # assigned sentiment
                                reference = imdb$positive_flag, # actual
                                positive = "1",
                                mode = "everything")
jockers_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  -1   1
##         -1 308  46
##         1  192 454
##                                           
##                Accuracy : 0.762           
##                  95% CI : (0.7344, 0.7881)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.524           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9080          
##             Specificity : 0.6160          
##          Pos Pred Value : 0.7028          
##          Neg Pred Value : 0.8701          
##               Precision : 0.7028          
##                  Recall : 0.9080          
##                      F1 : 0.7923          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4540          
##    Detection Prevalence : 0.6460          
##       Balanced Accuracy : 0.7620          
##                                           
##        'Positive' Class : 1               
##

As shown, the lexicon does a much better job of identifying the positive reviews than the negative reviews.

Sentiment Analysis: Classification

For sentiment analysis using classification methods, we will use the imdb_df.csv file that we prepared in the TextMiningI.R script file and imported into our current session as imdb_df. The dataframe contains a column/variable for each of our terms and the target variable, positive_flag. The values in the term variables are the TF-IDF-weighted values.

Again, the positive_flag variable is our variable of interest in a predictive model. First, we can convert it to a nominal variable.

imdb_df$positive_flag <- factor(imdb_df$positive_flag)

Training & Testing Sets

We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 85/15 split (85% in training and 15% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (positive_flag). Since the function takes a random sample, we initialize a random seed first for reproducibility. We use the imdb_df dataframe to create our train and test sets.

set.seed(831) 
sub <- createDataPartition(y = imdb_df$positive_flag, # target variable
                            p = 0.85, # % in training
                            list = FALSE)

Next, we subset the rows of the imdb_df dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.

train <- imdb_df[sub, ] 
test <- imdb_df[-sub, ]

Classification Analysis (Bagging)

We will train a Bagging ensemble classification model using the train() function in the caret package to perform repeated 5-fold crossvalidation to obtain the generalizable error.

First, we set up our control object for input in the train() function for the trControl argument.

ctrl <- trainControl(method = "repeatedcv", # repeated cross-validation
                     number = 5, # k = 5 folds
                     repeats = 3) # repeated 3 times

Next, we initialize a random seed for our resampling.

set.seed(831)

Then, we use the train() function to train the Bagging (method = "treebag) model using 5-Fold Cross Validation repeated 3 times.

bagMod <- train(x = train[ ,-ncol(train)], # use all terms as predictors
                    y = train$positive_flag, # predict positive_flag
                    trControl = ctrl, # control object
                    method = "treebag") # bagging model

We can view the average Accuracy and Kappa values across our resamples.

bagMod

## Bagged CART 
## 
## 794 samples
## 226 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 635, 635, 635, 635, 636, 635, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7069872  0.4142908

Testing Performance

We use the predict() function to generate class predictions for our testing data set and evaluate model performance.

bag.preds <- predict(object = bagMod,
                     newdata = test)

Next, we get performance measures using the confusionMatrix() function.

bag_conf <- confusionMatrix(data = bag.preds, # predictions
                            reference = test$positive_flag, # actual
                            positive = "1",
                            mode = "everything")
bag_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 64 28
##          1  5 42
##                                           
##                Accuracy : 0.7626          
##                  95% CI : (0.6831, 0.8306)
##     No Information Rate : 0.5036          
##     P-Value [Acc > NIR] : 3.373e-10       
##                                           
##                   Kappa : 0.5263          
##                                           
##  Mcnemar's Test P-Value : 0.0001283       
##                                           
##             Sensitivity : 0.6000          
##             Specificity : 0.9275          
##          Pos Pred Value : 0.8936          
##          Neg Pred Value : 0.6957          
##               Precision : 0.8936          
##                  Recall : 0.6000          
##                      F1 : 0.7179          
##              Prevalence : 0.5036          
##          Detection Rate : 0.3022          
##    Detection Prevalence : 0.3381          
##       Balanced Accuracy : 0.7638          
##                                           
##        'Positive' Class : 1               
##

As shown, the classification analysis is better able to predict the negative reviews than the positive reviews.

Comparing Lexicon & Classification

We can compare the results of our lexicon (Jockers) and classification (Bagging) based sentiment analyses using the cbind() function on our confusionMatrix() output.

Overall Measures

cbind(Lexicon = jockers_conf$overall,
      Classification = bag_conf$overall)

##                     Lexicon Classification
## Accuracy       7.620000e-01   7.625899e-01
## Kappa          5.240000e-01   5.262832e-01
## AccuracyLower  7.343563e-01   6.830570e-01
## AccuracyUpper  7.880916e-01   8.305956e-01
## AccuracyNull   5.000000e-01   5.035971e-01
## AccuracyPValue 8.479288e-65   3.372839e-10
## McnemarPValue  5.510818e-21   1.282952e-04

Class-Level Measures

cbind(Lexicon = jockers_conf$byClass,
      Classification = bag_conf$byClass)

##                        Lexicon Classification
## Sensitivity          0.9080000      0.6000000
## Specificity          0.6160000      0.9275362
## Pos Pred Value       0.7027864      0.8936170
## Neg Pred Value       0.8700565      0.6956522
## Precision            0.7027864      0.8936170
## Recall               0.9080000      0.6000000
## F1                   0.7923211      0.7179487
## Prevalence           0.5000000      0.5035971
## Detection Rate       0.4540000      0.3021583
## Detection Prevalence 0.6460000      0.3381295
## Balanced Accuracy    0.7620000      0.7637681

As shown, the two methods have similar overall performance, but based on the sensitivity/recall and specificity values the lexicon-based method is better able to assign the correct class to the positive reviews and the classification-based approach is better able to predict the negative reviews.