Introduction

Sentiment Analysis or Opinion Mining was introduced in the early 2000s as a method to understand and analyze opinions and feelings (Dave, Lawrence and Pennock 2003; Liu 2012; Nasukawa and Yi 2003).

Why do opinions/sentiment matter to businesses?

Lexicon-based Sentiment Analysis uses previously scored words and/or word phrases to assign a sentiment value to new text data. Each word or phrase that matches the corresponding word or phrase in the lexicon is assigned that value.

Sentiment lexicons demonstrated in the analysis that follows include:

rbind(head(tidytext::get_sentiments("afinn"), n = 5), tail(tidytext::get_sentiments("afinn"), n = 5))
## # A tibble: 10 x 2
##    word      value
##    <chr>     <dbl>
##  1 abandon      -2
##  2 abandoned    -2
##  3 abandons     -2
##  4 abducted     -2
##  5 abduction    -2
##  6 yucky        -2
##  7 yummy         3
##  8 zealot       -2
##  9 zealots      -2
## 10 zealous       2
lexicon::hash_sentiment_huliu
##                x  y
##    1:     a plus  1
##    2:   abnormal -1
##    3:    abolish -1
##    4: abominable -1
##    5: abominably -1
##   ---              
## 6870:  zealously -1
## 6871:     zenith  1
## 6872:       zest  1
## 6873:      zippy  1
## 6874:     zombie -1
lexicon::hash_sentiment_jockers
##                  x     y
##     1:     abandon -0.75
##     2:   abandoned -0.50
##     3:   abandoner -0.25
##     4: abandonment -0.25
##     5:    abandons -1.00
##    ---                  
## 10734:     zealous  0.40
## 10735:      zenith  0.40
## 10736:        zest  0.50
## 10737:      zombie -0.25
## 10738:     zombies -0.25
lexicon::hash_sentiment_nrc
##                 x  y
##    1:     abandon -1
##    2:   abandoned -1
##    3: abandonment -1
##    4:        abba  1
##    5:   abduction -1
##   ---               
## 5464:       youth  1
## 5465:        zeal  1
## 5466:     zealous  1
## 5467:        zest  1
## 5468:         zip -1

Below depicts a sampling of sentiment classifications that are the same across many lexicons.

Sentiment Analysis

Preliminary & Preprocessing

library(tm) # text mining
library(lexicon) # sentiment lexicons
library(wordcloud) # visualization
library(irr) # inter-rater reliability for lexicons
library(textstem) # stemming and lemmatization (example purposes only)
library(syuzhet) # sentiment lexicons and analysis
load("SA_lexicons.RData")

For Sentiment Analysis using lexicons, the following preprocessing steps are typically undertaken (if the selected algorithm does not perform preprocessing):

  • case conversion
  • number removal
  • punctuation removal*
  • white space removal

Why no stop word removal?

Many stop word lists, including the default english and SMART contain sentiment-bearing words. Since we are matching our terms to a lexicon, there is no incentive to remove stop words and possibly remove these words.

English ("english" or "en"):

stopwords("en")[stopwords("en") %in% lexicons$word]
##  [1] "this"    "these"   "being"   "have"    "do"      "ought"   "as"     
##  [8] "by"      "for"     "about"   "against" "through" "above"   "up"     
## [15] "down"    "out"     "on"      "off"     "over"    "under"   "again"  
## [22] "further" "then"    "once"    "all"     "each"    "some"    "such"   
## [29] "no"      "own"     "same"    "too"

SMART ("SMART"):

stopwords("SMART")[stopwords("SMART") %in% lexicons$word]
##   [1] "able"          "about"         "above"         "according"    
##   [5] "accordingly"   "actually"      "again"         "against"      
##   [9] "all"           "allow"         "allows"        "alone"        
##  [13] "already"       "always"        "anyway"        "apart"        
##  [17] "appear"        "appreciate"    "appropriate"   "around"       
##  [21] "as"            "aside"         "ask"           "available"    
##  [25] "away"          "awfully"       "become"        "becoming"     
##  [29] "behind"        "being"         "believe"       "beside"       
##  [33] "best"          "better"        "beyond"        "brief"        
##  [37] "by"            "cause"         "certain"       "clearly"      
##  [41] "come"          "consider"      "contain"       "course"       
##  [45] "currently"     "described"     "despite"       "different"    
##  [49] "do"            "down"          "downwards"     "each"         
##  [53] "eight"         "else"          "elsewhere"     "enough"       
##  [57] "entirely"      "even"          "every"         "everything"   
##  [61] "everywhere"    "exactly"       "example"       "far"          
##  [65] "fifth"         "first"         "five"          "following"    
##  [69] "for"           "former"        "four"          "further"      
##  [73] "get"           "go"            "going"         "gone"         
##  [77] "greetings"     "have"          "hello"         "help"         
##  [81] "hence"         "hereupon"      "hopefully"     "ignored"      
##  [85] "immediate"     "indicate"      "inner"         "inward"       
##  [89] "just"          "keep"          "kept"          "know"         
##  [93] "last"          "less"          "let"           "like"         
##  [97] "liked"         "likely"        "look"          "looking"      
## [101] "mainly"        "many"          "may"           "maybe"        
## [105] "mean"          "merely"        "name"          "namely"       
## [109] "near"          "nearly"        "necessary"     "need"         
## [113] "needs"         "nevertheless"  "new"           "next"         
## [117] "nine"          "no"            "non"           "nothing"      
## [121] "novel"         "now"           "nowhere"       "obviously"    
## [125] "off"           "often"         "ok"            "okay"         
## [129] "old"           "on"            "once"          "one"          
## [133] "otherwise"     "ought"         "out"           "outside"      
## [137] "over"          "overall"       "own"           "perhaps"      
## [141] "please"        "plus"          "possible"      "probably"     
## [145] "rather"        "reasonably"    "regardless"    "relatively"   
## [149] "right"         "same"          "saw"           "say"          
## [153] "second"        "see"           "self"          "sensible"     
## [157] "seven"         "several"       "six"           "some"         
## [161] "somehow"       "sometime"      "soon"          "sorry"        
## [165] "specify"       "still"         "sub"           "such"         
## [169] "sup"           "take"          "taken"         "tell"         
## [173] "thank"         "thanks"        "then"          "thereupon"    
## [177] "these"         "think"         "third"         "this"         
## [181] "thorough"      "thoroughly"    "three"         "through"      
## [185] "together"      "too"           "toward"        "try"          
## [189] "trying"        "twice"         "two"           "under"        
## [193] "unfortunately" "unlikely"      "unto"          "up"           
## [197] "upon"          "us"            "use"           "used"         
## [201] "useful"        "using"         "value"         "want"         
## [205] "way"           "welcome"       "well"          "whatever"     
## [209] "whither"       "whole"         "willing"       "wish"         
## [213] "wonder"        "yes"           "yet"           "zero"

Why not stemming or lemmatizing?

Let’s consider and illustrative example of why neither stemming nor lemmatization should not be performed (contrary to popular belief..) prior to lexicon-based Sentiment Analysis. Let’s consider the terms: love, loves, loved, and loving.

Stemming will result in:

stem_strings(c("love", "loves", "loved", "loving"))
## [1] "love" "love" "love" "love"

Lemmatization will result in:

lemmatize_strings(c("love", "loves", "loved", "loving"))
## [1] "love" "love" "love" "love"

Stemming and lemmatization would produce the same result–the word love (Note: this will not typically be the case). Now, if we were to use either method and wanted to match to lexicons, the sentiment scores for 7 lexicons are:

lexicons[lexicons$word %in% c("love", "loves", "loved", "loving"),]
##        word afinn bing jockers loughran nrc senticnet sentiwordnet
## 1471   love     3    1    0.75       NA   1     0.655        0.375
## 1472  loved     3    1    0.50       NA  NA     0.658           NA
## 1475 loving     2    1    0.75       NA   1     0.383           NA
## 6363  loves    NA    1    1.00       NA  NA        NA           NA

As shown, the sentiment scores assigned to the root word, love, are not necessarily the same as the word love with inflections included. For this reason, the sentiment values assigned to the terms in our document collection will not accurately reflect the lexicon polarity values in some cases if we apply stemming or lemmatization.

After a particular lexicon is chosen, lemmatization or stemming can be used, where appropriate.

Analysis

We can use the get_sentiment() function in the syuzhet package to compute sentiment scores for 4 lexicons: afinn, bing, jockers (syuzhet, default) and nrc. Since the algorithm for get_sentiment() handles tokenization and cleaning, the text variable can be used directly as input and the scores can be saved as new columns in the data.

cr$jockers <- get_sentiment(cr$text, method = "syuzhet")
cr$bing <- get_sentiment(cr$text, method = "bing")
cr$afinn <- get_sentiment(cr$text, method = "afinn")
cr$nrc <- get_sentiment(cr$text, method = "nrc")
Internal Validity

In addition to the aggregated numerical scores, we can also assign categorical scores of -1, 0 and 1, to represent negative, neutral and positive, respectively.

sents_sub <- cr[ , (ncol(cr)-3):ncol(cr)]
sents_sub <- data.frame(lapply(sents_sub, sign))
sents_sub <- data.frame(lapply(sents_sub, as.factor))

We can apply the table() function to the sentiment scores to view the sentiment distribution across the lexicons.

lapply(sents_sub, table)
## $jockers
## 
##   -1    0    1 
##  190   32 4286 
## 
## $bing
## 
##   -1    0    1 
##  219  244 4045 
## 
## $afinn
## 
##   -1    0    1 
##  139   90 4279 
## 
## $nrc
## 
##   -1    0    1 
##  586  676 3246

Inter-rater reliability (IRR) can be used to assess the agreement across lexicons. In evaluating the results of the lexicon-based analysis, we can consider internal and external validation measures. Both can help us to choose the ultimate lexicon to use.

We can assess the overall IRR for the sentiment scores using Fleiss’ Kappa, since we have categorical data and more than 2 ‘raters’.

Below are some recommendations for evaluating Kappa values:

kappam.fleiss(sents_sub)
##  Fleiss' Kappa for m Raters
## 
##  Subjects = 4508 
##    Raters = 4 
##     Kappa = 0.219 
## 
##         z = 46.4 
##   p-value = 0

We find poor IRR across all of the lexicons. Since we have ordinal values, we can use weighted (squared) Kappa values to compare the lexicons pairwise.

We can use a custom function, irr_vals() to generate IRR values for our lexicons.

Looping over our first 3 lexicons, we can output the pair-wise IRR results as a list object.

irrl <- list()
for (i in 1:3){
  irrl[[i]] <- irr_vals(i)
}
names(irrl) <- names(sents_sub)[1:3]
irrl
## $jockers
##   jockers      bing     afinn       nrc 
##        NA 0.6564727 0.5403925 0.2452963 
## 
## $bing
##   jockers      bing     afinn       nrc 
## 0.6564727        NA 0.4983843 0.2761485 
## 
## $afinn
##   jockers      bing     afinn       nrc 
## 0.5403925 0.4983843        NA 0.1909130

As shown, Jockers, AFINN and Bing have moderate IRR and are the most consistent. Using Fleiss’ Kappa, we can confirm this for the 3 lexicons.

kappam.fleiss(sents_sub[,c("jockers", "bing", "afinn")])
##  Fleiss' Kappa for m Raters
## 
##  Subjects = 4508 
##    Raters = 3 
##     Kappa = 0.427 
## 
##         z = 62.7 
##   p-value = 0

We can combine our data with sentiment scores and the dataframe containing the sentiment labels.

cr <- data.frame(cr, sents_sub, 
                             stringsAsFactors = FALSE)

We will use the Bing lexicon in the example that follows. First, we obtain descriptive statistics and visualize the Bing sentiment scores, both continuous and categorical.

Bing Sentiment Score (Continuous)

summary(cr$bing)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -5.000   2.000   3.000   3.485   5.000  13.000
hist(cr$bing,
     xlab = "",
     main = "Sentiment (bing)",
     col = "steelblue")

Bing Sentiment Score (Categorical)

table(cr$bing.1)
## 
##   -1    0    1 
##  219  244 4045
barplot(table(cr$bing.1),
        names.arg = c("Negative", "Neutral", "Positive"),
        col = c("darkred", "darkgoldenrod", "darkgreen"),
        main = "Sentiment (bing)")

We can use a custom function, lexicon_wordcloud(), to produce side-by-side wordclouds displaying the most frequent positive and negative terms in the document collection based on a chosen lexicon. The function takes as arguments:

  • dtm: a DocumentTermMatrix() object
  • df: the original dataframe that the DTM was created from.
  • lexicon: the lexicon to use to create the wordcloud. Valid values include: “afinn” (default), “bing”, “nrc”, “jockers”, “loughran”, “senticnet”, “sentiwordnet”.
  • seed: the seed value to use in set.seed(). Defaults to 831.
lexicon_wordcloud(dtm = cr_DTM_SA, 
                  df = cr, 
                  lexicon = "bing")

External Validity

After sentiment scores are assigned to documents and internal reliability is assessed, we want to evaluate our sentiments with respect to external validity, if labeled data is available.

Rating

In this dataset, the closest variable to a sentiment variable is the Rating. We can evaluate the correlation between the Rating and our sentiment categories.

We can perform a Chi-Square Test for Independence to test for dependency/correlation in our two categorical variables. Note: Chi-Square is valid for categorical variables and ordinal variables with few categories.

If p < 0.05, we have evidence of an association between the categorical variables.

Rating_bing <- table(Rating = cr$Rating, 
      Sentiment = cr$bing.1)

chisq.test(Rating_bing)
## 
##  Pearson's Chi-squared test
## 
## data:  Rating_bing
## X-squared = 798.79, df = 8, p-value < 2.2e-16

We can use Spearman rank correlation to obtain the correlation between the two ordinal variables (Rating and Sentiment (bing.1)).

cor(as.numeric(cr$Rating), 
    as.numeric(cr$bing.1), 
    method = "spearman")
## [1] 0.3457032

Next, we can evaluate aggregate sentiment score information across some of the other variables in our data.

aggregate(bing ~ Rating, 
          data = cr,
          FUN = "mean")
##   Rating      bing
## 1      1 0.2732919
## 2      2 1.4967532
## 3      3 2.2094718
## 4      4 3.2899590
## 5      5 4.2883850
boxplot(bing ~ Rating, 
        data = cr,
        ylab = "Sentiment (bing)",
        col = cm.colors(n = 5))

We can perform an ANOVA test to test for statistically significant differences in mean sentiment across ratings and Tukey’s HSD to identify significant pair-wise differences.

an_bing_rat <- aov(bing ~ Rating, data = cr)
summary(an_bing_rat)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## Rating         4   5431  1357.8   265.8 <2e-16 ***
## Residuals   4503  23007     5.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA summary output, at least one of the group means is different. We use Tukey’s HSD to identify which of the group means is/are different.

TukeyHSD(an_bing_rat)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = bing ~ Rating, data = cr)
## 
## $Rating
##          diff       lwr      upr    p adj
## 2-1 1.2234613 0.6235883 1.823334 3.00e-07
## 3-1 1.9361798 1.3833503 2.489009 0.00e+00
## 4-1 3.0166671 2.4919761 3.541358 0.00e+00
## 5-1 4.0150931 3.5136432 4.516543 0.00e+00
## 3-2 0.7127185 0.2735916 1.151845 9.47e-05
## 4-2 1.7932058 1.3900773 2.196334 0.00e+00
## 5-2 2.7916318 2.4192558 3.164008 0.00e+00
## 4-3 1.0804872 0.7514197 1.409555 0.00e+00
## 5-3 2.0789133 1.7883332 2.369493 0.00e+00
## 5-4 0.9984260 0.7657957 1.231056 0.00e+00
Sentiment Analysis in Context

Sentiment across covariate data should be explored for patterns and visualized (where appropriate).

Recommended.IND

We can evaluate if there is a statistically significant difference in average sentiment across those who recommend and do not recommend the product.

boxplot(bing ~ Recommended.IND, 
        data = cr,
        ylab = "Sentiment (bing)",
        xlab = "Recommended",
        col = c("darkred", "darkgreen"))

t.test(bing ~ Recommended.IND, 
       data = cr)
## 
##  Welch Two Sample t-test
## 
## data:  bing by Recommended.IND
## t = -25.535, df = 1103.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.627629 -2.252635
## sample estimates:
## mean in group 0 mean in group 1 
##        1.473485        3.913617

As shown, we find a statistically significant difference in the average sentiment for those that do and do not recommend the product.

Age We can create age groups automatically using the cut() function to evaluate if there are sentiment differences across age groups.

AgeFac <- cut(cr$Age, 
              breaks = 7, 
              dig.lab=2)
barplot(table(cr$bing.1, AgeFac), 
        beside=TRUE,
        col = c("darkred", "darkgoldenrod", "darkgreen"),
        main = "Sentiment (bing) Across Age Group")

What are some variables in the J&J data that can provide important sentiment insights?

Sentiment Analysis Extensions

Uses for the output from sentiment analysis include: