Sentiment Analysis

About the Assignment

The assignment is to perform sentiment analysis by re-creating and analyzing primary code from chapter 2 of the book Text Mining With R by Julia Silge and David Robinson with different corpus and additional lexicon with recommendations based on the findings.

Overview of Approach

I read the movies review data set with 5000 observations and 2 variables from the link: https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/sentdata.csv to make the assignment reproducible. I created row numbers to track each row of the dataset and performed tokenization to achieve the One-word-per-row-format rule. Then I sampled some Joy words in the movie reviews, analysed differences in sentiment of words in the review as used by different revewers at different points. I also compared the result of the sentiment analysis with four different lexicons.

Read Movie reviews

url<-"https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/sentdata.csv"
text_df <- read_csv(url)
## Rows: 5000 Columns: 2
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): text
## dbl (1): label
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(text_df)
## # A tibble: 6 x 2
##   text                                                                     label
##   <chr>                                                                    <dbl>
## 1 "It's been about 14 years since Sharon Stone awarded viewers a leg-cros~     0
## 2 "someone needed to make a car payment... this is truly awful... makes j~     0
## 3 "The Guidelines state that a comment must contain a minimum of four lin~     0
## 4 "This movie is a muddled mish-mash of clichés from recent cinema. There~     0
## 5 "Before Stan Laurel became the smaller half of the all-time greatest co~     0
## 6 "This is the best movie I've ever seen! <br /><br />Maybe it's because ~     1

Create row numbers to track rows

text_df<-text_df %>%mutate(linenumber=row_number()) %>%
  unnest_tokens(word, text)
text_df
## # A tibble: 1,162,943 x 3
##    label linenumber word   
##    <dbl>      <int> <chr>  
##  1     0          1 it's   
##  2     0          1 been   
##  3     0          1 about  
##  4     0          1 14     
##  5     0          1 years  
##  6     0          1 since  
##  7     0          1 sharon 
##  8     0          1 stone  
##  9     0          1 awarded
## 10     0          1 viewers
## # ... with 1,162,933 more rows

Sample afinn lexicon

kbl(head(get_sentiments("afinn"),n=20)) %>%
  kable_styling() %>% kable_paper("hover", full_width = F)
word value
abandon -2
abandoned -2
abandons -2
abducted -2
abduction -2
abductions -2
abhor -3
abhorred -3
abhorrent -3
abhors -3
abilities 2
ability 2
aboard 1
absentee -1
absentees -1
absolve 2
absolved 2
absolves 2
absolving 2
absorbed 1

Sample bing lexicon

kbl(head(get_sentiments("bing"),n=20)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)
word sentiment
2-faces negative
abnormal negative
abolish negative
abominable negative
abominably negative
abominate negative
abomination negative
abort negative
aborted negative
aborts negative
abound positive
abounds positive
abrade negative
abrasive negative
abrupt negative
abruptly negative
abscond negative
absence negative
absent-minded negative
absentee negative

Sample nrc lexicon

kbl(head(get_sentiments("nrc"),n=20)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)
word sentiment
abacus trust
abandon fear
abandon negative
abandon sadness
abandoned anger
abandoned fear
abandoned negative
abandoned sadness
abandonment anger
abandonment fear
abandonment negative
abandonment sadness
abandonment surprise
abba positive
abbot trust
abduction fear
abduction negative
abduction sadness
abduction surprise
aberrant negative

Sample loughran lexicon

kbl(head(get_sentiments("loughran"),n=20)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)
word sentiment
abandon negative
abandoned negative
abandoning negative
abandonment negative
abandonments negative
abandons negative
abdicated negative
abdicates negative
abdicating negative
abdication negative
abdications negative
aberrant negative
aberration negative
aberrational negative
aberrations negative
abetting negative
abnormal negative
abnormalities negative
abnormality negative
abnormally negative

The most common joy words in movie review

nrc_joy<-nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")
dim(nrc_joy)
## [1] 687   2
text_joy<-text_df %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
#Subset of joy words

kbl(head(text_joy,n=30)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)
word n
good 2886
love 1206
pretty 663
music 585
kind 562
fun 525
found 519
money 482
true 455
excellent 454
special 442
beautiful 413
star 404
enjoy 357
wonderful 337
sex 335
mother 330
hope 309
laugh 303
finally 298
friend 293
perfect 287
favorite 249
entertaining 247
feeling 238
child 236
brilliant 225
god 221
daughter 220
art 218

To estimate sentiment changes or differences

text_sentiment <- text_df %>%
  inner_join(get_sentiments("bing"))
## Joining, by = "word"
text_sentiment
## # A tibble: 96,706 x 4
##    label linenumber word       sentiment
##    <dbl>      <int> <chr>      <chr>    
##  1     0          1 awarded    positive 
##  2     0          1 twisted    negative 
##  3     0          1 smash      negative 
##  4     0          1 sexy       positive 
##  5     0          1 vulnerable negative 
##  6     0          1 fans       positive 
##  7     0          1 painful    negative 
##  8     0          1 mediocre   negative 
##  9     0          1 plot       negative 
## 10     0          1 breaks     negative 
## # ... with 96,696 more rows
text_sentiment_wider <- text_sentiment %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = c("word", "sentiment")
text_sentiment_wider
## # A tibble: 43,554 x 5
##    word       index negative positive sentiment
##    <chr>      <dbl>    <int>    <int>     <int>
##  1 abnormal      35        1        0        -1
##  2 abolish        9        1        0        -1
##  3 abominable    15        1        0        -1
##  4 abominable    22        2        0        -2
##  5 abominable    29        2        0        -2
##  6 abominable    39        1        0        -1
##  7 abominable    40        1        0        -1
##  8 abominable    50        1        0        -1
##  9 abominably    10        1        0        -1
## 10 abominably    11        1        0        -1
## # ... with 43,544 more rows
ggplot(text_sentiment_wider, aes(index, sentiment)) +
  geom_col(show.legend = FALSE,fill=("#CF7F1A"))

Comparing the Four sentiment dictionaries

afinn <- text_df %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
  text_df %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  text_df %>% 
    inner_join(get_sentiments("loughran")) %>%
    mutate(method = "loughran"),
  text_df %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Most common positive and negative words

bing_word_counts <- text_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 4,325 x 3
##    word   sentiment     n
##    <chr>  <chr>     <int>
##  1 like   positive   3924
##  2 good   positive   2886
##  3 well   positive   2129
##  4 great  positive   1831
##  5 bad    negative   1799
##  6 best   positive   1291
##  7 love   positive   1206
##  8 plot   negative   1200
##  9 better positive   1117
## 10 funny  negative    989
## # ... with 4,315 more rows
bing_word_counts %>%filter(word!="br")%>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Wordclouds

library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.1.1
## Loading required package: RColorBrewer
text_df %>%filter(word!="br")%>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

text_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining, by = "word"

Conclusion

Sentiment analysis provides a way to understand the attitudes and opinions expressed in texts (Julia Silge and David Robinson, July 18, 2017). From my analysis words like films, story, movie, character and people are most prevalent in the reviews given by the reviewers. Bad, plot, funny, hard, worst, death, stupid, awful and wrong are the most common negative words. While like, good, well, great, well, best, love, better, work, enough, and pretty are the most persistent positive words used by reviewers. Note that the Bing et al lexicon appeared to be most useful to analyse movie reviews because it displayed both negative and positive sentiments as indicated in the sentiment plot above.

References