Sentiment Analysis

About the Assignment

The assignment is to perform sentiment analysis by re-creating and analyzing primary code from chapter 2 of the book Text Mining With R by Julia Silge and David Robinson with different corpus and additional lexicon with recommendations based on the findings.

Overview of Approach

I read the movies review data set with 5000 observations and 2 variables from the link: https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/sentdata.csv to make the assignment reproducible. I created row numbers to track each row of the dataset and performed tokenization to achieve the One-word-per-row-format rule. Then I sampled some Joy words in the movie reviews, analysed differences in sentiment of words in the review as used by different revewers at different points. I also compared the result of the sentiment analysis with four different lexicons.

Read Movie reviews

url<-"https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/sentdata.csv"
text_df <- read_csv(url)

## Rows: 5000 Columns: 2

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): text
## dbl (1): label

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(text_df)

## # A tibble: 6 x 2
##   text                                                                     label
##   <chr>                                                                    <dbl>
## 1 "It's been about 14 years since Sharon Stone awarded viewers a leg-cros~     0
## 2 "someone needed to make a car payment... this is truly awful... makes j~     0
## 3 "The Guidelines state that a comment must contain a minimum of four lin~     0
## 4 "This movie is a muddled mish-mash of clichés from recent cinema. There~     0
## 5 "Before Stan Laurel became the smaller half of the all-time greatest co~     0
## 6 "This is the best movie I've ever seen! <br /><br />Maybe it's because ~     1

Create row numbers to track rows

text_df<-text_df %>%mutate(linenumber=row_number()) %>%
  unnest_tokens(word, text)
text_df

## # A tibble: 1,162,943 x 3
##    label linenumber word   
##    <dbl>      <int> <chr>  
##  1     0          1 it's   
##  2     0          1 been   
##  3     0          1 about  
##  4     0          1 14     
##  5     0          1 years  
##  6     0          1 since  
##  7     0          1 sharon 
##  8     0          1 stone  
##  9     0          1 awarded
## 10     0          1 viewers
## # ... with 1,162,933 more rows

Sample afinn lexicon

kbl(head(get_sentiments("afinn"),n=20)) %>%
  kable_styling() %>% kable_paper("hover", full_width = F)

word	value
abandon	-2
abandoned	-2
abandons	-2
abducted	-2
abduction	-2
abductions	-2
abhor	-3
abhorred	-3
abhorrent	-3
abhors	-3
abilities	2
ability	2
aboard	1
absentee	-1
absentees	-1
absolve	2
absolved	2
absolves	2
absolving	2
absorbed	1

Sample bing lexicon

kbl(head(get_sentiments("bing"),n=20)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)

word	sentiment
2-faces	negative
abnormal	negative
abolish	negative
abominable	negative
abominably	negative
abominate	negative
abomination	negative
abort	negative
aborted	negative
aborts	negative
abound	positive
abounds	positive
abrade	negative
abrasive	negative
abrupt	negative
abruptly	negative
abscond	negative
absence	negative
absent-minded	negative
absentee	negative

Sample nrc lexicon

kbl(head(get_sentiments("nrc"),n=20)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)

word	sentiment
abacus	trust
abandon	fear
abandon	negative
abandon	sadness
abandoned	anger
abandoned	fear
abandoned	negative
abandoned	sadness
abandonment	anger
abandonment	fear
abandonment	negative
abandonment	sadness
abandonment	surprise
abba	positive
abbot	trust
abduction	fear
abduction	negative
abduction	sadness
abduction	surprise
aberrant	negative

Sample loughran lexicon

kbl(head(get_sentiments("loughran"),n=20)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)

word	sentiment
abandon	negative
abandoned	negative
abandoning	negative
abandonment	negative
abandonments	negative
abandons	negative
abdicated	negative
abdicates	negative
abdicating	negative
abdication	negative
abdications	negative
aberrant	negative
aberration	negative
aberrational	negative
aberrations	negative
abetting	negative
abnormal	negative
abnormalities	negative
abnormality	negative
abnormally	negative

The most common joy words in movie review

nrc_joy<-nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")
dim(nrc_joy)

## [1] 687   2

text_joy<-text_df %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

#Subset of joy words

kbl(head(text_joy,n=30)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)

word	n
good	2886
love	1206
pretty	663
music	585
kind	562
fun	525
found	519
money	482
true	455
excellent	454
special	442
beautiful	413
star	404
enjoy	357
wonderful	337
sex	335
mother	330
hope	309
laugh	303
finally	298
friend	293
perfect	287
favorite	249
entertaining	247
feeling	238
child	236
brilliant	225
god	221
daughter	220
art	218

To estimate sentiment changes or differences

text_sentiment <- text_df %>%
  inner_join(get_sentiments("bing"))

## Joining, by = "word"

text_sentiment

## # A tibble: 96,706 x 4
##    label linenumber word       sentiment
##    <dbl>      <int> <chr>      <chr>    
##  1     0          1 awarded    positive 
##  2     0          1 twisted    negative 
##  3     0          1 smash      negative 
##  4     0          1 sexy       positive 
##  5     0          1 vulnerable negative 
##  6     0          1 fans       positive 
##  7     0          1 painful    negative 
##  8     0          1 mediocre   negative 
##  9     0          1 plot       negative 
## 10     0          1 breaks     negative 
## # ... with 96,696 more rows

text_sentiment_wider <- text_sentiment %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = c("word", "sentiment")

text_sentiment_wider

## # A tibble: 43,554 x 5
##    word       index negative positive sentiment
##    <chr>      <dbl>    <int>    <int>     <int>
##  1 abnormal      35        1        0        -1
##  2 abolish        9        1        0        -1
##  3 abominable    15        1        0        -1
##  4 abominable    22        2        0        -2
##  5 abominable    29        2        0        -2
##  6 abominable    39        1        0        -1
##  7 abominable    40        1        0        -1
##  8 abominable    50        1        0        -1
##  9 abominably    10        1        0        -1
## 10 abominably    11        1        0        -1
## # ... with 43,544 more rows

ggplot(text_sentiment_wider, aes(index, sentiment)) +
  geom_col(show.legend = FALSE,fill=("#CF7F1A"))

Comparing the Four sentiment dictionaries

afinn <- text_df %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

## Joining, by = "word"

bing_and_nrc <- bind_rows(
  text_df %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  text_df %>% 
    inner_join(get_sentiments("loughran")) %>%
    mutate(method = "loughran"),
  text_df %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Most common positive and negative words

bing_word_counts <- text_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

bing_word_counts

## # A tibble: 4,325 x 3
##    word   sentiment     n
##    <chr>  <chr>     <int>
##  1 like   positive   3924
##  2 good   positive   2886
##  3 well   positive   2129
##  4 great  positive   1831
##  5 bad    negative   1799
##  6 best   positive   1291
##  7 love   positive   1206
##  8 plot   negative   1200
##  9 better positive   1117
## 10 funny  negative    989
## # ... with 4,315 more rows

bing_word_counts %>%filter(word!="br")%>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Wordclouds

library(wordcloud)

## Warning: package 'wordcloud' was built under R version 4.1.1

## Loading required package: RColorBrewer

text_df %>%filter(word!="br")%>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

## Joining, by = "word"

text_df %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

## Joining, by = "word"

Conclusion

Sentiment analysis provides a way to understand the attitudes and opinions expressed in texts (Julia Silge and David Robinson, July 18, 2017). From my analysis words like films, story, movie, character and people are most prevalent in the reviews given by the reviewers. Bad, plot, funny, hard, worst, death, stupid, awful and wrong are the most common negative words. While like, good, well, great, well, best, love, better, work, enough, and pretty are the most persistent positive words used by reviewers. Note that the Bing et al lexicon appeared to be most useful to analyse movie reviews because it displayed both negative and positive sentiments as indicated in the sentiment plot above.

References

Julia Silge and David Robinson(July 18, 2017). Text Mining With R. Chapter 2: Sentiment analysis with tidy data
https://www.tidytextmining.com/sentiment.html