About the Assignment
The assignment is to perform sentiment analysis by re-creating and analyzing primary code from chapter 2 of the book Text Mining With R by Julia Silge and David Robinson with different corpus and additional lexicon with recommendations based on the findings.
Overview of Approach
I read the movies review data set with 5000 observations and 2 variables from the link: https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/sentdata.csv to make the assignment reproducible. I created row numbers to track each row of the dataset and performed tokenization to achieve the One-word-per-row-format rule. Then I sampled some Joy words in the movie reviews, analysed differences in sentiment of words in the review as used by different revewers at different points. I also compared the result of the sentiment analysis with four different lexicons.
Read Movie reviews
url<-"https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/sentdata.csv"
text_df <- read_csv(url)
## Rows: 5000 Columns: 2
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): text
## dbl (1): label
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 x 2
## text label
## <chr> <dbl>
## 1 "It's been about 14 years since Sharon Stone awarded viewers a leg-cros~ 0
## 2 "someone needed to make a car payment... this is truly awful... makes j~ 0
## 3 "The Guidelines state that a comment must contain a minimum of four lin~ 0
## 4 "This movie is a muddled mish-mash of clichés from recent cinema. There~ 0
## 5 "Before Stan Laurel became the smaller half of the all-time greatest co~ 0
## 6 "This is the best movie I've ever seen! <br /><br />Maybe it's because ~ 1
Create row numbers to track rows
text_df<-text_df %>%mutate(linenumber=row_number()) %>%
unnest_tokens(word, text)
text_df
## # A tibble: 1,162,943 x 3
## label linenumber word
## <dbl> <int> <chr>
## 1 0 1 it's
## 2 0 1 been
## 3 0 1 about
## 4 0 1 14
## 5 0 1 years
## 6 0 1 since
## 7 0 1 sharon
## 8 0 1 stone
## 9 0 1 awarded
## 10 0 1 viewers
## # ... with 1,162,933 more rows
Sample afinn lexicon
kbl(head(get_sentiments("afinn"),n=20)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)
word
|
value
|
abandon
|
-2
|
abandoned
|
-2
|
abandons
|
-2
|
abducted
|
-2
|
abduction
|
-2
|
abductions
|
-2
|
abhor
|
-3
|
abhorred
|
-3
|
abhorrent
|
-3
|
abhors
|
-3
|
abilities
|
2
|
ability
|
2
|
aboard
|
1
|
absentee
|
-1
|
absentees
|
-1
|
absolve
|
2
|
absolved
|
2
|
absolves
|
2
|
absolving
|
2
|
absorbed
|
1
|
Sample bing lexicon
kbl(head(get_sentiments("bing"),n=20)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)
word
|
sentiment
|
2-faces
|
negative
|
abnormal
|
negative
|
abolish
|
negative
|
abominable
|
negative
|
abominably
|
negative
|
abominate
|
negative
|
abomination
|
negative
|
abort
|
negative
|
aborted
|
negative
|
aborts
|
negative
|
abound
|
positive
|
abounds
|
positive
|
abrade
|
negative
|
abrasive
|
negative
|
abrupt
|
negative
|
abruptly
|
negative
|
abscond
|
negative
|
absence
|
negative
|
absent-minded
|
negative
|
absentee
|
negative
|
Sample nrc lexicon
kbl(head(get_sentiments("nrc"),n=20)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)
word
|
sentiment
|
abacus
|
trust
|
abandon
|
fear
|
abandon
|
negative
|
abandon
|
sadness
|
abandoned
|
anger
|
abandoned
|
fear
|
abandoned
|
negative
|
abandoned
|
sadness
|
abandonment
|
anger
|
abandonment
|
fear
|
abandonment
|
negative
|
abandonment
|
sadness
|
abandonment
|
surprise
|
abba
|
positive
|
abbot
|
trust
|
abduction
|
fear
|
abduction
|
negative
|
abduction
|
sadness
|
abduction
|
surprise
|
aberrant
|
negative
|
Sample loughran lexicon
kbl(head(get_sentiments("loughran"),n=20)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)
word
|
sentiment
|
abandon
|
negative
|
abandoned
|
negative
|
abandoning
|
negative
|
abandonment
|
negative
|
abandonments
|
negative
|
abandons
|
negative
|
abdicated
|
negative
|
abdicates
|
negative
|
abdicating
|
negative
|
abdication
|
negative
|
abdications
|
negative
|
aberrant
|
negative
|
aberration
|
negative
|
aberrational
|
negative
|
aberrations
|
negative
|
abetting
|
negative
|
abnormal
|
negative
|
abnormalities
|
negative
|
abnormality
|
negative
|
abnormally
|
negative
|
The most common joy words in movie review
nrc_joy<-nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
dim(nrc_joy)
## [1] 687 2
text_joy<-text_df %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
#Subset of joy words
kbl(head(text_joy,n=30)) %>%
kable_styling() %>% kable_paper("hover", full_width = F)
word
|
n
|
good
|
2886
|
love
|
1206
|
pretty
|
663
|
music
|
585
|
kind
|
562
|
fun
|
525
|
found
|
519
|
money
|
482
|
true
|
455
|
excellent
|
454
|
special
|
442
|
beautiful
|
413
|
star
|
404
|
enjoy
|
357
|
wonderful
|
337
|
sex
|
335
|
mother
|
330
|
hope
|
309
|
laugh
|
303
|
finally
|
298
|
friend
|
293
|
perfect
|
287
|
favorite
|
249
|
entertaining
|
247
|
feeling
|
238
|
child
|
236
|
brilliant
|
225
|
god
|
221
|
daughter
|
220
|
art
|
218
|
To estimate sentiment changes or differences
text_sentiment <- text_df %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
## # A tibble: 96,706 x 4
## label linenumber word sentiment
## <dbl> <int> <chr> <chr>
## 1 0 1 awarded positive
## 2 0 1 twisted negative
## 3 0 1 smash negative
## 4 0 1 sexy positive
## 5 0 1 vulnerable negative
## 6 0 1 fans positive
## 7 0 1 painful negative
## 8 0 1 mediocre negative
## 9 0 1 plot negative
## 10 0 1 breaks negative
## # ... with 96,696 more rows
text_sentiment_wider <- text_sentiment %>%
inner_join(get_sentiments("bing")) %>%
count(word, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = c("word", "sentiment")
## # A tibble: 43,554 x 5
## word index negative positive sentiment
## <chr> <dbl> <int> <int> <int>
## 1 abnormal 35 1 0 -1
## 2 abolish 9 1 0 -1
## 3 abominable 15 1 0 -1
## 4 abominable 22 2 0 -2
## 5 abominable 29 2 0 -2
## 6 abominable 39 1 0 -1
## 7 abominable 40 1 0 -1
## 8 abominable 50 1 0 -1
## 9 abominably 10 1 0 -1
## 10 abominably 11 1 0 -1
## # ... with 43,544 more rows
ggplot(text_sentiment_wider, aes(index, sentiment)) +
geom_col(show.legend = FALSE,fill=("#CF7F1A"))

Comparing the Four sentiment dictionaries
afinn <- text_df %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
text_df %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
text_df %>%
inner_join(get_sentiments("loughran")) %>%
mutate(method = "loughran"),
text_df %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")

Most common positive and negative words
bing_word_counts <- text_df %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
## # A tibble: 4,325 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 like positive 3924
## 2 good positive 2886
## 3 well positive 2129
## 4 great positive 1831
## 5 bad negative 1799
## 6 best positive 1291
## 7 love positive 1206
## 8 plot negative 1200
## 9 better positive 1117
## 10 funny negative 989
## # ... with 4,315 more rows
bing_word_counts %>%filter(word!="br")%>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)

Wordclouds
## Warning: package 'wordcloud' was built under R version 4.1.1
## Loading required package: RColorBrewer
text_df %>%filter(word!="br")%>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

text_df %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining, by = "word"

Conclusion
Sentiment analysis provides a way to understand the attitudes and opinions expressed in texts (Julia Silge and David Robinson, July 18, 2017). From my analysis words like films, story, movie, character and people are most prevalent in the reviews given by the reviewers. Bad, plot, funny, hard, worst, death, stupid, awful and wrong are the most common negative words. While like, good, well, great, well, best, love, better, work, enough, and pretty are the most persistent positive words used by reviewers. Note that the Bing et al lexicon appeared to be most useful to analyse movie reviews because it displayed both negative and positive sentiments as indicated in the sentiment plot above.