library(tidyverse)
library(tidytext)
library(janeaustenr)
library(dplyr)
library(stringr)
library(readr)
This is assignment 8 from week ten of the fall 2024 edition of DATA 607.The assignment is as stated below, lightly edited for length and clarity:
You should start by getting the primary example code from chapter 2 of Text Mining with R on sentiment anlysis working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:
The primary code example comes from the tidy-text-mining Github repo (Robinson and Silge), which is the official repository for the above noted book. The relevant Rmd parts are as follows:
“As discussed above, there are a variety of methods and dictionaries that exist for evaluating the opinion or emotion in text. The tidytext package provides access to several sentiment lexicons. Three general-purpose lexicons are
AFINN
from Finn
Årup Nielsen,bing
from Bing Liu
and collaborators, andnrc
from Saif
Mohammad and Peter Turney.All three of these lexicons are based on unigrams, i.e., single
words. These lexicons contain many English words and the words are
assigned scores for positive/negative sentiment, and also possibly
emotions like joy, anger, sadness, and so forth. The nrc
lexicon categorizes words in a binary fashion (“yes”/“no”) into
categories of positive, negative, anger, anticipation, disgust, fear,
joy, sadness, surprise, and trust. The bing
lexicon
categorizes words in a binary fashion into positive and negative
categories. The AFINN
lexicon assigns words with a score
that runs between -5 and 5, with negative scores indicating negative
sentiment and positive scores indicating positive sentiment.
The function get_sentiments()
allows us to get specific
sentiment lexicons with the appropriate measures for each one.
library(tidytext)
library(tidyverse)
library(janeaustenr)
library(dplyr)
library(stringr)
library(ggplot2)
afinn <- get_sentiments("afinn")
bing <- get_sentiments("bing")
nrc <- get_sentiments("nrc")
# Prepare the text data in tidy format
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\\\divxlc]", ignore_case = TRUE)))
) %>%
ungroup() %>%
unnest_tokens(word, text)
# Join with NRC lexicon for sentiment "joy" in "Emma"
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ℹ 291 more rows
nrc_joy <- nrc %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ℹ 291 more rows
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
afinn <- pride_prejudice %>%
inner_join(afinn) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(nrc %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Warning in inner_join(., nrc %>% filter(sentiment %in% c("positive", "negative"))): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
nrc %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
” ## Overview of chosen data and analysis
I’ve chosen the App Store Reviews for a Mobile App from Kaggle, which contains the app store reviews for a mobile app, broken out by platform, contry, and device. I exported the data and then uploaded it to my GCP instance for import below.
For my additional Lexicon, I have chosen the Loughran-McDonald Lexicon. According to the University of Notre Dame’s Software Repository for Accounting and Finance overview:
“The dictionary reports counts, proportion of total, average proportion per document, standard deviation of proportion per document, document count (i.e., number of documents containing at least one occurrence of the word), seven sentiment category identifiers, complexity, number of syllables, and source for each word (source is either 12of12inf or the year in which the word was added).
The sentiment categories are: negative, positive, uncertainty, litigious, strong modal, weak modal, and constraining.”(Loughran and McDonald)
apps_url = 'https://storage.googleapis.com/data_science_masters_files/2024_fall/data_607_data_management/week_ten/app_store_reviews.csv'
app_reviews_df <- read_delim(apps_url, delim = ";")
loughran_mc_lex <- get_sentiments("loughran")
head(app_reviews_df)
## # A tibble: 6 × 10
## date platform country review star user_id issue_flag likes_count
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 7.07.2023 iOS Australia Love the i… 5 13c954… No 1
## 2 12.08.2023 Android India The premiu… 4 945725… No 2
## 3 12.09.2023 iOS UK I can't sh… 5 e3d956… No 5
## 4 12.07.2023 Android Brazil The price … 3 1fa559… No 0
## 5 24.09.2023 iOS India Smooth boo… 3 679346… No 2
## 6 20.09.2023 Android USA Premium ac… 2 ea1c97… No 4
## # ℹ 2 more variables: dislike_count <dbl>, label <lgl>
Here is ny implementation of the above code fronm the authors, with a component added that uses the Loughran method. This does the analysis by country, platform (operating system), and star reviews. Unsurprisingly, the higher the start reviews, the more positive the sentiment of the reviews. Not really any divergences by country or platform, either. It seems we are all, at the end of the day, remarkably similar in certain respects.
loughran_mc <- get_sentiments("loughran")
tidy_reviews_apps <- app_reviews_df %>%
mutate(row_id = row_number()) %>%
unnest_tokens(word, review)
nrc_joy_apps <- nrc %>%
filter(sentiment == "joy")
tidy_reviews_apps %>%
filter(country %in% c("USA", "India", "Brazil")) %>%
inner_join(nrc_joy_apps,relationship = 'many-to-many') %>%
count(word, sort = TRUE)
## # A tibble: 18 × 2
## word n
## <chr> <int>
## 1 love 16
## 2 helpful 6
## 3 good 5
## 4 happy 4
## 5 holiday 4
## 6 inspire 4
## 7 share 4
## 8 perfect 3
## 9 finally 2
## 10 fun 2
## 11 improvement 2
## 12 pay 2
## 13 companion 1
## 14 deal 1
## 15 enjoy 1
## 16 excellent 1
## 17 found 1
## 18 pretty 1
reviews_sentiment_apps <- tidy_reviews_apps %>%
filter(country %in% c("USA", "India", "Brazil"), platform %in% c("iOS", "Android")) %>%
inner_join(bing, relationship = 'many-to-many') %>%
count(country, platform, row_id, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment_score = positive - negative)
ggplot(reviews_sentiment_apps, aes(row_id, sentiment_score, fill = country)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ platform + country, ncol = 2, scales = "free_x") +
labs(title = "Sentiment Analysis of App Reviews by Country and Platform")
afinn_apps <- tidy_reviews_apps %>%
inner_join(get_sentiments("afinn")) %>%
group_by(country, platform, index = row_id %/% 80) %>%
summarise(sentiment_score = sum(value)) %>%
mutate(method = "AFINN")
bing_and_nrc_apps <- bind_rows(
tidy_reviews_apps %>%
inner_join(bing, relationship = "many-to-many") %>%
mutate(method = "Bing et al."),
tidy_reviews_apps %>%
inner_join(nrc %>% filter(sentiment %in% c("positive", "negative"))) %>%
mutate(method = "NRC")
) %>%
count(method, country, platform, index = row_id %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment_score = positive - negative)
## Warning in inner_join(., nrc %>% filter(sentiment %in% c("positive", "negative"))): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2714 of `x` matches multiple rows in `y`.
## ℹ Row 1992 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
loughran_sentiment_apps <- tidy_reviews_apps %>%
inner_join(loughran_mc %>% filter(sentiment %in% c("positive", "negative"))) %>%
count(country, platform, index = row_id %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment_score = positive - negative, method = "Loughran")
sentiment_combined_apps <- bind_rows(afinn_apps, bing_and_nrc_apps, loughran_sentiment_apps)
ggplot(sentiment_combined_apps, aes(index, sentiment_score, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ method, ncol = 1, scales = "free_y") +
labs(title = "Sentiment Comparison across Methods")