Week 10 Homework
Overview
In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:
Work with a different corpus of your choosing, and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research). As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment
Setup
Load Libraries
The example code referenced below requires a number of libraries which will be premptively loaded here.
library(tidytext)
library(janeaustenr)
library(dplyr)
library(stringr)
library(wordcloud)
library(reshape2)
library(ggplot2)
library(tidyr)
library(sentimentr)
library(magrittr)Assignment Solution
The corpus I chose to work with is the Trip Advisor Reviews Data obtained from Kaggle1.
The data consists of the text from Trip Advisor reviews and the associated numerical rating that the user gave. I am going to attempt to compare the sentiment of the words in the reviews with the ratings of the user to see if they match.
First load the data
dataUrl <- 'https://raw.githubusercontent.com/nolivercuny/data607/master/homework7/tripadvisor_hotel_reviews.csv'
reviews <- read.csv(dataUrl)
glimpse(reviews)## Rows: 20,491
## Columns: 2
## $ Review <chr> "nice hotel expensive parking got good deal stay hotel annivers…
## $ Rating <int> 4, 2, 3, 5, 5, 5, 5, 4, 5, 5, 2, 4, 4, 3, 4, 1, 2, 5, 5, 3, 5, …
Below is a box plot showing the summary statistics for the length of the different rating reviews.
I noticed that there are a large amount of extreme outliers. In order to pair down the data set I will only be analyzing reviews that contain fewer than 500 words.
reviewsCounts <- reviews %>%
mutate(row = row_number()) %>%
unnest_tokens(word, Review) %>%
anti_join(stop_words) %>%
group_by(row, Rating) %>%
tally(sort=TRUE)## Joining, by = "word"
reviewsCounts %>%
ggplot(aes(y=n, x=factor(Rating), fill=Rating)) +
geom_boxplot()Filter the reviews down to remove extreme outliers. Tokenize the reviews into a wide dataset where each row is a single word from the review and its rating.
reviewsTokenized <- reviews %>%
filter(nchar(Review) < 500) %>%
mutate(row = row_number()) %>%
unnest_tokens(word, Review)Categorize and display the words from the reviews in a positive/negative word cloud using the bing lexicon.
reviewsTokenized %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "green"),
max.words = 100)## Joining, by = "word"
## Joining, by = "word"
I decided to count the number of positive and negative words in each rating bracket. My assumptions is that lower rating reviews would show more negative sentiment words than positive sentiment words. As the ratings got higher I assumed the sentiment would flip with more positive sentiment words than negative.
Here I break the dataset down generate sentiment counts for both negative and positive grouped by rating. Then generate overall word counts for each rating/
Finally I plot it in a bar chart by rating with the negative and positive sentiment mean count side-by-side.
As you can see my assumption was validated by this analysis.
#get negative sentiment words from bing
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
#get positive sentiment words from bing
bingpositive <- get_sentiments("bing") %>%
filter(sentiment == "positive")
#count all negative words across every review grouped by rating
negativeReviews <- reviewsTokenized %>%
anti_join(stop_words) %>%
semi_join(bingnegative) %>%
group_by(Rating) %>%
summarize(negativewords = n()) %>%
ungroup()## Joining, by = "word"
## Joining, by = "word"
#count all positive words across every review grouped by rating
positiveReviews <- reviewsTokenized %>%
anti_join(stop_words) %>%
semi_join(bingpositive) %>%
group_by(Rating) %>%
summarize(positivewords = n()) %>%
ungroup()## Joining, by = "word"
## Joining, by = "word"
#count all words across every review grouped by rating
reviewsWordCount <- reviewsTokenized %>%
anti_join(stop_words) %>%
group_by(Rating) %>%
summarize(count = n()) %>%
ungroup()## Joining, by = "word"
#join all three dataframes together
#compute the mean negative and positive words per rating
reviewsJoined <- negativeReviews %>%
inner_join(positiveReviews) %>%
inner_join(reviewsWordCount) %>%
mutate(positivepercent = positivewords / count) %>%
mutate(negativepercent = negativewords / count)## Joining, by = "Rating"
## Joining, by = "Rating"
#plot the data
reviewsJoined %>%
select(Rating, positivepercent, negativepercent) %>%
pivot_longer(!Rating) %>%
ggplot(aes(fill=name, y=value, x=Rating)) +
geom_bar(position="dodge", stat="identity")For my secondary lexicon I decided to use the sentimentr library. The author of the library implemented a unique algorithm that accounts for “valence shifters” in the analysis. Valence shifters are things like negators which prevent labeling a sentence like “I was not happy with the service” as positive because the word “happy” appears in it. The library also performs the analysis on sentences over words which I used in my above analysis.
Here I use the get_sentences() function of the library to tokenize the reviews into sentences. I then use the sentiment_by function to get sentiment per rating. I then plot the sentiment per rating using the built-in plot function of the sentimentr library.
This library produces very different results over my analysis with the bing lexicon. The only rating that shows majority negative sentiment is the 1 star rating. Every other rating is positive.
reviewsSentAgg <- reviews %>%
filter(nchar(Review) < 500) %>%
get_sentences() %$%
sentiment_by(Review, list(Rating))
reviewsSentAgg## Rating word_count sd ave_sentiment
## 1: 1 28444 0.4004791 -0.12687296
## 2: 2 29342 0.4001992 0.08851533
## 3: 3 41846 0.4396274 0.33837032
## 4: 4 125388 0.4799364 0.59373899
## 5: 5 202476 0.5278442 0.73934188
plot(reviewsSentAgg)Finally, the sentimentr library offers a feature where you visually see which parts of the text were labeled as positive over negative.
reviews %>%
mutate(row = row_number()) %>%
filter(row %in% sample(unique(row), 5)) %>%
mutate(review_sentence = get_sentences(Review)) %$%
sentiment_by(review_sentence, row) %>%
highlight(open=FALSE, file="file.html")## Saved in file.html
3006: -.844
loved punta cana hated hotel place not care quality going cheap, love punta cana plan year not stay, stay ritz cap cana, way spread, read reviews thought bad really oh, food terrible, fine hotel decide quality not quanity,
5631: +1.096
nice location, stayed calderon 2 nights weeks ago happy hotel, room 10th floor excellent view barcelona, room decorated nicely dark wood mirrors, desk pleasant helpful lobby pretty modern, pool floor excellent views barcelona did not swim, location great 7 minute walk las ramblas right placa cataluna restaurants lots action, spain 2 years previously friends staying loved took advice tried new happy,
11863: +.754
fine right price vibe located easy walking distance things cbd short stroll central station, clean comfortable public areas quite funky n't expecting 4.5 star luxury, aircon little hard right desk service hit miss did secret hotel 126/night good value, disappointed standard paid rack rate 330/night does actually pay,
14685: +.475
good hotel, stayed 3 nights january, room warm windows opened ventilation, small kitchen great boiling water tea storing beer things fridge gas stove oven immaculate, fancied cook, room big decor good bit dodgy places not spoiled, bathroom nice water pressure shower bit toooo, mini bar big tv loads amenities business people cd player built alarm clock fax printer internet access, stayed 32nd floor reasonable view just faciniating watching new york hotel location n't bad prepared walking did n't bother think best way explore healthy, blockheads mexican restaurant block right hotel great place youngsters llike drunk eat nice food, come staff friendly helpful did n't bother, barking dog good food drinks smokers like n't bad going outside sneaky cigarette temperature, great, 6 stars,
18654: +.736
pretty good price husband stayed majestic 7 days end june, happy stay, resort kept clean manicured, room beautiful, bathroom weird adjusted reviews correct beds hard rock pillows sad, requests brought pillows 8, complaint, got room tv n't hooked thought oh no service man 10 min, fixed right, room beautiful no mold smell whatsoever. the food ok. plenty eat left restaurant hungry, best food earth no fine, breakfast buffet best far, steakhouse good french restaurant good small portions, end week decided try ordering extra menu items seeing work, staff brought ordered no complaints, didn__Ç_é_ expecting gourmet food especially not steal price got wasn__Ç_é_ gourmet, things fresh presented good variety, husband nor stomach problems all. the language barrier problem not unmanageable, staff ok not overly friendly can__Ç_é_ say blame, best spa, best massages lives, absolutely fabulous, pool beach beautiful, time day plenty lounge chairs, pool beach water soooooo warm, great overall great time majestic anytime wanted good cheap vacation,
Example Code
Example code was taken from chapter 2 of the book Text mining with R: A tidy approach2
I have chosen to exclude the
get_sentimentscode as it is trivial and explanatory.
This code loads the works of Jane Austen and tokenizes each word into a long dataframe.
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)This code loads the nrc sentiment dataset from Saif Mohammad and Peter Turney3. It then filters the sentiment data set to only joy words. Filters the Jane Austen book dataframe to only use text from the book Emma then joins the two data sets and counts the occurance of each word in the product of the join.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)## Joining, by = "word"
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # … with 291 more rows
Here they are loading a new sentiment dataset, chunking each book into 80 line segments, computing the sentiment of the chunk and then plotting out the sentiment. This is a graphical picture of how the sentiment changes with the narrative of the books.
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining, by = "word"
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")This section is about comparing how the different sentiment datasets map to the text of a single book. They do the same plotting of the data as in the previous code snippet but using a single book and multiple sentiment data sets.
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
#pride_prejudice - I have intentionally excluded this to keep the assignment cleaner
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")## Joining, by = "word"
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining, by = "word"
## Joining, by = "word"
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")Here the author’s compare continue their analysis by counting which positive and negative words contributed the most to the sentiment scores
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()## Joining, by = "word"
# bing_word_counts - I have intentionally modified this from its original source
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)As a follow up to the previous section the author’s demonstrate adding custom “stop words” that are specific to the given analysis. Pointing out that miss was the one of the most common negative sentitment words but in the context of a Jane Austen novel, miss is used frequently as a way of identifying a female character.
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
custom_stop_words## # A tibble: 1,150 × 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # … with 1,140 more rows
Here the author’s demonstrate a word cloud as another way of visualizing the data
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))## Joining, by = "word"
## Warning in wordcloud(word, n, max.words = 100): miss could not be fit on page.
## It will not be plotted.
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)## Joining, by = "word"
The final section discusses using whole sentences to perform the sentiment analysis over individual words
#demonstrate how to tokenize by sentence
p_and_p_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
p_and_p_sentences$sentence[2]## [1] "by jane austen"
# counting the chapters of each book
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())## # A tibble: 6 × 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
# Find which chapters are the most negative in each book
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()## Joining, by = "word"
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
## # A tibble: 6 × 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
References
Alam, M. H., Ryu, W.-J., Lee, S., 2016. Joint multi-grain topic sentiment: modeling semantic aspects for online reviews. Information Sciences 339, 206–223.↩︎
Silge, J., & Robinson, D. (2017). Chapter 2: Sentiment analysis with tidy data. In Text mining with R: A tidy approach. essay, O’Reilly Media.↩︎