Week 10 Homework

Overview

In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:

Work with a different corpus of your choosing, and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research). As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment

Setup

Load Libraries

The example code referenced below requires a number of libraries which will be premptively loaded here.

library(tidytext)
library(janeaustenr)
library(dplyr)
library(stringr)
library(wordcloud)
library(reshape2)
library(ggplot2)
library(tidyr)
library(sentimentr)
library(magrittr)

Assignment Solution

The corpus I chose to work with is the Trip Advisor Reviews Data obtained from Kaggle¹.

The data consists of the text from Trip Advisor reviews and the associated numerical rating that the user gave. I am going to attempt to compare the sentiment of the words in the reviews with the ratings of the user to see if they match.

First load the data

dataUrl <- 'https://raw.githubusercontent.com/nolivercuny/data607/master/homework7/tripadvisor_hotel_reviews.csv'
reviews <- read.csv(dataUrl)
glimpse(reviews)

## Rows: 20,491
## Columns: 2
## $ Review <chr> "nice hotel expensive parking got good deal stay hotel annivers…
## $ Rating <int> 4, 2, 3, 5, 5, 5, 5, 4, 5, 5, 2, 4, 4, 3, 4, 1, 2, 5, 5, 3, 5, …

Below is a box plot showing the summary statistics for the length of the different rating reviews.

I noticed that there are a large amount of extreme outliers. In order to pair down the data set I will only be analyzing reviews that contain fewer than 500 words.

reviewsCounts <- reviews %>%
    mutate(row = row_number()) %>%
    unnest_tokens(word, Review) %>%
    anti_join(stop_words) %>%
    group_by(row, Rating) %>% 
    tally(sort=TRUE)

## Joining, by = "word"

reviewsCounts %>%
  ggplot(aes(y=n, x=factor(Rating), fill=Rating)) +
  geom_boxplot()

Filter the reviews down to remove extreme outliers. Tokenize the reviews into a wide dataset where each row is a single word from the review and its rating.

reviewsTokenized <- reviews %>%
  filter(nchar(Review) < 500) %>%
  mutate(row = row_number()) %>%
  unnest_tokens(word, Review)

Categorize and display the words from the reviews in a positive/negative word cloud using the bing lexicon.

reviewsTokenized %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "green"),
                   max.words = 100)

## Joining, by = "word"
## Joining, by = "word"

I decided to count the number of positive and negative words in each rating bracket. My assumptions is that lower rating reviews would show more negative sentiment words than positive sentiment words. As the ratings got higher I assumed the sentiment would flip with more positive sentiment words than negative.

Here I break the dataset down generate sentiment counts for both negative and positive grouped by rating. Then generate overall word counts for each rating/

Finally I plot it in a bar chart by rating with the negative and positive sentiment mean count side-by-side.

As you can see my assumption was validated by this analysis.

#get negative sentiment words from bing
bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")
#get positive sentiment words from bing
bingpositive <- get_sentiments("bing") %>%
  filter(sentiment == "positive")
#count all negative words across every review grouped by rating
negativeReviews <- reviewsTokenized %>%
  anti_join(stop_words) %>%
  semi_join(bingnegative) %>%
  group_by(Rating) %>%
  summarize(negativewords = n()) %>%
  ungroup()

## Joining, by = "word"
## Joining, by = "word"

#count all positive words across every review grouped by rating
positiveReviews <- reviewsTokenized %>%
  anti_join(stop_words) %>%
  semi_join(bingpositive) %>%
  group_by(Rating) %>%
  summarize(positivewords = n()) %>%
  ungroup()

## Joining, by = "word"
## Joining, by = "word"

#count all words across every review grouped by rating
reviewsWordCount <- reviewsTokenized %>%
  anti_join(stop_words) %>%
  group_by(Rating) %>%
  summarize(count = n()) %>%
  ungroup()

## Joining, by = "word"

#join all three dataframes together
#compute the mean negative and positive words per rating
reviewsJoined <- negativeReviews %>% 
  inner_join(positiveReviews) %>%
  inner_join(reviewsWordCount) %>% 
  mutate(positivepercent = positivewords / count) %>%
  mutate(negativepercent = negativewords / count)

## Joining, by = "Rating"

## Joining, by = "Rating"

#plot the data
reviewsJoined %>% 
  select(Rating, positivepercent, negativepercent) %>%
pivot_longer(!Rating)    %>% 
  ggplot(aes(fill=name, y=value, x=Rating)) + 
    geom_bar(position="dodge", stat="identity")

For my secondary lexicon I decided to use the sentimentr library. The author of the library implemented a unique algorithm that accounts for “valence shifters” in the analysis. Valence shifters are things like negators which prevent labeling a sentence like “I was not happy with the service” as positive because the word “happy” appears in it. The library also performs the analysis on sentences over words which I used in my above analysis.

Here I use the get_sentences() function of the library to tokenize the reviews into sentences. I then use the sentiment_by function to get sentiment per rating. I then plot the sentiment per rating using the built-in plot function of the sentimentr library.

This library produces very different results over my analysis with the bing lexicon. The only rating that shows majority negative sentiment is the 1 star rating. Every other rating is positive.

reviewsSentAgg <- reviews %>%
  filter(nchar(Review) < 500) %>%
    get_sentences() %$%
    sentiment_by(Review, list(Rating))
reviewsSentAgg

##    Rating word_count        sd ave_sentiment
## 1:      1      28444 0.4004791   -0.12687296
## 2:      2      29342 0.4001992    0.08851533
## 3:      3      41846 0.4396274    0.33837032
## 4:      4     125388 0.4799364    0.59373899
## 5:      5     202476 0.5278442    0.73934188

plot(reviewsSentAgg)

Finally, the sentimentr library offers a feature where you visually see which parts of the text were labeled as positive over negative.

reviews %>%
    mutate(row = row_number()) %>%
    filter(row %in% sample(unique(row), 5)) %>%
    mutate(review_sentence = get_sentences(Review)) %$%
    sentiment_by(review_sentence, row) %>%
    highlight(open=FALSE, file="file.html")

## Saved in file.html

Polarity

3006: -.844

loved punta cana hated hotel place not care quality going cheap, love punta cana plan year not stay, stay ritz cap cana, way spread, read reviews thought bad really oh, food terrible, fine hotel decide quality not quanity,

5631: +1.096

nice location, stayed calderon 2 nights weeks ago happy hotel, room 10th floor excellent view barcelona, room decorated nicely dark wood mirrors, desk pleasant helpful lobby pretty modern, pool floor excellent views barcelona did not swim, location great 7 minute walk las ramblas right placa cataluna restaurants lots action, spain 2 years previously friends staying loved took advice tried new happy,

11863: +.754

fine right price vibe located easy walking distance things cbd short stroll central station, clean comfortable public areas quite funky n't expecting 4.5 star luxury, aircon little hard right desk service hit miss did secret hotel 126/night good value, disappointed standard paid rack rate 330/night does actually pay,

14685: +.475

good hotel, stayed 3 nights january, room warm windows opened ventilation, small kitchen great boiling water tea storing beer things fridge gas stove oven immaculate, fancied cook, room big decor good bit dodgy places not spoiled, bathroom nice water pressure shower bit toooo, mini bar big tv loads amenities business people cd player built alarm clock fax printer internet access, stayed 32nd floor reasonable view just faciniating watching new york hotel location n't bad prepared walking did n't bother think best way explore healthy, blockheads mexican restaurant block right hotel great place youngsters llike drunk eat nice food, come staff friendly helpful did n't bother, barking dog good food drinks smokers like n't bad going outside sneaky cigarette temperature, great, 6 stars,

18654: +.736

pretty good price husband stayed majestic 7 days end june, happy stay, resort kept clean manicured, room beautiful, bathroom weird adjusted reviews correct beds hard rock pillows sad, requests brought pillows 8, complaint, got room tv n't hooked thought oh no service man 10 min, fixed right, room beautiful no mold smell whatsoever. the food ok. plenty eat left restaurant hungry, best food earth no fine, breakfast buffet best far, steakhouse good french restaurant good small portions, end week decided try ordering extra menu items seeing work, staff brought ordered no complaints, didn__Ç_é_ expecting gourmet food especially not steal price got wasn__Ç_é_ gourmet, things fresh presented good variety, husband nor stomach problems all. the language barrier problem not unmanageable, staff ok not overly friendly can__Ç_é_ say blame, best spa, best massages lives, absolutely fabulous, pool beach beautiful, time day plenty lounge chairs, pool beach water soooooo warm, great overall great time majestic anytime wanted good cheap vacation,

Example Code

Example code was taken from chapter 2 of the book Text mining with R: A tidy approach²

I have chosen to exclude the get_sentiments code as it is trivial and explanatory.

This code loads the works of Jane Austen and tokenizes each word into a long dataframe.

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

This code loads the nrc sentiment dataset from Saif Mohammad and Peter Turney³. It then filters the sentiment data set to only joy words. Filters the Jane Austen book dataframe to only use text from the book Emma then joins the two data sets and counts the occurance of each word in the product of the join.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # … with 291 more rows

Here they are loading a new sentiment dataset, chunking each book into 80 line segments, computing the sentiment of the chunk and then plotting out the sentiment. This is a graphical picture of how the sentiment changes with the narrative of the books.

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

This section is about comparing how the different sentiment datasets map to the text of a single book. They do the same plotting of the data as in the previous code snippet but using a single book and multiple sentiment data sets.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

#pride_prejudice - I have intentionally excluded this to keep the assignment cleaner

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

## Joining, by = "word"

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"
## Joining, by = "word"

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Here the author’s compare continue their analysis by counting which positive and negative words contributed the most to the sentiment scores

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

# bing_word_counts - I have intentionally modified this from its original source

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

As a follow up to the previous section the author’s demonstrate adding custom “stop words” that are specific to the given analysis. Pointing out that miss was the one of the most common negative sentitment words but in the context of a Jane Austen novel, miss is used frequently as a way of identifying a female character.

custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)

custom_stop_words

## # A tibble: 1,150 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # … with 1,140 more rows

Here the author’s demonstrate a word cloud as another way of visualizing the data

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

## Joining, by = "word"

## Warning in wordcloud(word, n, max.words = 100): miss could not be fit on page.
## It will not be plotted.

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

## Joining, by = "word"

The final section discusses using whole sentences to perform the sentiment analysis over individual words

#demonstrate how to tokenize by sentence
p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")
p_and_p_sentences$sentence[2]

## [1] "by jane austen"

# counting the chapters of each book
austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())

## # A tibble: 6 × 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

# Find which chapters are the most negative in each book
bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()

## Joining, by = "word"
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.

## # A tibble: 6 × 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

References

Alam, M. H., Ryu, W.-J., Lee, S., 2016. Joint multi-grain topic sentiment: modeling semantic aspects for online reviews. Information Sciences 339, 206–223.↩︎
Silge, J., & Robinson, D. (2017). Chapter 2: Sentiment analysis with tidy data. In Text mining with R: A tidy approach. essay, O’Reilly Media.↩︎
NRC Word-Emotion Association Lexicon (aka EmoLex)↩︎