SPS-DATA607-WEEK10 - Sentiment Analysis

Tage N Singh

2021-04-19

output:
  prettydoc::html_pretty:
    theme: architect
    highlight: github

Setup

library(tidytext)
library(janeaustenr)
library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gutenbergr)
library(wordcloud)
## Loading required package: RColorBrewer
library(tidyr)
library(corpus)
library(ggmap)
## Loading required package: ggplot2
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
#library(rvest)
#library(leaflet)
#library(RColorBrewer)
#library(scrapeR)

=============================================================================

The focus of this assignment is to improve proficiency in the use of the “tidytext” package in sentiment analysis. The “tidytext” package and accompanying book “Text mining with R” is the result of work done by Data Scientists Julia Silge and David Robinson and is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License.

=============================================================================

My response to the assignment include using the “gutenberg” library for R and the works of Henry Wadsworth Longfellow, favorite of my late father.

=============================================================================

The following is a recreation of the chapter 2 code from the text book Text mining with R

Silge, Julia, and David Robinson. Text Mining with R. Silge and robinson, June 2017.

### the "afinn" sentiment Lexicon

get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
### the "bing" sentiment Lexicon

get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
### the "nrc" sentiment Lexicon

get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows
### extracting the austen books from the library and "tokenizing" the words
### Putting in a  - tidy format - 

tidy_books_austen <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

### Using the nrc lexicon and filter() for the - joy - words

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

### Applying the "joy" filter to the Book "Emma"

tidy_books_austen %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows
### using pivot_wider() so that we have negative and positive sentiment in 
### separate columns, and lastly calculate a net sentiment 
### (positive - negative)


jane_austen_sentiment <- tidy_books_austen %>%
  inner_join(get_sentiments("nrc")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
### Plotting these sentiment scores across the plot trajectory of each novel


ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

=============================================================================

For our Sentiment analysis we will use the Gutenberg Library and examine some of Henry Wadsworth Longfellow’s work.

### The Gutenberg "id" for Longfellow is "16"

hwl_works <- gutenberg_metadata %>%
         filter (gutenberg_author_id == 16)

### The titles in the Gutenberg library for Henry Wadsworth Longfellow are :

hwl_works[c("gutenberg_id","title")]
## # A tibble: 13 x 2
##    gutenberg_id title                                                      
##           <int> <chr>                                                      
##  1           19 "The Song of Hiawatha"                                     
##  2         1365 "The Complete Poetical Works of Henry Wadsworth Longfellow"
##  3         2039 "Evangeline: A Tale of Acadie"                             
##  4         5436 "Hyperion"                                                 
##  5         9080 "The Children's Own Longfellow"                            
##  6        10490 "The Golden Legend"                                        
##  7        13830 "The Wreck of the Hesperus"                                
##  8        15390 "Evangeline\nwith Notes and Plan of Study"                 
##  9        20894 "Evangeline: Traduction du poème Acadien de Longfellow"    
## 10        23332 "Greetings from Longfellow"                                
## 11        25153 "Tales of a Wayside Inn"                                   
## 12        30795 "The Song of Hiawatha: An Epic Poem"                       
## 13        44398 "Poems on Slavery"
hwl_books <- hwl_works[c("gutenberg_id")]

hwl_books
## # A tibble: 13 x 1
##    gutenberg_id
##           <int>
##  1           19
##  2         1365
##  3         2039
##  4         5436
##  5         9080
##  6        10490
##  7        13830
##  8        15390
##  9        20894
## 10        23332
## 11        25153
## 12        30795
## 13        44398
### setting up the conversion from id to titles

book_titles <- as_labeller(
     c(`19` = "Hiawatha", `1365` = "Poetical Works",`2039` = "Acadie", 
       `5436` = "Hyperion",`9080` = "Children's Own", `10490` = "Golden Legend",
              `13830` = "Hesperus",`15390` = "Evangeline", `20894` = "Traduction du poème",
              `23332` = "Greetings",`25153` = "Wayside Inn", `30795` = "Epic Poem", `44398` = "Slavery"))


### Seting up the "nrc" Lexicon

get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows
nrc_sentiment <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

### Downloading the Longfellw books.

hwl_books_download <- gutenberg_download(hwl_books,mirror = NULL,
strip = TRUE,
verbose = TRUE,
files = NULL)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_books_hwl <- hwl_books_download %>%
  group_by(gutenberg_id)%>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE))))%>%
  ungroup() %>%
  unnest_tokens(word, text)

### Applying the "nrc" Lexicon for "joy" words

tidy_books_hwl %>%
  inner_join(nrc_sentiment) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 505 x 2
##    word          n
##    <chr>     <int>
##  1 love        662
##  2 white       562
##  3 god         558
##  4 good        517
##  5 sun         501
##  6 art         411
##  7 sweet       369
##  8 beautiful   363
##  9 young       308
## 10 tree        289
## # ... with 495 more rows
hwl_sentiment <- tidy_books_hwl %>%
  inner_join(get_sentiments("nrc")) %>%
  count(gutenberg_id, index = linenumber %/% 100, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(hwl_sentiment, aes(index, sentiment, fill = "red")) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~gutenberg_id, ncol = 10, scales = "free_x",labeller = "book_titles")

==========================================================================

We will apply the “loughran” sentiment to our selection of Longfellow’s work

lhr_sentiment <- get_sentiments("loughran")

tidy_books_hwl %>%
  inner_join(lhr_sentiment) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 1,048 x 2
##    word          n
##    <chr>     <int>
##  1 shall      1021
##  2 great       874
##  3 may         528
##  4 good        517
##  5 could       404
##  6 unto        388
##  7 beautiful   363
##  8 might       243
##  9 strong      235
## 10 fear        230
## # ... with 1,038 more rows
lhr_sentiment <- tidy_books_hwl %>%
  inner_join(get_sentiments("loughran")) %>%
  count(gutenberg_id, index = linenumber %/% 100, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(lhr_sentiment, aes(index, sentiment, fill = "gutenberg_id")) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~gutenberg_id, ncol = 10, scales = "free_x",labeller = "book_titles")

=========================================================================

Conclusion:

The Tidytext package provides a significant tool for sentiment analysis as is evidenced in our examples above. In this exercise, after we demonstrated the use of the code from the Tidytext book, we instanciated the “Gutenberg” library and extracted the works of Henry Wadsworth Longfellow, a total of 13 books in the library. We compared sentiment analysis using this Corpus and two Lexicons, “nrc” and “loughran”. As our analysis shows the “loughran” Lexicon produce a significantly larger number of negative sentiments compared to the “nrc” Lexicon.

While this exercise is by no means a complete analysis of Longfellow’s work, It does demonstrate the power of the “Tidytext” package.