output:
prettydoc::html_pretty:
theme: architect
highlight: githubSetup
library(tidytext)
library(janeaustenr)
library(stringr)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gutenbergr)
library(wordcloud)## Loading required package: RColorBrewer
library(tidyr)
library(corpus)
library(ggmap)## Loading required package: ggplot2
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
#library(rvest)
#library(leaflet)
#library(RColorBrewer)
#library(scrapeR)=============================================================================
The focus of this assignment is to improve proficiency in the use of the “tidytext” package in sentiment analysis. The “tidytext” package and accompanying book “Text mining with R” is the result of work done by Data Scientists Julia Silge and David Robinson and is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License.
=============================================================================
My response to the assignment include using the “gutenberg” library for R and the works of Henry Wadsworth Longfellow, favorite of my late father.
=============================================================================
The following is a recreation of the chapter 2 code from the text book Text mining with R
Silge, Julia, and David Robinson. Text Mining with R. Silge and robinson, June 2017.
### the "afinn" sentiment Lexicon
get_sentiments("afinn")## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
### the "bing" sentiment Lexicon
get_sentiments("bing")## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
### the "nrc" sentiment Lexicon
get_sentiments("nrc")## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
### extracting the austen books from the library and "tokenizing" the words
### Putting in a - tidy format -
tidy_books_austen <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
### Using the nrc lexicon and filter() for the - joy - words
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
### Applying the "joy" filter to the Book "Emma"
tidy_books_austen %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)## Joining, by = "word"
## # A tibble: 303 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # ... with 293 more rows
### using pivot_wider() so that we have negative and positive sentiment in
### separate columns, and lastly calculate a net sentiment
### (positive - negative)
jane_austen_sentiment <- tidy_books_austen %>%
inner_join(get_sentiments("nrc")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining, by = "word"
### Plotting these sentiment scores across the plot trajectory of each novel
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")=============================================================================
For our Sentiment analysis we will use the Gutenberg Library and examine some of Henry Wadsworth Longfellow’s work.
### The Gutenberg "id" for Longfellow is "16"
hwl_works <- gutenberg_metadata %>%
filter (gutenberg_author_id == 16)
### The titles in the Gutenberg library for Henry Wadsworth Longfellow are :
hwl_works[c("gutenberg_id","title")]## # A tibble: 13 x 2
## gutenberg_id title
## <int> <chr>
## 1 19 "The Song of Hiawatha"
## 2 1365 "The Complete Poetical Works of Henry Wadsworth Longfellow"
## 3 2039 "Evangeline: A Tale of Acadie"
## 4 5436 "Hyperion"
## 5 9080 "The Children's Own Longfellow"
## 6 10490 "The Golden Legend"
## 7 13830 "The Wreck of the Hesperus"
## 8 15390 "Evangeline\nwith Notes and Plan of Study"
## 9 20894 "Evangeline: Traduction du poème Acadien de Longfellow"
## 10 23332 "Greetings from Longfellow"
## 11 25153 "Tales of a Wayside Inn"
## 12 30795 "The Song of Hiawatha: An Epic Poem"
## 13 44398 "Poems on Slavery"
hwl_books <- hwl_works[c("gutenberg_id")]
hwl_books## # A tibble: 13 x 1
## gutenberg_id
## <int>
## 1 19
## 2 1365
## 3 2039
## 4 5436
## 5 9080
## 6 10490
## 7 13830
## 8 15390
## 9 20894
## 10 23332
## 11 25153
## 12 30795
## 13 44398
### setting up the conversion from id to titles
book_titles <- as_labeller(
c(`19` = "Hiawatha", `1365` = "Poetical Works",`2039` = "Acadie",
`5436` = "Hyperion",`9080` = "Children's Own", `10490` = "Golden Legend",
`13830` = "Hesperus",`15390` = "Evangeline", `20894` = "Traduction du poème",
`23332` = "Greetings",`25153` = "Wayside Inn", `30795` = "Epic Poem", `44398` = "Slavery"))
### Seting up the "nrc" Lexicon
get_sentiments("nrc")## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
nrc_sentiment <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
### Downloading the Longfellw books.
hwl_books_download <- gutenberg_download(hwl_books,mirror = NULL,
strip = TRUE,
verbose = TRUE,
files = NULL)## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_books_hwl <- hwl_books_download %>%
group_by(gutenberg_id)%>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE))))%>%
ungroup() %>%
unnest_tokens(word, text)
### Applying the "nrc" Lexicon for "joy" words
tidy_books_hwl %>%
inner_join(nrc_sentiment) %>%
count(word, sort = TRUE)## Joining, by = "word"
## # A tibble: 505 x 2
## word n
## <chr> <int>
## 1 love 662
## 2 white 562
## 3 god 558
## 4 good 517
## 5 sun 501
## 6 art 411
## 7 sweet 369
## 8 beautiful 363
## 9 young 308
## 10 tree 289
## # ... with 495 more rows
hwl_sentiment <- tidy_books_hwl %>%
inner_join(get_sentiments("nrc")) %>%
count(gutenberg_id, index = linenumber %/% 100, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining, by = "word"
ggplot(hwl_sentiment, aes(index, sentiment, fill = "red")) +
geom_col(show.legend = FALSE) +
facet_wrap(~gutenberg_id, ncol = 10, scales = "free_x",labeller = "book_titles")
==========================================================================
We will apply the “loughran” sentiment to our selection of Longfellow’s work
lhr_sentiment <- get_sentiments("loughran")
tidy_books_hwl %>%
inner_join(lhr_sentiment) %>%
count(word, sort = TRUE)## Joining, by = "word"
## # A tibble: 1,048 x 2
## word n
## <chr> <int>
## 1 shall 1021
## 2 great 874
## 3 may 528
## 4 good 517
## 5 could 404
## 6 unto 388
## 7 beautiful 363
## 8 might 243
## 9 strong 235
## 10 fear 230
## # ... with 1,038 more rows
lhr_sentiment <- tidy_books_hwl %>%
inner_join(get_sentiments("loughran")) %>%
count(gutenberg_id, index = linenumber %/% 100, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining, by = "word"
ggplot(lhr_sentiment, aes(index, sentiment, fill = "gutenberg_id")) +
geom_col(show.legend = FALSE) +
facet_wrap(~gutenberg_id, ncol = 10, scales = "free_x",labeller = "book_titles") =========================================================================
Conclusion:
The Tidytext package provides a significant tool for sentiment analysis as is evidenced in our examples above. In this exercise, after we demonstrated the use of the code from the Tidytext book, we instanciated the “Gutenberg” library and extracted the works of Henry Wadsworth Longfellow, a total of 13 books in the library. We compared sentiment analysis using this Corpus and two Lexicons, “nrc” and “loughran”. As our analysis shows the “loughran” Lexicon produce a significantly larger number of negative sentiments compared to the “nrc” Lexicon.
While this exercise is by no means a complete analysis of Longfellow’s work, It does demonstrate the power of the “Tidytext” package.