MLA citation: Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach. , 2017. Internet resource.)
In this assignment, we will perform a sentiment analysis on a corpus of 4 books by H.G. Wells, an English writer in the 19th century. The books were obtained from the Gutenberg Project using the gutenbergr R package.
We will be using three lexicons from the tidytext package in R: AFINN, Bing, and NRC. In addition, we will also be using the Loughran lexicon.
To perform the sentiment analysis, we first load the necessary packages and the corpus of H.G. Wells’ books using the gutenbergr package. We then clean the text by removing punctuation, converting all letters to lowercase, and removing stopwords.
Next, we apply each of the lexicons to the cleaned text and calculate the sentiment scores for each word. We then aggregate the sentiment scores by grouping the words into chunks of 80 words, which we call “chunks”. We do this to capture the sentiment of a larger unit of text, as analyzing sentiment on a sentence or word level may not provide enough context.
Finally, we plot the sentiment scores for each chunk using ggplot2 to visualize any patterns or trends in the sentiment. This allows us to gain insight into the overall sentiment of the corpus and identify any notable shifts or changes in sentiment over time.
library(tidytext)
library(janeaustenr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(stringr)
library(gutenbergr)
library(wordcloud)
## Loading required package: RColorBrewer
library(lexicon)
nrc_joy <- get_sentiments("nrc") |>
filter(sentiment == "joy")
hgwells <- gutenberg_download(c(35, 36, 5230, 159))
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_hgwells <- hgwells |>
unnest_tokens(word, text) |>
anti_join(stop_words) |>
mutate(gutenberg_id = if_else(gutenberg_id == 35, "Time Machine",
if_else(gutenberg_id == 36, "The War of the Worlds",
if_else(gutenberg_id == 159, "The Invisible Man",
if_else(gutenberg_id == 5230, "The Island of Doctor Moreau", NA_character_)))))
## Joining, by = "word"
tidy_hgwells |>
count(word, sort = TRUE)
## # A tibble: 11,811 × 2
## word n
## <chr> <int>
## 1 time 461
## 2 people 302
## 3 door 260
## 4 heard 249
## 5 black 232
## 6 stood 229
## 7 white 224
## 8 hand 218
## 9 kemp 213
## 10 eyes 210
## # … with 11,801 more rows
tidy_hgwells |>
inner_join(nrc_joy) |>
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 309 × 2
## word n
## <chr> <int>
## 1 found 200
## 2 green 104
## 3 beach 75
## 4 sun 73
## 5 save 68
## 6 food 67
## 7 feeling 49
## 8 dawn 38
## 9 rising 35
## 10 god 34
## # … with 299 more rows
hgwells_sentiment <- tidy_hgwells |>
inner_join(get_sentiments("bing")) |>
count(gutenberg_id, index = row_number() %/% 80, sentiment) |>
rename(book_id = gutenberg_id) |>
spread(sentiment, n, fill = 0) |>
mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(hgwells_sentiment, aes(index, sentiment, fill = book_id)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book_id, ncol = 2, scales = "free_x")
the_invisible_man <- tidy_hgwells |>
filter(gutenberg_id == "The Invisible Man")
afinn <- the_invisible_man |>
inner_join(get_sentiments("afinn")) |>
group_by(index = row_number() %/% 80) |>
summarise(sentiment = sum(value)) |>
mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
the_invisible_man |>
inner_join(get_sentiments("bing")) |>
mutate(method = "Bing et al."),
the_invisible_man |>
inner_join(get_sentiments("nrc") |>
filter(sentiment %in% c("positive",
"negative"))) |>
mutate(method = "NRC")) |>
count(method, index = row_number() %/% 80, sentiment) |>
spread(sentiment, n, fill = 0) |>
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
bind_rows(afinn, bing_and_nrc) |>
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
bing_word_counts <- tidy_hgwells |>
inner_join(get_sentiments("bing")) |>
count(word, sentiment, sort = TRUE) |>
ungroup()
## Joining, by = "word"
bing_word_counts |>
group_by(sentiment) |>
top_n(10) |>
ungroup() |>
mutate(word = reorder(word, n)) |>
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
## Selecting by n
tidy_hgwells |>
anti_join(stop_words) |>
count(word) |>
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
loughran_hgwells <- tidy_hgwells |>
inner_join(get_sentiments("loughran")) |>
count(gutenberg_id, index = row_number() %/% 80, sentiment) |>
rename(book_id = gutenberg_id) |>
spread(sentiment, n, fill = 0) |>
mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(loughran_hgwells, aes(index, sentiment, fill = book_id)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book_id, ncol = 2, scales = "free_x")
loughran_word_counts <- tidy_hgwells |>
inner_join(get_sentiments("loughran")) |>
count(word, sentiment, sort = TRUE) |>
ungroup()
## Joining, by = "word"
loughran_word_counts |>
group_by(sentiment) |>
top_n(10) |>
ungroup() |>
mutate(word = reorder(word, n)) |>
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
## Selecting by n
The sentiment analysis performed on the four books written by H.G. Wells showed predominantly negative sentiment, using lexicons such as nrc, bing, loughran, and afinn. One potential explanation for this finding could be that Wells lived through two world wars, which may have influenced the overall negative sentiment in his writing. This highlights the potential impact of historical and societal events on an author’s work, and how sentiment analysis can provide insight into the emotions and attitudes conveyed in their writing. However, it is important to note that sentiment analysis has its limitations and should be used in conjunction with other analytical tools to gain a more comprehensive understanding of the text. Overall, this analysis serves as an example of how text mining and sentiment analysis can provide valuable insights into literature and language.