Citation

MLA citation: Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach. , 2017. Internet resource.)

Introduction

In this assignment, we will perform a sentiment analysis on a corpus of 4 books by H.G. Wells, an English writer in the 19th century. The books were obtained from the Gutenberg Project using the gutenbergr R package.

We will be using three lexicons from the tidytext package in R: AFINN, Bing, and NRC. In addition, we will also be using the Loughran lexicon.

To perform the sentiment analysis, we first load the necessary packages and the corpus of H.G. Wells’ books using the gutenbergr package. We then clean the text by removing punctuation, converting all letters to lowercase, and removing stopwords.

Next, we apply each of the lexicons to the cleaned text and calculate the sentiment scores for each word. We then aggregate the sentiment scores by grouping the words into chunks of 80 words, which we call “chunks”. We do this to capture the sentiment of a larger unit of text, as analyzing sentiment on a sentence or word level may not provide enough context.

Finally, we plot the sentiment scores for each chunk using ggplot2 to visualize any patterns or trends in the sentiment. This allows us to gain insight into the overall sentiment of the corpus and identify any notable shifts or changes in sentiment over time.

Loading in Libraries

library(tidytext)
library(janeaustenr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(stringr)
library(gutenbergr)
library(wordcloud)
## Loading required package: RColorBrewer
library(lexicon)

Using a Different Corpus

nrc_joy <- get_sentiments("nrc") |> 
    filter(sentiment == "joy")
hgwells <- gutenberg_download(c(35, 36, 5230, 159))
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_hgwells <- hgwells |> 
    unnest_tokens(word, text) |> 
    anti_join(stop_words) |> 
    mutate(gutenberg_id = if_else(gutenberg_id == 35, "Time Machine", 
                                  if_else(gutenberg_id == 36, "The War of the Worlds",
                                          if_else(gutenberg_id == 159, "The Invisible Man",
                                                  if_else(gutenberg_id == 5230, "The Island of Doctor Moreau", NA_character_))))) 
## Joining, by = "word"
tidy_hgwells |> 
    count(word, sort = TRUE)
## # A tibble: 11,811 × 2
##    word       n
##    <chr>  <int>
##  1 time     461
##  2 people   302
##  3 door     260
##  4 heard    249
##  5 black    232
##  6 stood    229
##  7 white    224
##  8 hand     218
##  9 kemp     213
## 10 eyes     210
## # … with 11,801 more rows
tidy_hgwells |> 
    inner_join(nrc_joy) |> 
    count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 309 × 2
##    word        n
##    <chr>   <int>
##  1 found     200
##  2 green     104
##  3 beach      75
##  4 sun        73
##  5 save       68
##  6 food       67
##  7 feeling    49
##  8 dawn       38
##  9 rising     35
## 10 god        34
## # … with 299 more rows
hgwells_sentiment <- tidy_hgwells |> 
    inner_join(get_sentiments("bing")) |> 
    count(gutenberg_id, index = row_number() %/% 80, sentiment) |>
    rename(book_id = gutenberg_id) |> 
    spread(sentiment, n, fill = 0) |> 
    mutate(sentiment = positive - negative) 
## Joining, by = "word"
ggplot(hgwells_sentiment, aes(index, sentiment, fill = book_id)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~book_id, ncol = 2, scales = "free_x")

the_invisible_man <- tidy_hgwells |> 
    filter(gutenberg_id == "The Invisible Man")
afinn <- the_invisible_man |> 
  inner_join(get_sentiments("afinn")) |> 
  group_by(index = row_number() %/% 80) |> 
  summarise(sentiment = sum(value)) |> 
  mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
    the_invisible_man |> 
        inner_join(get_sentiments("bing")) |> 
        mutate(method = "Bing et al."),
    the_invisible_man |> 
    inner_join(get_sentiments("nrc") |> 
               filter(sentiment %in% c("positive",
                        "negative"))) |> 
    mutate(method = "NRC")) |> 
    count(method, index = row_number() %/% 80, sentiment) |> 
    spread(sentiment, n, fill = 0) |> 
    mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
bind_rows(afinn, bing_and_nrc) |> 
    ggplot(aes(index, sentiment, fill = method)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~method, ncol = 1, scales = "free_y")

bing_word_counts <- tidy_hgwells |> 
    inner_join(get_sentiments("bing")) |> 
    count(word, sentiment, sort = TRUE) |> 
    ungroup()
## Joining, by = "word"
bing_word_counts |> 
    group_by(sentiment) |> 
    top_n(10) |> 
    ungroup() |> 
    mutate(word = reorder(word, n)) |> 
    ggplot(aes(word, n, fill = sentiment)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~sentiment, scales = "free_y") +
    labs(y = "Contribution to sentiment",
         x = NULL) +
    coord_flip()
## Selecting by n

tidy_hgwells |> 
    anti_join(stop_words) |> 
    count(word) |> 
    with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

Incorporating A New Lexicon: Loughran

loughran_hgwells <- tidy_hgwells |> 
    inner_join(get_sentiments("loughran")) |> 
    count(gutenberg_id, index = row_number() %/% 80, sentiment) |>
    rename(book_id = gutenberg_id) |> 
    spread(sentiment, n, fill = 0) |> 
    mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(loughran_hgwells, aes(index, sentiment, fill = book_id)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~book_id, ncol = 2, scales = "free_x")

loughran_word_counts <- tidy_hgwells |> 
    inner_join(get_sentiments("loughran")) |> 
    count(word, sentiment, sort = TRUE) |> 
    ungroup()
## Joining, by = "word"
loughran_word_counts |> 
    group_by(sentiment) |> 
    top_n(10) |> 
    ungroup() |> 
    mutate(word = reorder(word, n)) |> 
    ggplot(aes(word, n, fill = sentiment)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~sentiment, scales = "free_y") +
    labs(y = "Contribution to sentiment",
         x = NULL) +
    coord_flip()
## Selecting by n

Conclusion

The sentiment analysis performed on the four books written by H.G. Wells showed predominantly negative sentiment, using lexicons such as nrc, bing, loughran, and afinn. One potential explanation for this finding could be that Wells lived through two world wars, which may have influenced the overall negative sentiment in his writing. This highlights the potential impact of historical and societal events on an author’s work, and how sentiment analysis can provide insight into the emotions and attitudes conveyed in their writing. However, it is important to note that sentiment analysis has its limitations and should be used in conjunction with other analytical tools to gain a more comprehensive understanding of the text. Overall, this analysis serves as an example of how text mining and sentiment analysis can provide valuable insights into literature and language.