Assignment

Use the primary example code from Chapter 2 of “Text Mining with R” from Julia Silge & David Robinson and extended the code in two ways:

Text Mining with R

Example code taken from https://www.tidytextmining.com/sentiment.html “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson

Load Libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.4.0      v purrr   0.3.4 
## v tibble  3.1.6      v dplyr   1.0.10
## v tidyr   1.2.1      v stringr 1.4.0 
## v readr   2.1.2      v forcats 0.5.1 
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidytext)
library(janeaustenr)
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(hcandersenr)
library(SentimentAnalysis)
## 
## Attaching package: 'SentimentAnalysis'
## 
## The following object is masked from 'package:base':
## 
##     write
library(textdata)

Example Code from Chapter 2

Loading Jane Austen collection

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
  chapter = cumsum(str_detect(text,                 regex("^chapter[\\divxlc]",ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Examining Joy Words in the book Emma

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 301 x 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ... with 291 more rows

Using “bing” Lexicon for Sentiment Analysis

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"

Comparing Sentiment Count in “nrc” & “bing” Lexicon

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3316
## 2 positive   2308

Most Common Positive and Negative Words

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ... with 2,575 more rows
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)


My Analysis

I chose to explore a package that contains all 157 of H.C. Andersen’s fairy tale works. The dataset contains his work in 5 languages: Danish, German, English, Spanish and French. To begin my analysis, I filtered the dataset for the English version of the texts.

hcandersenr Dataset

Credit to Emil Hvitfedlt for the hcandersenr package: https://github.com/emilhvitfeldt/hcandersenr hcandersenr by Emil Hvitfeldt

Downloading Dataset

# filtered for the english version of the text
hcane <- hca_fairytales() %>%
  group_by(book) %>%
  filter(language == "English") %>%
  mutate(linenumber = row_number()) %>%
  ungroup()

Tokenization

Now that we have our text in the language of interest, we begin the process of tokenization.

tidy_hcane <- hcane %>%
  unnest_tokens(word, text)

Removal of Stop Words

data(stop_words)
tidy_hcane <- tidy_hcane %>%
  anti_join(stop_words)
## Joining, by = "word"
tidy_hcane %>%
  group_by(book) %>%
  count(word, sort = TRUE)
## # A tibble: 74,611 x 3
## # Groups:   book [156]
##    book                        word        n
##    <chr>                       <chr>   <int>
##  1 The ice maiden              rudy      194
##  2 The snow queen              gerda     114
##  3 Little Claus and big Claus  claus     100
##  4 The ice maiden              babette    88
##  5 A story from the sand dunes jörgen     87
##  6 The shadow                  shadow     81
##  7 The bottle neck             bottle     76
##  8 The fir tree                tree       76
##  9 The gate key                key        72
## 10 The marsh king's daughter   stork      72
## # ... with 74,601 more rows
tidy_hcane
## # A tibble: 136,919 x 4
##    book           language linenumber word     
##    <chr>          <chr>         <int> <chr>    
##  1 The tinder-box English           1 soldier  
##  2 The tinder-box English           1 marching 
##  3 The tinder-box English           1 road     
##  4 The tinder-box English           1 left     
##  5 The tinder-box English           1 left     
##  6 The tinder-box English           2 knapsack 
##  7 The tinder-box English           2 sword    
##  8 The tinder-box English           2 wars     
##  9 The tinder-box English           3 returning
## 10 The tinder-box English           3 home     
## # ... with 136,909 more rows

Downloading Lexicon for Assignment

For my analysis, I decided to use the DictionaryGI lexicon from the Sentiment Analysis package. Upon inspection, I can see that DictionaryGI dictionary is in a list. There are 2005 negative words and 1637 positive words - which can pose a problem if I wish to join them together in a dataframe.

data(DictionaryGI)
str(DictionaryGI)
## List of 2
##  $ negative: chr [1:2005] "abandon" "abandonment" "abate" "abdicate" ...
##  $ positive: chr [1:1637] "abide" "ability" "able" "abound" ...

Cleaning Lexicon Dataframe

To avoid errors when turning the list into a dataframe, I made the lengths of both rows the same.

length(DictionaryGI$positive) <- length(DictionaryGI$negative)

I then turned the list object into a dataframe.

DictionaryGI_df <- as.data.frame(DictionaryGI)

I realized I needed to transform the presentation of dataframe for my analysis later. In order to do this, I would have to put the variables “positive” and “negative” into their own columns. My approach was to seperate the words by sentiment, add a column that had the sentiment that applied to that group, rename the columns and then join them. Once joined, I then removed the NA values.

negative <- DictionaryGI_df$negative
negative <- as.data.frame(negative)
negative <- negative %>% 
  mutate(sentiment = "negative") %>%
  rename("word"="negative")
positive <- DictionaryGI_df$positive
positive <- as.data.frame(positive)
positive <- positive %>% 
  mutate(sentiment="positive") %>%
  rename("word"="positive")
Lex_DictionaryGI <- bind_rows(positive, negative)

Lex_DictionaryGI <- Lex_DictionaryGI %>%
  na.omit()

Final Dataset for Analysis

books_sentiment <- tidy_hcane %>%
  inner_join(Lex_DictionaryGI) %>%
  group_by(book) %>%
  count(sentiment)
## Joining, by = "word"

Top 5 Books with Most Negative & Positive Words

Before creating the visuals, I seperate H.C. Andersen’s text by sentiment values and then arranged them in descending order, limiting the results to only 5.

books_pos <- books_sentiment %>%
  filter(sentiment=="positive") %>%
  arrange(desc(n)) %>%
  head(5)
books_neg <- books_sentiment %>%
  filter(sentiment == "negative") %>%
  arrange(desc(n)) %>%
  head(5)
top_five <- bind_rows(books_pos, books_neg)

An interesting observation is that, with the exception of the lower two variables, the top 3 books that contain negative works also appears in the same order for containing the most positive words.

ggplot(top_five,mapping = aes(x = reorder(book, desc(n)), y = n)) + 
geom_col() + 
facet_grid(~sentiment,scales = "free") +
theme(axis.text.x = element_text(size = 6))

The amount of total negative words in the author’s collection exceeds the total positive words found.

Lex_DictionaryGI %>%
  count(sentiment)
##   sentiment    n
## 1  negative 2005
## 2  positive 1637

Looking at the most frequent words that appeared in the text as well as it’s paired sentiment value, I was suprised by some of the labeling such as “stood” being assigned as positive, and “hand” as negative. I’m interested about the context in which the words appear. That likely played an impact in their sentiment assignment.

dictGI_count <- tidy_hcane %>%
  inner_join(Lex_DictionaryGI) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
dictGI_count
## # A tibble: 1,654 x 3
##    word  sentiment     n
##    <chr> <chr>     <int>
##  1 stood positive    532
##  2 home  positive    435
##  3 heart positive    355
##  4 lay   negative    347
##  5 poor  negative    336
##  6 hand  negative    285
##  7 hand  positive    285
##  8 light positive    257
##  9 round positive    250
## 10 dead  negative    243
## # ... with 1,644 more rows

We can see in the visuals, that despite negative words appearing more frequently in the text, a positive word (“stood”) appear the most, when comparing both sentiments.

dictGI_count %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Sentiment Analysis",
       y = NULL)