Use the primary example code from Chapter 2 of “Text Mining with R” from Julia Silge & David Robinson and extended the code in two ways:
Work with a different corpus of your choosing
Incorporate at least one additional sentiment lexicon
Example code taken from https://www.tidytextmining.com/sentiment.html “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.4.0 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.10
## v tidyr 1.2.1 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
library(janeaustenr)
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(hcandersenr)
library(SentimentAnalysis)
##
## Attaching package: 'SentimentAnalysis'
##
## The following object is masked from 'package:base':
##
## write
library(textdata)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter[\\divxlc]",ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 301 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ... with 291 more rows
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,585 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ... with 2,575 more rows
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
I chose to explore a package that contains all 157 of H.C. Andersen’s fairy tale works. The dataset contains his work in 5 languages: Danish, German, English, Spanish and French. To begin my analysis, I filtered the dataset for the English version of the texts.
Credit to Emil Hvitfedlt for the hcandersenr package: https://github.com/emilhvitfeldt/hcandersenr hcandersenr by Emil Hvitfeldt
# filtered for the english version of the text
hcane <- hca_fairytales() %>%
group_by(book) %>%
filter(language == "English") %>%
mutate(linenumber = row_number()) %>%
ungroup()
Now that we have our text in the language of interest, we begin the process of tokenization.
tidy_hcane <- hcane %>%
unnest_tokens(word, text)
data(stop_words)
tidy_hcane <- tidy_hcane %>%
anti_join(stop_words)
## Joining, by = "word"
tidy_hcane %>%
group_by(book) %>%
count(word, sort = TRUE)
## # A tibble: 74,611 x 3
## # Groups: book [156]
## book word n
## <chr> <chr> <int>
## 1 The ice maiden rudy 194
## 2 The snow queen gerda 114
## 3 Little Claus and big Claus claus 100
## 4 The ice maiden babette 88
## 5 A story from the sand dunes jörgen 87
## 6 The shadow shadow 81
## 7 The bottle neck bottle 76
## 8 The fir tree tree 76
## 9 The gate key key 72
## 10 The marsh king's daughter stork 72
## # ... with 74,601 more rows
tidy_hcane
## # A tibble: 136,919 x 4
## book language linenumber word
## <chr> <chr> <int> <chr>
## 1 The tinder-box English 1 soldier
## 2 The tinder-box English 1 marching
## 3 The tinder-box English 1 road
## 4 The tinder-box English 1 left
## 5 The tinder-box English 1 left
## 6 The tinder-box English 2 knapsack
## 7 The tinder-box English 2 sword
## 8 The tinder-box English 2 wars
## 9 The tinder-box English 3 returning
## 10 The tinder-box English 3 home
## # ... with 136,909 more rows
For my analysis, I decided to use the DictionaryGI lexicon from the Sentiment Analysis package. Upon inspection, I can see that DictionaryGI dictionary is in a list. There are 2005 negative words and 1637 positive words - which can pose a problem if I wish to join them together in a dataframe.
data(DictionaryGI)
str(DictionaryGI)
## List of 2
## $ negative: chr [1:2005] "abandon" "abandonment" "abate" "abdicate" ...
## $ positive: chr [1:1637] "abide" "ability" "able" "abound" ...
To avoid errors when turning the list into a dataframe, I made the lengths of both rows the same.
length(DictionaryGI$positive) <- length(DictionaryGI$negative)
I then turned the list object into a dataframe.
DictionaryGI_df <- as.data.frame(DictionaryGI)
I realized I needed to transform the presentation of dataframe for my analysis later. In order to do this, I would have to put the variables “positive” and “negative” into their own columns. My approach was to seperate the words by sentiment, add a column that had the sentiment that applied to that group, rename the columns and then join them. Once joined, I then removed the NA values.
negative <- DictionaryGI_df$negative
negative <- as.data.frame(negative)
negative <- negative %>%
mutate(sentiment = "negative") %>%
rename("word"="negative")
positive <- DictionaryGI_df$positive
positive <- as.data.frame(positive)
positive <- positive %>%
mutate(sentiment="positive") %>%
rename("word"="positive")
Lex_DictionaryGI <- bind_rows(positive, negative)
Lex_DictionaryGI <- Lex_DictionaryGI %>%
na.omit()
books_sentiment <- tidy_hcane %>%
inner_join(Lex_DictionaryGI) %>%
group_by(book) %>%
count(sentiment)
## Joining, by = "word"
Before creating the visuals, I seperate H.C. Andersen’s text by sentiment values and then arranged them in descending order, limiting the results to only 5.
books_pos <- books_sentiment %>%
filter(sentiment=="positive") %>%
arrange(desc(n)) %>%
head(5)
books_neg <- books_sentiment %>%
filter(sentiment == "negative") %>%
arrange(desc(n)) %>%
head(5)
top_five <- bind_rows(books_pos, books_neg)
An interesting observation is that, with the exception of the lower two variables, the top 3 books that contain negative works also appears in the same order for containing the most positive words.
ggplot(top_five,mapping = aes(x = reorder(book, desc(n)), y = n)) +
geom_col() +
facet_grid(~sentiment,scales = "free") +
theme(axis.text.x = element_text(size = 6))
The amount of total negative words in the author’s collection exceeds the total positive words found.
Lex_DictionaryGI %>%
count(sentiment)
## sentiment n
## 1 negative 2005
## 2 positive 1637
Looking at the most frequent words that appeared in the text as well as it’s paired sentiment value, I was suprised by some of the labeling such as “stood” being assigned as positive, and “hand” as negative. I’m interested about the context in which the words appear. That likely played an impact in their sentiment assignment.
dictGI_count <- tidy_hcane %>%
inner_join(Lex_DictionaryGI) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
dictGI_count
## # A tibble: 1,654 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 stood positive 532
## 2 home positive 435
## 3 heart positive 355
## 4 lay negative 347
## 5 poor negative 336
## 6 hand negative 285
## 7 hand positive 285
## 8 light positive 257
## 9 round positive 250
## 10 dead negative 243
## # ... with 1,644 more rows
We can see in the visuals, that despite negative words appearing more frequently in the text, a positive word (“stood”) appear the most, when comparing both sentiments.
dictGI_count %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Sentiment Analysis",
y = NULL)