In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:
Work with a different corpus of your choosing,
and
Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).
As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment.
#install.packages("tidytext")
#install.packages("janeaustenr")
library(tidytext)
library(janeaustenr)
library(dplyr)
library(ggplot2)
library(stringr)
library(tidyr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
bing <- get_sentiments("bing")
bing_sentiment <- tidy_books %>%
inner_join(bing) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(bing_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, scales = "free_x") +
labs(title = "Sentiment analysis using Bing lexicon")
This code example is adapted from Text Mining with R by Silge and
Robinson (2017). Full text available at https://www.tidytextmining.com.
I had so much trouble with getting data. I tried to use gutenbergr, but it was fighting me to install the pacakge, and then when I did, it just wouldn’t download any of the data. I ended up just pulling fown the text file, hosting on my github and pulling it in here. Sorry it’s a little in-elegant. But it’s here.
library(readr)
url <- "https://raw.githubusercontent.com/tcgraham-data/data-607-week-10/refs/heads/main/pg84.txt"
text_lines <- read_lines(url)
text_df <- data.frame(line = 1:length(text_lines), text = text_lines)
tidy_text <- text_df %>%
unnest_tokens(word, text) %>%
mutate(linenumber = row_number(), word = tolower(word))
bing <- get_sentiments("bing")
bing_sentiment <- tidy_text %>%
inner_join(bing, by = "word") %>%
count(index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0))
if (!"positive" %in% names(bing_sentiment)) {
bing_sentiment$positive <- 0
}
if (!"negative" %in% names(bing_sentiment)) {
bing_sentiment$negative <- 0
}
bing_sentiment <- bing_sentiment %>%
mutate(sentiment = positive - negative)
ggplot(bing_sentiment, aes(index, sentiment)) +
geom_col(fill = "darkred") +
labs(
title = "Sentiment in *Frankenstein* (from GitHub .txt)",
x = "Index (every 80 lines)",
y = "Net Sentiment"
)
##install.packages("textdata")
library("textdata")
afinn <- get_sentiments("afinn")
afinn_sentiment <- tidy_text %>%
inner_join(afinn, by = "word") %>%
count(index = linenumber %/% 80, wt = value, name = "sentiment")
ggplot(afinn_sentiment, aes(index, sentiment)) +
geom_col(fill = "steelblue") +
labs(
title = "AFINN Sentiment in *Frankenstein*",
x = "Index (every 80 lines)",
y = "Net Sentiment Score"
)
For what appeared to be a pretty streaightforward assignment, I ran into some suprising issues. I did some quick research and learned about the gutenbergr package which sources open source literature. At first I had a tough time loading the file do to some internet issues. Then once I did get it to load, everything I downloaded came in as an empty dataset. Ultimately, I had to download the .txt file and house it on my github in order to complete the assignment. But it’s all here now!