The approach to extend the code from the Chapter will be:
Get code working as per chapter with the included Lexicon in TIDY and the provided Corpus from Jane Austen
Load the lexicon R library and use a few of their included lexicons
Load in cloud server a few books in .TXT format which we will use as the Corpus to analyze.
For this assignment we will use “The Scarlet Letter”, BUT we also downloaded and tested “War and Peace”, “Ulysses” and “The Great Gatsby”
rm(list=ls())
library(janeaustenr)
library(tidytext)
library(tidyverse)
#library(dplyr)
#library(stringr)
#library(tidyr)
#library(ggplot2)
We will take a look at the 3 Lexicons included in Tidy:
head(get_sentiments("afinn"))
## # A tibble: 6 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
head(get_sentiments("bing"))
## # A tibble: 6 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
head(get_sentiments("nrc"))
## # A tibble: 6 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
The three offer a somewhat different approach to rate a words and its sentiment. We need to consider each one’s range of options when analyzing data.
Let’s do some basic “tyding” on the corpuses from Jane Austen Books. Last step is to “tokenize” each word in the corpus.
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
We will use the R Package lexicon from Tyle Rinker at https://github.com/trinker/lexicon
library(lexicon)
## Warning: package 'lexicon' was built under R version 4.1.3
# available_data('English')
We have uploaded a few books in our server in the cloud.
For this assignment we will use Scarlet Letter
First we define URL’s and available .txt in our server
url_1 <- "http://3.86.40.38/data607/"
url_2a <- "scarlet.txt"
url_2b <- "gatsby.txt"
url_2c <- "ulysses.txt"
url_2d <- "warandpeace.txt"
#We will start with Scarlet Letter
url12 <- paste0(url_1,url_2a)
Let’s read in the selected book and convert it into Tidy format and Dataframe.
corpus_txt <- read_lines(url12)
numberoflines <- length(read_lines(url12))
corpus_df <- tibble(line = 1:numberoflines, text = corpus_txt)
tidy_corpus <- corpus_df %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
unnest_tokens(word, text)
Let’s test the new corpus using the standard BING lexicon
corpus_sentiment <- tidy_corpus %>%
inner_join(get_sentiments("bing")) %>%
count(index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
Let’s plot the results
library(ggplot2)
ggplot(corpus_sentiment, aes(index, sentiment)) +
geom_col(show.legend = FALSE)
Now lets use the nrc sentiment data set to assess the different sentiments that are represented across the selected new Corpus.
tidy_corpus %>%
right_join(get_sentiments("nrc")) %>%
filter(!is.na(sentiment)) %>%
count(sentiment, sort = TRUE)
## Joining, by = "word"
## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 positive 6767
## 2 negative 6038
## 3 trust 3708
## 4 fear 3108
## 5 anticipation 3011
## 6 sadness 2934
## 7 joy 2728
## 8 anger 2357
## 9 disgust 1970
## 10 surprise 1457
Let’s take a look at the Lexicon provided in the Lexicon package. It offers a few lexicons with words rated for sentiment.
#-1 OR +1
head(hash_sentiment_huliu)
## x y
## 1: a plus 1
## 2: abnormal -1
## 3: abolish -1
## 4: abominable -1
## 5: abominably -1
## 6: abominate -1
# head(hash_sentiment_nrc)
# From -1 to +1 in decimals
# head(hash_sentiment_senticnet)
#head(hash_sentiment_sentiword)
head(hash_sentiment_jockers)
## x y
## 1: abandon -0.75
## 2: abandoned -0.50
## 3: abandoner -0.25
## 4: abandonment -0.25
## 5: abandons -1.00
## 6: abducted -1.00
As you can see from before, the lexicons as divided into two categories: One for lexicons which rate words simply into -1 or +1 AND lexicons which give an exact rating between -1 and +1 in decimals.
For this excercise we will use hash_sentiment_huliu for comparisons to other lexicons that also have only positive or negative line BINNG.
We will use hash_sentiment_jockers for comparisons to AFFIN which have a scale of ratings for sentiment
hash_words_Scale <- hash_sentiment_jockers %>%
rename(word = x, sentiment = y)
hash_words_PosNeg <- hash_sentiment_huliu %>%
rename(word = x, sentiment = y) %>%
mutate(sentiment = replace(sentiment, sentiment == 1, "positive")) %>%
mutate(sentiment = replace(sentiment, sentiment == -1, "negative"))
Let’s take a look at the joker lexicon using the new corpus.
joker_word_counts <- tidy_corpus %>%
inner_join(hash_words_Scale) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
joker_word_counts
## # A tibble: 2,756 x 3
## word sentiment n
## <chr> <dbl> <int>
## 1 child 0.6 197
## 2 like 0.5 146
## 3 good 0.75 121
## 4 new 0.8 96
## 5 well 0.8 77
## 6 better 0.8 65
## 7 physician -0.25 65
## 8 great 0.5 61
## 9 work 0.25 60
## 10 young 0.4 58
## # ... with 2,746 more rows
Let’s take a look at the huliu lexicon using the new corpus.
huliu_word_counts <- tidy_corpus %>%
inner_join(hash_words_PosNeg) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
huliu_word_counts
## # A tibble: 1,609 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 like positive 146
## 2 good positive 121
## 3 well positive 77
## 4 better positive 65
## 5 great positive 61
## 6 work positive 60
## 7 smile positive 53
## 8 wild negative 51
## 9 strange negative 48
## 10 poor negative 44
## # ... with 1,599 more rows
newcorpus_sentiment <- tidy_corpus %>%
inner_join(get_sentiments("bing")) %>%
count(index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(newcorpus_sentiment, aes(index, sentiment)) +
geom_col(show.legend = FALSE)
newcorpus_sentiment <- tidy_corpus %>%
inner_join(hash_words_PosNeg) %>%
count(index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(newcorpus_sentiment, aes(index, sentiment)) +
geom_col(show.legend = FALSE)
What we could see is that the results were very similar, in fact the difference between plots is very small still exists, so I would suspect either of the lexicons used the other as a basis.
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
huliu <- pride_prejudice %>%
inner_join(hash_words_PosNeg) %>%
count(index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative, method="HULIU")
## Joining, by = "word"
joker <- pride_prejudice %>%
inner_join(hash_words_Scale) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(sentiment)) %>%
mutate(method = "JOKER")
## Joining, by = "word"
Let’s plot the whole thing now.
bind_rows(afinn,
bing_and_nrc,joker,huliu) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
afinn <- tidy_corpus %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
tidy_corpus %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
tidy_corpus %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
huliu <- tidy_corpus %>%
inner_join(hash_words_PosNeg) %>%
count(index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative, method="HULIU")
## Joining, by = "word"
joker <- tidy_corpus %>%
inner_join(hash_words_Scale) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(sentiment)) %>%
mutate(method = "JOKER")
## Joining, by = "word"
Let’s plot the whole thing now.
bind_rows(afinn,
bing_and_nrc,joker,huliu) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
Let’s check JOKER
joker_word_counts <- tidy_corpus %>%
inner_join(hash_words_Scale) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
joker_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
Let’s check HULIU
huliu_word_counts <- tidy_corpus %>%
inner_join(hash_words_PosNeg) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
huliu_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
library(wordcloud)
tidy_corpus %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
library(reshape2)
tidy_corpus %>%
inner_join(hash_words_PosNeg) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)