For the Week 10 assignment using ‘In Text Mining with R’, Chapter 2 looks at Sentiment Analysis.
In this assignment, we started by getting the primary example code from chapter 2 working in an R Markdown document.
You should provide a citation to this base code. You’re then asked to extend the code in two ways:
Later in the demonstration we work with a different corpus of our choosing, and we ncorporate at least one additional sentiment lexicon
Source: AFINN from Finn Årup Nielsen bing from Bing Liu and collaborators nrc from Saif Mohammad and Peter Turney
The Tidytext package draws upon three main lexicons for sentiment analysis: “Bing,” “AFINN,” and “NRC.”
#The AFINN lexicon grades words between -5 and 5 (positive scores indicate positive sentiments).
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
#The NRC lexicon categorizes sentiment words into positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise and trust
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
#The Bing lexicon uses a binary categorization model that sorts words into positive or negative positions
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
This package provides access to the full texts of Jane Austen’s 6 completed, published novels. The UTF-8 plain text for each novel was sourced from Project Gutenberg, processed a bit, and is ready for text analysis. Each text is in a character vector with elements of about 70 characters.
Below we use some functionality of dplyr to organize and tidy the data for analysis. In the case below we see the output helps us create a dataframe displaying sentiment count
Emma, by Jane Austen, is a novel about youthful hubris and romantic misunderstandings. It is set in the fictional country village of Highbury and the surrounding estates of Hartfield, Randalls and Donwell Abbey, and involves the relationships among people from a small number of families.
tidy_books123 <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books123 %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE) %>%
head()
## Joining, by = "word"
## # A tibble: 6 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
Please note the creation of jane_austen_sentiment1 using functionality of sentiments[bing]
jane_austen_sentiment1 <- tidy_books123 %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
Graphical depiction using ggplot of frame jane_austen_sentiment1 displaying sentiment high an low levels for the novels
ggplot(jane_austen_sentiment1, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
Demonstration using dplyr/filter for Pride & Prejudice. We are able to depict using the head function some of the line number and word distribution
Pride and Prejudice is a novel of manners by Jane Austen, first published in 1813. The story follows the main character, Elizabeth Bennet, as she deals with issues of manners, upbringing, morality, education, and marriage in the society of the landed gentry of the British Regency.
pride_prejudice1 <- tidy_books123 %>%
filter(book == "Pride & Prejudice")
pride_prejudice1%>%
head()
## # A tibble: 6 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
Using a few techniques of bindings, mutate and inner_join to compile sentiment values denoting ‘positive’ and ‘negative’. As you can see our information indicates the total sentiment values of negative 3324 and positive 2312
afinn1 <- pride_prejudice1 %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
## `summarise()` ungrouping output (override with `.groups` argument)
bing_and_nrc1 <- bind_rows(pride_prejudice1 %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice1 %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative")) %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 3324
## 2 positive 2312
Utilizing a facetwrap and ggplot to bindrows of values afinn1 and bing_and_nrc1 to display the levels of sentiment for each Lexicon
bind_rows(afinn1,
bing_and_nrc1) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
Extend analysis to new corpus and new lexicon We identified and implemented a different corpus to perform sentiment analysis - Philosopher’s Stone We identified and implement an additional lexicon for sentiment analysis
Creating a tibble below to view content in format for philosophers stone
str(philosophers_stone)
## chr [1:17] "THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfect"| __truncated__ ...
tibble(philosophers_stone)
## # A tibble: 17 x 1
## philosophers_stone
## <chr>
## 1 "THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were~
## 2 "THE VANISHING GLASS Nearly ten years had passed since the Dursleys had wok~
## 3 "THE LETTERS FROM NO ONE The escape of the Brazilian boa constrictor earned~
## 4 "THE KEEPER OF THE KEYS BOOM. They knocked again. Dudley jerked awake. \"Wh~
## 5 "DIAGON ALLEY Harry woke early the next morning. Although he could tell it ~
## 6 "THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS Harry's last month with ~
## 7 "THE SORTING HAT The door swung open at once. A tall, black-haired witch in~
## 8 "THE POTIONS MASTER There, look.\" \"Where?\" \"Next to the tall kid with~
## 9 "THE MIDNIGHT DUEL Harry had never believed he would meet a boy he hated mo~
## 10 "HALLOWEEN Malfoy couldn't believe his eyes when he saw that Harry and Ron ~
## 11 "QUIDDITCH As they entered November, the weather turned very cold. The moun~
## 12 "THE MIRROR OF ERISED Christmas was coming. One morning in mid-December, Ho~
## 13 "NICOLAS FLAMEL Dumbledore had convinced Harry not to go looking for the Mi~
## 14 "NORBERT THE NORWEGIAN RIDGEBACK Quirrell, however, must have been braver t~
## 15 "THE FORIBIDDEN FOREST Things couldn't have been worse. Filch took them do~
## 16 "THROUGH THE TRAPDOOR In years to come, Harry would never quite remember ho~
## 17 "THE MAN WITH TWO FACES It was Quirrell. \"You!\" gasped Harry. Quirrell ~
Formatting the content of philosophers stone to organize in the manner of displaying word by sequence in the novel
titles1 <- c("philosophers_stone")
books1 <- list(philosophers_stone)
series1 <- tibble()
for(i in seq_along(titles1)) {
temp1 <- tibble(chapter = seq_along(books1[[i]]),
text = books1[[i]]) %>%
unnest_tokens(word, text) %>%
mutate(book = titles1[i]) %>%
select(book, everything())
series1 <- rbind(series1, temp1)
}
# set factor to keep books in order of publication
series1$book <- factor(series1$book, levels = rev(titles1))
series1
## # A tibble: 77,875 x 3
## book chapter word
## <fct> <int> <chr>
## 1 philosophers_stone 1 the
## 2 philosophers_stone 1 boy
## 3 philosophers_stone 1 who
## 4 philosophers_stone 1 lived
## 5 philosophers_stone 1 mr
## 6 philosophers_stone 1 and
## 7 philosophers_stone 1 mrs
## 8 philosophers_stone 1 dursley
## 9 philosophers_stone 1 of
## 10 philosophers_stone 1 number
## # ... with 77,865 more rows
afinn1 <- series1 %>%
group_by(book) %>%
mutate(word_count = 1:n(),
index = word_count %/% 500 + 1) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(book, index) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
## `summarise()` regrouping output by 'book' (override with `.groups` argument)
Final compilation and organiazation of the content displaying the sentiment and mode
afinn1 <- series1 %>%
group_by(book) %>%
mutate(word_count = 1:n(),
index = word_count %/% 500 + 1) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(book, index) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
## `summarise()` regrouping output by 'book' (override with `.groups` argument)
afinn1
## # A tibble: 156 x 4
## # Groups: book [1]
## book index sentiment method
## <fct> <dbl> <dbl> <chr>
## 1 philosophers_stone 1 11 AFINN
## 2 philosophers_stone 2 4 AFINN
## 3 philosophers_stone 3 7 AFINN
## 4 philosophers_stone 4 12 AFINN
## 5 philosophers_stone 5 3 AFINN
## 6 philosophers_stone 6 21 AFINN
## 7 philosophers_stone 7 -7 AFINN
## 8 philosophers_stone 8 8 AFINN
## 9 philosophers_stone 9 17 AFINN
## 10 philosophers_stone 10 13 AFINN
## # ... with 146 more rows
series1 %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 105))
## Joining, by = "word"
A dataset containing a character vector of Loughran & McDonald’s (2016) constraining words list.
A tibble display of sentiment values and count
loughran1 <- series1 %>%
right_join(get_sentiments("loughran")) %>%
filter(!is.na(sentiment)) %>%
count(sentiment, sort = TRUE)
## Joining, by = "word"
loughran1
## # A tibble: 6 x 2
## sentiment n
## <chr> <int>
## 1 negative 3027
## 2 litigious 919
## 3 uncertainty 888
## 4 positive 767
## 5 constraining 205
## 6 superfluous 56
loughran1 <- bind_rows(series1 %>%
group_by(book) %>%
mutate(word_count = 1:n(),
index = word_count %/% 500 + 1) %>%
inner_join(get_sentiments("loughran") %>%
filter(sentiment %in% c("positive", "negative"))) %>%
mutate(method = "Loughran")) %>%
count(book, method, index = index , sentiment) %>%
ungroup() %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
select(book, index, method, sentiment)
## Joining, by = "word"
A graphical/visual representation of the content utilizing the affin1 and loughran1. We see the lows and highs vary between methods. Afinn showing drastic higher and lowe levels of sentiment spread throughout the index. Loughran demonstrating a more compact distribution of the sentiment values.
bind_rows(afinn1,
loughran1) %>%
ungroup() %>%
mutate(book = factor(book, levels = titles1)) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_grid(book ~ method)
Austen, J., & Stafford, F. (2003). Emma (Penguin Classics) (Reissue ed.). Penguin Classics.
Austen, J., & Tanner, T. (2002). Pride and Prejudice (Reprint. ed.). Penguin Books.
Rowling, J. K. (2018). Harry Potter and the Philosopher’s Stone: Slytherin Edition; Black and Green (Anniversary ed.). Educa Books.
In conclusion, we see sentiment analysis enables us to make sense of qualitative data such as novels, tweets, product reviews, and support tickets, and extract insights. By detecting positive, neutral, and negative opinions within text, you can understand how general feeling about a novel, brand, product, or service, and make data-driven decisions. With a variety of Lexicons, we utilize the english words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).