Overview

Today’s goal is to analyze sentiment lexicons across a few main examples. We hope to learn more about the nuances between using each one and see if there is any difference when using different corpora.

Sentiment analysis

The sentiment analyses are divided into the example code from chapter 2 of the textbook “Text Mining with R” and mirroring and adding to the process using another text and sentiment lexicon. Toggle between each section using the respective tabs.

Text Mining with R Chapter 2 Code

The base code from the textbook begins by showing what each tidytext lexicon provides. Generally they contain a word and a sentiment or sentiment value.

library(tidytext)
## Warning: package 'tidytext' was built under R version 4.4.3
library(textdata)
## Warning: package 'textdata' was built under R version 4.4.3
get_sentiments('afinn')
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 2,467 more rows
get_sentiments('bing')
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows
get_sentiments('nrc')
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ℹ 13,862 more rows

Get Jane Austen book texts and tidy them into a corpus to use sentiment analysis on. This analysis is on the book “Emma”.

library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.4.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ℹ 291 more rows

Good, friend, and hope are the most common joy words found.

Compare the sentiments per line with the big sentiment lexicon.

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

These sentiments can now be plotted.

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

The plots depict the sentiments throughout each text’s story.

Take the words from only “Pride & Prejudice” and obtain the net sentiments based on afinn, bing, and nrc.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Plot the sentiments according to each sentiment lexicon.

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

The sentiment trajectories are similar for each lexicon, but have individual nuances as well.

We can study each lexicon individually to understand why there may be differences.

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3316
## 2 positive   2308
get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005
bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Bing skews more negative which leads to a lower sentiment usually.

Build a few more bar plots to see which words contribute the most to each sentiment.

bing_word_counts
## # A tibble: 2,585 × 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ℹ 2,575 more rows
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)

custom_stop_words
## # A tibble: 1,150 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ℹ 1,140 more rows

Miss is the most common negative sentiment word by a lot. Positive sentiment words are a bit more diverse such as well, good, great. Realistically, miss likely refers to young, unmarried women, so we can use custom stop words to prevent it from impacting our results.

Build word clouds to see the most common words in another way.

library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.4.3
## Loading required package: RColorBrewer
tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")

p_and_p_sentences$sentence[2]
## [1] "by jane austen"
austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())
## # A tibble: 6 × 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25
bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

A number of words look common such as lady, time, emma, and dear. For sentiment words, miss is notably common, but we already addressed how its meaning might be wrong. Otherwise, good and well appear to be the next most common words, while poor stands out from the negative sentiment lexicon.

Robinson, Julia Silge and David. 2 Sentiment Analysis with Tidy Data | Text Mining with R. Www.tidytextmining.com, www.tidytextmining.com/sentiment.html.

Extra Practice with “A Tale of Two Cities” and Quanteda Sentiment

The corpus chosen is the text of “A Tale of Two Cities” by Charles Dickens. Using the gutenbergr package, download the text using gutenberg_id = 98. Modify the corpus in a similar way to the Jane Austen example. Since only 1 book is being used here, there is no need for the book column this time.

remotes::install_github('quanteda/quanteda.sentiment')
## Using GitHub PAT from the git credential store.
## Skipping install of 'quanteda.sentiment' from a github remote, the SHA1 (934c1e1f) has not changed since last install.
##   Use `force = TRUE` to force installation
library(quanteda.sentiment)
## Loading required package: quanteda
## Warning: package 'quanteda' was built under R version 4.4.3
## Package version: 4.2.0
## Unicode version: 15.1
## ICU version: 74.1
## Parallel computing: 16 of 16 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda.sentiment'
## The following object is masked from 'package:quanteda':
## 
##     data_dictionary_LSD2015
library(gutenbergr)
## Warning: package 'gutenbergr' was built under R version 4.4.3
# The value 98 was found on the project gutenberg site.
dickens_books <- gutenberg_download(98)
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_dickens_text <- dickens_books %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Once again, we apply a technique from the earlier example. We do an inner join to see what matches the joy sentiments from nrc.

tidy_dickens_text %>% inner_join(nrc_joy) %>% count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 333 × 2
##    word         n
##    <chr>    <int>
##  1 good       217
##  2 child       89
##  3 hope        84
##  4 friend      76
##  5 daughter    62
##  6 found       61
##  7 love        56
##  8 saint       56
##  9 mother      43
## 10 true        42
## # ℹ 323 more rows

The most recurring nrc joy word is good with 217 occurrences, followed by child, hope, friend, and daughter in order. Interestingly, good, friend, and hope are also among the most common in the book “Emma”.

We will use the bing sentiments to compare the positive and negative sentiments throughout the text.

dickens_sentiment <- tidy_dickens_text %>%
  inner_join(get_sentiments("bing")) %>%
  count(index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`

Repeat the process with a sentiment lexicon known as HuLiu found in the Quanteda package.

quant_sent_hu <- rep('positive', each = length(data_dictionary_HuLiu$positive))
quant_sent_hu <- append(quant_sent_hu, rep('negative', each = length(data_dictionary_HuLiu$negative)))
quant_df <- data.frame(word = unlist(data_dictionary_HuLiu, use.names = FALSE), sentiment = quant_sent_hu)

dickens_quanteda_sentiment <- tidy_dickens_text %>%
  inner_join(quant_df) %>%
  count(index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`

The HuLiu sentiment lexicon needed some tidying to be able to be used in the same functions. I collected the words into a column and associated the appropriate sentiment with them in a dataframe.

Next, check the sentiment plots to see how they compare.

ggplot(dickens_sentiment, aes(index, sentiment)) +
  geom_col(show.legend = FALSE)

ggplot(dickens_quanteda_sentiment, aes(index, sentiment)) +
  geom_col(show.legend = FALSE)

Oddly, the sentiment plots are identical for Bing and HuLiu. According to this article, https://medium.com/@laurenflynn1211/comparing-sentiment-analysis-dictionaries-in-r-c695fca64326, this surprising result is to be expected. They are nearly identical sentiment lexicons.

Compare all four lexicons.

afinn_dickens <- tidy_dickens_text %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc_dickens <- bind_rows(
  tidy_dickens_text %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  tidy_dickens_text %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 11 of `x` matches multiple rows in `y`.
## ℹ Row 2003 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
dickens_quanteda_sentiment <- dickens_quanteda_sentiment %>%
    mutate(method = "HuLiu")

bind_rows(afinn_dickens, 
          bing_and_nrc_dickens,
          dickens_quanteda_sentiment) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

NRC skews much more positive than the other lexicons in the earlier story. We established the reason for this in the textbook example. This is because of the amount of positive and negative words skews more negative for Bing and HuLiu.

bing_word_counts <- tidy_dickens_text %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
bing_word_counts
## # A tibble: 1,870 × 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative    233
##  2 good     positive    217
##  3 like     positive    214
##  4 well     positive    179
##  5 great    positive    161
##  6 prisoner negative    115
##  7 better   positive     90
##  8 dark     negative     89
##  9 work     positive     88
## 10 poor     negative     87
## # ℹ 1,860 more rows
huliu_word_counts <- tidy_dickens_text %>%
  inner_join(quant_df) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
huliu_word_counts
## # A tibble: 1,870 × 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative    233
##  2 good     positive    217
##  3 like     positive    214
##  4 well     positive    179
##  5 great    positive    161
##  6 prisoner negative    115
##  7 better   positive     90
##  8 dark     negative     89
##  9 work     positive     88
## 10 poor     negative     87
## # ℹ 1,860 more rows

Once again, miss is very common and likely not meant negatively in Dickens’s works. Additionally, the HuLiu results are again identical to Bing.

Plot the word contributions for each lexicon.

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

huliu_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

As there is still no difference between our lexicons, it is more interesting to note that miss is still the most common negative word, but is not as extreme as with Jane Austen’s works. Prisoner appears fairly often. Good instead of well is the most common positive word.

Let’s check the word cloud as before.

tidy_dickens_text %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`
## Warning in wordcloud(word, n, max.words = 100): defarge could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): cruncher could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): time could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): doctor could not be fit on
## page. It will not be plotted.

There is a different assortment of words than the example. Miss is still very common for the same reasons. Otherwise, we have madame, day, miss, night, doctor, and time as standout words. Some common words such as carton and darnay are understandably common, unique words to this text as they were some of the main characters of “A Tale of Two Cities”.

Let’s redo the example word cloud while tagging positive and negative words.

set.seed(123)

tidy_dickens_text %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining with `by = join_by(word)`

set.seed(123)

tidy_dickens_text %>%
  inner_join(quant_df) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining with `by = join_by(word)`

The two word clouds were separated out with seeds just to prove that they were giving identical results. Word clouds have a degree of randomness to them which can make the results look different when they actually are not.

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_dickens_text %>%
  group_by(chapter) %>%
  summarize(words = n())

tidy_dickens_text %>%
  semi_join(bingnegative) %>%
  group_by(chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()
## Joining with `by = join_by(word)`
## # A tibble: 1 × 4
##   chapter negativewords words  ratio
##     <int>         <int> <int>  <dbl>
## 1      44           231  4665 0.0495
huliunegative <- quant_df %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_dickens_text %>%
  group_by(chapter) %>%
  summarize(words = n())

tidy_dickens_text %>%
  semi_join(huliunegative) %>%
  group_by(chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()
## Joining with `by = join_by(word)`
## # A tibble: 1 × 4
##   chapter negativewords words  ratio
##     <int>         <int> <int>  <dbl>
## 1      44           231  4665 0.0495

Chapter 44 appears to be the saddest in “A Tale of Two Cities” according to both lexicons.

Let’s conduct one final test to see what was going on with the sentiment lexicons.

length(quant_df$sentiment)
## [1] 6789
length(get_sentiments('bing')$word)
## [1] 6786

HuLiu is 6789 words long and Bing is 6786. There apparently are differences, but very minute ones.

Conclusions

When picking out a new corpus and an extra sentiment lexicon, I would have expected to see more differences in results than the textbook example. Selecting another prominent author from a little later than Jane Austen did not change the frequency of the word miss. Additionally, I did not realize that HuLiu was so similar to Bing until performing my analysis.

Sentiment analysis is a powerful tool to help apply feelings from words to analyze texts. They are very heavily impacted by the choices of words used to measure each text. I believe that properly matching sentiment lexicons to the right era and type of work would contribute to better analyses of the texts.