Let’s use our text analyis on works on literature - I’ll look for patterns in the greatest works by Dostoevsky, but feel free to use the same technique to explore any other author or text.

Keep in mind, the more content you analyze, the more accurate your overall results should be. If I looked at the three greatest books by Dostoevsky, I’d potentially find a lot - but my methodology would only allow for me to apply those results to a conclusion on those three books, not any others. If I want to make grand claims about how Dostoevsy writes, I need to analyze as much of his work as possible. Ideally, we could set a high bar for primary information, which others could also use if duplicating our research: something like the ‘10 most popular books by Dostoevsky.’ How do we determine that?

Well, Feodor wrote 11 novels in his life, and five seem to be recognized as his greatest works, according to many lists online. Do we analyze all 11, or just the 5 ‘best?’ Do we include his short stories and novellas? Again, the more data the better - and also, after initially loading the data, there’s really no extra work involved in adding more books to the project.

And why Dostoevsky? Why no analyze the Harry Potter books? Well, Feodor is dead, and his works are in the public domain - which means we can connect to the website gutenberg.org and download copies of his books for free. We can’t do that with Harry Potter, as those books are copywrighted. (We could still attempt to find and load that data, but chances are it’d be very messy and hard to analyze if someone did the text conversion themselves.)

# install.packages('gutenbergr')
library(gutenbergr)

If you explore gutenberg.org, you’ll find hundreds of books in the public domain for analysis. Each book has a number associated with it, most easily found in the URL when looking at a particular novel. I’m going to load Dostoevsky’s books using the Gutenbergr package and these numbers.

gutenberg_works(author == "Dostoyevsky, Fyodor") %>%  View()

OK, there they are! While there are 12 results, not all of them are novels - we also have some short story collections. Let’s include them all.

Now to download them as .txt files. Note that I use the ‘mutate’ function of dplyr to add a column with the name of each book - this is so, when we merge all of the books together into one big ‘corpus,’ we can still figure out which book the text came from.

This is a lot of code, but we’re just loading all of these books into R:

crime <-gutenberg_download(2554, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "Crime & Punishment")

brothers <-gutenberg_download(28054, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "The Brothers Karamazov")

notes <-gutenberg_download(600, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "Notes from the Underground")

idiot <-gutenberg_download(2638, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "The Idiot")

demons <-gutenberg_download(8117, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "The Possessed")

gambler <-gutenberg_download(2197, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "The Gambler")

 poor <-gutenberg_download(2302, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "Poor Folk")
 
 white <-gutenberg_download(36034, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "White Nights and Other Stories")
 
 house <-gutenberg_download(37536, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "The House of the Dead")
 
 uncle <-gutenberg_download(38241, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "Uncle's Dream; and The Permanent Husband")
 
 short <-gutenberg_download(40745, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "Short Stories")
 
 grand <-gutenberg_download(8578, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "The Grand Inquisitor")
 
 stavrogin <-gutenberg_download(57050, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "Stavrogin's Confession and The Plan of The Life of a Great Sinner")

Now, let’s merge all of the books into one huge corpus. We’ll do this using the function rbind(), which is the easiest way to merge data frames with identical columns, as we have here.

dostoyevsky <- rbind(crime,brothers, notes, idiot, demons, gambler, poor, white, house, uncle, short, grand, stavrogin)

So, what are the most common words used by Dostoevsky?

dostoyevsky %>% 
  unnest_tokens(word, text) -> d_words

d_words %>% 
  anti_join(stop_words) %>% 
  group_by(title) %>% 
  count(word, sort = TRUE) %>% 
  head(20) %>% 
  knitr::kable()

## Joining, by = "word"

title	word	n
The Idiot	prince	1787
The Brothers Karamazov	alyosha	1183
The Brothers Karamazov	mitya	820
The Brothers Karamazov	don’t	789
The Brothers Karamazov	it’s	767
The Brothers Karamazov	father	730
Crime & Punishment	raskolnikov	725
The Brothers Karamazov	ivan	682
The Brothers Karamazov	time	678
The Brothers Karamazov	that’s	610
The Possessed	time	587
The Brothers Karamazov	suddenly	585
The Possessed	pyotr	529
The Possessed	stepan	524
The Idiot	don’t	509
The Brothers Karamazov	day	498
The Possessed	stepanovitch	490
The Brothers Karamazov	cried	476
The Possessed	trofimovitch	476
Uncle’s Dream; and The Permanent Husband	velchaninoff	469

bigrams

#bigrams!

dostoyevsky %>% 
  unnest_tokens(bigram, text, token="ngrams", n=2) %>% 
  separate(bigram, c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  filter(!word1 %in% NA) %>% 
  filter(!word2 %in% NA) %>%
  count(word1, word2, sort = TRUE) -> d_bigrams
  
d_bigrams %>%  
  filter(word2 == "fellow")

## # A tibble: 105 × 3
##    word1     word2      n
##    <chr>     <chr>  <int>
##  1 dear      fellow    61
##  2 poor      fellow    44
##  3 queer     fellow    10
##  4 nice      fellow     9
##  5 clever    fellow     8
##  6 capital   fellow     5
##  7 funny     fellow     5
##  8 low       fellow     5
##  9 decent    fellow     4
## 10 excellent fellow     4
## # … with 95 more rows

word clouds

 library(wordcloud2)

d_words %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2()

## Joining, by = "word"

Sentiment!

d_words %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("bing")) %>% 
  group_by(sentiment) %>% 
  count()

## Joining, by = "word"
## Joining, by = "word"

## # A tibble: 2 × 2
## # Groups:   sentiment [2]
##   sentiment     n
##   <chr>     <int>
## 1 negative  63883
## 2 positive  35217

d_words %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(value) %>% 
  count(word, sort=TRUE)

## Joining, by = "word"
## Joining, by = "word"

## # A tibble: 1,748 × 3
## # Groups:   value [10]
##    value word        n
##    <dbl> <chr>   <int>
##  1    -2 cried    1828
##  2     3 love     1389
##  3     1 god      1013
##  4     2 dear      852
##  5     1 matter    778
##  6    -1 strange   747
##  7    -2 poor      730
##  8    -2 afraid    700
##  9     2 true      649
## 10     1 feeling   633
## # … with 1,738 more rows

d_words %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(title) %>% 
  #count(word, value, sort=TRUE) %>% 
  summarize(average = mean(value)) %>% 
  ggplot(aes(reorder(title, -average), average, fill=average)) + geom_col() + coord_flip() +
  ggtitle("Most Depressing Dostoyevsky Books")

## Joining, by = "word"
## Joining, by = "word"

d_words %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("bing")) %>% 
  group_by(title) %>% 
  count(sentiment, sort=TRUE)

## Joining, by = "word"
## Joining, by = "word"

## # A tibble: 26 × 3
## # Groups:   title [13]
##    title                                    sentiment     n
##    <chr>                                    <chr>     <int>
##  1 The Brothers Karamazov                   negative  14083
##  2 The Possessed                            negative   9492
##  3 The Idiot                                negative   8361
##  4 Crime & Punishment                       negative   8257
##  5 The Brothers Karamazov                   positive   7903
##  6 The Idiot                                positive   4943
##  7 The Possessed                            positive   4936
##  8 White Nights and Other Stories           negative   4840
##  9 The House of the Dead                    negative   4736
## 10 Uncle's Dream; and The Permanent Husband negative   3578
## # … with 16 more rows

d_words %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("nrc")) %>% 
  # group_by(title) %>% 
  count(sentiment,  sort=TRUE)

## Joining, by = "word"
## Joining, by = "word"

## # A tibble: 10 × 2
##    sentiment        n
##    <chr>        <int>
##  1 positive     70314
##  2 negative     64890
##  3 trust        41583
##  4 fear         34716
##  5 anticipation 34419
##  6 sadness      32940
##  7 joy          30529
##  8 anger        29825
##  9 disgust      22150
## 10 surprise     19810

Case Study: Gutenbergr

Brian Walsh

2022-06-01

bigrams