Let’s use our text analyis on works on literature - I’ll look for patterns in the greatest works by Dostoevsky, but feel free to use the same technique to explore any other author or text.
Keep in mind, the more content you analyze, the more accurate your overall results should be. If I looked at the three greatest books by Dostoevsky, I’d potentially find a lot - but my methodology would only allow for me to apply those results to a conclusion on those three books, not any others. If I want to make grand claims about how Dostoevsy writes, I need to analyze as much of his work as possible. Ideally, we could set a high bar for primary information, which others could also use if duplicating our research: something like the ‘10 most popular books by Dostoevsky.’ How do we determine that?
Well, Feodor wrote 11 novels in his life, and five seem to be recognized as his greatest works, according to many lists online. Do we analyze all 11, or just the 5 ‘best?’ Do we include his short stories and novellas? Again, the more data the better - and also, after initially loading the data, there’s really no extra work involved in adding more books to the project.
And why Dostoevsky? Why no analyze the Harry Potter books? Well, Feodor is dead, and his works are in the public domain - which means we can connect to the website gutenberg.org and download copies of his books for free. We can’t do that with Harry Potter, as those books are copywrighted. (We could still attempt to find and load that data, but chances are it’d be very messy and hard to analyze if someone did the text conversion themselves.)
# install.packages('gutenbergr')
library(gutenbergr)
If you explore gutenberg.org, you’ll find hundreds of books in the public domain for analysis. Each book has a number associated with it, most easily found in the URL when looking at a particular novel. I’m going to load Dostoevsky’s books using the Gutenbergr package and these numbers.
gutenberg_works(author == "Dostoyevsky, Fyodor") %>% View()
OK, there they are! While there are 12 results, not all of them are novels - we also have some short story collections. Let’s include them all.
Now to download them as .txt files. Note that I use the ‘mutate’ function of dplyr to add a column with the name of each book - this is so, when we merge all of the books together into one big ‘corpus,’ we can still figure out which book the text came from.
This is a lot of code, but we’re just loading all of these books into R:
crime <-gutenberg_download(2554, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "Crime & Punishment")
brothers <-gutenberg_download(28054, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "The Brothers Karamazov")
notes <-gutenberg_download(600, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "Notes from the Underground")
idiot <-gutenberg_download(2638, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "The Idiot")
demons <-gutenberg_download(8117, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "The Possessed")
gambler <-gutenberg_download(2197, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "The Gambler")
poor <-gutenberg_download(2302, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "Poor Folk")
white <-gutenberg_download(36034, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "White Nights and Other Stories")
house <-gutenberg_download(37536, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "The House of the Dead")
uncle <-gutenberg_download(38241, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "Uncle's Dream; and The Permanent Husband")
short <-gutenberg_download(40745, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "Short Stories")
grand <-gutenberg_download(8578, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "The Grand Inquisitor")
stavrogin <-gutenberg_download(57050, mirror = "http://mirrors.xmission.com/gutenberg/") %>% mutate('title' = "Stavrogin's Confession and The Plan of The Life of a Great Sinner")
Now, let’s merge all of the books into one huge corpus. We’ll do this using the function rbind(), which is the easiest way to merge data frames with identical columns, as we have here.
dostoyevsky <- rbind(crime,brothers, notes, idiot, demons, gambler, poor, white, house, uncle, short, grand, stavrogin)
So, what are the most common words used by Dostoevsky?
dostoyevsky %>%
unnest_tokens(word, text) -> d_words
d_words %>%
anti_join(stop_words) %>%
group_by(title) %>%
count(word, sort = TRUE) %>%
head(20) %>%
knitr::kable()
## Joining, by = "word"
title | word | n |
---|---|---|
The Idiot | prince | 1787 |
The Brothers Karamazov | alyosha | 1183 |
The Brothers Karamazov | mitya | 820 |
The Brothers Karamazov | don’t | 789 |
The Brothers Karamazov | it’s | 767 |
The Brothers Karamazov | father | 730 |
Crime & Punishment | raskolnikov | 725 |
The Brothers Karamazov | ivan | 682 |
The Brothers Karamazov | time | 678 |
The Brothers Karamazov | that’s | 610 |
The Possessed | time | 587 |
The Brothers Karamazov | suddenly | 585 |
The Possessed | pyotr | 529 |
The Possessed | stepan | 524 |
The Idiot | don’t | 509 |
The Brothers Karamazov | day | 498 |
The Possessed | stepanovitch | 490 |
The Brothers Karamazov | cried | 476 |
The Possessed | trofimovitch | 476 |
Uncle’s Dream; and The Permanent Husband | velchaninoff | 469 |
#bigrams!
dostoyevsky %>%
unnest_tokens(bigram, text, token="ngrams", n=2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word1 %in% NA) %>%
filter(!word2 %in% NA) %>%
count(word1, word2, sort = TRUE) -> d_bigrams
d_bigrams %>%
filter(word2 == "fellow")
## # A tibble: 105 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 dear fellow 61
## 2 poor fellow 44
## 3 queer fellow 10
## 4 nice fellow 9
## 5 clever fellow 8
## 6 capital fellow 5
## 7 funny fellow 5
## 8 low fellow 5
## 9 decent fellow 4
## 10 excellent fellow 4
## # … with 95 more rows
word clouds
library(wordcloud2)
d_words %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
wordcloud2()
## Joining, by = "word"
Sentiment!
d_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("bing")) %>%
group_by(sentiment) %>%
count()
## Joining, by = "word"
## Joining, by = "word"
## # A tibble: 2 × 2
## # Groups: sentiment [2]
## sentiment n
## <chr> <int>
## 1 negative 63883
## 2 positive 35217
d_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(value) %>%
count(word, sort=TRUE)
## Joining, by = "word"
## Joining, by = "word"
## # A tibble: 1,748 × 3
## # Groups: value [10]
## value word n
## <dbl> <chr> <int>
## 1 -2 cried 1828
## 2 3 love 1389
## 3 1 god 1013
## 4 2 dear 852
## 5 1 matter 778
## 6 -1 strange 747
## 7 -2 poor 730
## 8 -2 afraid 700
## 9 2 true 649
## 10 1 feeling 633
## # … with 1,738 more rows
d_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(title) %>%
#count(word, value, sort=TRUE) %>%
summarize(average = mean(value)) %>%
ggplot(aes(reorder(title, -average), average, fill=average)) + geom_col() + coord_flip() +
ggtitle("Most Depressing Dostoyevsky Books")
## Joining, by = "word"
## Joining, by = "word"
d_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("bing")) %>%
group_by(title) %>%
count(sentiment, sort=TRUE)
## Joining, by = "word"
## Joining, by = "word"
## # A tibble: 26 × 3
## # Groups: title [13]
## title sentiment n
## <chr> <chr> <int>
## 1 The Brothers Karamazov negative 14083
## 2 The Possessed negative 9492
## 3 The Idiot negative 8361
## 4 Crime & Punishment negative 8257
## 5 The Brothers Karamazov positive 7903
## 6 The Idiot positive 4943
## 7 The Possessed positive 4936
## 8 White Nights and Other Stories negative 4840
## 9 The House of the Dead negative 4736
## 10 Uncle's Dream; and The Permanent Husband negative 3578
## # … with 16 more rows
d_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("nrc")) %>%
# group_by(title) %>%
count(sentiment, sort=TRUE)
## Joining, by = "word"
## Joining, by = "word"
## # A tibble: 10 × 2
## sentiment n
## <chr> <int>
## 1 positive 70314
## 2 negative 64890
## 3 trust 41583
## 4 fear 34716
## 5 anticipation 34419
## 6 sadness 32940
## 7 joy 30529
## 8 anger 29825
## 9 disgust 22150
## 10 surprise 19810