[You do not have to include any graphs here, but you can for your own notes if you wish!]
Download 3+ books by some author on Project Gutenberg. Jane Austen, Victor Hugo, Lucy Maud Montgomery, Arthur Conan Doyle, Mark Twain, Henry David Thoreau, Fyodor Dostoyevsky, Leo Tolstoy. Anyone. Just make sure it’s all from the same author. If you are able to find downloadable books elsewhere (like the Harry Potter example from class) and prefer to do that, that’s OK too! You just might have to do a bit more start up work to get the data into a tidy format.
Make these two plots and describe what each tell about your author’s books:
You can knit your .Rmd file to any file format you wish!
# Download your books here -- the chunk options
jane_austen_raw <- gutenberg_download(c(1342,158,161), mirror = "http://mirrors.xmission.com/gutenberg/", meta_fields = "title")
head(jane_austen_raw)
## # A tibble: 6 x 3
## gutenberg_id text title
## <int> <chr> <chr>
## 1 158 "EMMA" Emma
## 2 158 "" Emma
## 3 158 "By Jane Austen" Emma
## 4 158 "" Emma
## 5 158 "" Emma
## 6 158 "" Emma
ja_words <- jane_austen_raw %>%
drop_na(text) %>%
unnest_tokens(word, text)
top_words_ja <- ja_words %>%
# Remove stop words
anti_join(stop_words) %>%
# Count all the words in each book
count(title, word, sort = TRUE) %>%
# Keep top 15 in each book
group_by(title) %>%
top_n(10) %>%
ungroup() %>%
# Make the words an ordered factor so they plot in order
mutate(word = fct_inorder(word))
top_words_ja
## # A tibble: 30 x 3
## title word n
## <chr> <fct> <int>
## 1 Emma emma 786
## 2 Sense and Sensibility elinor 622
## 3 Emma miss 599
## 4 Pride and Prejudice elizabeth 597
## 5 Sense and Sensibility marianne 492
## 6 Emma harriet 415
## 7 Emma weston 389
## 8 Pride and Prejudice darcy 373
## 9 Emma knightley 356
## 10 Emma elton 319
## # … with 20 more rows
ja_words_filtered = ja_words %>%
# Remove stop words
anti_join(stop_words) %>%
# Count all the words in each book
count(title, word, sort = TRUE)
## Joining, by = "word"
# Add the tf-idf values to the counts
ja_tf_idf <- ja_words_filtered %>%
bind_tf_idf(word, title, n)
# Get the top 10 uniquest words
ja_tf_idf_plot <- ja_tf_idf %>%
arrange(desc(tf_idf)) %>%
group_by(title) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = fct_inorder(word))
## Selecting by tf_idf
ggplot(ja_tf_idf_plot, aes(y = fct_rev(word), x = tf_idf, fill = title)) +
geom_col() +
guides(fill = FALSE) +
labs(y = "Count", x = NULL,
title = "Top 10 most unique words in Jane Austen's Novels") +
facet_wrap(vars(title), scales = "free_y") +
lab12_theme +
scale_fill_viridis_d(option = "C", end = .75)
The top 10 most unique words in Emma, Pride and Prejudice, and Sense and Sensibility are all names, which make sense because names are unique and her novels usually consist of many characters who all have some sort of relationship with each other. The number one ranked name also tells us that this character is most likely the main character, so it would make sense that it would come up the most throughout the story.
ja_bigrams <- jane_austen_raw %>%
drop_na(text) %>%
# n = 2 here means bigrams. We could also make trigrams (n = 3) or any type of n-gram
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
# Split the bigrams into two words so we can remove stopwords
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word) %>%
# Put the two word columns back together
unite(bigram, word1, word2, sep = " ")
ja_bigrams
## # A tibble: 28,716 x 3
## gutenberg_id title bigram
## <int> <chr> <chr>
## 1 158 Emma NA NA
## 2 158 Emma NA NA
## 3 158 Emma jane austen
## 4 158 Emma NA NA
## 5 158 Emma NA NA
## 6 158 Emma NA NA
## 7 158 Emma NA NA
## 8 158 Emma NA NA
## 9 158 Emma NA NA
## 10 158 Emma NA NA
## # … with 28,706 more rows
top_bigrams <- ja_bigrams %>%
filter(bigram != "NA NA") %>%
# Count all the bigrams in each play
count(title, bigram, sort = TRUE) %>%
# Keep top 15 in each play
group_by(title) %>%
top_n(10) %>%
ungroup() %>%
# Make the bigrams an ordered factor so they plot in order
mutate(bigram = fct_inorder(bigram))
## Selecting by n
ggplot(top_bigrams, aes(y = fct_rev(bigram), x = n, fill = title)) +
geom_col() +
guides(fill = FALSE) +
labs(y = "Count", x = NULL,
title = "10 most frequent bigrams in Jane Austen's novels") +
facet_wrap(vars(title), scales = "free") +
lab12_theme +
scale_fill_viridis_d(option = "C", end = .75) +
guides(fill = FALSE)
The top 10 bigrams across Jane Austen’s novels consist of names with honorifics attached to them which tell us that Austen’s stories take place in Britain during an earlier century (most of her novels were published in the 1810’s) and, like mentioned before, are very character relationship driven, evident by all the names listed in the plot above.