Part 1

[You do not have to include any graphs here, but you can for your own notes if you wish!]

Part 2: Your turn!

Download 3+ books by some author on Project Gutenberg. Jane Austen, Victor Hugo, Lucy Maud Montgomery, Arthur Conan Doyle, Mark Twain, Henry David Thoreau, Fyodor Dostoyevsky, Leo Tolstoy. Anyone. Just make sure it’s all from the same author. If you are able to find downloadable books elsewhere (like the Harry Potter example from class) and prefer to do that, that’s OK too! You just might have to do a bit more start up work to get the data into a tidy format.

Make these two plots and describe what each tell about your author’s books:

  1. Top 10 most frequent words in each book OR top 10 most unique words in each book (i.e. tf-idf)
  2. Any other plot of your choice

You can knit your .Rmd file to any file format you wish!

# Download your books here -- the chunk options 
jane_austen_raw <- gutenberg_download(c(1342,158,161), mirror = "http://mirrors.xmission.com/gutenberg/", meta_fields = "title")

head(jane_austen_raw)
## # A tibble: 6 x 3
##   gutenberg_id text             title
##          <int> <chr>            <chr>
## 1          158 "EMMA"           Emma 
## 2          158 ""               Emma 
## 3          158 "By Jane Austen" Emma 
## 4          158 ""               Emma 
## 5          158 ""               Emma 
## 6          158 ""               Emma
ja_words <- jane_austen_raw %>% 
  drop_na(text) %>% 
  unnest_tokens(word, text)


top_words_ja <- ja_words %>% 
  # Remove stop words
  anti_join(stop_words) %>% 
  # Count all the words in each book
  count(title, word, sort = TRUE) %>% 
  # Keep top 15 in each book
  group_by(title) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  # Make the words an ordered factor so they plot in order
  mutate(word = fct_inorder(word))
top_words_ja
## # A tibble: 30 x 3
##    title                 word          n
##    <chr>                 <fct>     <int>
##  1 Emma                  emma        786
##  2 Sense and Sensibility elinor      622
##  3 Emma                  miss        599
##  4 Pride and Prejudice   elizabeth   597
##  5 Sense and Sensibility marianne    492
##  6 Emma                  harriet     415
##  7 Emma                  weston      389
##  8 Pride and Prejudice   darcy       373
##  9 Emma                  knightley   356
## 10 Emma                  elton       319
## # … with 20 more rows

Plot 1:

ja_words_filtered = ja_words %>% 
  # Remove stop words
  anti_join(stop_words) %>% 
  # Count all the words in each book
  count(title, word, sort = TRUE) 
## Joining, by = "word"
# Add the tf-idf values to the counts
ja_tf_idf <- ja_words_filtered %>% 
  bind_tf_idf(word, title, n)

# Get the top 10 uniquest words
ja_tf_idf_plot <- ja_tf_idf %>% 
  arrange(desc(tf_idf)) %>% 
  group_by(title) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  mutate(word = fct_inorder(word))
## Selecting by tf_idf
ggplot(ja_tf_idf_plot, aes(y = fct_rev(word), x = tf_idf, fill = title)) + 
  geom_col() + 
  guides(fill = FALSE) +
  labs(y = "Count", x = NULL, 
       title = "Top 10 most unique words in Jane Austen's Novels") +
  facet_wrap(vars(title), scales = "free_y") +
  lab12_theme +
  scale_fill_viridis_d(option = "C", end = .75)

The top 10 most unique words in Emma, Pride and Prejudice, and Sense and Sensibility are all names, which make sense because names are unique and her novels usually consist of many characters who all have some sort of relationship with each other. The number one ranked name also tells us that this character is most likely the main character, so it would make sense that it would come up the most throughout the story.

Plot 2:

ja_bigrams <- jane_austen_raw %>% 
  drop_na(text) %>% 
  # n = 2 here means bigrams. We could also make trigrams (n = 3) or any type of n-gram
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  # Split the bigrams into two words so we can remove stopwords
  separate(bigram, c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>% 
  # Put the two word columns back together
  unite(bigram, word1, word2, sep = " ")
ja_bigrams
## # A tibble: 28,716 x 3
##    gutenberg_id title bigram     
##           <int> <chr> <chr>      
##  1          158 Emma  NA NA      
##  2          158 Emma  NA NA      
##  3          158 Emma  jane austen
##  4          158 Emma  NA NA      
##  5          158 Emma  NA NA      
##  6          158 Emma  NA NA      
##  7          158 Emma  NA NA      
##  8          158 Emma  NA NA      
##  9          158 Emma  NA NA      
## 10          158 Emma  NA NA      
## # … with 28,706 more rows
top_bigrams <- ja_bigrams %>% 
  filter(bigram != "NA NA") %>%
  # Count all the bigrams in each play
  count(title, bigram, sort = TRUE) %>% 
  # Keep top 15 in each play
  group_by(title) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  # Make the bigrams an ordered factor so they plot in order
  mutate(bigram = fct_inorder(bigram))
## Selecting by n
ggplot(top_bigrams, aes(y = fct_rev(bigram), x = n, fill = title)) + 
  geom_col() + 
  guides(fill = FALSE) +
  labs(y = "Count", x = NULL, 
       title = "10 most frequent bigrams in Jane Austen's novels") +
  facet_wrap(vars(title), scales = "free") + 
  lab12_theme + 
  scale_fill_viridis_d(option = "C", end = .75) +
  guides(fill = FALSE)

The top 10 bigrams across Jane Austen’s novels consist of names with honorifics attached to them which tell us that Austen’s stories take place in Britain during an earlier century (most of her novels were published in the 1810’s) and, like mentioned before, are very character relationship driven, evident by all the names listed in the plot above.