Week 10 Assignment – Text Mining

Import Libraries

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.1.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Emily Dickinson wrote this text:

text <- c("Because I could not stop for Death -",
          "He kindly stopped for me -",
          "The Carriage held but just Ourselves -",
          "and Immortality")

Turn text into a tidy text dataset, we first need to put it into a data frame.

library(dplyr)
text_df <- tibble(line = 1:4, text = text)

text_df

## # A tibble: 4 x 2
##    line text                                  
##   <int> <chr>                                 
## 1     1 Because I could not stop for Death -  
## 2     2 He kindly stopped for me -            
## 3     3 The Carriage held but just Ourselves -
## 4     4 and Immortality

To both break the text into individual tokens (a process called tokenization) and transform it to a tidy data structure we need to import Library tidytext

library(tidytext)

text_df %>%
  unnest_tokens(word, text)

## # A tibble: 20 x 2
##     line word       
##    <int> <chr>      
##  1     1 because    
##  2     1 i          
##  3     1 could      
##  4     1 not        
##  5     1 stop       
##  6     1 for        
##  7     1 death      
##  8     2 he         
##  9     2 kindly     
## 10     2 stopped    
## 11     2 for        
## 12     2 me         
## 13     3 the        
## 14     3 carriage   
## 15     3 held       
## 16     3 but        
## 17     3 just       
## 18     3 ourselves  
## 19     4 and        
## 20     4 immortality

Tidying the works of Jane Austen, creating line number, and chapter column in the dataframe

library(janeaustenr)

## Warning: package 'janeaustenr' was built under R version 4.1.3

library(dplyr)
library(stringr)

## Warning: package 'stringr' was built under R version 4.1.2

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]",
                                           ignore_case = TRUE)))) %>%  ungroup()
 

original_books

## # A tibble: 73,422 x 4
##    text                    book                linenumber chapter
##    <chr>                   <fct>                    <int>   <int>
##  1 "SENSE AND SENSIBILITY" Sense & Sensibility          1       0
##  2 ""                      Sense & Sensibility          2       0
##  3 "by Jane Austen"        Sense & Sensibility          3       0
##  4 ""                      Sense & Sensibility          4       0
##  5 "(1811)"                Sense & Sensibility          5       0
##  6 ""                      Sense & Sensibility          6       0
##  7 ""                      Sense & Sensibility          7       0
##  8 ""                      Sense & Sensibility          8       0
##  9 ""                      Sense & Sensibility          9       0
## 10 "CHAPTER 1"             Sense & Sensibility         10       1
## # ... with 73,412 more rows

library(tidytext)

tidy_books <- original_books %>%
  unnest_tokens(word, text)

tidy_books

## # A tibble: 725,055 x 4
##    book                linenumber chapter word       
##    <fct>                    <int>   <int> <chr>      
##  1 Sense & Sensibility          1       0 sense      
##  2 Sense & Sensibility          1       0 and        
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0 by         
##  5 Sense & Sensibility          3       0 jane       
##  6 Sense & Sensibility          3       0 austen     
##  7 Sense & Sensibility          5       0 1811       
##  8 Sense & Sensibility         10       1 chapter    
##  9 Sense & Sensibility         10       1 1          
## 10 Sense & Sensibility         13       1 the        
## # ... with 725,045 more rows

We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join().

data(stop_words)

tidy_books <- tidy_books %>%
  anti_join(stop_words)

## Joining, by = "word"

tidy_books

## # A tibble: 217,609 x 4
##    book                linenumber chapter word       
##    <fct>                    <int>   <int> <chr>      
##  1 Sense & Sensibility          1       0 sense      
##  2 Sense & Sensibility          1       0 sensibility
##  3 Sense & Sensibility          3       0 jane       
##  4 Sense & Sensibility          3       0 austen     
##  5 Sense & Sensibility          5       0 1811       
##  6 Sense & Sensibility         10       1 chapter    
##  7 Sense & Sensibility         10       1 1          
##  8 Sense & Sensibility         13       1 family     
##  9 Sense & Sensibility         13       1 dashwood   
## 10 Sense & Sensibility         13       1 settled    
## # ... with 217,599 more rows

use dplyr’s count() to find the most common words in all the books as a whole

tidy_books %>%
  count(word, sort = TRUE)

## # A tibble: 13,914 x 2
##    word       n
##    <chr>  <int>
##  1 miss    1855
##  2 time    1337
##  3 fanny    862
##  4 dear     822
##  5 lady     817
##  6 sir      806
##  7 day      797
##  8 emma     787
##  9 sister   727
## 10 house    699
## # ... with 13,904 more rows

Using tidy tools, our word counts are stored in a tidy data frame. This allows us to pipe this directly to the ggplot2 package.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.2

tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL)

### The gutenbergr package

library(gutenbergr)

## Warning: package 'gutenbergr' was built under R version 4.1.3

hgwells <- gutenberg_download(c(35, 36, 5230, 159))

## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest

## Using mirror http://aleph.gutenberg.org

hgwells

## # A tibble: 20,020 x 2
##    gutenberg_id text              
##           <int> <chr>             
##  1           35 "The Time Machine"
##  2           35 ""                
##  3           35 "An Invention"    
##  4           35 ""                
##  5           35 "by H. G. Wells"  
##  6           35 ""                
##  7           35 ""                
##  8           35 "CONTENTS"        
##  9           35 ""                
## 10           35 " I Introduction" 
## # ... with 20,010 more rows

tidy_hgwells <- hgwells %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining, by = "word"

The most common words in these novels of H.G. Wells

tidy_hgwells %>%
  count(word, sort = TRUE)

## # A tibble: 11,811 x 2
##    word       n
##    <chr>  <int>
##  1 time     461
##  2 people   302
##  3 door     260
##  4 heard    249
##  5 black    232
##  6 stood    229
##  7 white    224
##  8 hand     218
##  9 kemp     213
## 10 eyes     210
## # ... with 11,801 more rows

We will again use the Project Gutenberg ID numbers for each novel and access the texts using gutenberg_download()

bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767))

tidy_bronte <- bronte %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining, by = "word"

The most common words in these novels of the Brontë sisters “time”, “eyes”, and “hand” are in the top 10 for both H.G. Wells and the Brontë sisters.

tidy_bronte %>%
  count(word, sort = TRUE)

## # A tibble: 23,297 x 2
##    word       n
##    <chr>  <int>
##  1 time    1065
##  2 miss     854
##  3 day      825
##  4 hand     767
##  5 eyes     714
##  6 don’t    666
##  7 night    648
##  8 heart    638
##  9 looked   601
## 10 door     591
## # ... with 23,287 more rows

To calculate the frequency for each word for the works of Jane Austen, the Brontë sisters, and H.G. Wells by binding the data frames together. We can use pivot_wider() and pivot_longer() from tidyr to reshape our dataframe

library(tidyr)

frequency <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"),
                       mutate(tidy_hgwells, author = "H.G. Wells"), 
                       mutate(tidy_books, author = "Jane Austen")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = author, values_from = proportion) %>%
  pivot_longer(`Brontë Sisters`:`H.G. Wells`,
               names_to = "author", values_to = "proportion")

frequency

## # A tibble: 57,116 x 4
##    word      `Jane Austen` author          proportion
##    <chr>             <dbl> <chr>                <dbl>
##  1 a            0.00000919 Brontë Sisters  0.0000587 
##  2 a            0.00000919 H.G. Wells      0.0000147 
##  3 aback       NA          Brontë Sisters  0.00000391
##  4 aback       NA          H.G. Wells      0.0000147 
##  5 abaht       NA          Brontë Sisters  0.00000391
##  6 abaht       NA          H.G. Wells     NA         
##  7 abandon     NA          Brontë Sisters  0.0000313 
##  8 abandon     NA          H.G. Wells      0.0000147 
##  9 abandoned    0.00000460 Brontë Sisters  0.0000900 
## 10 abandoned    0.00000460 H.G. Wells      0.000177  
## # ... with 57,106 more rows

We use str_extract() here because the UTF-8 encoded texts from Project Gutenberg have some examples of words with underscores around them to indicate emphasis (like italics). The tokenizer treated these as words, but we don’t want to count “any” separately from “any” as we saw in our initial data exploration before choosing to use str_extract()

Comparing the word frequencies of Jane Austen, the Brontë sisters, and H.G. Wells

library(scales)

# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y = `Jane Austen`, 
                      color = abs(`Jane Austen` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), 
                       low = "darkslategray4", high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Jane Austen", x = NULL)

## Warning: Removed 40758 rows containing missing values (geom_point).

## Warning: Removed 40760 rows containing missing values (geom_text).

Let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Austen and the Brontë sisters, and between Austen and Wells?

cor.test(data = frequency[frequency$author == "Brontë Sisters",],
         ~ proportion + `Jane Austen`)

## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Jane Austen
## t = 111.06, df = 10346, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7285370 0.7461189
## sample estimates:
##       cor 
## 0.7374529

cor.test(data = frequency[frequency$author == "H.G. Wells",], 
         ~ proportion + `Jane Austen`)

## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Jane Austen
## t = 35.229, df = 6008, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3925914 0.4345047
## sample estimates:
##       cor 
## 0.4137673

We conclude in the plots, the word frequencies are more correlated between the Austen and Brontë novels than between Austen and H.G. Wells.

Week 10 Assignment – Text Mining

Assigned Task:

Citation to text book, using a standard citation MLA syntax.

Silge and Robinson - O’Reilly - 2017

Import Libraries