Text Mining with R

Tidy Text
gutenberger
tokenization n-grams

Tidy Text

Content from Text Mining by Julia Silge & David Robinson.

Simple rules: - each variable is a column - each observation is a row ( 1 token per row) - each type of observational unit is a table

text mining approaches:

String: Text can, of course, be stored as strings, i.e., character vectors, within R, and often text data is first read into memory in this form.
Corpus: These types of objects typically contain raw strings annotated with additional metadata and details.
Document-term matrix: This is a sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term. The value in the matrix is typically word count or tf-idf

unnest_tokens function

Raw text vector

text_raw = c(
  "Cyber Security and Text Analysis is the future wave",
  "Analysis of passwords and phrases is interesting",
  "Using a password manager is a smart thing to do",
  "Have more than one way to authenticate",
  "Hackers will hack the planet and hack all the things",
  "Hacking passwords is what hackers do"
)

text_raw

## [1] "Cyber Security and Text Analysis is the future wave" 
## [2] "Analysis of passwords and phrases is interesting"    
## [3] "Using a password manager is a smart thing to do"     
## [4] "Have more than one way to authenticate"              
## [5] "Hackers will hack the planet and hack all the things"
## [6] "Hacking passwords is what hackers do"

dataframe of text

Turn the text_raw into a dataframe for analysis.

library(dplyr)

text_df = tibble(
  line= 1:6,
  text = text_raw
)

text_df

## # A tibble: 6 × 2
##    line text                                                
##   <int> <chr>                                               
## 1     1 Cyber Security and Text Analysis is the future wave 
## 2     2 Analysis of passwords and phrases is interesting    
## 3     3 Using a password manager is a smart thing to do     
## 4     4 Have more than one way to authenticate              
## 5     5 Hackers will hack the planet and hack all the things
## 6     6 Hacking passwords is what hackers do

tokenization

A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.

Within our tidy text framework, we need to both break the text into individual tokens (a process called tokenization) and transform it to a tidy data structure. To do this, we use tidytext’s unnest_tokens() function.

## # A tibble: 49 × 2
##     line word    
##    <int> <chr>   
##  1     1 cyber   
##  2     1 security
##  3     1 and     
##  4     1 text    
##  5     1 analysis
##  6     1 is      
##  7     1 the     
##  8     1 future  
##  9     1 wave    
## 10     2 analysis
## # ℹ 39 more rows

stopwords filter

now use the stop words from the text data

clean_text_df = token_text_df %>% 
  anti_join(stop_words)

clean_text_df

## # A tibble: 20 × 2
##     line word        
##    <int> <chr>       
##  1     1 cyber       
##  2     1 security    
##  3     1 text        
##  4     1 analysis    
##  5     1 future      
##  6     1 wave        
##  7     2 analysis    
##  8     2 passwords   
##  9     2 phrases     
## 10     3 password    
## 11     3 manager     
## 12     3 smart       
## 13     4 authenticate
## 14     5 hackers     
## 15     5 hack        
## 16     5 planet      
## 17     5 hack        
## 18     6 hacking     
## 19     6 passwords   
## 20     6 hackers

Notice we went from 33 rows to 20.

count & sort

Now we can count and sort the common words

clean_text_df %>% 
  count(word, sort = T)

## # A tibble: 16 × 2
##    word             n
##    <chr>        <int>
##  1 analysis         2
##  2 hack             2
##  3 hackers          2
##  4 passwords        2
##  5 authenticate     1
##  6 cyber            1
##  7 future           1
##  8 hacking          1
##  9 manager          1
## 10 password         1
## 11 phrases          1
## 12 planet           1
## 13 security         1
## 14 smart            1
## 15 text             1
## 16 wave             1

ggplot2

Onto the data visualization

library(ggplot2)
library(tidyverse)

clean_text_df %>% 
  count(word, sort = T) %>% 
  filter(n > 0) %>% 
  mutate(word = fct_reorder(word, n)) %>% 
  ggplot(
    aes(x= n, y= word)
  ) +
  geom_col()

gutenberger

Example of using a book from Project Gutenberg, H.G. Wells book The Time Machine.

library(gutenbergr)

hgwells = gutenberg_download(35,strip = T)

## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest

## Using mirror http://aleph.gutenberg.org

head(hgwells)

## # A tibble: 6 × 2
##   gutenberg_id text              
##          <int> <chr>             
## 1           35 "The Time Machine"
## 2           35 ""                
## 3           35 "An Invention"    
## 4           35 ""                
## 5           35 "by H. G. Wells"  
## 6           35 ""

tidy text

tidy_hgwells = hgwells %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words)

head(tidy_hgwells)

## # A tibble: 6 × 2
##   gutenberg_id word        
##          <int> <chr>       
## 1           35 time        
## 2           35 machine     
## 3           35 invention   
## 4           35 contents    
## 5           35 introduction
## 6           35 ii

word count

tidy_hgwells %>% 
  count(word, sort = T)

## # A tibble: 4,172 × 2
##    word          n
##    <chr>     <int>
##  1 time        207
##  2 machine      88
##  3 white        61
##  4 traveller    57
##  5 hand         49
##  6 morlocks     48
##  7 people       46
##  8 weena        46
##  9 found        44
## 10 light        43
## # ℹ 4,162 more rows

tokenization n-grams

Just get the text tokenized, and into bigrams.

library(tidytext)

hgwells %>% 
  unnest_tokens(bigram, text, token = "ngrams", n= 2) %>% 
  filter(!is.na(bigram))

## # A tibble: 29,983 × 2
##    gutenberg_id bigram        
##           <int> <chr>         
##  1           35 the time      
##  2           35 time machine  
##  3           35 an invention  
##  4           35 by h          
##  5           35 h g           
##  6           35 g wells       
##  7           35 i introduction
##  8           35 ii the        
##  9           35 the machine   
## 10           35 iii the       
## # ℹ 29,973 more rows

sort and count the bigrams

hgwells_bigrams = hgwells %>% 
  unnest_tokens(bigram, text, token = "ngrams", n= 2) %>% 
  filter(!is.na(bigram)) %>% 
  count(bigram, sort = T)

hgwells_bigrams

## # A tibble: 18,630 × 2
##    bigram       n
##    <chr>    <int>
##  1 of the     291
##  2 in the     161
##  3 i had      120
##  4 i was      107
##  5 and the    101
##  6 the time    97
##  7 it was      95
##  8 to the      82
##  9 as i        78
## 10 of a        71
## # ℹ 18,620 more rows

need to apply the stop words to word 1 and word 2

hgwells_clean_bigrams = hgwells_bigrams %>% 
  separate(bigram, c("word1","word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

hgwells_clean_bigrams

## # A tibble: 2,328 × 3
##    word1  word2           n
##    <chr>  <chr>       <int>
##  1 time   traveller      54
##  2 time   machine        33
##  3 white  sphinx         10
##  4 green  porcelain       9
##  5 time   traveller’s     8
##  6 time   travelling      7
##  7 looked round           6
##  8 golden age             5
##  9 bronze doors           4
## 10 fourth dimension       4
## # ℹ 2,318 more rows

word counts

hgwells_clean_bigrams %>% 
  count(word1, word2, sort = T)

## # A tibble: 2,328 × 3
##    word1           word2             n
##    <chr>           <chr>         <int>
##  1 _can_           move              1
##  2 _four_          directions        1
##  3 _instantaneous_ cube              1
##  4 _pall           mall              1
##  5 _philosophical  transactions_     1
##  6 _the            land              1
##  7 _the            time              1
##  8 _three_         dimensions        1
##  9 abandoned       ruins             1
## 10 abominable      desolation        1
## # ℹ 2,318 more rows

hgwells_clean_bigrams = hgwells_clean_bigrams %>% 
  filter(word1 =="time") %>% 
  count(word1, word2, sort = T)

hgwells_clean_bigrams

## # A tibble: 14 × 3
##    word1 word2           n
##    <chr> <chr>       <int>
##  1 time  _there          1
##  2 time  answered        1
##  3 time  brightening     1
##  4 time  dimension       1
##  5 time  exclaimed       1
##  6 time  fifty           1
##  7 time  geology         1
##  8 time  machine         1
##  9 time  machines        1
## 10 time  necessity       1
## 11 time  that’s          1
## 12 time  traveller       1
## 13 time  traveller’s     1
## 14 time  travelling      1