Content from Text Mining by Julia Silge & David Robinson.
Simple rules: - each variable is a column - each observation is a row ( 1 token per row) - each type of observational unit is a table
text mining approaches:
String: Text can, of course, be stored as strings, i.e., character vectors, within R, and often text data is first read into memory in this form.
Corpus: These types of objects typically contain raw strings annotated with additional metadata and details.
Document-term matrix: This is a sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term. The value in the matrix is typically word count or tf-idf
Raw text vector
text_raw = c(
"Cyber Security and Text Analysis is the future wave",
"Analysis of passwords and phrases is interesting",
"Using a password manager is a smart thing to do",
"Have more than one way to authenticate",
"Hackers will hack the planet and hack all the things",
"Hacking passwords is what hackers do"
)
text_raw
## [1] "Cyber Security and Text Analysis is the future wave"
## [2] "Analysis of passwords and phrases is interesting"
## [3] "Using a password manager is a smart thing to do"
## [4] "Have more than one way to authenticate"
## [5] "Hackers will hack the planet and hack all the things"
## [6] "Hacking passwords is what hackers do"
Turn the text_raw into a dataframe for analysis.
library(dplyr)
text_df = tibble(
line= 1:6,
text = text_raw
)
text_df
## # A tibble: 6 × 2
## line text
## <int> <chr>
## 1 1 Cyber Security and Text Analysis is the future wave
## 2 2 Analysis of passwords and phrases is interesting
## 3 3 Using a password manager is a smart thing to do
## 4 4 Have more than one way to authenticate
## 5 5 Hackers will hack the planet and hack all the things
## 6 6 Hacking passwords is what hackers do
A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.
Within our tidy text framework, we need to both break the text into individual tokens (a process called tokenization) and transform it to a tidy data structure. To do this, we use tidytext’s unnest_tokens() function.
## # A tibble: 49 × 2
## line word
## <int> <chr>
## 1 1 cyber
## 2 1 security
## 3 1 and
## 4 1 text
## 5 1 analysis
## 6 1 is
## 7 1 the
## 8 1 future
## 9 1 wave
## 10 2 analysis
## # ℹ 39 more rows
now use the stop words from the text data
clean_text_df = token_text_df %>%
anti_join(stop_words)
clean_text_df
## # A tibble: 20 × 2
## line word
## <int> <chr>
## 1 1 cyber
## 2 1 security
## 3 1 text
## 4 1 analysis
## 5 1 future
## 6 1 wave
## 7 2 analysis
## 8 2 passwords
## 9 2 phrases
## 10 3 password
## 11 3 manager
## 12 3 smart
## 13 4 authenticate
## 14 5 hackers
## 15 5 hack
## 16 5 planet
## 17 5 hack
## 18 6 hacking
## 19 6 passwords
## 20 6 hackers
Notice we went from 33 rows to 20.
Now we can count and sort the common words
clean_text_df %>%
count(word, sort = T)
## # A tibble: 16 × 2
## word n
## <chr> <int>
## 1 analysis 2
## 2 hack 2
## 3 hackers 2
## 4 passwords 2
## 5 authenticate 1
## 6 cyber 1
## 7 future 1
## 8 hacking 1
## 9 manager 1
## 10 password 1
## 11 phrases 1
## 12 planet 1
## 13 security 1
## 14 smart 1
## 15 text 1
## 16 wave 1
Onto the data visualization
library(ggplot2)
library(tidyverse)
clean_text_df %>%
count(word, sort = T) %>%
filter(n > 0) %>%
mutate(word = fct_reorder(word, n)) %>%
ggplot(
aes(x= n, y= word)
) +
geom_col()
Example of using a book from Project Gutenberg, H.G. Wells book The Time Machine.
library(gutenbergr)
hgwells = gutenberg_download(35,strip = T)
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
head(hgwells)
## # A tibble: 6 × 2
## gutenberg_id text
## <int> <chr>
## 1 35 "The Time Machine"
## 2 35 ""
## 3 35 "An Invention"
## 4 35 ""
## 5 35 "by H. G. Wells"
## 6 35 ""
tidy text
tidy_hgwells = hgwells %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
head(tidy_hgwells)
## # A tibble: 6 × 2
## gutenberg_id word
## <int> <chr>
## 1 35 time
## 2 35 machine
## 3 35 invention
## 4 35 contents
## 5 35 introduction
## 6 35 ii
word count
tidy_hgwells %>%
count(word, sort = T)
## # A tibble: 4,172 × 2
## word n
## <chr> <int>
## 1 time 207
## 2 machine 88
## 3 white 61
## 4 traveller 57
## 5 hand 49
## 6 morlocks 48
## 7 people 46
## 8 weena 46
## 9 found 44
## 10 light 43
## # ℹ 4,162 more rows
Just get the text tokenized, and into bigrams.
library(tidytext)
hgwells %>%
unnest_tokens(bigram, text, token = "ngrams", n= 2) %>%
filter(!is.na(bigram))
## # A tibble: 29,983 × 2
## gutenberg_id bigram
## <int> <chr>
## 1 35 the time
## 2 35 time machine
## 3 35 an invention
## 4 35 by h
## 5 35 h g
## 6 35 g wells
## 7 35 i introduction
## 8 35 ii the
## 9 35 the machine
## 10 35 iii the
## # ℹ 29,973 more rows
sort and count the bigrams
hgwells_bigrams = hgwells %>%
unnest_tokens(bigram, text, token = "ngrams", n= 2) %>%
filter(!is.na(bigram)) %>%
count(bigram, sort = T)
hgwells_bigrams
## # A tibble: 18,630 × 2
## bigram n
## <chr> <int>
## 1 of the 291
## 2 in the 161
## 3 i had 120
## 4 i was 107
## 5 and the 101
## 6 the time 97
## 7 it was 95
## 8 to the 82
## 9 as i 78
## 10 of a 71
## # ℹ 18,620 more rows
need to apply the stop words to word 1 and word 2
hgwells_clean_bigrams = hgwells_bigrams %>%
separate(bigram, c("word1","word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
hgwells_clean_bigrams
## # A tibble: 2,328 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 time traveller 54
## 2 time machine 33
## 3 white sphinx 10
## 4 green porcelain 9
## 5 time traveller’s 8
## 6 time travelling 7
## 7 looked round 6
## 8 golden age 5
## 9 bronze doors 4
## 10 fourth dimension 4
## # ℹ 2,318 more rows
word counts
hgwells_clean_bigrams %>%
count(word1, word2, sort = T)
## # A tibble: 2,328 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 _can_ move 1
## 2 _four_ directions 1
## 3 _instantaneous_ cube 1
## 4 _pall mall 1
## 5 _philosophical transactions_ 1
## 6 _the land 1
## 7 _the time 1
## 8 _three_ dimensions 1
## 9 abandoned ruins 1
## 10 abominable desolation 1
## # ℹ 2,318 more rows
hgwells_clean_bigrams = hgwells_clean_bigrams %>%
filter(word1 =="time") %>%
count(word1, word2, sort = T)
hgwells_clean_bigrams
## # A tibble: 14 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 time _there 1
## 2 time answered 1
## 3 time brightening 1
## 4 time dimension 1
## 5 time exclaimed 1
## 6 time fifty 1
## 7 time geology 1
## 8 time machine 1
## 9 time machines 1
## 10 time necessity 1
## 11 time that’s 1
## 12 time traveller 1
## 13 time traveller’s 1
## 14 time travelling 1