tidytext packagetidytext package for reading the file containing the dataset,magrittr package used for uning the piping format,dplyr package used for transformation functions,ggplot2 & wordcloud package used for plotting and visualization,stringr package used for text manipulation (regex).visualizeWordcloud <- function(term, freq, title = "", min.freq = 50, max.words = 200){
mypal <- brewer.pal(8,"Dark2")
wordcloud(words = term,
freq = freq,
colors = mypal,
scale=c(8,.3),
rot.per=.15,
min.freq = min.freq, max.words = max.words,
random.order = F)
}
The idea is to play around with the tidytext package and perform some common text mining operations on some documents using the available vignettes and the “Tidy Text Mining with R” book as guides (see References).
The tidytext package allows to use tidy text priciples with unstructured data/ text making possible to use the tidyverse ecosystem.
Tidy text format is define as ‘a table with one-term-per-row’.
unnest_tokens functionSupporting document is a char vector with one element made of 3 sentences. The dataset is not yet compatible with tidy tools (not compliant with tidy data principles).
document <- paste("Using tidy data principles is important.",
"In this package, we provide functions for tidy formats.",
"The novels of Jane Austen can be so tidy!")
df <- data.frame(text = document)
The unnest_token function splits a text column (input) into tokens (e.g. sentences, wors, ngrams, ect ) using the tokenizers package.
Tokenize into lines…
document_lines <- unnest_tokens(df, input = text, output = line, token = "sentences", to_lower = F)
document_lines$lineNo <- seq_along(document_lines$line)
head(document_lines)
## # A tibble: 3 × 2
## line lineNo
## <chr> <int>
## 1 Using tidy data principles is important. 1
## 2 In this package, we provide functions for tidy formats. 2
## 3 The novels of Jane Austen can be so tidy! 3
Tokenize into words (unigrams)…
df_text_to_word_tidy <- document_lines %>%
unnest_tokens(output = word, input = line, token = "words")
head(df_text_to_word_tidy)
## # A tibble: 6 × 2
## lineNo word
## <int> <chr>
## 1 1 using
## 2 1 tidy
## 3 1 data
## 4 1 principles
## 5 1 is
## 6 1 important
Tokenize into bigrams…
df_text_to_bigrams_tidy <- document_lines %>%
unnest_tokens(output = bigram, input = line, token = "ngrams", n = 2)
head(df_text_to_bigrams_tidy)
## # A tibble: 6 × 2
## lineNo bigram
## <int> <chr>
## 1 1 using tidy
## 2 1 tidy data
## 3 1 data principles
## 4 1 principles is
## 5 1 is important
## 6 2 in this
Tokenize into trigrams…
df_text_to_trigrams_tidy <- document_lines %>%
unnest_tokens(output = trigram, input = line, token = "ngrams", n = 3)
head(df_text_to_trigrams_tidy)
## # A tibble: 6 × 2
## lineNo trigram
## <int> <chr>
## 1 1 using tidy data
## 2 1 tidy data principles
## 3 1 data principles is
## 4 1 principles is important
## 5 2 in this package
## 6 2 this package we
A new data structure is created that is compliant with the tidy data principles and that can be used with the standard set of tidytools (tidyverse package).
stop_words dataset and the anti_join functionThe tidytext package offers a data stucture containing a list of english stopwords from 3 different lexicons (onix, SMART and snowball sets) that, optionally, can be used to remove most common and meaningless words from the text under examination.
head(stop_words)
## # A tibble: 6 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
The anti_join(x, y) function of the dplyr package can be used to remove a list of words (e.g. stopwords). The function returns all rows from x where there is not matching value in y (keeping all the columns in x).
df_tmp <- df_text_to_word_tidy %>%
anti_join(stop_words, by = c("word" = "word"))
head(df_tmp)
## # A tibble: 6 × 2
## lineNo word
## <int> <chr>
## 1 1 tidy
## 2 2 tidy
## 3 3 tidy
## 4 1 data
## 5 1 principles
## 6 2 package
count functionThe count frunction in the dplyr package can be used on a tidy dataset to count observations.
df_tmp %>%
count(word, sort = TRUE)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 tidy 3
## 2 austen 1
## 3 data 1
## 4 formats 1
## 5 functions 1
## 6 jane 1
## 7 novels 1
## 8 package 1
## 9 principles 1
## 10 provide 1
Using the janeaustenr package containing 6 published novels.
Prepare the raw dataset adding a chapter and number of line features for each book.
require(janeaustenr)
str(austen_books())
## Classes 'tbl_df', 'tbl' and 'data.frame': 73422 obs. of 2 variables:
## $ text: chr "SENSE AND SENSIBILITY" "" "by Jane Austen" "" ...
## $ book: Factor w/ 6 levels "Sense & Sensibility",..: 1 1 1 1 1 1 1 1 1 1 ...
original_books <- austen_books() %>%
group_by(book) %>%
mutate(line = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup()
head(original_books)
## # A tibble: 6 × 4
## text book line chapter
## <chr> <fctr> <int> <int>
## 1 SENSE AND SENSIBILITY Sense & Sensibility 1 0
## 2 Sense & Sensibility 2 0
## 3 by Jane Austen Sense & Sensibility 3 0
## 4 Sense & Sensibility 4 0
## 5 (1811) Sense & Sensibility 5 0
## 6 Sense & Sensibility 6 0
Transfor into a tidy dataset…
tidy_books <- original_books %>%
unnest_tokens(input = text, output = word, token = "words")
head(tidy_books)
## # A tibble: 6 × 4
## book line chapter word
## <fctr> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
Remove the stopwords…
tidy_books <- tidy_books %>%
anti_join(y = stop_words, by = c("word" = "word"))
Find the most common words…
word_frequencies <- tidy_books %>%
count(word, sort = TRUE)
head(word_frequencies)
## # A tibble: 6 × 2
## word n
## <chr> <int>
## 1 miss 1855
## 2 time 1337
## 3 fanny 862
## 4 dear 822
## 5 lady 817
## 6 sir 806
word_frequencies %>%
filter(n > 400) %>%
mutate(word = reorder(word, n)) %>%
ggplot(mapping = aes(x = word, y = n)) +
geom_col() +
coord_flip()
visualizeWordcloud(term = word_frequencies$word, freq = word_frequencies$n)
The gutenbergr package provides access to the Project Gutenberg collection. The package contains tools for downloading books and for finding works of interest.
require(gutenbergr)
gutenberg_works(str_detect(author,"Wells, H. G."))$title[1:5]
## [1] "The Time Machine"
## [2] "The War of the Worlds"
## [3] "The Island of Doctor Moreau"
## [4] "The Door in the Wall, and Other Stories"
## [5] "Ann Veronica: A Modern Love Story"
ids <- gutenberg_works(str_detect(author,"Wells, H. G."))$gutenberg_id[1:3]
#Download the time machhine, the war of worlds, the island of doctor Moreau
hgwells <- gutenbergr::gutenberg_download(ids)
hgwells <- hgwells %>%
group_by(gutenberg_id) %>%
mutate(line = row_number()) %>%
ungroup()
Transform into a tidy dataset…
hgwells_tidy <- hgwells %>%
unnest_tokens(output = word, input = text, token = "words")
Remove stopwords…
hgwells_tidy <- hgwells_tidy %>%
anti_join(stop_words, by = c("word" = "word"))
Calculate word frequencies…
hgwell_word_freqs_by_book <- hgwells_tidy %>%
group_by(gutenberg_id) %>%
count(word, sort = TRUE) %>%
ungroup()
Visualize the words with frequency greater than 50 for each book…
hgwell_word_freqs_by_book %>%
filter(n > 50) %>%
ggplot(mapping = aes(x = word, y = n)) +
geom_col() +
coord_flip() + facet_wrap(facets = ~ gutenberg_id)
Visualize the words with frequency greater than 50 for each book using wordclouds. Books 35, 36, 159 respectively…
par(mfrow = c(1,3), mar = c(0,0,0,0))
tmp <- hgwell_word_freqs_by_book[hgwell_word_freqs_by_book$gutenberg_id == 35,]
visualizeWordcloud(term = tmp$word, freq = tmp$n)
tmp <- hgwell_word_freqs_by_book[hgwell_word_freqs_by_book$gutenberg_id == 36,]
visualizeWordcloud(term = tmp$word, freq = tmp$n)
tmp <- hgwell_word_freqs_by_book[hgwell_word_freqs_by_book$gutenberg_id == 159,]
visualizeWordcloud(term = tmp$word, freq = tmp$n)
“Introduction to tidytext, tidytext vignette
“Tidy Text Mining with R”, Julia Silge and David Robinson
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=Norwegian (Bokmål)_Norway.1252
## [2] LC_CTYPE=Norwegian (Bokmål)_Norway.1252
## [3] LC_MONETARY=Norwegian (Bokmål)_Norway.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=Norwegian (Bokmål)_Norway.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_0.4.1 gutenbergr_0.1.2 janeaustenr_0.1.4
## [4] wordcloud_2.5 RColorBrewer_1.1-2 stringr_1.1.0
## [7] ggplot2_2.2.0 dplyr_0.5.0 magrittr_1.5
## [10] tidytext_0.1.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.7 formatR_1.4 plyr_1.8.4 tokenizers_0.1.4
## [5] tools_3.3.1 digest_0.6.10 evaluate_0.10 tibble_1.2
## [9] nlme_3.1-128 gtable_0.2.0 lattice_0.20-34 Matrix_1.2-7.1
## [13] psych_1.6.9 DBI_0.5-1 curl_2.1 yaml_2.1.13
## [17] parallel_3.3.1 knitr_1.14 triebeard_0.3.0 grid_3.3.1
## [21] R6_2.2.0 foreign_0.8-67 rmarkdown_1.1 readr_1.0.0
## [25] purrr_0.2.2 reshape2_1.4.1 tidyr_0.6.0 urltools_1.6.0
## [29] SnowballC_0.5.1 htmltools_0.3.5 assertthat_0.1 mnormt_1.5-5
## [33] colorspace_1.2-7 labeling_0.3 stringi_1.1.2 lazyeval_0.2.0
## [37] munsell_0.4.3 slam_0.1-38 broom_0.4.1