The goal of this project is to analyze a large corpus of text documents to discover the structure in the data and how words are put together. In this milestone report I will cover cleaning and analyzing text data. I will describe building and sampling what will become the foundation of a training dataset to be used in a predictive text model for the remainder of this course. Finally, using the model-ready dataset presented here, I will lay out my plans to build a predictive text product data product.
# load required R packages
library(knitr)
library(magrittr)
library(stringi)
library(dplyr)
library(ggplot2)
library(scales)
library(tm)
library(RWeka)
library(ggplot2)
library(dplyr)
library(wordcloud)
library(slam)
# input files directory
file_dir <- '../data/raw'
We begin by loading three large file of English text data collected from:
txt_raw <-
sapply(
# loop over three datasets
c('twitter', 'blogs','news'),
# apply function to open and read each source into a text file
function(txt_source){
con <-
txt_source %>%
paste0(file_dir, '/en_US.', ., '.txt') %>%
file(open = 'rb', encoding = 'UTF-8')
txt <- readLines(con = con, encoding = 'UTF-8', skipNul=T)
close(con)
return(txt)
},
# return a named list for each source
simplify = F, USE.NAMES = T
)
Next, we can explore the properties of these text files with some summary statistics. The table below shows that that the Twitter file contains the most number of lines (tweets), but not surprisingly, the least amount of characters.
txt_raw %>%
lapply(function(x) x %>% stri_stats_general %>% t %>% as.data.frame) %>%
bind_rows(.id = 'Source') %>%
kable(digits = 0)
| Source | Lines | LinesNEmpty | Chars | CharsNWhite |
|---|---|---|---|---|
| 2360148 | 2360148 | 162096241 | 134082806 | |
| blogs | 899288 | 899288 | 206824382 | 170389539 |
| news | 1010242 | 1010242 | 203223154 | 169860866 |
Next, we examine the number of words per line of text. Again, compared to other sources, Tweets have the lowest average and maximum number of words. News have the highest average, but Blogs have a long right tail (highest median and maximum word counts).
txt_word_counts <- lapply(txt_raw, stri_count_words)
txt_word_counts %>%
lapply(function(x) x %>% summary %>% as.matrix %>% t %>% as.data.frame) %>%
bind_rows(.id = 'Source') %>%
kable
| Source | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|---|
| 1 | 7 | 12 | 12.75 | 18 | 47 | |
| blogs | 0 | 9 | 28 | 41.75 | 60 | 6726 |
| news | 1 | 19 | 32 | 34.41 | 46 | 1796 |
txt_word_counts %>%
lapply(function(x) data.frame(wpl=x)) %>%
bind_rows(.id = 'Source') %>%
ggplot(aes(wpl, fill=Source))+
facet_wrap(~Source, nrow = 3, scales = 'free_y')+
geom_histogram(binwidth = 3, color='white', alpha=0.50)+
scale_y_continuous(labels=scales::comma)+
coord_cartesian(xlim = c(0,150))+
theme_minimal()+
labs(title = 'Distribution of Words per Line',
x = 'Words per text line', y = 'Count')
The goal of the project is to use the above datasets to train a model that predicts the next word following the previous one, two, or three words preceding it. I order to achieve that goal, we will need to calculate frequency counts of all 2, 3, 4 word combinations that occur in our text corpus. N-length word combinations are called n-grams.
Our text dataset size is quite large, as shown above. Calculating the frequencies of all possible 2, 3, 4 word combinations involves setting up large term-document matrices that can quickly exceed available computer memory (RAM).
txt_raw %>% object.size %>% format(units='MiB')
## [1] "799.5 MiB"
To conserve memory, we well take a random sample from each one of the text sources and use them to calculate our n-gram frequency counts.
sample_size = 1000L
set.seed(1106)
sample_corpus <-
txt_raw %>%
# loop through each source and take a random sample
lapply(function(x){
x[sample.int(n = length(x), size = sample_size, replace=FALSE)]
}) %>%
# convert list of documents into one long text string
unlist() %>% paste(collapse = '\n') %>%
stri_trans_general('latin-ascii') %>%
# create corpus
VectorSource() %>%
Corpus(readerControl = list(language="en_US", encoding ="UTF-8")) %>%
# clean up corpus
tm_map(content_transformer(stri_trans_tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
To calculate our TDM, we must first provide functions to tokenize the text into 2-, 3-, and 4-grams.
# n-gram tokenizers
token_2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
token_3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
token_4 <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
# function to plot most frequent terms from a TDM
plot_top_terms <-
function(tdm, top_n=50){
tdm %>%
removeSparseTerms(0.1) %>%
as.matrix %>%
rowSums() %>%
sort(decreasing = T) %>%
head(top_n) %>%
data_frame(ngram = names(.), freq = .) %>%
arrange(freq) %>%
mutate(ngram = ngram %>% factor(levels = ngram))%>%
ggplot(aes(ngram, freq))+
geom_bar(stat='identity', fill='steelblue')+
coord_flip()+
theme_minimal()
}
# Compute TDM
tdm_2 <- TermDocumentMatrix(sample_corpus, control = list(tokenize = token_2))
# plot bar
plot_top_terms(tdm_2, 50)
# Compute TDM
tdm_3 <- TermDocumentMatrix(sample_corpus, control = list(tokenize = token_3))
# plot bar
plot_top_terms(tdm_3, 50)
# Compute TDM
tdm_4 <- TermDocumentMatrix(sample_corpus, control = list(tokenize = token_4))
# plot bar
plot_top_terms(tdm_4, 50)
Now that we have calculated the n-gram frequency counts the next steps in building a word prediction model include: