Introduction

The goal of this project is to analyze a large corpus of text documents to discover the structure in the data and how words are put together. In this milestone report I will cover cleaning and analyzing text data. I will describe building and sampling what will become the foundation of a training dataset to be used in a predictive text model for the remainder of this course. Finally, using the model-ready dataset presented here, I will lay out my plans to build a predictive text product data product.

Initial Setup

# load required R packages
library(knitr)
library(magrittr)
library(stringi)
library(dplyr)
library(ggplot2)
library(scales)
library(tm)
library(RWeka)
library(ggplot2)
library(dplyr)
library(wordcloud)
library(slam)


# input files directory
file_dir <- '../data/raw'

Loading the Data

We begin by loading three large file of English text data collected from:

  1. News articles,
  2. Blogs, and
  3. Twitter
txt_raw <-
    sapply(
        # loop over three datasets
        c('twitter', 'blogs','news'),
        # apply function to open and read each source into a text file
        function(txt_source){
            con <- 
                txt_source %>% 
                paste0(file_dir, '/en_US.', ., '.txt') %>% 
                file(open = 'rb', encoding = 'UTF-8')
            txt <- readLines(con = con, encoding = 'UTF-8', skipNul=T)
            close(con)
            return(txt)
        },
        # return a named list for each source
        simplify = F, USE.NAMES = T
    )

Next, we can explore the properties of these text files with some summary statistics. The table below shows that that the Twitter file contains the most number of lines (tweets), but not surprisingly, the least amount of characters.

txt_raw %>% 
    lapply(function(x) x %>% stri_stats_general %>% t %>% as.data.frame) %>% 
    bind_rows(.id = 'Source') %>% 
    kable(digits = 0)
Source Lines LinesNEmpty Chars CharsNWhite
twitter 2360148 2360148 162096241 134082806
blogs 899288 899288 206824382 170389539
news 1010242 1010242 203223154 169860866

Exploratory Analysis

Next, we examine the number of words per line of text. Again, compared to other sources, Tweets have the lowest average and maximum number of words. News have the highest average, but Blogs have a long right tail (highest median and maximum word counts).

txt_word_counts <- lapply(txt_raw, stri_count_words)

txt_word_counts %>% 
    lapply(function(x) x %>% summary %>% as.matrix %>% t %>% as.data.frame) %>% 
    bind_rows(.id = 'Source') %>% 
    kable
Source Min. 1st Qu. Median Mean 3rd Qu. Max.
twitter 1 7 12 12.75 18 47
blogs 0 9 28 41.75 60 6726
news 1 19 32 34.41 46 1796
txt_word_counts %>%
    lapply(function(x) data.frame(wpl=x)) %>% 
    bind_rows(.id = 'Source') %>% 
    ggplot(aes(wpl, fill=Source))+
    facet_wrap(~Source, nrow = 3, scales = 'free_y')+
    geom_histogram(binwidth = 3, color='white', alpha=0.50)+
    scale_y_continuous(labels=scales::comma)+
    coord_cartesian(xlim = c(0,150))+
    theme_minimal()+
    labs(title = 'Distribution of Words per Line',
         x = 'Words per text line', y = 'Count')

Building Predictive Model Dataset

N-Gram Frequencies

The goal of the project is to use the above datasets to train a model that predicts the next word following the previous one, two, or three words preceding it. I order to achieve that goal, we will need to calculate frequency counts of all 2, 3, 4 word combinations that occur in our text corpus. N-length word combinations are called n-grams.

Sampling

Our text dataset size is quite large, as shown above. Calculating the frequencies of all possible 2, 3, 4 word combinations involves setting up large term-document matrices that can quickly exceed available computer memory (RAM).

txt_raw %>%  object.size %>% format(units='MiB')
## [1] "799.5 MiB"

To conserve memory, we well take a random sample from each one of the text sources and use them to calculate our n-gram frequency counts.

sample_size = 1000L
set.seed(1106)

sample_corpus <- 
    txt_raw %>% 
    # loop through each source and take a random sample
    lapply(function(x){
        x[sample.int(n = length(x), size = sample_size, replace=FALSE)]
    }) %>% 
    # convert list of documents into one long text string
    unlist() %>% paste(collapse = '\n') %>%
    stri_trans_general('latin-ascii') %>% 
    # create corpus
    VectorSource() %>% 
    Corpus(readerControl = list(language="en_US", encoding ="UTF-8")) %>% 
    # clean up corpus
    tm_map(content_transformer(stri_trans_tolower)) %>% 
    tm_map(removeNumbers) %>% 
    tm_map(removePunctuation) %>% 
    tm_map(stripWhitespace)

Term-Document Matrix (TDM)

To calculate our TDM, we must first provide functions to tokenize the text into 2-, 3-, and 4-grams.

# n-gram tokenizers
token_2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
token_3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
token_4 <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))


# function to plot most frequent terms from a TDM
plot_top_terms <- 
    function(tdm, top_n=50){
        tdm %>% 
            removeSparseTerms(0.1) %>% 
            as.matrix %>%
            rowSums() %>% 
            sort(decreasing = T) %>% 
            head(top_n) %>% 
            data_frame(ngram = names(.), freq = .) %>% 
            arrange(freq) %>% 
            mutate(ngram = ngram %>% factor(levels = ngram))%>% 
            ggplot(aes(ngram, freq))+ 
            geom_bar(stat='identity', fill='steelblue')+
            coord_flip()+
            theme_minimal()
    }

Bi-grams

# Compute TDM
tdm_2  <- TermDocumentMatrix(sample_corpus, control = list(tokenize = token_2))
# plot bar
plot_top_terms(tdm_2, 50)

Tri-grams

# Compute TDM
tdm_3  <- TermDocumentMatrix(sample_corpus, control = list(tokenize = token_3))
# plot bar
plot_top_terms(tdm_3, 50)

4-grams

# Compute TDM
tdm_4  <- TermDocumentMatrix(sample_corpus, control = list(tokenize = token_4))
# plot bar
plot_top_terms(tdm_4, 50)

Going Further

Now that we have calculated the n-gram frequency counts the next steps in building a word prediction model include:

  1. Increase the sample size to make the sample more representative
  2. Filter out profanity from the training text sample
  3. Research different word prediction models
  4. Write several algorithms and evaluate their out-of-sample accuracy
  5. Develop an interactive prediction Shiny app