Milestone Report

Introduction
Initial Setup
Loading the Data
Exploratory Analysis
Building Predictive Model Dataset
Going Further

Introduction

The goal of this project is to analyze a large corpus of text documents to discover the structure in the data and how words are put together. In this milestone report I will cover cleaning and analyzing text data. I will describe building and sampling what will become the foundation of a training dataset to be used in a predictive text model for the remainder of this course. Finally, using the model-ready dataset presented here, I will lay out my plans to build a predictive text product data product.

Initial Setup

# load required R packages
library(knitr)
library(magrittr)
library(stringi)
library(dplyr)
library(ggplot2)
library(scales)
library(tm)
library(RWeka)
library(ggplot2)
library(dplyr)
library(wordcloud)
library(slam)


# input files directory
file_dir <- '../data/raw'

Loading the Data

We begin by loading three large file of English text data collected from:

News articles,
Blogs, and
Twitter

txt_raw <-
    sapply(
        # loop over three datasets
        c('twitter', 'blogs','news'),
        # apply function to open and read each source into a text file
        function(txt_source){
            con <- 
                txt_source %>% 
                paste0(file_dir, '/en_US.', ., '.txt') %>% 
                file(open = 'rb', encoding = 'UTF-8')
            txt <- readLines(con = con, encoding = 'UTF-8', skipNul=T)
            close(con)
            return(txt)
        },
        # return a named list for each source
        simplify = F, USE.NAMES = T
    )

Next, we can explore the properties of these text files with some summary statistics. The table below shows that that the Twitter file contains the most number of lines (tweets), but not surprisingly, the least amount of characters.

txt_raw %>% 
    lapply(function(x) x %>% stri_stats_general %>% t %>% as.data.frame) %>% 
    bind_rows(.id = 'Source') %>% 
    kable(digits = 0)

Source	Lines	LinesNEmpty	Chars	CharsNWhite
twitter	2360148	2360148	162096241	134082806
blogs	899288	899288	206824382	170389539
news	1010242	1010242	203223154	169860866

Exploratory Analysis

Next, we examine the number of words per line of text. Again, compared to other sources, Tweets have the lowest average and maximum number of words. News have the highest average, but Blogs have a long right tail (highest median and maximum word counts).

txt_word_counts <- lapply(txt_raw, stri_count_words)

txt_word_counts %>% 
    lapply(function(x) x %>% summary %>% as.matrix %>% t %>% as.data.frame) %>% 
    bind_rows(.id = 'Source') %>% 
    kable

Source	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
twitter	1	7	12	12.75	18	47
blogs	0	9	28	41.75	60	6726
news	1	19	32	34.41	46	1796

txt_word_counts %>%
    lapply(function(x) data.frame(wpl=x)) %>% 
    bind_rows(.id = 'Source') %>% 
    ggplot(aes(wpl, fill=Source))+
    facet_wrap(~Source, nrow = 3, scales = 'free_y')+
    geom_histogram(binwidth = 3, color='white', alpha=0.50)+
    scale_y_continuous(labels=scales::comma)+
    coord_cartesian(xlim = c(0,150))+
    theme_minimal()+
    labs(title = 'Distribution of Words per Line',
         x = 'Words per text line', y = 'Count')

Building Predictive Model Dataset

N-Gram Frequencies

The goal of the project is to use the above datasets to train a model that predicts the next word following the previous one, two, or three words preceding it. I order to achieve that goal, we will need to calculate frequency counts of all 2, 3, 4 word combinations that occur in our text corpus. N-length word combinations are called n-grams.

Sampling

Our text dataset size is quite large, as shown above. Calculating the frequencies of all possible 2, 3, 4 word combinations involves setting up large term-document matrices that can quickly exceed available computer memory (RAM).

txt_raw %>%  object.size %>% format(units='MiB')

## [1] "799.5 MiB"

To conserve memory, we well take a random sample from each one of the text sources and use them to calculate our n-gram frequency counts.

sample_size = 1000L
set.seed(1106)

sample_corpus <- 
    txt_raw %>% 
    # loop through each source and take a random sample
    lapply(function(x){
        x[sample.int(n = length(x), size = sample_size, replace=FALSE)]
    }) %>% 
    # convert list of documents into one long text string
    unlist() %>% paste(collapse = '\n') %>%
    stri_trans_general('latin-ascii') %>% 
    # create corpus
    VectorSource() %>% 
    Corpus(readerControl = list(language="en_US", encoding ="UTF-8")) %>% 
    # clean up corpus
    tm_map(content_transformer(stri_trans_tolower)) %>% 
    tm_map(removeNumbers) %>% 
    tm_map(removePunctuation) %>% 
    tm_map(stripWhitespace)

Term-Document Matrix (TDM)

To calculate our TDM, we must first provide functions to tokenize the text into 2-, 3-, and 4-grams.

# n-gram tokenizers
token_2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
token_3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
token_4 <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))


# function to plot most frequent terms from a TDM
plot_top_terms <- 
    function(tdm, top_n=50){
        tdm %>% 
            removeSparseTerms(0.1) %>% 
            as.matrix %>%
            rowSums() %>% 
            sort(decreasing = T) %>% 
            head(top_n) %>% 
            data_frame(ngram = names(.), freq = .) %>% 
            arrange(freq) %>% 
            mutate(ngram = ngram %>% factor(levels = ngram))%>% 
            ggplot(aes(ngram, freq))+ 
            geom_bar(stat='identity', fill='steelblue')+
            coord_flip()+
            theme_minimal()
    }

Bi-grams

# Compute TDM
tdm_2  <- TermDocumentMatrix(sample_corpus, control = list(tokenize = token_2))
# plot bar
plot_top_terms(tdm_2, 50)

Tri-grams

# Compute TDM
tdm_3  <- TermDocumentMatrix(sample_corpus, control = list(tokenize = token_3))
# plot bar
plot_top_terms(tdm_3, 50)

4-grams

# Compute TDM
tdm_4  <- TermDocumentMatrix(sample_corpus, control = list(tokenize = token_4))
# plot bar
plot_top_terms(tdm_4, 50)

Going Further

Now that we have calculated the n-gram frequency counts the next steps in building a word prediction model include:

Increase the sample size to make the sample more representative
Filter out profanity from the training text sample
Research different word prediction models
Write several algorithms and evaluate their out-of-sample accuracy
Develop an interactive prediction Shiny app