Data Science Capstone - Milestone Report

Executive Summary

This report describes the preliminary findings from analyzing a corpus of twitter, blog and news files that will ultimately used to design a model that will predict the next word after a typed sequence. We begin by reviewing the data loading preprocessing steps. Next, we visualize some of the most frequent words and combinations of words to explore the data set. We conclude by discussing next steps for model development.

Data Loading

Our first task is to load the data and generate general statistics on each file. We examine the number of lines in each file, the average and standard deviation of line length (in number of characters), and the average and standard deviation of number of words per line.

## twitter
con.t <- file("./Corpus/en_US.twitter.txt", "r")
tmp.t <- readLines(con.t)

linecnt.t <- length(tmp.t) # grab linecount of file
lenlines.t <- sapply(tmp.t,nchar) # grab length of each line in characters
numwords.t <- sapply(strsplit(tmp.t," "),length) # grab number of words in each line

close(con.t); rm(con.t)  # close and remove connections


## blog
con.b <- file("./Corpus/en_US.blogs.txt", "r") # create text connection

tmp.b <- readLines(con.b)

linecnt.b <- length(tmp.b) # grab linecount of file
lenlines.b <- sapply(tmp.b,nchar) # grab length of each line in characters
numwords.b <- sapply(strsplit(tmp.b," "),length) # grab number of words in each line

close(con.b); rm(con.b)  # close and remove connections


## news
con.n <- file("./Corpus/en_US.news.txt", "r") # create text connection

tmp.n <- readLines(con.n)

linecnt.n <- length(tmp.n) # grab linecount of file
lenlines.n <- sapply(tmp.n,nchar)
numwords.n <- sapply(strsplit(tmp.n," "),length) # grab number of words in each line

File.Source	Total.Lines	Avg.Line.Length	Std.Line.Length	Avg.Words.per.Line	Std.Words.per.Line
Twitter	2360148	68.8	37.27	12.87	6.90
Blogs	899288	231.7	260.36	41.52	46.27
News	77259	203.0	134.46	34.22	22.82

Preprocessing

Given the size of each file and computing/memory constraints, the remainder of our analysis and model development will be based off of a sample of the overall dataset. We construct a corpus based on a randomly selected 5% sample of lines in each file to balance the tradeoff of completeness vs. processing capacity. These files have been saved with the corresponding indices to ensure repeatability.

src <- "~/Data Science WD/Capstone Project/Corpus/sample"
corp <- Corpus(DirSource(src), readerControl = list(reader = readPlain, language = "en", load = TRUE))

Our next step involves cleaning the existing document corpus to render it suitable for tokenization, the necessary process that standardizes terms and enables us to perform frequency and relationship analysis. This step is not shown (out of scope), but the standardization inlcudes removing punctuation, vulgar terms, common linking terms (i.e., “stopwords”), excess white space, and numbers. We also transform all text to lower-case, and standardize certain words to their “stem”, in order to improve our ability to count the frequency of words with a common root (e.g., “runs” vs. “running” vs. “run”).

Exploratory Visualization

After processing, we construct frequency tables for each word in our sample individually, as well as for word pairs (bigrams) and groups of three words (trigrams). From these tables, we can illustrate the fact that a smaller subset of frequently used terms can account for the vast majority of words in our corpus sample. The plots below highlight this point. In the left-most plot, we can see that 20,000 of the unique words, or 17% of the total unique words, included in our corpus can actually explain over 91% of the content. Note the charts below plot only words, word pairs and word groups that appear with a frequency greater than 1 in our sample.

Lastly, we examine some of the most frequently (top 15) observed words, word pairs and 3-word groups. This enables us to refine our list of words that may not be useful for prediction (i.e., stopwords), and begin to conceptualize the prediction model.

Model Development: Next Steps

Our next steps will involve refining the model training set to exclude some of the terms that may have lower predictive value (e.g., some of those that fall at the bottom of our frequency tables). We plan to build a Markov-Chain base model that relies on our pre-processing transformations and analysis, which will ultimately be deployed on a web-based Shiny application. The application will allow users to input a term or phrase, and will output a predicted next word based upon our trained model.