Capstone Milestone Report

A Submission for the Milestone Report Portion of the Data Science Specialization Capstone Project

Explanation Of Data

Source

The data used for this project were generously provided by SwiftKey as part of the Coursera Data Science Specialization taught by Professors Jeff Leek, Roger Peng, and Brian Caffo. The data are blocks of English text from newspapers, blogs, and twitter.

File Statistics

	file	size(MB)	num.of.lines	longest.line	num.of.words
1	en_US.blogs.txt	200.42	899288	483415	37334441
2	en_US.news.txt	196.28	77259	14556	2643972
3	en_US.twitter.txt	159.36	2360148	1484357	30373792

Exploratory Analysis

Using a subset of 5000 news excerpts, 5000 blog posts, and 5000 tweets we can use R’s quanteda package to explore the dataset. I generated a distribution of the frequency of different words (“unigrams”) and combinations of words (“n-grams”) to show how many words it’s necessary to expect the final model to predict.

Top Unigrams (sans stopwords)
	Frequency
one_of_the	154.00
a_lot_of	150.00
going_to_be	78.00
to_be_a	77.00
i_donÃ¢_t	74.00
it_was_a	67.00
i_want_to	65.00
the_end_of	63.00
some_of_the	62.00
out_of_the	60.00

Top Trigrams (with stopwords)
	frequency
one_of_the	154.00
a_lot_of	150.00
going_to_be	78.00
to_be_a	77.00
i_donÃ¢_t	74.00
it_was_a	67.00
i_want_to	65.00
the_end_of	63.00
some_of_the	62.00
out_of_the	60.00

Approach for Final Project

The final project is to create a Shiny.io app that will:
* Receive a string of text
* Predict the next word in the string

My basic approach will consist of a 4-gram, Back-off model, but I also plan to experiment with some machine learning techniques based on a 4-gram dataframe. All Document-Feature-Matrices will be created using the quanteda package, which has fewer features than the gold-standard, tm package, but is much, much faster, allowing the possibility of running a 4-gram model.

Regardless of the algorithmic approach, I plan to store the R objects used for prediction, save them to the server and load the model and DFM objects rather than compile on the fly in order to improve performance.