Captstone Project

CB

September 1, 2016

Overview:

Exploratory analysis on a set of sample data used for the creation of a natural language processing (NLP) ‘next word’ prediction algorithm

Data were collected from three online sources:

Twitter Messages
News Articles
Blogs

Data were stored in separate files according to language including:

German
English
Finnish
Russian

For the sake of simplicity, only the English language files were loaded and analyzed while building the initial model. In addition, due to the very large file-sizes, sampling was done on the data sets to reduce the prediction times of the model.

Loading the Data

The original Blog data set contains 38,156,768 words, the News data set contains 2,694,073 words, and the Twitter data set contains 30,221,979 words.

The data were processed in order to:

Stem the data (i.e. take the root words excluding modifiers like ‘ing’, ‘ed’ etc..)
Remove common ‘bad words’ or explicatives.
Remove punctuation (except for apostrophe’s)
Remove extra white-spacing
Remove numbers
Convert all characters to lower case
Randomly sample the data set, in order to reduce processing time

Preprocess the Data

After processing and sampling 0.05 percent of the original data, the blogs (English) file now contains 939,125 words, the Twitter file contains 254,968 words, and the News file contains 73,027 words.

Individual counts for terms were examined from each of the three data sources sampled.

##       Docs
## Terms  Blog News Twitter
##   fun   671   22     346
##   lost  372   31     111
##   tree  293    8      27

For example, above are the counts for fun, lost, and tree.

The most common words found in the new data were examamined in a table, simple word cloud (see Appendix, Figure-1) and bar graph (see Appendix, Figure-2).

The relative occurance and common relationship between some of these terms is displayed in a clulster dendrogram (see Appendix, Figure-3). A cluster plot (see Appendix, Figure-4) gives a general idea of possible groupings of some of these common words.

Executive Summary

The initial approach was to divide the sampled terms into smaller phrases (or ngrams), then determine the most likely next word in a given line of text based upon overall popularity within the set of sampled phrases.

Taking the small phrase, case of, results in a next word prediction of just. This is obviously not the best approach because the choices offerred (i.e. Quiz 2, Question 1) should have been soda, cheese, beer, or pretzels.

Next Steps:

There will need to be additional effort to come up with a more sophisticated and accurate approach, while maintaining a reasonable processing return time to the end user.

After designing an accurate prediction model, an application will be created allowing users to input a phrase and then be provided with a reasonable suggestion for the probable next word.

Milestone Report