Exploratory analysis on a set of sample data used for the creation of a natural language processing (NLP) ‘next word’ prediction algorithm
Data were collected from three online sources:
Data were stored in separate files according to language including:
For the sake of simplicity, only the English language files were loaded and analyzed while building the initial model. In addition, due to the very large file-sizes, sampling was done on the data sets to reduce the prediction times of the model.
The original Blog data set contains 38,156,768 words, the News data set contains 2,694,073 words, and the Twitter data set contains 30,221,979 words.
The data were processed in order to:
After processing and sampling 0.05 percent of the original data, the blogs (English) file now contains 939,125 words, the Twitter file contains 254,968 words, and the News file contains 73,027 words.
Individual counts for terms were examined from each of the three data sources sampled.
## Docs
## Terms Blog News Twitter
## fun 671 22 346
## lost 372 31 111
## tree 293 8 27
For example, above are the counts for fun, lost, and tree.
The most common words found in the new data were examamined in a table, simple word cloud (see Appendix, Figure-1) and bar graph (see Appendix, Figure-2).
The relative occurance and common relationship between some of these terms is displayed in a clulster dendrogram (see Appendix, Figure-3). A cluster plot (see Appendix, Figure-4) gives a general idea of possible groupings of some of these common words.
The initial approach was to divide the sampled terms into smaller phrases (or ngrams), then determine the most likely next word in a given line of text based upon overall popularity within the set of sampled phrases.
Taking the small phrase, case of, results in a next word prediction of just. This is obviously not the best approach because the choices offerred (i.e. Quiz 2, Question 1) should have been soda, cheese, beer, or pretzels.
There will need to be additional effort to come up with a more sophisticated and accurate approach, while maintaining a reasonable processing return time to the end user.
After designing an accurate prediction model, an application will be created allowing users to input a phrase and then be provided with a reasonable suggestion for the probable next word.