Milestone Report I - Text Prediction Project

Introduction

This is our first update on the Text Prediction Project. Our goal is to create a Shiny R app that can be used to predict the next word after users input of one or more words. The corpus we are using to create these predictions comes from a dataset that includes a selection of blogs, news articles, and tweets provided by Coursera and SwiftKey.

Exploratory Data Analysis

On reading the data, we converted removed non-standard characters and null lines. After this cleaning of the data, we did some simple exploration with R to determine the size and content of the data package.

	Size Mb	Lines	Words	Max WPL	Min WPL	Average WPL
blogs	210.16	899,288	37,599,515	6,668	1	41.81
news	205.81	1,010,242	34,790,784	1,796	1	34.44
twitter	167.11	2,360,148	30,234,913	56	1	12.81

The blog and news files are of similar size with just under and just over a million records, respectively. The twitter file is smaller with more records as tweets are quite short. The number of words is between 30 and 40 million for each file. Average blog entries are longer with an average of about 42 words per line whereas the tweets are shortest with an average of about 13. There are short records in each file with only one word. The longest record is a blog post at over 6000 words. These files and records seem large enough to be a decent sampling of online language and thus appropriate for predicting the next word of someone writing online.

N-grams

N-grams will be used as the base of next word prediction for this project. We used the R package quanteda to tokenize the text documents and look at the frequencies of single words, word pairs and word triplets. Symbols, numbers, punctuation, and profanity were removed first. The graphs below show the 20 most common of each.

An interesting question asked is how many words do we need to cover 50% of the language used. Or even 90%. If we assume this corpus represent the body of language, the number is smaller than would be imagined. The graph below shows the cumulative coverage as each word is added.

Surprisingly for the entire corpus only 235 words are needed to cover 50% of the language and just under 10,000 to cover 90% of the language. These word counts may actually be an overestimate, because misspellings, non-words, and non-English words have not been cleared from the corpus yet and increase the denominator in these calculations. Removing these extraneous words will be an important next step before building a prediction model.

Many of the ngrams of all sizes have only a single occurrence even in files as large as these and looking at the cumulative percentages above suggests that they may not be all needed to give a robust predicted. These graphs show the proportion of single use words vs. those that appear multiple times. Again some of these single words may be due to misspellings, non-words, and non-English words. Naturally the proportion of single occurances increases as the n-gram size increases, this might be an issue with the 5-grams that we hope to use.

Ngram Type	Occurrence Type	Occurrences
AllWords	Multiple Occurrences	327,229
AllWords	Single Occurrence	495,679
Bigrams	Multiple Occurrences	10,052,278
Bigrams	Single Occurrence	4,193,555
Trigrams	Multiple Occurrences	7,485,488
Trigrams	Single Occurrence	37,457,208

Next Steps

We need to find a way to clean the data further and eliminate misspellings, non-words, and non-English words. We are researching if the quanteda package will be able to accomplish this or if another package is necessary.
Though our computing capacity managed to accomplish the above analysis through the trigrams, we think that including sequences of four and five would improve prediction power and are working on how to do that in a more memory efficient way.
We are looking at different models/algorithms to make the predictions based on probabilities of the different n-grams. In small sample testing, we had some n-grams with ties (two final words with the same number of occurrences) and need to have a way to account for that.
The entire corpus was used for this first look. As we are building, testing, and validating models, we need both a training set and testing/validation set of data. Either this dataset needs to be split, or we need to collect another corpus of similar material to test our models on.
The Shiny app is the next big phase after the model is complete, but we need to keep in mind that we have limited storage and computing resources on that platform, so again efficiency needs to be at the forefront.

References

Coursera Johns Hopkins Univeristy Data Science Specialization https://www.coursera.org/specializations/jhu-data-science

Benoit, Kenneth et. al. (). “quanteda: Quantitative Analysis of Textual Data”. R package version: 0.9.9.67. http://quanteda.io.