This is our first update on the Text Prediction Project. Our goal is to create a Shiny R app that can be used to predict the next word after users input of one or more words. The corpus we are using to create these predictions comes from a dataset that includes a selection of blogs, news articles, and tweets provided by Coursera and SwiftKey.
On reading the data, we converted removed non-standard characters and null lines. After this cleaning of the data, we did some simple exploration with R to determine the size and content of the data package.
| Size Mb | Lines | Words | Max WPL | Min WPL | Average WPL | |
|---|---|---|---|---|---|---|
| blogs | 210.16 | 899,288 | 37,599,515 | 6,668 | 1 | 41.81 |
| news | 205.81 | 1,010,242 | 34,790,784 | 1,796 | 1 | 34.44 |
| 167.11 | 2,360,148 | 30,234,913 | 56 | 1 | 12.81 |
The blog and news files are of similar size with just under and just over a million records, respectively. The twitter file is smaller with more records as tweets are quite short. The number of words is between 30 and 40 million for each file. Average blog entries are longer with an average of about 42 words per line whereas the tweets are shortest with an average of about 13. There are short records in each file with only one word. The longest record is a blog post at over 6000 words. These files and records seem large enough to be a decent sampling of online language and thus appropriate for predicting the next word of someone writing online.
N-grams will be used as the base of next word prediction for this project. We used the R package quanteda to tokenize the text documents and look at the frequencies of single words, word pairs and word triplets. Symbols, numbers, punctuation, and profanity were removed first. The graphs below show the 20 most common of each.
An interesting question asked is how many words do we need to cover 50% of the language used. Or even 90%. If we assume this corpus represent the body of language, the number is smaller than would be imagined. The graph below shows the cumulative coverage as each word is added.
Surprisingly for the entire corpus only 235 words are needed to cover 50% of the language and just under 10,000 to cover 90% of the language. These word counts may actually be an overestimate, because misspellings, non-words, and non-English words have not been cleared from the corpus yet and increase the denominator in these calculations. Removing these extraneous words will be an important next step before building a prediction model.
Many of the ngrams of all sizes have only a single occurrence even in files as large as these and looking at the cumulative percentages above suggests that they may not be all needed to give a robust predicted. These graphs show the proportion of single use words vs. those that appear multiple times. Again some of these single words may be due to misspellings, non-words, and non-English words. Naturally the proportion of single occurances increases as the n-gram size increases, this might be an issue with the 5-grams that we hope to use.
| Ngram Type | Occurrence Type | Occurrences |
|---|---|---|
| AllWords | Multiple Occurrences | 327,229 |
| AllWords | Single Occurrence | 495,679 |
| Bigrams | Multiple Occurrences | 10,052,278 |
| Bigrams | Single Occurrence | 4,193,555 |
| Trigrams | Multiple Occurrences | 7,485,488 |
| Trigrams | Single Occurrence | 37,457,208 |
Coursera Johns Hopkins Univeristy Data Science Specialization https://www.coursera.org/specializations/jhu-data-science
Benoit, Kenneth et. al. (). “quanteda: Quantitative Analysis of Textual Data”. R package version: 0.9.9.67. http://quanteda.io.