The objective of the Johns Hopkins University Data Science capstone project is to create a “next word” predictor, similar to how a mobile phone or Google search “suggests” the next word when you are typing.
The Next Word Predictor is a web-based application which a user can interact with the prediction model.
A structured approach is taken to build the Next Word Predictor
To build our predicting model, extracts of three (3) English text sources are used to create the words and phrases which are used in the model:
- extracts of news feeds, - blogs and - Twitter conversations
The table below shows a brief analysis of the text extracts
| Document | MB | Lines | Pct | No.Lines | No.Tokens | No.Unique | Tokens | Unique |
|---|---|---|---|---|---|---|---|---|
| News | 196.3 | 1,010,242 | 10% | 101,024 | 3,657,373 | 67,302 | 1,971,816 | 58,251 |
| Blogs | 200.4 | 899,288 | 10% | 89,929 | 3,976,838 | 69,431 | 1,941,122 | 57,704 |
| 159.4 | 2,360,148 | 10% | 236,015 | 3,521,649 | 73,458 | 1,690,589 | 64,881 | |
| TOTAL | 556.1 | 4,269,678 | 0 | 426,968 | 11,155,860 | 141,661 | 5,603,527 | 121,382 |
For the “ALL” category, a small number of the tokens account for large percentage of the occurrences, as seen in the table below:
Using text analysis tools, the top 20 tokens (“words”) of each data source have been identified.
N-grams represent sequences of tokens (words) which occur within text extracts. 2-gram and 3-gram sequences have been generated. The resulting top sequences (words separated by a “/”) are displayed in the table below.
The corpus of all the text sources is so large that we subset it to use only words that occur at least 10 times. The following figure show the top 20 2 and 3 n-grams:
The current plan is to use a Markov model as the prediction engine. Using the probabilities associated with the n-grams, the prediction algorithm with suggest the top 3 “next words” (see figure below).
Illustrative Markov Model