Brendan Powers
March 12th, 2018
Capstone project: Data Science Specialization
The purpose of the application is to predict the next word that the user wants.
1.The user types a phrase which is processed into lowercase tokens.
2. Every time the input in the text box changes. The tokens are sent to the stupid backoff function.
3. The stupid backoff function returns the predicted values based on the score.
4. These are displayed in a menu. The user can select a word from the menu to add to the text.
A pseudo-randomly selected training set (60% of corpus) is used to to calculate n-gram counts. A score is calculated for each n-gram with more than two tokens.
\[ \begin{align} S(w_i | w^{i-1}_{k+1}) = \begin{cases}\frac{count(w^i_{i-k+1} )}{count(w^{i-1}_{i-k+1})} if count(w^i_{i-k+1}) > 0 \\ \lambda S(w_i | w^{i-1}_{i-k+2}) otherwise \end{cases} \end{align} \] \[ S(w_i) = \frac{count(w_i)}{N} \]
The last four tokens from the user phrase are compared for matches against the ngrams. The last term of the highest scoring ngrams are returned. If there are no matching ngrams, the 5 most frequently occurring unigrams are returned.
| Min.Occurances | RAM.Ngrams.Mb | Prediction.Time |
|---|---|---|
| 3 | 130.0 | 0.120 |
| 10 | 22.0 | 0.058 |
| 30 | 4.4 | 0.047 |
| 60 | 1.6 | 0.046 |
20% of Corpus reserved for testing.
22 % accuracy based on tests from 60000 lines selected from the testing dataset with 10 predictions.
To get started click the link above. After the app. loads, type part of a phrase. The highest scoring prediction is at the top of the menu.
2. The Coursera-Swiftkey Corpus. No citation information found
3. Brendan Powers. 2018. Char8Ball