Char8Ball: A text prediction application based on the stupid backoff method

Brendan Powers
March 12th, 2018

Capstone project: Data Science Specialization

Approach

The purpose of the application is to predict the next word that the user wants.

1.The user types a phrase which is processed into lowercase tokens.
2. Every time the input in the text box changes. The tokens are sent to the stupid backoff function.
3. The stupid backoff function returns the predicted values based on the score.
4. These are displayed in a menu. The user can select a word from the menu to add to the text.

A pseudo-randomly selected training set (60% of corpus) is used to to calculate n-gram counts. A score is calculated for each n-gram with more than two tokens.

\[ \begin{align} S(w_i | w^{i-1}_{k+1}) = \begin{cases}\frac{count(w^i_{i-k+1} )}{count(w^{i-1}_{i-k+1})} if count(w^i_{i-k+1}) > 0 \\ \lambda S(w_i | w^{i-1}_{i-k+2}) otherwise \end{cases} \end{align} \] \[ S(w_i) = \frac{count(w_i)}{N} \]

The last four tokens from the user phrase are compared for matches against the ngrams. The last term of the highest scoring ngrams are returned. If there are no matching ngrams, the 5 most frequently occurring unigrams are returned.

Results after validation
Varying \( \LARGE\lambda \) has almost no impact
removing less frequently occurring terms will:
- improve accuracy with < 6 predictions
- reduce accuracy with > 6 predictions

Min.Occurances	RAM.Ngrams.Mb	Prediction.Time
3	130.0	0.120
10	22.0	0.058
30	4.4	0.047
60	1.6	0.046

plot of chunk unnamed-chunk-2

Application

Selected: \( \LARGE\lambda = 0.4 \) (standard)
Minimum NGram Count = 60
20% of Corpus reserved for testing.
22 % accuracy based on tests from 60000 lines selected from the testing dataset with 10 predictions.

Check out the shiny app Char8Ball

To get started click the link above. After the app. loads, type part of a phrase. The highest scoring prediction is at the top of the menu.

References

1. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of EMNLP.

2. The Coursera-Swiftkey Corpus. No citation information found

3. Brendan Powers. 2018. Char8Ball

Char8Ball: A text prediction application based on the stupid backoff method

Approach

Application

Check out the shiny app Char8Ball

References

Thank you for your time. I look forward to hearing your insights.