Char8Ball: A text prediction application based on the stupid backoff method

Brendan Powers
March 12th, 2018

Capstone project: Data Science Specialization

Approach

The purpose of the application is to predict the next word that the user wants.

1.The user types a phrase which is processed into lowercase tokens.
2. Every time the input in the text box changes. The tokens are sent to the stupid backoff function.
3. The stupid backoff function returns the predicted values based on the score.
4. These are displayed in a menu. The user can select a word from the menu to add to the text.

A pseudo-randomly selected training set (60% of corpus) is used to to calculate n-gram counts. A score is calculated for each n-gram with more than two tokens.

\[ \begin{align} S(w_i | w^{i-1}_{k+1}) = \begin{cases}\frac{count(w^i_{i-k+1} )}{count(w^{i-1}_{i-k+1})} if count(w^i_{i-k+1}) > 0 \\ \lambda S(w_i | w^{i-1}_{i-k+2}) otherwise \end{cases} \end{align} \] \[ S(w_i) = \frac{count(w_i)}{N} \]

The last four tokens from the user phrase are compared for matches against the ngrams. The last term of the highest scoring ngrams are returned. If there are no matching ngrams, the 5 most frequently occurring unigrams are returned.

  • Results after validation
  • Varying \( \LARGE\lambda \) has almost no impact
  • removing less frequently occurring terms will:
    • improve accuracy with < 6 predictions
    • reduce accuracy with > 6 predictions
Min.Occurances RAM.Ngrams.Mb Prediction.Time
3 130.0 0.120
10 22.0 0.058
30 4.4 0.047
60 1.6 0.046

plot of chunk unnamed-chunk-2

Application

  • Selected: \( \LARGE\lambda = 0.4 \) (standard)
  • Minimum NGram Count = 60
  • 20% of Corpus reserved for testing.

  • 22 % accuracy based on tests from 60000 lines selected from the testing dataset with 10 predictions.

Check out the shiny app Char8Ball

To get started click the link above. After the app. loads, type part of a phrase. The highest scoring prediction is at the top of the menu.

References