Capstone Project

12/28/2020

Goal

The goal of this project is to create an algorithm that is capable of guessing the next word that the user wants to type in, similar to other applications already in use, like the smart keyboards in smartphones.

Furthermore the capabilities of this algorithm should be accessible via an easy to use shiny application.

In short: I succeeded with an algorithm, though more improvement is needed to gain a higher accuracy.

Algorithm

The algorithm is an N-Gram-algorithm with a Bayesian-Smoothing for words not found in the N-Grams.

Let \(\vec{\aleph}_1\) to \(\vec{\aleph}_5\) be the word or N-Gram count vector in the vocabulary, with their respective index representing the underlying, unique word
Assume that each word can be given a Weight \(W\) based on a \(beta\)-distribution, given:

\[ \begin{align*} \alpha_i & = 10^{i-1} \\ shape_1 & = 1 + \sum^5_{i = 1} \vec{\aleph}_i \times \alpha_i \\ shape_2 & = 1 + \sum \vec{\aleph}_1 - shape_1 \\ W & = qbeta(.05, shape_1, shape_2) \\ \end{align*} \]

The word with the highest \(w_i\) are choosen as viable options.

Visualization

The visualized Weight Distributions \(W\) of the word you given the context hello what are:

We start with an even \(W = beta(1,1)\) that gets refined to \(beta(376.554, 37.012.220)\) by the vocabulary. The appearance of the word in the Bigram and Trigram increases \(W\), resulting in a distribution with the final, highest \(W\) of \(qbeta(.05, 15.177.454, 22.211.320) = .41\) in the vocabulary.

The word does not appear in the Quadgram and the Pentagram given the previous words. The weights for this word do not change in this case.

Performance

Memory Needed: <60 Mbyte¹
RAM needed: <350 Mb
Accuracy: 35 - 51 %²
Fairly efficient algorithm: <0.619 sec per guess³

Conclusion: I would rather recommend using an up to date neural network based approach (LSTM, Attention Based) given the limited performance, like GPT-2. Though I had no time on pursuing this technique and higher computational resources are needed for training such a model.

¹ Could be increased by 235 Mbytes by including the 5-Gram and unfiltered 4-Grams, which improves the accuracy slightly.

² Given 100 example sentences sliced as a moving window to a maximum of 3 words as context. A proper match is when the target appears in the TOP-10 words choosen. Could be trainable by user input and gain a much higher accuracy by experimenting with better \(alpha_i\) values.

³ Slightly higher delay per query in the shiny app due to limited resources of shinyapps.io

Shiny application

A shiny application has been developed to test the performance of this algorithm. Is is a simple input box that automatically displays the probabilities of the next words given the vocabulary after the text is typed in.

The TOP-100 results are shown in a neatly formatted table to dig down to words with a much lower probability and to debug the model.

The application can be tested HERE.

Thank you for reviewing my project!