Matt Cliff
16 March 2018
Capstone Project
Coursera Data Science Specialization by John Hopkins University
QK-Newt will help users typing in sentences or other natural language text by providing the most likley next word with single key auto-completion.
Increasing the number of words a users can accurately type in a natural language interaction such as email, chat, or other social media will increase the efficiency of usage.
In our application, the user once the user completes a word and a space is entered, the tool will present the most likley next words, and other select a button to auto-populate one of the top three words, view the top 10 scores, or view a word cloud.
View initial proposal in the Milestone Report
The tool is trained on a set of text from three sources, news feeds, twitter and blogs ( download) to identify the most likley next word in the similar context. This body of text (corpus) is cleaned, and frequency tables of the n-grams \( {}^1 \) are generated and scored.
The Stupid Back-Off\( {}^2 \) algorithm matches the last n - 1 tokens from the user input with the prefix in the n-gram table. These scores will be aggregated as we back-off the length \( n \), at each step apply a factor \( \alpha \) which will reduce the impact of shorter length matches.
The n-grams are generated from the sample corpus by removing non US characters, punctuation, and numbers, tokenize then filter profane and non-dictionary words. Keep only the n-gram that have a frequency of 2 or more (for \( n>2 \)).
\[ S\left(w_n \middle\vert w_1^{n-1} \right) = \begin{cases} \frac{\text{Count}(w_{1}^n)}{\text{Count}(w_{1}^{n-1})},& \text{ if } C(w_{1}^n) > 0\\ \alpha S(w_n \vert w_{2}^{n-1}),& \text{otherwise} \end{cases} \]
Where \( S(w_1) = \text{Count}(w_1)/\text{# Unique Words} \) is the terminating case.
The scores are pre-calculated for every n-gram, and the tables are stored with the prefix and tail in seperate columns. When invoked the tool scores from the most likley candidates and returns the results.
Enter Text in the Text Input field, the tool triggers the prediction algorithm after a space has been entered following a word.
The slider control adjust for the back-off weight, \( \alpha \). The table tab shows the 10 highest scoreing words, along with the score and from which n-gram table. The Word Cloud feature allows the user to see a much larger set of likely next words.