Text Editor

A.K. Patel
11/20/2017

Text Editor -- The Problem

Typing on smartphones is cumbersome given the small screen size and results in sometimes mis-typing or use of not universially understood short-hand. Ideally a computer can anticipate and propose best word given our context. However, language has context and grammar that is not easily quantified. Plus modern usage of language can play havoc with language rules.

We would like to develop a text editor that:

Recognize user typing in English
The app should have a model that dynamically updates as the user is typing
The model should predict the most like next word from previous typed word(s) and/or partially type word
The app should present the four most likely words selected by the model as buttons that the user can click
After clicking a button, the focus should automatically go back to the text box so that the user can continue to type without interruption

Text Editor -- General Outline of Model

Our approach to building the model entailed taking diverse dataset to incorporate the many usage of modern language.

We used text from twitter, blogs, and news articles. However given the massive size of the data, we randomly selected 20% to build our train, holdout, and test dataset. -We applied filters to each to remove inappropriate, incorrect, and not valuable word occurences(i.e. foreign text).
We use the Markov Assumption to estimate the likelyhood of a sentence using the previous word (bi-gram) and previous two words (tri-gram). In conjuction with likelyhood of an individual word (uni-gram), we use the three n-grams to build our predictive model.
We calculated the frequency of each word or word-combindation and stored the data as a data.tables using keys for faster access.
Our predictive n-gram model uses Simple Linear Interpolation Method, which associates a weighting (lambda) for each n-gram, that maximizes overall predictive ability against our holdout dataset.

Text Editor -- Model Tweaks & Performance

Various tweaks to the model where necessary to fit processing environment parameters.

Train n-gram dataset needed to be reduced due to the excessive use of memory. We found the largest benefit in doing so in the uni-gram as 98% of word occurences could be gotten by use of approximately 25% of the words. In the bi-gram we used 95% word occurrence threshold and 90% for tri-grams. As you would expect, there are few occurrences of higher order n-grams, but reduction here was necessary as they take the most amout of memory.
Our Simple Linear Interpolated Method of using lambda's to weight the n-gram improved our overall results. We optimized the model performance by tuning with our holdout dataset. We determined a lambda weighting of 65% tri-gram, 35% bi-gram, and 5% uni-gram for our final word guess.

We opted to test our model using the test dataset as a simulation of words the user would have typed. Feeding this into our model linearly we calculated the percentage of times the next word was correctly guessed.

When the model only had the previous word(s) to go by, it predicted the next word correctly 26% of the time.
When it had previous word(s) and first letter of the next word it was right 53%.
When it had previous word(s) and first two letters of the next word we jumped to 73%.

Text Editor -- The App

alt text

The app characteristics:

Memory – after optimization, our app's memory usage is 285 megabytes.
Load time – it takes approximately 7 seconds to initially load on a standard laptop.
Word recall – the time it takes for the model to guess is is negligible.

Text Editor -- Usage and Features

After loading the progam:

User clicks on the text box to begin typing.
The model is dynamic, it predicts as the user types. It can predict even when only part of the word is typed.
The four buttons on the top get populated based on the model prediction.
By clicking on the button, the text box is updated with the word guess. If a partial word was entered in the text box, then the partial entry is replaced by the clicked word.
The focus is automatically returned from the selected button back to the text box, so that the user can continue typing without having to select the textbox again.
The model knows to capitilize the first word and any word following a period, exclamation, or question mark.
The reset button allows the user to clear the text box and start over.