Andrea Vallebueno
April 24, 2019
WordCast is a text prediction app which aims to forecast the three most likely words to follow a phrase inputed by the user.
WordCast is extremely user-friendly, efficient and accurate. The user merely needs to write a phrase of his or her choosing, click the “Predict” button and a set of three words shall appear in order of likelihood.
The model that the app is based on is an n-gram natural language model, which uses frequencies from single words, and from word combinations (called bigrams, trigrams, etc. according to the number of words) to determine the most likely phrase continuations.
The corpora used to train this model stem from three different types of text documents: blogs, twitter tweets and news articles.
The original text underwent several pre-processing steps in order to be fit to train the final model, including lowercasing and contraction expansion.
Due to the nature of text prediction, stopwords were maintained and no stemming or lemmatization was performed.
Both Stupid Back-Off and Kneser-Ney models were trained with the corpora for varying amounts of the corpora, n-gram levels and pre-processing decisions. However, the Stupid Back-Off model with the below features exhibited a better combination of accuracy, efficiency and speed, which are essential to think about when creating an app. This makes WordCast particularly user-friendly, and allows it to display an excellent performance on any device including mobile phones.
When evaluated on a test set from the twitter and blogs corpora, the final model displayed the following accuracy metrics:
Overall top-1 precision: 10.74 %
Overall top-3 precision: 18.73 %
Total memory used: 268.65 MB
The app is highly user-friendly and displays a solid performance on any device, quickly returning three predictions for any phrase.
The interface was kept as clean and simple as possible to enhance the feeling of a phone app for texting purposes.