An App to Predict Next Word

Andy Tai
13 Jul 2019

First Slide

This project is part of the Data Science track by John Hopkins University on Coursera. This is a capstone project done in partnership with SwiftKey. The objective is to use the large corpus of unstructured text curated by SwiftKey and develop a light-weight app that can perform next word prediction.

Second Slide

The app predicts the next word based on previous two words. The app uses n-grams and a stupid back-off algorithm to predict the next word. The following approaches were used in the app development:

  • Datasets from three sources twitter, blogs, and news were taken from SwiftKey.

  • Random samples of 10% from each of these data sources were extracted. These 3 datasets were subsequently combined and randomly suffled. Then about 100,000 rows from that aggregated dataset was extracted to form a corpus. The purpose of randomising the datasets from various sources is to ensure that every data from the each of the three media sources has equal chances of being selected. This in turn can help improve the quality of word prediction.

Third Slide

  • The corpus was subsequently pre-processed to remove noises that are not essential for word prediction. These noises include punctuations, non-ascii characters, elongated words, upper case characters and so more. The libraries from “textclean” and “qdapRegex”“ packages were used to do the preprocessing, instead of using regex, as these libraries can perform more surgical text processing, and hence improve the overall quality of the processed corpus.

  • The cleaned corpus is then tokenised into unigrams, bigrams, and trigrams, with their respective word freqencies estimated.

  • On that note, I have used only up to trigrams in the word prediction process. Although using higher n-grams may improve word prediction accuracy but will also lead to larger corpus and slower performance.

Fourth Slide

  • The app uses a backoff algorithm to predict the next word based on n-gram frequencies and assigns a penalty for the backoff.

  • In terms of using the app, the user just need to key in word phrases in the text input field, and the next predicted word will be displayed, along with its calculated probability. There is, hence, no need for a submit button.

  • At the UI level, there is a simple algorithm to detect whether user has entered only one word or a few words in the text input field.

  • If user has entered more than two words, another agorithm will simply extract only the last two words and feed into the backoff algorithm to predict the next word.

Fifth Slide

  • If it is not able to find a suitable word in the observed trigram, it will simply backoff to look at a suitable word using the highest bigram frequency.

  • This is process is again repeated until the unigram is reached, and the word with highest unigram frequency will be selected. There is a penaly assigned to handle single “stopwords” from being selected.