Coursera Data Science Capstone

Marc Reitz
June 1, 2019

Word Prediction Application

Word Prediction Application

The proliferation of mobile devices has made technology more accessible and unlocked significant increases in workplace productivity. One particular challenge that technology has had to overcome is evolving the Human-Computer Interface to the small screen. Ubiquitous now in our mobile devices are features that assist in the process of typing text input. This capstone project implements a methodology common to many mobile apps today - predicting the next word in a sentence given the prior word. A simple web-based interface is used to collect a phrase from a user. Based on what was entered, the system will provide a list of recommended next words.

For example: A common English phrase is “The best of both worlds” which describes a situation where one may enjoy the advantages of two very different things at the same time. When a user types “The best of both ”, the user is presented with a ranked set of options for what the next word might be.

The application may be accessed from this link

Data Sources and Preparation

This implementation uses a dataset provided by the corporate partner Swiftkey consisting of over 600MB of text from public blog posts, news articles and Twitter.

The raw data was cleaned in several stages. Non-english characters, numbers, punctuation, URLs and twitter hash tags were removed.

The dataset was loaded in to a corpus and “tokenized” using the quanteda package in to n-grams, from 1-grams to 6-grams. A threshold was applied to remove low frequency n-grams (min 5 for bi-grams and unigrams, 2 for higher order n-grams.)

Based on preliminary testing, it was decided to use 1 through 4-grams only in the final application in order to improve the load and execution time. This reduced the final, tokenized data needed to run the application to 150MB.

Algorithm and Result

The methodology that was implemented in the application to generate the next word recommendation was Katz's back-off model. The system will search through the n-gram lists from highest order to lowest order for a potential match. Once it finds a match, it returns a list of all potential n-grams at that level, sorted by the maximum likelihood estimate. Lower order n-grams are not displayed.

The user has the ability to control the number of results that are returned. The application has a default of 10, but that figure may be adjusted up to 500 if the user desires.

The data table format of the result set may be sorted by any column. This format also allows the user to search for a particular word if desired to understand its relative rank.

A Simple User Interface

A sentence fragment is entered on the left and the slider below allows the user to specify the desired number of results to return. The user will then click “Submit”.
The system will search for potential matches and return the results.
The user may then review or subset the results using the features on the right side of the screen.