Data Science Specialization Capstone Project

Juli / August 2015

SwiftXey!

Martin Hediger, PhD

Predicting text input.


(Use arrows to navigate to next slide)

Basic Functionality and Usage

When the user types, the next most likely word is predicted. The application is trained on 100'000 lines of blog, twitter and news sentences from the HC Corpus, respectively.
Message Input:
Type your message here, the prediction will happen immediately
Basic Output:
The predicted word
Advanced Output:
Detailed information about the predictions (counts from documents and likelihood)
About: Short description of the application.

Diagram: A graphical represenation of the calculated relative likelihoods for the various expressions - can be used to get a feeling for how reliable the prediction is.
Check out the application here

Data Preparation and Algorithm

Data Preparation
Using the tm package, term-document matrices ranging from 2- to 5-grams are constructed.

In constructing the term-document matrices, stopwords are removed, all text is converted to lower case, whitespace is stripped and punctuation and numbers are discarded.

For performance, the look-up tables are implemented using data.table.

Algorithm
Next word prediction is based on non-interpolated n-gram frequency counts and stupid backoff. If no match is found, the most likely unigram is returned.

Specifications and Algorithm Details

Specifications
Data acquisition: ~ 13 hours on a MacBook Pro (2.66 GHz i7, 4 GB RAM).

Overall accuracy: ~ 10 % (blogs, twitter and news combined)

Required Memory on Shinyapps: ~ 100 MB

Algorithm Details

Considerable development time was invested in the implementation of the back-off model, therefore it shall be briefly described. First the length of the user input is calculated. Eg. if 7 words are input, then the last 4 are matched against the 5-grams in the look-up table. If no match is found, the last 3 words are matched against the 4-gram look-up table. This procedure is carried out until a match is found, else the top unigrams are returned.

Conclusions and Outlook

The prediction app returns a suggested next word for all inputs tested and is very responsive.

Upcoming features: