Data Science Specialization Capstone Project
Juli / August 2015
SwiftXey!
Martin Hediger, PhD
Predicting text input.
Juli / August 2015
Predicting text input.
|
Message Input:
Type your message here, the prediction will happen immediately Basic Output: The predicted word Advanced Output: Detailed information about the predictions (counts from documents and likelihood) About: Short description of the application. Diagram: A graphical represenation of the calculated relative likelihoods for the various expressions - can be used to get a feeling for how reliable the prediction is. |
Data Preparation
Using the tm package, term-document matrices ranging from 2- to 5-grams are constructed.
In constructing the term-document matrices, stopwords are removed, all text is converted to lower case, whitespace is stripped and punctuation and numbers are discarded.
For performance, the look-up tables are implemented using data.table.
Algorithm
Next word prediction is based on non-interpolated n-gram frequency counts and stupid backoff.
If no match is found, the most likely unigram is returned.
Specifications
Data acquisition: ~ 13 hours on a MacBook Pro (2.66 GHz i7, 4 GB RAM).
Overall accuracy: ~ 10 % (blogs, twitter and news combined)
Required Memory on Shinyapps: ~ 100 MB
Algorithm Details
The prediction app returns a suggested next word for all inputs tested and is very responsive.
Upcoming features: