Andria Hall
April 24, 2016
This application is the a capstone project offered by John Hopkins University and made available through Coursera. It is a predictive text model that determine the next word when provided with preceding words in a phrase.
This application was based on SwiftKey innovation of a smart keyboard that makes text predictable. They are also a partner with Joh Hopkins University in this capstone project.
The data for this application are available in English, German, Finish, Russian. But only the English dataset, en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt were used. The data are available at: corpora.heliohost.org
| Summary | News | Blogs | |
|---|---|---|---|
| Lines | 2360148 | 1010242 | 899288 |
| Size | 159.3641 | 196.2775 | 159.3641 |
Since there was memory and processor limitations based on size, 10 percent sample size of each dataset were used to create the textSample.
A corpus was created to observe the frequency of words, from which will derive our prediction model for the text prediction. White space, punctuation, text convertion to lower case and profantity filtering was done on the corpus in order to create a tidy dataset appropriate for exploratory analysis.
Using Term Document Matrix from the R tm package, tables of 3-gram, 2-gram, 1-gram where generated with their corresponding frequencies. A sample of the most frequently occuring n-grams were used in the model. Applying Katz's back-off Model
The Swiftkey Text Prediction will allow the user to key words into the application and click a “Go” button. The application will fit the word with the highest conditional probability based on the words used in the Uni-gram table.
If there is not a match, the application “backs off” to the Bi-gram table and if still no match is found, the Tri-gram table will be used. Click the Swiftkey Text Application link to go to application SwiftKey Text Application