Data Science Capstone

Andria Hall
April 24, 2016

Introduction

This application is the a capstone project offered by John Hopkins University and made available through Coursera. It is a predictive text model that determine the next word when provided with preceding words in a phrase.

This application was based on SwiftKey innovation of a smart keyboard that makes text predictable. They are also a partner with Joh Hopkins University in this capstone project.

Data Aquisition and Summary

The data for this application are available in English, German, Finish, Russian. But only the English dataset, en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt were used. The data are available at: corpora.heliohost.org

Summary	Twitter	News	Blogs
Lines	2360148	1010242	899288
Size	159.3641	196.2775	159.3641

Since there was memory and processor limitations based on size, 10 percent sample size of each dataset were used to create the textSample.

Tokenziation and Algorithm

A corpus was created to observe the frequency of words, from which will derive our prediction model for the text prediction. White space, punctuation, text convertion to lower case and profantity filtering was done on the corpus in order to create a tidy dataset appropriate for exploratory analysis.

Using Term Document Matrix from the R tm package, tables of 3-gram, 2-gram, 1-gram where generated with their corresponding frequencies. A sample of the most frequently occuring n-grams were used in the model. Applying Katz's back-off Model

Text Prediction Application

The Swiftkey Text Prediction will allow the user to key words into the application and click a “Go” button. The application will fit the word with the highest conditional probability based on the words used in the Uni-gram table.

If there is not a match, the application “backs off” to the Bi-gram table and if still no match is found, the Tri-gram table will be used. Click the Swiftkey Text Application link to go to application SwiftKey Text Application