KismetK
This is the final project for Data Science Specialization Capstone Course, by Johns Hopkins University x SwiftKey.
Background
Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, a leading software company has built a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.
Overview
The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.
The Final Product
The next word prediction app is hosted on shinyapps.io:
https://kishi.shinyapps.io/predict-next-word/
Working Method
The dataset is provided from SwiftKey. We use the english database. To speed up data pre-processing, we built sampling models. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data.
Using R package tm (Text Mining) to cleanup (Tokenization and Profanity filtering) the data.
Tokenization - identifying appropriate tokens such as words, punctuation, and numbers.
Profanity filtering - removing profanity and other words you do not want to predict.
Quadgram,Trigram and Bigram N-grams are created. The objects are saved as R-Compressed files.
Build basic n-gram model
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
Algorithm used to make the prediction
Step 1 Type word(s) in the right box
Step 2 You may find some suggested words under the box, select one of them or type your own desired word
Step 3 Keep typing or selecting your ideal words or sentense from the predictions.