My little homegrown tribute to SwiftKey - Coursera Data Science Capstone project
Dmitry Danilov
24 Apr 2016
Introduction
- The 'My little homegrown tribute to SwitfKey' application is implemented with Shiny - a web application framework for R.
- The main purpose of the application is to demonstrate the ability to predict the next word by previously entered words.
- The application mimics certain features of the SwiftKey for Android application.
Main features
- The application not only predicts the next word by entered whole words but also predicts the word being typed by its first letters.
- 3 most probable predicted words are displayed as the labels of the 3 buttons located right above the input text field.
- The order of appearing of the predicted words depends on their probability and is as follows: |w_p2|w_p1|w_p3| (the word with the highest probability is on the middle button, 2nd highest probability is on the left button and 3rd highest probability is on the right button)
- As user types text in the input field, predicted words will appear on the 3 buttons, user has the option to press the button with the correct prediction to enter the predicted word in the input field quickly or continue typing.
- Predicted words are updated instantly as user types letters in the input field.
Important facts
- Prediction is based on the Kneser-Ney smoothing algorithm
- The main idea of the Kneser-Ney alrogithm - when predicting the probability of a word given a context, not only the current context is taken into account, but also the number of contexts that the word appears in.
- The model was built from a corpus containing 3 text documents in English: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt
- The total size of the model is about 58 Mb.
- The application uses 4-gram prediction model.
- The model consists of a vocabulary with an index mapped to each word and a set of data tables with higher order n-grams that only contain indexes of the words in the vocabulary. This has allowed to keep the size of the model and lookup operations duration to the minimum.
- The model performance is estimated at around 24%, estimation was performed using a held out test set extracted from the same source files.