SwiftKey Keyboard Simulator

This is the supporting pitch for the created application using Shiny. The application is deployed on RStudio's Shiny server. Our Shiny app, “Next word prediction”, aims to predict the next word of an input phrase using ngram analysis.

The Graphical User Interface (GUI), contains a sidebar and the main panel. The user has control over a couple parameters for the prediction. The instructions on how to use the app and additional documents are provided in couple tabs in the main panel.

Ngrams and TDM creation:

Creating corpora, tokenization and n-grams:

Term Document Matirx (TDM) Computation

A subsample of the combined training dataset is used to build the model.
The training data is cleaned using several built-in functionalities of tm and Rweka libraries: removing numbers, punctuations, non-english objects and profanity words.
Data is tokenized using ngram library, and term documnet matrix is computed listing the word and their frequency of occurances.

Load the data and perform the following:

Corpus Analytics

Corpus creation/cleaning done using the following format:

en_blogs <-  sampleText(en_b_tot,sample.size) 
# This is repeated for all datasets
comb_data <- c(en_blogs, en_news, en_twitter) 
# Combined data
Corp<-VCorpus(VectorSource(list(comb_data)))  
# Volatile corpora
Corp <- tm_map(Corp, content_transformer(tolower)) 
Corp <- tm_map(Corp, removeNumbers)
Corp <- tm_map(Corp, removePunctuation) 
Corp <- tm_map(Corp, removeWords, profanityList)
Corp <- tm_map(Corp,content_transformer(bracketX)) 
Corp <- tm_map(Corp, stripWhitespace)

Katz Back-off Implementation

A prediction model in NLP based on conditional probability:

This approach selects the next word based on the maximum frequency
Starting by the largest ngram model, attempting to find the pattern, otherwise iteratively moving down to the smallest N (unigram).
Ngrams are sorted based on descending frequency, hence the iterative search is also sorted that way.
If there are more than one candidate with the same frequency, the order of the candidates is selected randomly.

Future Works

There are several areas of research and improvements for accuracy and efficiency of the implemented framwork:

Dividing corpus and performing parallel processing using multiple cores
Smoothing and enhancing the model using Marcov chains for example
Include context, and the meaning of words of a phrase in the prediction
For unseen cases, improved prediction by adding the context and Kneser-Ney Smoothing will be performed.
Knowing that the word is the last word, or is followed by another word could also help.