Text Prediction Tool powered by ngram algorithm

10/26/2018

Special Thanks

Special Thanks to SwiftKey for providing the cleaned text data from twitter, blogs and news. I believe that no algorithm can run without the data set
Special Thanks to Johns Hopkins University and Coursera for providing technical support
Lastly, special thanks to my friend Anthony for giving me valuable suggestions
Limitation: Due to the capacity of my own computer, I used a sample from the raw data rather than directly load the data. So, the training set is quite samll, so the current prediction power may not be as high as you expect.

The fundamental objective of this project is to mimic a prediction text inputting tool for example, IPhone's text input, to facilitate typing and reduce error
The project can be summarised in 3 steps:

Consolidate raw dataset (use website crawling to collect information). Thankfully, this has been done by swiftkey team.
Design algorithm to recognize the pattern. Here, I am using n-gram algorithm. Thankfully, the R professionals have already compiled some useful R packages for us, such as quanteda and RWeka. We can directly leverage the function rather than write the program from the very beginning.
Lastly, modify the algorithm and design the UI interface. Here, we are using Shiny App.

In a nutshell, N-gram is a common natural language processing algorithm for prediction. The assumption of N-gram is that what the next world is is solely dependent on the previous N words. N can be 1,2,3,4,etc.
Therefore, the general approach for this project is:

Get the data set and clean the dataset (eg: remove punctuation, remove number, make letter consistent in upper case or lower case)
Tokenize the dataset. This is to extract the feature of the dataset for analyzing
Use N-gram to predict the next word from the database