10/26/2018

Special Thanks

  • Special Thanks to SwiftKey for providing the cleaned text data from twitter, blogs and news. I believe that no algorithm can run without the data set
  • Special Thanks to Johns Hopkins University and Coursera for providing technical support
  • Lastly, special thanks to my friend Anthony for giving me valuable suggestions
  • Limitation: Due to the capacity of my own computer, I used a sample from the raw data rather than directly load the data. So, the training set is quite samll, so the current prediction power may not be as high as you expect.

Executive Summary

  • The fundamental objective of this project is to mimic a prediction text inputting tool for example, IPhone's text input, to facilitate typing and reduce error
  • The project can be summarised in 3 steps:
  1. Consolidate raw dataset (use website crawling to collect information). Thankfully, this has been done by swiftkey team.
  2. Design algorithm to recognize the pattern. Here, I am using n-gram algorithm. Thankfully, the R professionals have already compiled some useful R packages for us, such as quanteda and RWeka. We can directly leverage the function rather than write the program from the very beginning.
  3. Lastly, modify the algorithm and design the UI interface. Here, we are using Shiny App.

About the algorithm: N-gram

  • In a nutshell, N-gram is a common natural language processing algorithm for prediction. The assumption of N-gram is that what the next world is is solely dependent on the previous N words. N can be 1,2,3,4,etc.
  • Therefore, the general approach for this project is:
  1. Get the data set and clean the dataset (eg: remove punctuation, remove number, make letter consistent in upper case or lower case)
  2. Tokenize the dataset. This is to extract the feature of the dataset for analyzing
  3. Use N-gram to predict the next word from the database

Thank you for your attention