Next Word Predictor

Eric Lim B G
26 April 2015

Coursera, Data Science Specialisation
Capstone Project - SwiftKey (Rpres)

Background

Natural Language Processing and Text Mining is a field of computer science, artificial intelligence, and computational linguistics concerned with interactions between computers and human (natural) languages.

A shiny app has been developed to demonstrate the use of data science techniques in NPL and TM in building a predictive model for next word prediction.

The shiny app is composed of:

  1. A Predictor with dynamic selections and display that allow user to enter phases for next word prediction
  2. An About that describes the application and provides references to other related information
  3. A Help guide that serve to let user get started with using this shiny application

Underlying Training Data

Training data from the HC Corpora corpus is used to build the model. 0.1% sampling is obtained for each of the US locale blogs, news and twitters dataset in consideration of the required performance.

Below are summary information on the datasets. Full exploratory analysis results are available in the project's milestone report.

     File   Lines     Chars CharsNWhite TotalWords
1   blogs  899288 206824382   170389539   37570839
2    news 1010242 203223154   169860866   34494539
3 twitter 2360148 162096031   134082634   30451128

Underlying Predictive Algorithm

The main alogrithm used to build the model comes from tm and RWeka packages. Following are the steps performed after sampling each datasets:

  • Text Mining techniques are used to cleanse and remove unnessary words (e.g. profanity) from each corpus
  • Natural Language Processing techniques (e.g. tokenization) are used to generate the Term Document Matrix (TDM) for each corpus
  • Aggregation/Summation and Sorting is performed on each TDM to generate a search list ranked by term frequency
  • Key-Value Hashing is performed on each search list before being used by the shiny app

Next Word Predictor

The shiny app accepts input editorial profile and phrases from the user, and outputs prediction of the next word. More instructions are available under the application “Help”. plot of chunk unnamed-chunk-2