Data Science Capstone: Ngram Text Predictor

Calvin Chin [calvin3663@hotmail.com]
April 2016

Project Overview

People are spending increasing amount of time on their mobile devices. What if we could have an application that makes it easier/faster for people to type on their mobile devices by predicting the next word based on the sentense entered? This is a simple Shiny application to demonstrates the power of Text Prediction Model.

Data Mining and Term Frequency Analysis

The project begins by mining large corpus of text sentences obtained from news, blogs and twitter data. The objective is to discover how words are put together, which will form the basis to build the Term Frequency database to be used in the text prediction application.

Exploratory Analysis Report: http://rpubs.com/calvin3663/DataScienceCapstoneExploratoryAnalysis/

The graphs below show an excerpt of the initial data analysis of the Top Term Frequencies.

alt text

Reusable Text Prediction Model

The Text Prediction Model is essentially a re-usable R function (FuncTextPrediction.R) that utilizes the Term Frequency Database created in during the Data Mining and Text Analysis phase summarized in the previous section. The function takes in text sentense as input, extracts the terms, and employ text matching algorithm in order to obtain the list of possible next words. Five possible suggestions is then return to the calling application.

The Text Prediction Function and the accompanying Term Frequency Database can be easily port into different environment, and can be easily integrated into mobile devices to extend the capabilities of existing mobile applications.

Unique Value Proposition

This Smart Text Prediction Algorithm has tremendous potential based on the following factors:

  • Fast. With predictive capabilities, you can now write much faster and saves time
  • Algorithm is efficient and small footprint, which makes it ideal to run on mobile devices
  • Extremely wide application area. E.g. Email, Text chat/messaging, text to speach synthesis, etc
  • More capabilities can be made available if given the opportunities to develop further. E.g. Multi-language support, Learn from the user to improves its predictive capabilities, etc.

LIVE DEMO @ Shinyapps.io

To run the application, use your browser to open the link below:

https://calvin3663.shinyapps.io/DataScienceCapstone_NgramTextPredictor/

alt text