Shiny App for Predictive Text

JGG
2017

OVERVIEW AND PREVIEW

This capstone project aims to create an app for PREDICTIVE TEXT to be done in Shiny. Shiny is a package from RStudio that can be used to build interactive web pages with R.

Practical applications of predictive text: text messaging, emails, search engine sites, customer management sites, chat apps, among others. Here is the PREVIEW and TRY the APP later…

ALGORITHM

In building the App the following concepts and models were used:

  1. N-Gram: a sequence of N words (ex. 2-gram for “beautiful life”, 3-gram for “I am home” )

  2. Markov Chain: the probability of a word to be the next word depends only on the previous words

  3. Stupid Backoff for smoothing: use 4-gram if result is sufficient, otherwise use 3-gram, otherwise use 2-gram.

CLICK HERE FOR THE REFERENCE:D.Jurafsky & J. Martin (2014). Speech and Language Processing, Chapter 4: N-Grams

TRAINING AND TEST DATA SETS

LOADING AND PROCESSING THE DATA: DATA SET was provided by SwiftKey, our corporate partner for this project.

  • Extract training data set from blogs, twitter, and news data sets
  • Preprocess the data using ngram package: removing punctuation, lowering letter case, and fix spacing
  • Build the pruned 4-gram model for efficient storing; that is, considering only 4-gram with high frequencies

RESULTS

   Source LineCount  WordCount Train_LineCt Train_WordCt Test_Ct Accuracy
1   Blogs   899,288 37,334,131       50,000    2,053,168   1,022      15%
2 Twitter 2,360,148 30,373,543      200,000    2,523,971     178      11%
3    News    77,259  2,643,969       54,081    1,843,581     592      15%

Number of stored 4-gram: 372,223. Accuracy can be improved by increasing the number of stored 4-gram.

FEATURES AND FUTURE

CURRENT FEATURES of the App are:

  • predicts next word after the phrase you typed in
  • number of next word selections can be set
  • shows cloud of words related to the phrase
  • user-friendly, with tab on how to use the app (just enter the phrase and number of selections, then submit)

CLICK HEREto try the App! CLICK HERE for the reproducible code!

FUTURE ENHANCEMENTS

  • improved speed and accuracy
  • can recognize different languages other than English