TextSmart










Data Science Capstone Project
Ramana Sonti
June 17, 2017

Introduction/Executive Summary:

TextSmart is designed to predict the next word and/or the rest of the current word that's being typed.

  • The user interface has the following input/output components
    . the text area for inputting the message
    . three buttons whose labels get updated with the predicted words
    . the output message area below the buttons

  • The input from the text field is fed through the prediction routine after each character is typed
  • The prediction routine returns three most probable words out of which one is expected to
    . autocomplete the rest of the word when a non-space character is typed or
    . match the next word when a sapce is typed
  • Three buttons get updated with the predicted words with the word on the first button has the highest probability

Data Cleaning:

  • Built the corpora with about 4M lines of blogs, news, and twitter feed provided
  • Split the corpora into three parts using tm package with random sampling
    . training (60%)
    . validation (20%)
    . testing (20%)
  • Further split training set into 6 parts to process them in parallel on Linux running on 16 CPU x 64G hvm/AWS
  • Used perl regular expressions to remove profanity words from the input datasets
  • Used quanteda package to remove
    . non-ascii characters
    . punctuation
    . digits and white space
    . symbols and hyphens
    . URLs and separators

Prediction Algorithm:

  • Generated 1-4 ngrams from each part of the training data using quanteda
  • Merged all ngrams into one final data table
  • Calculated probabilities for the last word on every ngram via a copy of ngram-frequency hash table
  • Merged the probabilities of ngrams from all 6 parts for training into one final table
  • Validation and test parts were put through similar processing steps
  • Tried interpolation on 1% sampled set and found no major improvement in accuracy
  • Pruned final set of ngrams to limit the size of the object in memory to 86MB
  • Used the back off technique
    . it tries to match on 4-gram first if the input has at least three prior words
    . it returns the last words of the top three matching 4-grams that start with the input string passed
    . if there is no matching 4-gram, it tries 3-gram, then 2-gram, and finally 1-gram
  • 46.39% success rate with the test data when ngrams with the frequency==1 were discarded
  • Predicted 50% words from Quiz 2 and 30% from Quiz 3

Shiny App:

The UI has been built using Shiny package and is hosted at TextSmart

https://sontivr.shinyapps.io/WordPrediction
alt text