Data Science Capstone Project
Ramana Sonti
June 17, 2017
Introduction/Executive Summary:
TextSmart is designed to predict the next word and/or the rest of the current word that's being typed.
The user interface has the following input/output components
. the text area for inputting the message
. three buttons whose labels get updated with the predicted words
. the output message area below the buttons
The input from the text field is fed through the prediction routine after each character is typed
The prediction routine returns three most probable words out of which one is expected to
. autocomplete the rest of the word when a non-space character is typed or
. match the next word when a sapce is typed
Three buttons get updated with the predicted words with the word on the first button has the highest probability
The user can hit the button that has the predicted word to add it to the input text and continue typing
Data Cleaning:
Built the corpora with about 4M lines of blogs, news, and twitter feed provided
Split the corpora into three parts using tm package with random sampling
. training (60%)
. validation (20%)
. testing (20%)
Further split training set into 6 parts to process them in parallel on Linux running on 16 CPU x 64G hvm/AWS
Used perl regular expressions to remove profanity words from the input datasets
Used quanteda package to remove
. non-ascii characters
. punctuation
. digits and white space
. symbols and hyphens
. URLs and separators
Prediction Algorithm:
Generated 1-4 ngrams from each part of the training data using quanteda
Merged all ngrams into one final data table
Calculated probabilities for the last word on every ngram via a copy of ngram-frequency hash table
Merged the probabilities of ngrams from all 6 parts for training into one final table
Validation and test parts were put through similar processing steps
Tried interpolation on 1% sampled set and found no major improvement in accuracy
Pruned final set of ngrams to limit the size of the object in memory to 86MB
Used the back off technique
. it tries to match on 4-gram first if the input has at least three prior words
. it returns the last words of the top three matching 4-grams that start with the input string passed
. if there is no matching 4-gram, it tries 3-gram, then 2-gram, and finally 1-gram
46.39% success rate with the test data when ngrams with the frequency==1 were discarded
Predicted 50% words from Quiz 2 and 30% from Quiz 3
Shiny App:
The UI has been built using Shiny package and is hosted at TextSmart