Data Science Capstone Project: SwiftKey Text Prediction

Camille P
Aug 2015

  • The goal of this project is to create an application that gives the predicted next word after a user enters a phrase or partial sentence.
  • The corpus of data were provided by SwiftKey and included data from Twitter, the news and blogs. The corpus contains roughly 3 million lines of text. Due to the limitations of processing and the Shiny host size only about 15% of that data are currently used by the application.

Data Cleaning & Processing

  • The data was imported and each set was cleaned of profanity and the contractions were replaced with full words.
  • Using the stylo package for R Studio, the data was then combined and processed using the txt.to.words() function. This function removes punctuation, lowers all capitalization, removes numbers and cleans the resulting white space. Initially all of these tasks were done separately and the stylo function was found to be the most concise and efficient.
  • The app uses n-grams as the method to build the prediction algorithm. Tables of unigrams, bigrams, trigrams, and 4-grams were constructed after the cleaning and processing of the data.
  • The user input is similarly cleaned upon the user selecting “Submit” to ensure proper matching is done.

Models & Algorithm

  • The prediction algorithm is a basic back off model. The idea is matching a phrase to a higher-order ngrams first. If no match is found, back off to a lower-order ngram recursively until the unigram level. This is called a Katz Back-Off Model.
  • To smooth the data - and aid in shortening load time & reducing file size - only ngrams were included that occurred multiple times. For 4-gram and trigram I kept those occurring three or more times. For ngrams and bigrams I included those that occurred more than five times.

Implementation

  • Initial tests showed that it was very likely users would come up with input that had no match. As a result, if the input has more than one word the algorithm will remove the first word from the phrase and check for a match with the remaining words. If there is no match for the input with the very last word than a message is produced indicating there is no match with the suggestion that the user check their spelling. Because of limited file size available it's highly likely there are words that are just not included. The data includes just under 400,000 unique words at this point.
  • The user may input profanity but since the data has been cleaned of all profanity no match will be returned - only the aforementioned error message.

Instructions

  • The application takes a few seconds to completely load. I would suggest waiting as much as 10 seconds. I haven't had it take more than 3 or 4 but that could vary depending on internet speeds.

  • After the app has completely loaded there is very little lag time in retrieving results. Generally just a quick flash and the results appear.

  • You may enter up to 3 words. If you utilize contractions be aware that those are replaced with full words in the algorithm and count as two words once replaced.

  • The application may be found here or you may copy and paste this link - https://camillersr.shinyapps.io/Capstone