The Text Prediction Application

Capstone Project for the Johns Hopkins-Coursera Data Science Series

Mark Culp, August 27, 2017

About the Data

The data set used for this project was provided by the SwiftKey Company, which is based in central London. The SwiftKey “corpora” used in the project was comprised of:

  • 899,288 blogs
  • 77,259 news articles
  • 2,360,148 tweets

The R “tm” and “RWeka” packages were used to mine the text, and create a series of 2 and 3 term “n-grams.” These n-grams predict the next word a user types based on either the last word the user typed, or a combination of the last word and second-to-last word the user typed.

Challenges Faced

In an effort to cover every series of 2 and 3 term n-grams contained in the corpora, the entire data set was processed in approximately 10 megabyte chunks of text. The 2 and 3 term n-grams were then merged into a single data frame that summarized the n-gram words counts with the “merge.R” script. The final,merged data frame contained approximately 36.8 million lines of text.

An Alienware computer with 16 GB RAM was used to process the data. The “tokenizer.R” script took anywhere from one hour to 90 minutes to filter and create each n-gram pair. The “merge.R” script took about 20 to 30 minutes to merge the n-gram pairs and add them to final data frame.

Sorting and Filtering

The final data frame was parsed alphabetically into a series of 42 R object files (.rds) for quick retrieval by the text prediction application. These files preserve most of the unique word combinations discovered:

  • Twitter hash tags were removed.
  • Common English contractions were converted to full words
  • Objectionable words were removed

Two final reduction scripts reduced the final accuracy of the application while preserving the most popular words and word combinations:

  • Matching word combinations below certain count thresholds were removed.
  • Matching words sets were limited to the 4 most popular terms.

The Data Product

The final application predicts the most commonly used word combinations obtained from the SwiftKey corpora. The most likely guess is derived from user input and appended to the user's input. Up to three alternative guesses are populated in an alternative answer text box.

Text Predict App