Capstone Project: Word Predicting Application

Fenton Taylor
23 March, 2017

img1 img2 img3

The Task

This application was created for the final assignment of the Johns Hopkins University Data Science Specialization Capstone Project, which was co-sponsored by SwiftKey.

The task was to create a model based on texts taken from online blogs, news articles, and Twitter posts (source) and use that model in an algorithm to predict the next word in a phrase.

The ultimate goal was to implement the model and prediction algorithm in a Shiny application that takes user input text and generates a prediction for the next word.

The Model

An N-gram (contiguous set of n items from a given sequence of text) model was created from the text sources. The tm package was used to create a corpus of the texts for necessary pre-processing. Pre-processing included various methods of cleaning and preparing the text, such as converting to ASCII and lower case, profanity and web-text filtering, and end of sentence tagging. For example:

[1] Raw Text: John gave her WHAT carat diamond?!?! #crazy
[1] Clean Text:  john gave her what carat diamond <EOS> 

After the corpus was pre-processed. N-grams of lengths 1 to 6 were created using the quanteda package and the frequency of each N-gram were counted. Finally, simple Good-Turing smoothing was performed to adjust for unseen words and N-grams. Example:

           words  freq r_smooth      pr
1     one of the 10529  10527.5 0.00084
2       a lot of  8975   8973.5 0.00071
3 thanks for the  7242   7240.5 0.00058

The Prediction Function

The prediction algorithm does the following:

  • Takes user text input
  • Pre-processes the text to match the format of the cleaned corpus text
  • Searches the appropriate highest-order N-gram list for the user’s text
  • If no match is found, perform Stupid Backoff until a match is found
  • Return up to top 5 words that complete the N-gram
  • If no matches are found, return the top 5 most common words

Model Performance:

     model top_acc top3_acc top5_acc avg_time
1    Speed   0.170    0.250    0.293    0.103
2 Accuracy   0.186    0.258    0.298    0.652

The App

The Shiny Application provides a simple user interface to interact with the prediction algorithm and the data. Notable features include:

  • Interactive command line to input text for prediction
  • Parameter selections for the algorithm: model choice, number of results
  • Interactive plot of most frequent N-grams
  • Interactive examples of pre/post-processed text

app3 app2 app1