Capstone Project: Text Prediction Application

Erich F Gruhn
September 01, 2018

.CSPICS1

The Task

This Shiny application has created for the Capstone assignment of the Johns Hopkins University Data Science Specialization. The project was co-sponsored by SwiftKey.

The assignment was to create a model from texts taken from online blogs, news articles, and Twitter posts using the following dataset: (Coursera-SwiftKey-Datasets)

The model was to be used to build an algorithm that may best predict the next word in a phrase based on the many phrases collected in the original data.

The requirement for this project was to create a Shiny application that responds to user input, takes the word(s) entered and predicts the next word.

The Model

The model uses a contiguous set of “n” items from a given sequence of text (an N-gram) created from various text sources. The tm package was primarily used to create a corpus of the texts in order to complete pre-processing. This pre-processing used various methods of cleaning and preparing the text, some of this cleaning included removal of profanity, web-text filtering, conversion of lower case to ASCII, elimination of numbers, and end of sentence tagging. For example:

[1] Raw Text: These guys are playing WAY BEYOND their normal game!!! #gocavs
[1] Clean Text:  these guys are playing way beyond their normal game <EOS> 

Once the pre-processing of the corpus has completed. 6 N-grams were created of sizes 1, 2, 3, 4, 5, and 6 words respectively. This work was completed using the package quanteda then the frequency of each of the 6 N-grams were counted. The final step was to adjust for unseen words using the process of simple Good-Turing smoothing was performed. For example:

           words  freq r_smooth      pr
1     one of the 10529  10527.5 0.00084
2       a lot of  8975   8973.5 0.00071
3 thanks for the  7242   7240.5 0.00058

The Prediction Function

This prediction algorithm will do the following:

  • Allow the user to input text
  • Pre-process the text to align with the format of the cleaned corpus
  • Start a search of the associated highest-order N-gram list based on the user’s text
  • Should no match be found, the application will perform Katz Backoff as needed until the match is found
  • The application allows for up to top 5 words to be returned that might complete the N-gram
  • Should no match be found after all backoffs are complete, the application will return the top 5 most common words
  • The user has the option to run the application based on higher Accuracy or faster Speed - this is oulined further in the next section on Model Performance

Model Performance:

     model top_acc top3_acc top5_acc avg_time
1    Speed   0.170    0.250    0.293    0.103
2 Accuracy   0.186    0.258    0.298    0.652

The App

The Shiny Application presents a basic user interface to prediction algorithm based on a data phrase. Features included in this applicaiton are:

  • Slider selections for number of results
  • Interactive command line to input text for prediction
  • Options for the algorithm model choice
  • Interactive examples of pre/post-processed text where the profanity removal may be demonstrated
  • Bar plots of most frequent N-grams that may be selected based on user options

CSPICS2