Capstone Project: Text Prediction Application

Erich F Gruhn
September 01, 2018

. CSPICS1

The Task

This Shiny application has created for the Capstone assignment of the Johns Hopkins University Data Science Specialization. The project was co-sponsored by SwiftKey.

The assignment was to create a model from texts taken from online blogs, news articles, and Twitter posts using the following dataset: (Coursera-SwiftKey-Datasets)

The model was to be used to build an algorithm that may best predict the next word in a phrase based on the many phrases collected in the original data.

The requirement for this project was to create a Shiny application that responds to user input, takes the word(s) entered and predicts the next word.

The Model

The model uses a contiguous set of “n” items from a given sequence of text (an N-gram) created from various text sources. The tm package was primarily used to create a corpus of the texts in order to complete pre-processing. This pre-processing used various methods of cleaning and preparing the text, some of this cleaning included removal of profanity, web-text filtering, conversion of lower case to ASCII, elimination of numbers, and end of sentence tagging. For example:

[1] Raw Text: These guys are playing WAY BEYOND their normal game!!! #gocavs

[1] Clean Text:  these guys are playing way beyond their normal game <EOS>

Once the pre-processing of the corpus has completed. 6 N-grams were created of sizes 1, 2, 3, 4, 5, and 6 words respectively. This work was completed using the package quanteda then the frequency of each of the 6 N-grams were counted. The final step was to adjust for unseen words using the process of simple Good-Turing smoothing was performed. For example:

           words  freq r_smooth      pr
1     one of the 10529  10527.5 0.00084
2       a lot of  8975   8973.5 0.00071
3 thanks for the  7242   7240.5 0.00058

The Prediction Function

This prediction algorithm will do the following:

Allow the user to input text
Pre-process the text to align with the format of the cleaned corpus
Start a search of the associated highest-order N-gram list based on the user’s text
Should no match be found, the application will perform Katz Backoff as needed until the match is found
The application allows for up to top 5 words to be returned that might complete the N-gram
Should no match be found after all backoffs are complete, the application will return the top 5 most common words
The user has the option to run the application based on higher Accuracy or faster Speed - this is oulined further in the next section on Model Performance

Model Performance:

     model top_acc top3_acc top5_acc avg_time
1    Speed   0.170    0.250    0.293    0.103
2 Accuracy   0.186    0.258    0.298    0.652

The App

The Shiny Application presents a basic user interface to prediction algorithm based on a data phrase. Features included in this applicaiton are:

Slider selections for number of results
Interactive command line to input text for prediction
Options for the algorithm model choice
Interactive examples of pre/post-processed text where the profanity removal may be demonstrated
Bar plots of most frequent N-grams that may be selected based on user options

CSPICS2