Erich F Gruhn
September 01, 2018
.
This Shiny application has created for the Capstone assignment of the Johns Hopkins University Data Science Specialization. The project was co-sponsored by SwiftKey.
The assignment was to create a model from texts taken from online blogs, news articles, and Twitter posts using the following dataset: (Coursera-SwiftKey-Datasets)
The model was to be used to build an algorithm that may best predict the next word in a phrase based on the many phrases collected in the original data.
The requirement for this project was to create a Shiny application that responds to user input, takes the word(s) entered and predicts the next word.
The model uses a contiguous set of “n” items from a given sequence of text (an N-gram) created from various text sources. The tm package was primarily used to create a corpus of the texts in order to complete pre-processing. This pre-processing used various methods of cleaning and preparing the text, some of this cleaning included removal of profanity, web-text filtering, conversion of lower case to ASCII, elimination of numbers, and end of sentence tagging. For example:
[1] Raw Text: These guys are playing WAY BEYOND their normal game!!! #gocavs
[1] Clean Text: these guys are playing way beyond their normal game <EOS>
Once the pre-processing of the corpus has completed. 6 N-grams were created of sizes 1, 2, 3, 4, 5, and 6 words respectively. This work was completed using the package quanteda then the frequency of each of the 6 N-grams were counted. The final step was to adjust for unseen words using the process of simple Good-Turing smoothing was performed. For example:
words freq r_smooth pr
1 one of the 10529 10527.5 0.00084
2 a lot of 8975 8973.5 0.00071
3 thanks for the 7242 7240.5 0.00058
This prediction algorithm will do the following:
Model Performance:
model top_acc top3_acc top5_acc avg_time
1 Speed 0.170 0.250 0.293 0.103
2 Accuracy 0.186 0.258 0.298 0.652
The Shiny Application presents a basic user interface to prediction algorithm based on a data phrase. Features included in this applicaiton are: