Data Science Capstone Project

Athanasios Stamatoukos
October 20, 2018

Introduction

For this project, we were tasked with making a text prediction algorithm that works similarly to the popular phone keyboard SwiftKey.

We were given samples of blog posts, news articles, and twitter posts to analyze. I randomly sampled 40% of these posts and made a corpus out of them using the 'quanteda' package in R.

Using this corpus, I then broke down each of the posts into 4-, 3-, 2-, and 1-grams and removed all n-grams that appeared fewer than 2 times. In the next slides I will explain how I use these n-grams to create the prediction model.

Creating the Model

I split the n-gram tables I built in the previous step into three columns. Column 1 was the first n-1 terms of each n-gram, column 2 was the nth term of each n-gram, and column 3 was the frequency that each n-gram appeared.

For example, if the phrase 'welcome to the jungle' appeared 38 times in the corpus, it would be represented in a data table as “welcome to the” / “jungle” / “38.” This was done for each and every n-gram that appeared.

These 3 column (or 2 column in the case of the 1-gram table) data tables were then saved to a .RData file which was loaded into the Shiny app so the entire corpus creation process could be circumvented as it takes a very long time.

NOTE: All my code can be found on my Github repository at https://athanasios8193.shinyapps.io/textprediction/.

How the App Works

All the hard work has been done. The app itself is quite simple at this point. When you type text into the box, I read the last 3 words you typed. (If you've only typed 1 or 2 words, I only look at those 2 words.) I take the “x” amount of words from what you've typed and compare it to the n-gram tables.

If you typed 3 or more words, I first compare to the 4-gram table. If there are matching 3-grams in the first column, the app returns the top 3 results as predictions. If there are no matches, I then check the last 2 words of your text against the 3-gram table. If there are no matches, then I check the last word against the 2-gram table. If there are still no matches, the top 3 most used words in the entire corpus are returned as a prediction. The same process occurs if you only entered 1 or 2 words.

If there are fewer than 3 matches for the text string you entered, the missing values are replaced with random results from the top 10 most used words in the entire corpus.

Conclusion/Acknowledgements

With more resources and skill, one could implement a model that could learn each person's habits (kind of like how SwiftKey does) and make a more personalized prediction model. This is beyond the scope of this exercise and my skill level, but it could definitely be done. This model starts out using random texts from random people across the Internet which is a good starting point, so personalizing it to each user would make it even better.

I owe thanks to the forum posts on Coursera, the posts on Stack Overflow, and the developers of the quanteda package in R for making this project possible for me. Most importantly, thank you to Drs. Roger Peng, Brian Caffo, and Jeff Leek from the Johns Hopkins School of Public Health for developing this course. I learned so much from it and am grateful that you decided to share it with us.