Data Science Capstone Project

Geons
March 2017

This introduces an application that uses Natural Language Processing techniques for predicting the next word.
The Shiny application can be visualized at this link
The Data Science Specialisation is jointly organised by Coursera and Johns Hopkins University
Data for this Capstone project is kindly offered by SwiftKey.

A Shiny app that can predict the next word was to be developed. This is not unlike the apps we use daily in our mobile keyboards implemented companies like Swiftkey.
This app is meant to showcase what was learnt during the 9 courses of the Data Science specialisation.
The initial tasks like obtaining and cleaning data are documented in a Milestone Report. The creation of N-grams is also described there.

After preparation the data was sampled and broken down (tokenized) into contiguous sequences of N words, the so-called N-grams.
These N-grams were then analysed and form the base for the predictive model. A lot of care was taken in their production process.
The predictive model uses the Katz Backoff model. In essence this means that if the n-gram has been seen more than k times in training, the conditional probability of a word given its history is proportional to the maximum likelihood estimate of that n-gram. Otherwise, the conditional probability is equal to the back-off conditional probability of the “(n - 1)-gram” (Wikipedia).

The Shiny application allow the prediction of the next possible word in a sentence.
The application loads and the user is directly offered a view of the input text box and a user guide alongside.
After switching tabs the user can see the most probable prediction and the runners up.
In a third and final tab more background information is given.

The next word prediction app is hosted on https://geowizard.shinyapps.io/word_prediction2/
This Pitch slide deck is located in RPubs at https://rpubs.com/Geo/capstone
This project would not have been feasible without these video's from Stanford University
More on the Coursera Data Science Specialisation: https://www.coursera.org/specialization/jhudatascience/1
Finally : a big thank you to my fellow students and the staff at JHU and Coursera !