Coursera Data Science Capstone Project

Saugata Ghosh
March 24, 2018

This presentation pitches a Shiny application for predicting the three most probable next words after the user types a partial input sentence.

This prototype application has been developed as part of the capstone project for the Johns Hopkins University and Coursera Data Science Specialization in collaboration with SwiftKey.

The Methodology Behind Developing The App

Next word prediction is the task of suggesting the most probable word a user will type next. Current approaches are based on the empirical analysis of corpora (large text files) resulting in probability distributions over the different sequences that occur in the corpus. The resulting language models are then used for predicting the most likely next word.
The English language corpus used in this case to build the 'n-gram' model was collected from publicly available sources - blogs, twitter and news by a web crawler.
In order to reduce the size of the data and allow the prototype to run smoothly, about 10 % of the data was sampled, cleaned and tokenized into 6,5,4,3,2, and 1 word n-grams.
After suitable data exploration, A 'Stupid Back-off' model was used to determine the three most probable next words based on the probabilities of occurrences of the relevant n-grams.

The Stupid Back-off Model

The model was developed with 6, 5, 4, 3, 2 and 1 word 'ngrams' - sequences of words drawn from the sampled corpus that occur consecutively.
The input sentence typed in the app is first cleaned of special characters,numbers, punctuations, swear words, etc. Then the last five words in the sentence (or fewer if the sentence is of smaller length ) are taken and matched with the corresponding 'n-order' n-gram list to find a match. If one is found the last word in the n-gram is returned as the predicted next word. If not the input string is further shortened and the next lower-order n-gram list is searched for a match.
For example let us consider the input to be “All roads lead to…”. The model will then search the list of all 5-word n-grams and if a 5-word n-gram completing the sentence is found, the last word will be returned as next word predicted. In case no such n-gram is found the input will be shortened to “roads lead to…” and the model will search the list of 4-grams for a match. If one is found the last word is returned. If not the input is further shortened by a word and the next lower order n-gram list searched for a match. Finally if no match is returned up to the bigram level the model simply returns the top most probable unigram.

Predictive Performance of the model

- In order to test the accuracy of the model, it was tested on 4173 predictions done on 100 lines each of texts from blogs and tweets, using data and a benchmark function kindly made available at this github repository. A screenshot of the resulting accuracy statistics is given below:

alt text

- Since our model was developed with 10 % of the data and n-grams upto 6-grams have been considered the resulting 'top-1 prediction' accuracy of around 11% is in line with expectations.

The Shiny app

The shiny app developed is hosted here. A screenshot of the hosted app is appended below
In view of the large data files involved please allow the app about a minute to load onto your screen.
All code files pertaining to the app are available at this github repository

Coursera Data Science Capstone Project

The Methodology Behind Developing The App

The Stupid Back-off Model

Predictive Performance of the model

The Shiny app

- Thanks