Coursera Data Science Capstone Project Word Predictor

Michail Xenakis
8/01/2019

This presentation is a necessary requirement for the capstone project and accompanies the shiny application that predicts the next word of an input sentence.
The objective of the application is to predict the next word based on an input sentence.
All the scripts are on GitHub in this link.
You can gain access to the shiny application via this link.
In the following slides we briefly present the full structure of the application in a comprehensive figure, the computation of the NGram frequency tables and the prediction algorithm.
The shiny app simply builds upon these scripts.

The following figure illustrates the structure of the application:

$plot of chunk unnamed-chunk-1$

In simple words, we process the US blogs, news and twitter datasets from this link in order to produce word frequency tables upon which we will base our prediction model.
Due to low memory capacity, we had to work on a very small sample of the datasets (0.4% of the total ~ 17 thousand lines) upon which we produced tables of word frequencies for unigrams, bigrams, trigrams and tetragrams (one word, two words, three words, and four words).
We saved these NGram Frequency tables in an .Rds format (gramN_1.Rds, gramN_2.Rds, gramN_3.Rds, gramN_4.Rds) in order to use them in the prediction algorithm.
The full code for the algorithm (NGramsFreq.R) is here.

The algorithm predicts the next word of an input sentence as follows.
It reads an input sentence and breaks it down to find the word length (i.e. N)
Based on the word length and by utilizing the NGram Frequencies datasets (i.e. N + 1 word length dataset) tries to find the phrase with the highest frequency in the corresponding dataset.
If there is an identical phrase it returns the word that follows the input sentence. If not, it reiterates by removing from the input sentence the first word and performs the same procedure to the N word length dataset instead of the N + 1. In case there is no match, then it randomly returns one of the 10 most frequent unigrams as a result.
The full code for the prediction model (PredictionModelling.R) is here.

The application looks like this:

$plot of chunk unnamed-chunk-2$