Coursera Data Science Capstone Project Word Predictor

Michail Xenakis
8/01/2019

This presentation is a necessary requirement for the capstone project and accompanies the shiny application that predicts the next word of an input sentence.
The objective of the application is to predict the next word based on an input sentence.
All the scripts are on GitHub in this link.
You can gain access to the shiny application via this link.
In the following slides we briefly present the full structure of the application in a comprehensive figure, the computation of the NGram frequency tables and the prediction algorithm.
The shiny app simply builds upon these scripts.

$plot of chunk unnamed-chunk-1$

In simple words, we process the US blogs, news and twitter datasets from this link in order to produce word frequency tables upon which we will base our prediction model.
Due to low memory capacity, we had to work on a very small sample of the datasets (0.4% of the total ~ 17 thousand lines) upon which we produced tables of word frequencies for unigrams, bigrams, trigrams and tetragrams (one word, two words, three words, and four words).
We saved these NGram Frequency tables in an .Rds format (gramN_1.Rds, gramN_2.Rds, gramN_3.Rds, gramN_4.Rds) in order to use them in the prediction algorithm.
The full code for the algorithm (NGramsFreq.R) is here.

The algorithm predicts the next word of an input sentence as follows.
It reads an input sentence and breaks it down to find the word length (i.e. N)
Based on the word length and by utilizing the NGram Frequencies datasets (i.e. N + 1 word length dataset) tries to find the phrase with the highest frequency in the corresponding dataset that the first N words are identical to the input sentence.
If there is an identical phrase then it returns the word that follows the input sentence. If not, it re-runs in the same way by removing from the input sentence the first word and performs the same procedure to the N word length dataset instead of the N + 1. In case there is no match by these iterations, then in the end it randomly returns one of the 10 most frequent unigrams as a result.
The full code for the prediction model (PredictionModelling.R) is here.

The application looks like this:

$plot of chunk unnamed-chunk-2$