Coursera Data Science Capstone Project Word Predictor

Michail Xenakis
8/01/2019

Overview of the Project

  • This presentation is a necessary requirement for the capstone project and accompanies the shiny application that predicts the next word of an input sentence.
  • The objective of the application is to predict the next word based on an input sentence.
  • All the scripts are on GitHub in this link.
  • You can gain access to the shiny application via this link.
  • In the following slides we briefly present the full structure of the application in a comprehensive figure, the computation of the NGram frequency tables and the prediction algorithm.
  • The shiny app simply builds upon these scripts.

The full structure of the application

plot of chunk unnamed-chunk-1

NGram Frequency tables

  • In simple words, we process the US blogs, news and twitter datasets from this link in order to produce word frequency tables upon which we will base our prediction model.
  • Due to low memory capacity, we had to work on a very small sample of the datasets (0.4% of the total ~ 17 thousand lines) upon which we produced tables of word frequencies for unigrams, bigrams, trigrams and tetragrams (one word, two words, three words, and four words).
  • We saved these NGram Frequency tables in an .Rds format (gramN_1.Rds, gramN_2.Rds, gramN_3.Rds, gramN_4.Rds) in order to use them in the prediction algorithm.
  • The full code for the algorithm (NGramsFreq.R) is here.

Prediction Model

  • The algorithm predicts the next word of an input sentence as follows.
  • It reads an input sentence and breaks it down to find the word length (i.e. N)
  • Based on the word length and by utilizing the NGram Frequencies datasets (i.e. N + 1 word length dataset) tries to find the phrase with the highest frequency in the corresponding dataset that the first N words are identical to the input sentence.
  • If there is an identical phrase then it returns the word that follows the input sentence. If not, it re-runs in the same way by removing from the input sentence the first word and performs the same procedure to the N word length dataset instead of the N + 1. In case there is no match by these iterations, then in the end it randomly returns one of the 10 most frequent unigrams as a result.
  • The full code for the prediction model (PredictionModelling.R) is here.

Resulting Shiny Application

The application looks like this:

plot of chunk unnamed-chunk-2