4 November 2018

Capstone project

The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. For this project you must submit:

A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word. A slide deck consisting of no more than 5 slides created with R Studio Presenter (https://support.rstudio.com/hc/en-us/articles/200486468-Authoring-R-Presentations) pitching your algorithm and app as if you were presenting to your boss or an investor.

Algorith of prediction

In this project we did:

  1. Downloading the data and create the Corpora.
  2. Cleaning and sampling the Corpora
  3. Tokenizing the Corpora and building frequency tables for N-grams based on the processed Corpora (2-grams, 3-grams, 4-grams and 5-grams)
  4. For each N-gram tables, split the n-gram column into two columns containing: First N-1 words (previous to the Last word) in the 1st column and Last word in the 2nd column
  5. The application takes the last N-grams words entered in the input text (first trying with 4-Gram, then 3, then 2, then 1) and searches in the N+1-gram table its most frequent following word and a few second one to offer as alternatives.

The application