Shiny Word Prediction App

Charles-Antoine de Thibault
19 March 2018

Executive Summary

The application is the Capstone Project for the Data Science Specialisation from Johns Hopkins University on Coursera. https://www.coursera.org/specializations/jhu-data-science

The objective was to digest a raw dataset from blogs, news and twitter to be able to build a model that should predict the next word you are about to write given the last words written.

some caption

The Process

The whole process can be found on github on https://github.com/charlesdethibault/DataScienceCaptone.

The work is divided into 5 different steps.

  1. Corpus and Ngrams: Creation of the Corpus by structuring and cleaning the data
  2. Prediction Tables: Creation of probability table to define which words are the most frequent
  3. Table Reduction: Optimisation of the datasets
  4. Model Prediction: Creation of the model including research
  5. Word Prediction: Creation of the function based on the Model Creation

The Model

To be able to create a model light enough to with acceptable accuracy able to run on shinyapps.io, I have decided to use a Back off Model which will compare the sequence of words types against the same sequence of words in the initial database.

If the sequence does not exist, it will remove the first word of the sequence and compare the new sequence to the database.

App Overview

The app is available on https://charlesdethibault.shinyapps.io/SwiftFinal/.

If you enter your word or sentence of the left, the next predicted word will appear on the right.

The N gram used to predict the word will appear below prediction.

some caption