Coursera Final Project Presentation

Abdullah Albyati
11/25/2017

This presentation will provide an overview of my algorithm to predict the next word in a sentence.This is the Capstone project for Johns Hopkins University Data Science specialization on coursera.org

Objective

The goal of this capstone project is to build a Shiny application that is capable of predicting the next word based on user text input.

This project was completed in three phases

  • Downloading and cleaning the text data
    • Prior to downloading the text data the algorithm will check the current working directory and see if the file already exist to avoid re downloading the file again.
    • In this section I process the text to remove numbers, profanity, and white space
  • Exploratory Analysis

  • Prediction model and Shinny App Creation

Prediction Algorithm

The prediction algorithm was created using a back-off ngram model (Up to 6 ngrams) The algorithm used a subset of the data obtained by running the following code

#Take a small sample of the text to work with 
set.seed( 2017 ); ds.blogs  <- sample(blogs,   0.2 * length(blogs))
set.seed( 2017 ); ds.tweets <- sample(news, 0.2 * length(news))
set.seed( 2017 ); ds.news   <- sample(twitter, 0.2 * length(twitter))

The Algorithm will predict the next word by using the higher ngram first starting from 6 and work it's way down.

The Application

The application layout is as follow;

  • Sidebar (left side of the screen)
    • On the left side panel is the text input box where users can input their words
    • Below the text input box is the slider where users can choose how many words to enter fro prediction
  • Main Panel
    • The top part of the main panel is the user complete sentence verbatim
    • Below that is the top 5 predicted words

Links and references