Coursera Final Project Presentation

Abdullah Albyati
11/25/2017

This presentation will provide an overview of my algorithm to predict the next word in a sentence.This is the Capstone project for Johns Hopkins University Data Science specialization on coursera.org

Objective

The goal of this capstone project is to build a Shiny application that is capable of predicting the next word based on user text input.

This project was completed in three phases

Downloading and cleaning the text data
- Prior to downloading the text data the algorithm will check the current working directory and see if the file already exist to avoid re downloading the file again.
- In this section I process the text to remove numbers, profanity, and white space
Exploratory Analysis
Prediction model and Shinny App Creation

Prediction Algorithm

The prediction algorithm was created using a back-off ngram model (Up to 6 ngrams) The algorithm used a subset of the data obtained by running the following code

#Take a small sample of the text to work with 
set.seed( 2017 ); ds.blogs  <- sample(blogs,   0.2 * length(blogs))
set.seed( 2017 ); ds.tweets <- sample(news, 0.2 * length(news))
set.seed( 2017 ); ds.news   <- sample(twitter, 0.2 * length(twitter))

The Algorithm will predict the next word by using the higher ngram first starting from 6 and work it's way down.

The Application

The application layout is as follow;

Sidebar (left side of the screen)
- On the left side panel is the text input box where users can input their words
- Below the text input box is the slider where users can choose how many words to enter fro prediction
Main Panel
- The top part of the main panel is the user complete sentence verbatim
- Below that is the top 5 predicted words

Links and references

The app is located here https://albyati.shinyapps.io/text-prediction/
refernces
- https://rpubs.com/lmullen/nlp-chapter
- https://github.com/lgreski/datasciencectacontent/blob/master/markdown/capstone-ngramComputerCapacity.md