Final Project Submission

Elimane NDOYE
03/10/2019

First Slide

For more details on authoring R presentations please visit https://support.rstudio.com/hc/en-us/articles/200486468.

Introdiction

Welcome to my Final Submission for the Capstone Project The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. For this project you must submit: 1.A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word. 2.A slide deck consisting of no more than 5 slides created with R Studio Presenter (https://support.rstudio.com/hc/en-us/articles/200486468-Authoring-R-Presentations) pitching your algorithm and app as if you were presenting to your boss or an investor. To see the shiny app, go here: http://rpubs.com/Ellimann/534991

Slide 2

Prerequisiques

In order to produce this shiny app, it has been required:

Data: Coursera-SwiftKey dataset including News and Twitter examples to feed the model
Sofware: R, optional: RStudio
Libraries: Shiny, tm, data.table, stringr, dplyr
Internet Conection
Disclaimer: due to limited resourses this is only a prototype, so all tweets and news are not analysed, just a selected group

Slide 3

Summary of Project Steps

  • Bullet 1Loading Libraries - First step of the project is to load all the libraries necessary to complete all the tasks outlined in the introduction.
  • Bullet 2Loading Data - The data used in this project can be found here
  • Bullet 3Summarizing Data Files - create a very basic overview of the data file statistics in terms of File Names, File Sizes, Number of Rows in each file as well as word count.
  • Bullet 3 Creating a Data Sample - 1000 lines from each file were sampled. Total sample size will be 2000.
  • Bullet 4Cleaning Data - Convert all text to lowercase, remove all punctuation, numbers, whitespace and “english” stop words
  • Bullet 5Creating the corresponding n-gram frequencies
  • Bullet 5Saving n-grams as .rds files

Slide 4

Algorimth

  • N-gram model used ( from 1 to 4 n-gram )
  • If no match is found in any of the 4 n-grams, the algorimth indicates that the sample is too small
  • Stupid back-off strategy implemended
  • Important: Sample size was limited to 2000 due to relatively small processing power of the pc that was used

Slide 5

Shiny App - How it works

  • User inputs a word into the app interface
  • The app then checks the word against the prediction algorithm
  • The next word is proactively predicted
  • This prediction is based from longerst to shortest N-gram frequency
  • Prediction is displayed