Data Science Specialization

Frantz Moudoute
2018-02-13

This presentation will briefly pitch an application for smartly predicting the next word.

The application is the capstone project for Data Science Specialization held by The Johns Hopkins University in cooperation with SwiftKey on the Coursera Platform for online education.

Objective

The primary objective is to build a shiny application that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.

The following steps are necessary to reach the prediction data product: Data extraction, cleansing, exploratory analysis, the creation of a predictive model, integration to a data product.

The data are texts used to create frequency dictionary and thus predict the next words. Data are from a corpus called HC Corpora.

Our work will rely on very standard R packages: StringR, DplyR, tidytext, ggplot2, stringi, RWeka, R.utils, tm, qdap and worldcloud.

Process

We retrieve data which are initially stored in a text file using the readLines function and a CON connection. Data are then cleaned by conversion to lowercase removing punctuation, links, white space, numbers and stop-words.

After cleansing, data are tokenized using the NGramTokenizer function in R and a Weka_control for the respective rank of the token that we wish to build.

Data are then converted into RDS and four functions are simultaneously created. They will act as a router to the right n-gram dataset depending on the number of word input by the user.

Each function is equipped with an automatic scale back option at the n-gram level in case the system cannot find a suggestion.

Application

The user can access the application from desktop or mobile and input one to n words.

User inputs the text in the text area.

A “text output” area on the right-end side will confirm the input of the user.

The below field is the maximum number of suggestion that the system will offer (currently limited to 3).

Last but not least, the lower section contains a dataset with the “NEXT WORD”

Additional Information

The next word prediction app is hosted on shinyapps.io: https://fmoudoute.shinyapps.io/Capstone/

Learn more about the Coursera Data Science Specialization: https://www.coursera.org/specializations/jhu-data-science

My other project: https://fmoudoute.shinyapps.io/PollutionUltimate/