fp2

Alon Gur-Arie
03/02/2018

Coursera Data Science Capstone Project Shiny Application

This presentation will briefly present and explain a shinny app application meant to predict the next word to be included in a text that is typed in. The application is the final part of the capstone project for the Coursera Data Science specialization taught by professors from Johns Hopkins University. The project is in cooperation with SwiftKey.

SwiftKey, Bloomberg & Coursera Logo

Project Objective

The main goal of this capstone project is to build a shiny application that is to be able predict the next word to be typed on an interactively typed text. The prediction is based processing a very large body of text, that can be found at HC Corpora.

The entire cleaned version of the corpus is used in creation of the data used as the base of this application.

This exercise was divided into several sub tasks: 1. comprehensive data preparation, 2. exploratory analysis of the text corpus, 3. the construction of a predictive model 4. development of an interactive Shiny app that is based on the model 5. Several iterations of additional data tweaking. 5. Optimization of the application. 6. Creation and presentation of a pitch presentation

Models Used and Methods applied during the project

I sampled a large data sample from the HC Corpora data corpus, this sample was prepared through conversion to lowercase, removal of punctuation marks, links, white space, numbers and other kinds of special characters.

This data sample was then tokenized into n-grams. The 2-gram, 3-gram and 4-gram term frequency matrices were processed into frequency dictionaries. Memory restrictions imposed by the current version of shiny, prevented me from working with text units larger then 4-grams.

The actual process of predicting the next word was done by executing a Stupid backoff algorithm on the n-gram frequency tables to rank various choices and display the choice with the highest word. Following is a comprehensive paper discussing the Stupid backoff word prediction method Stupid Backoff

Basic Usage of The Application

The user interface of this application was designed with mobile applications in mind. While entering the text (1), the field with the predicted next word (2) automatically refreshes the displyed text instantaneously and the whole text input (3) also gets updated automatically.

Application Screenshot

Additional Information