Data Science Capstone Project

Venkat Sri (vesr)
Feb 2018

Predicting Next Words(s) Project

Peer Graded Assignment

Overview

Scope and Objective of this Project Coursera Data Science Specialization.

The objective of the application is to implement model that prompts hint (next set of words), related to the phrase/text entered by the user. The input for this program consists of three datasets twitter, news and blogs from HC Corpora. Data has been cleaned and a subset is used as sample data in R data frames. Back-off algorithm is used complementing with NLP techniques to create n-grams. The UI layer has been developed with Shiny package with additional libraries (such as a DT, javascript, HTML Render) to enhance the user experience.

Approach and Solution Steps

Here are the key steps in define, design and develop the application, based on the three data sources available through Swiftkey.

Multiple tasks have been performed:

Defining the problem, download and clean the data;
Making of Exploratory Data Analysis to understand the data ;
Tokenization of words and predictive text mining;
Writing a milestone project and a prediction model;
Developing a shiny application;
Developing supporting documentation and the presentation.

Process: Input, Methods, and Ouput

Input: The data came from HC Corpora with three files (Blogs, News and Twitter). A sample data was created based on this huge data file. The same data was converted to lower case, removed the punctuation, links, whitespace, numbers and profanity words.

Model: The sample text was tokenized* into n-grams** to construct the predictive models (* Tokenization is the process of breaking a stream of text up into words, phrases. N-gram is a contiguous sequence of n items from a given sequence of text). The final data (RDS) created as described the link Milestone Report

Output: Shiny Package has been created to enter the input data and use the model to predict the next work. The data is displayed in multiple tabs for better classification.

Application: Steps to execute the program

Next Word Prediction Application

#1. Enter the word(s) or text and click Predict
#2. Tab 1: Word Prediction Result displays the top 6 (configurable) prediction based on the n-gram model.
#3 Tab 2: Behind the Scens: Displays actual processing steps, time taken, possible words etc
#4 Tab 3: Documentation and References (links to docs)
5 Possible next word - based on the Frequency and Probability

Viewing the Shiny App and Links

Sample Data (These phrases are picked from Quiz 2 and Quiz 3)

> Test Data
- You made (**my day**)
- and a case of (**the / beer**)
- make me the (**happiest**)

LINKS

Application Shiny App ←← Core Deliverable of this project

GitHub repository code to this application

Exploratory Analysis link to Milestone Report

Data Store link to Data used for this project