Text Prediction - N Gram

Dzaringana Mate
2016-01-23

Overview

https://dmatedatasciencecapstone.shinyapps.io/D_MATECAPSTONE/

Introduction

This presentation serves as an introduction to an application for the capstone project of the Coursera Data Science specialization by Johns Hopkins University in cooperation with SwiftKey.

The application was designed with the following goals in mind:

Satisfy requirement of predicting the next work given an input n-gram
Ease of use
Exploring different prediction algorithms

User Interface

The main layout elements are a sidebar panel with prediction algorithm controls (which mutate based on selections to keep only relevant controls visible) and a main content panel with tabs for

'Prediction' (with input and output elements)
'Instructions' documenting the UI and prediction algos; see this tab for more information than would fit in this presentation.
'Bibliography' recording various sources I found useful
'About' containing authoring and versioning details.

About the Capstone Project

The capstone project is designed to allow students to create a usable/public data product that can be used to show skills to potential employers. The project's data is drawn from real-world data. The goal of this exercise is to create a product to highlight the prediction algorithm that can be accessed by an app interface easily used by others. The training data from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip was cleaned (canonical case, removed numbers and punctuation) and {2,3,4}-grams formed for use in prediction. The best match (ordering by n-Gram length and then prevalence) is used to predict the next word following the input n-Gram.

The Objective

The goal of the project is to build an application using real-world data to take a string of words and predict the next word.

The basis of the prediction algorithm is a set of three documents (corpus) containing text from blogs, news articles and tweets.

The data used in developing a dictionary to predict the next word comes from a corpus HC Corpora

For our corpora we have used the following three files: en_US.blogs.txt en_US.news.txt en_US.twitter.txt

Data Analysis and Manipulation

After creating the Corpus from the HC Corpora data, the analysis concluded that a data cleaning is necessary for an accurate prediction algorithm to work with a high successful rate.

The sample data was transformed by eliminating extra Whitespace, removal of numbers, punctuation, profanity and converting the text to lower case.Many of R language natural language processing functions and Technics are used essentially the “tm” package to process the data.

The resulting dataset was split into three N-grams files. Unigrams Bigrams Trigrams

The Application

The user interface of the application was designed to predict English words from English text. The App has an interactive interface that refreshes the predicted word as text is being enterd.

To use the application, simply type in a word, phrase, or sentence. The app will show the next top predicted word. The user can enter additional words, or change their entry, and the app will respond to the new input.

To access The application on Shinny App application