Coursera Data Science Specialization Capstone Project

Kadri Umay on 5/18/2016

This short presentation is to provide brief information on the the application for predicting the next word. University & Coursera Logo

The Objective

The main objective of this capstone project is to build a shiny application for predicting the next words as the user enters a sentence, similar to the SwiftKey smart keyboard application.

Text data from a corpus called HC Corpora that is being provided by SwiftKey is used to create NGrams. English US version of the corpora is being used in this project. There are three huge files for blogs, news and twitter data.

All text mining and natural language processing was mainly done with Quanteda R package. tm was very slow and intitial tests to create the dfm took almost a day. Data is not clean as expected, lots of manual cleaning is done. NGrams are being loaded in a Microsoft SQL Server database and read back using RODBC. Shiny version of the application uses RSQLite.

Applied Methods & Models

After loading the 3 text files from the HC Corpora data, first it was cleaned by conversion to lowercase, removing punctuation, links, white space, numbers and all kinds of special characters. Profianity words are being removed using the dataset. Using Quanteda dfm for 2-4 grams has been generated.

Those aggregated bi-,tri- and quadgram term frequency matrices have been transferred into frequency dictionaries on a Microsoft SQL Server for additional cleansing. Specifically there are lots of incomplete words and ngrams with Non-ASCII characters.

Only the NGrams with occurances of more then 10 are taken due to space and performance limitations and loaded into an embedded SQLite database and queried using RSQLite package.

Prediction Algorithm

Markov Chain, is a random process that undergoes transitions from one state to another on a state space. It must possess a property that is usually characterized as “memorylessness”.

Katz Back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by “backing-off” to models with smaller histories under certain conditions.Process

Usage Of The Application

The application is designed using a very simple Shiny template. User enters a sentence and the number of words to be predicted, the application predicts the next words. A simple random sentence generator is included to provide a fun way of testing. Application Screenshot