DATA SCIENCE CAPSTONE PROJECT PRESENTATION

NIRAV A. DESAI
JULY 2, 2016

The goal of this project is develop a Natural Language Processing based Text Prediction Algorithm and a data product that showcases this algorithm
The course is final in a series of courses on Data Science taught by Professors Roger Peng, Brian Caffo and Jeff Leek at the Johns Hopkins University
The course project was done in association SwiftKey, who make text prediction software for mobile phones

tm (Text Mining) is a library of functions available in the R language for the purpose of Natural Language Processing
Important first step in text mining using R is to create a corpus of documents for the analysis
After the corpus is created, we pre-process the corpus with a standard set of techniques
- Convert all words to lower case -Map similar words together such walk, walks, walking (stemming) -Remove swearwords -The pre-processed is ready for text mining analysis

The corpus is then parsed using bigrams (groups of 2 words), trigrams (groups of 3 words) and quadrigrams (groups of 4 words)
The RWeka library can be used for parsing into n-grams
The n-grams are ranked on their importance by their frequencies
Most frequent n-grams are ordered first

The text prediction algorithm is based on building a vocabulary of trigram and quadrigrams
The parsed n-grams are arranged in descending order of their frequencies
They are then split into 2 parts: -Last word of the n-gram becomes the next(predicted) word -The n-gram minus last word becomes the given n-gram
Input from user is parsed using the same pre-processing steps to generate given n-grams
Given n-grams are compared against the dictionary
The first match (having the highest frequency) is returned as matched n-gram
Corressponding next word becomes the predicted word

A Shiny Data Product was created which uses this algorithm.
The data product can be used here: – https://niravadesai.shinyapps.io/datascience_capstone/