DATA SCIENCE CAPSTONE PROJECT PRESENTATION
NIRAV A. DESAI
JULY 2, 2016
INTRODUCTION
- The goal of this project is develop a Natural Language Processing based Text Prediction Algorithm and a data product that showcases this algorithm
- The course is final in a series of courses on Data Science taught by Professors Roger Peng, Brian Caffo and Jeff Leek at the Johns Hopkins University
- The course project was done in association SwiftKey, who make text prediction software for mobile phones
BACKGROUND
- tm (Text Mining) is a library of functions available in the R language for the purpose of Natural Language Processing
- Important first step in text mining using R is to create a corpus of documents for the analysis
- After the corpus is created, we pre-process the corpus with a standard set of techniques
- Convert all words to lower case
-Map similar words together such walk, walks, walking (stemming)
-Remove swearwords
-The pre-processed is ready for text mining analysis
PARSING THE CORPUS
- The corpus is then parsed using bigrams (groups of 2 words), trigrams (groups of 3 words) and quadrigrams (groups of 4 words)
- The RWeka library can be used for parsing into n-grams
- The n-grams are ranked on their importance by their frequencies
- Most frequent n-grams are ordered first
TEXT PREDICTION ALGORITHM
- The text prediction algorithm is based on building a vocabulary of trigram and quadrigrams
- The parsed n-grams are arranged in descending order of their frequencies
- They are then split into 2 parts:
-Last word of the n-gram becomes the next(predicted) word
-The n-gram minus last word becomes the given n-gram
- Input from user is parsed using the same pre-processing steps to generate given n-grams
- Given n-grams are compared against the dictionary
- The first match (having the highest frequency) is returned as matched n-gram
- Corressponding next word becomes the predicted word