COURSERA - JOHNS HOPKINS UNIVERSITY - SWIFTKEY DATA SCIENCE SPECIALIZATION CAPSTONE PROJECT

DATA SCIENCE CAPSTONE PROJECT PRESENTATION

NIRAV A. DESAI
JULY 2, 2016

INTRODUCTION

  • The goal of this project is develop a Natural Language Processing based Text Prediction Algorithm and a data product that showcases this algorithm
  • The course is final in a series of courses on Data Science taught by Professors Roger Peng, Brian Caffo and Jeff Leek at the Johns Hopkins University
  • The course project was done in association SwiftKey, who make text prediction software for mobile phones

BACKGROUND

  1. tm (Text Mining) is a library of functions available in the R language for the purpose of Natural Language Processing
  2. Important first step in text mining using R is to create a corpus of documents for the analysis
  3. After the corpus is created, we pre-process the corpus with a standard set of techniques
    • Convert all words to lower case -Map similar words together such walk, walks, walking (stemming) -Remove swearwords -The pre-processed is ready for text mining analysis

PARSING THE CORPUS

  1. The corpus is then parsed using bigrams (groups of 2 words), trigrams (groups of 3 words) and quadrigrams (groups of 4 words)
  2. The RWeka library can be used for parsing into n-grams
  3. The n-grams are ranked on their importance by their frequencies
  4. Most frequent n-grams are ordered first

TEXT PREDICTION ALGORITHM

  1. The text prediction algorithm is based on building a vocabulary of trigram and quadrigrams
  2. The parsed n-grams are arranged in descending order of their frequencies
  3. They are then split into 2 parts: -Last word of the n-gram becomes the next(predicted) word -The n-gram minus last word becomes the given n-gram
  4. Input from user is parsed using the same pre-processing steps to generate given n-grams
  5. Given n-grams are compared against the dictionary
  6. The first match (having the highest frequency) is returned as matched n-gram
  7. Corressponding next word becomes the predicted word

DATA PRODUCT