Coursera/Johns Hopkins University
Data Science Specialization
Switftkey Capstone Project
Pradeep K. Pant
October 5, 2016
Create an algorithm for predicting the next word given one or more words as input using NLP
A large corpus of blog, news and twitter data was loaded and analyzed
N-grams were extracted from the corpus and then used for building the predictive model
Various methods of improving the prediction accuracy and speed were explored
N-gram model with stupid back-off strategy was used
Dataset was cleaned, lower-cased, removing links, twitter handles, punctuations, numbers and extra whitespaces, etc
Matrices from 6-gram to uni-gram were extracted using RWeka
Reduced size of model by dropping least frequent N-grams
Provides a text input box for user to type a word/phrase
Detects words typed and predicts the next word reactively
Iterates from longest N-gram (6-gram) to shortest (2-gram)
Predicts using the longest, most frequent, matching N-gram
Option to select no, of prediction displayed
Average response time under 2-3 seconds
Application memory usage well under 150 MB
Application is running at: https://ppant.shinyapps.io/nextWordPredict/
Github link for various code files is here: https://github.com/ppant/Coursera-Data-Science-Capstone-Project
Code and app will be updated with any new features/improvements.