Capstone Project_Predict next word

Soumava Dey

December 26, 2018

Project Description

In the context of keyboard typing, predicting the next word is a very interesting ‘Natural Language Processing’ (NLP) problem in data science

Goal of this project is to analyze the twitter,blogs and news data set, and build a predictive model that can predict the next word of the input data.The next word prediction app has been build on Shiny.

Data has been collected from publicaly available data sources

Data cleansing and Exploratory data analysis report published here: (https://rpubs.com/soudey/449337)

Generate n-gram (https://en.wikipedia.org/wiki/N-gram) tables 10% Data sampling method used to sample the data for building train dataset.

Shiny App

A caption

A caption

Algorithm built process

Stupid Back Off Algorithm has been used to predit next word.

### Inexpensive: It requires few resources compared to others becasue it doesn’t generate normalized proabilities, for example, Katz’ Backoff Model

### Good Accuracy: It approaches quality of Kneser-Ney Smoothing

### Reference: * Large Language Models in Machine Translation, Google Inc.

### Algorithm Steps:

  1. For prediction of the next word, Quadgram is first used (first three words of Quadgram are the last three words of the user provided sentence).

  2. If no Quadgram is found, back off to Trigram (first two words of Trigram are the last two words of the sentence).

  3. If no Trigram is found, back off to Bigram (first word of Bigram is the last word of the sentence)

  4. If no Bigram is found, back off to the most common word with highest frequency ‘the’ is returned.

Future Improvement

Yes, this app is fast but needs better accuracy!

Therefore, it requires more

advanced knowledge of natural language processing
appropriate text mining tool
powerful computation capacity
RAM and additional instances of an app server

This app is developed for the capstone project of Data Science Specialization by Johns Hopkins University on Coursera.

Special Thanks to all instructors and classmates to assist me in this project!