Capstone Project for Next word Prediction

Suganthi M
September 1, 2017

Presenting Next word Prediction App as part of the Data Science Specialization Capstone Project

Overview

The goal of the project is to build a shiny app that will predict the next word as the user types a sentence similar to the way most smart phone keyboards are implemented today using the technology of Swiftkey.

Techniques such as data cleansing, exploratory analysis, predictive modeling, etc. were used.

The predictive model is based on the concept of n-gram sequence of words used in NLP (natural language processing) using Maximum Likelihood Estimation and smoothed using interpolation.

Data Preparation

20% of the sampled dataset with profanity words removed is cleaned, lower-cased, removing links, twitter handles, punctuations, numbers and extra whitespaces using Quanteda package.

N Gram model is created (Trigram, Bigram and Unigram) and sorted based on the frequency. The performance of computing the model greatly improved with Quanteda Package when compared with tm Package from 2 hours to few seconds.

Reduced the size of the model by pruning the less frequent words.

The Relative frequency for the Maximum Likelihood Estimation are computed for each of the N-grams and the resulting N gram model with MLE estimates is stored in a data table locally with a key set on the columns

Prediction Model Algorithm

N-gram model with Maximum Likelihood Estimation and smoothed using Jelinek-Mercer smoothing(Interpolation Method).Part of Speech Tagging (POST) is done for the default prediction of the model if Interpolation do not fetch the prediction.

The interpolation of the saved N gram model is done based on a fixed \( \lambda \) (Lambda)

For prediction of the next word, first the trigram ( first two words of Trigram are the last two words of the sentence) is checked to see how many matches are found, similarly bigram and unigram are checked and a matching matrix is created with frequency sorted.

If no match is found, the prediction is defaulted based on Part of Speech Tagging

The Prediction word by default is 'the' if none of the above works and also If the user do not enter any text in the app.

App Info and User Instructions

shinyapp

App info:

ShinyApp URL https://suganthim.shinyapps.io/next_word_prediction/
Average response time under 2-3 seconds with App Memory less than 140 MB

User instructions:

Under the “Enter Your text” , the user may enter phrase/words
Detects the words typed and predicts the next word reactively

The Predicted word is displayed in the box as soon as the user is done typing words/phrases.