Data Science Capstone Project: Next Word Prediction

Rohidah Maskuri
15th April 2016

Introduction

The purpose of this project is to build a natural language model that predict the next word give an input in text or phrase.

Data Source & Data Cleaning

3 data sources were used to predict the next word. The data source were extracted from twitter, news and blogs. The data sources were cleaned by using dfm function from quatenda library. The dfm function cleans the data source by removing whitespaces, punctuation, and change the characters to lower case. By removing this clutter, it will make the prediction more accurate. The cleansed data from the 3 media were then combined together to form a training data.

N-grams

For the purpose of our prediction, we are using the N-gram methods. N-grams are basically a group of n words. E.g Uni-gram is a category of one word, bi-gram is a collection of two words and tri-gram is a collection of 3 words etc. For the purpose of this project, we are building 4 type of n-grames, i.e, unigram, bigram, trigram and quadgram.

For this project, the training data that was created earlier was then grouped into a collection of unigram, bigram, tirgram and quadgram. These collections are then save into the file of type Rdata object. These preprocessed file is then used to predict the model.

Algorithm

For the purpose of the predicting, I am using Katz back off predictive model. I think it is the most straight forward algorithm to apply for this project.

Thorugh this approach, first, the model checks for full matches of the last 3 words, the most frequently occuring quadgram starting with those 3 words would be selected. If no quadgram contained the phrase the algorithm backs of to trigrams using the last 2 words, if there are still no matches it backs off to bigrams using the last word and if there are still no matches it reverts to NA.

The Shiny Application

The actual application is available at https://rohidah.shinyapps.io/MyPredictionApps/.

The user will key in the input and there app is able to predict up to 5 most frequently used words after the last word of the input. The user can choose any one of the predictions and append the words and the app will conitnue to predict the next word. Below is the screenshot of the application. alt text