Coursera Data Science Capstone - next word prediction

Yuliang Wang
APR24 2016

Next word prediction is widely used on mobile devices and search engines and have made our lives a lot easier.
This presentation will describe an application for predicting the next word.
The application is the capstone project for the Coursera Data Science Specialization by the Johns Hopkins University and SwiftKey.

Background and objective

The objective of this project is to build a Shiny application https://yuliangwang.shinyapps.io/shiny/ to predict the next word given user input phrases.

The project is divided into several sub-tasks, including exploratory analysis, model building and refinement, and Shiny app development.

HC Corpora is the basis of all n-gram calculations. Several important R packages used include quanteda, data.table and tm.

Due to hardware limitations, a randomly sampled 50% of HC Corpora data was used. The dfm function in quanteda was used to convert to lower case, remove punctuation, remove numbers, and other clean up.

Methods and Models

According to Wikipedia:
“In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.”

For example, “I am” is a bigram, “good at learning” is a trigram, etc.

Modified interpolated Kneser Ney method, a highly popular and accurate method, are used to calculate probabilities for all 2-,3- and 4-grams.
First try if input matches any 4-grams that begins with the 3 input words, if so, select 3 most probably next words. If not, recursively go to 3-grams and 2-grams.

How to use the Shiny App

The user interface is very simple. The user just enters any phrases, hit the “GO” button, and top 3 most probable next words will be returned.
Since the pre-calculated model handles up to 4-grams, only the last 3 of the input word sequences will be used to predict the next words. The first trial will take 10 seconds to load, as the app needs to load pre-computed models and required packages, but subsequent trials will be instanteneous.
Please try it out at https://yuliangwang.shinyapps.io/shiny/.

References and more information

My app is hosted on shinyapps.io:
https://yuliangwang.shinyapps.io/shiny/
Stanford NLP course:
https://www.coursera.org/course/nlp
The code used in this application:
https://github.com/yuliangwang/next_word_predict.git
The slides are hosted at:
http://rpubs.com/wang341/175613