Next Word Prediction - Capstone

Peter Geers
May 2017

MOOC Data Science

John Hopkins University

Coursera.org

Intro

This is a description of the Capstone Project, as final product of Coursera Data Science Specialization.

The objective of the App is a predictive model that offers hints with what verbs to continue the words entered by user. The dataset used to train the application includes text from twitter, news and blogs provided by Swiftkey. After performing data cleaning, sampling and sub-setting, all data is gathered in a data frame. Applying some Text Mining (TM) and NLP techniques, a set of word combinations (N-grams) is created. A Katz Backoff algorithm predicts the next word.

The Shiny App

Just type one or more words. The app shows what the user entered and a cleaned version. As the main result, the top n-grams predictions, based on the data enetered, are displayed. The user can review and change the data, and the app will turn back to present more hints to predict. Another tab offers more documentation.

Access Shiny Word Predictor

Main steps for next word(s) predictions:

Load data frame with n-grams combinations.
Read user input (a word or sentence)
Clean user input (To lower case, tokenization of input words)
Call prediction model function, a backoff algorithm

N-grams excerpts

Top 5 of some N-Grams in the data frame loaded by Shiny App.

	word	freq
right now	right now	423
cant wait	cant wait	391
last night	last night	305
feel like	feel like	243
dont know	dont know	237

	word	freq
thanks for the follow	thanks for the follow	141
the end of the	the end of the	102
at the end of	at the end of	87
the rest of the	the rest of the	79
cant wait to see	cant wait to see	77

A Twitter Word cloud

Based on the dataset retrieved word clouds are made to get an impression of the data in the dataset. Here a Twitter example.