Next Word Prediction - Capstone

Peter Geers
May 2017


MOOC Data Science

John Hopkins University

Coursera.org

Intro

This is a description of the Capstone Project, as final product of Coursera Data Science Specialization.
The objective of the App is a predictive model that offers hints with what verbs to continue the words entered by user. The dataset used to train the application includes text from twitter, news and blogs provided by Swiftkey. After performing data cleaning, sampling and sub-setting, all data is gathered in a data frame. Applying some Text Mining (TM) and NLP techniques, a set of word combinations (N-grams) is created. A Katz Backoff algorithm predicts the next word.

The Shiny App

Just type one or more words. The app shows what the user entered and a cleaned version. As the main result, the top n-grams predictions, based on the data enetered, are displayed. The user can review and change the data, and the app will turn back to present more hints to predict. Another tab offers more documentation.

Access Shiny Word Predictor

Main steps for next word(s) predictions:

  1. Load data frame with n-grams combinations.
  2. Read user input (a word or sentence)
  3. Clean user input (To lower case, tokenization of input words)
  4. Call prediction model function, a backoff algorithm

N-grams excerpts

Top 5 of some N-Grams in the data frame loaded by Shiny App.

word freq
right now right now 423
cant wait cant wait 391
last night last night 305
feel like feel like 243
dont know dont know 237
word freq
thanks for the follow thanks for the follow 141
the end of the the end of the 102
at the end of at the end of 87
the rest of the the rest of the 79
cant wait to see cant wait to see 77

A Twitter Word cloud

Based on the dataset retrieved word clouds are made to get an impression of the data in the dataset. Here a Twitter example.