Coursera Data Science Specialization Capstone

Thomas Guenther
April 2015

Natural Language Processing,
Predicting Next Words

The Problem

  • While typing text into a search field isn't it a good idea to try predicting what a user is just searching for?
  • While writing a message on a mobile device won't it be great to predict some possible next words and save you some time?
  • While you talk to your computer, mobile device or even your car wont' it be awesome if it could give you an appropriate answer or executing the given command?

You think that all this isn't possible?
I think it is…

The Solution

  • We will concentrate on text prediction with our shiny-app
  • The base data has over 4 million lines of text collected from blogs, twitter and news
  • We sampled only 10% of the data due to performance, memory usage and pre-processing time
  • A cleanup of the data was performed and all characters but spaces and letters got removed
  • The data was aggregated, splitted into uni-, bi- and tri-grams and we created probability tables
  • For unseen words a simple Katz-Backoff-Smooting was used
  • If the higher order n-gram could not been found we back-off to lower order n-grams until we find probable words

Simple Flow Chart

Here we see a simple flow chart showing how the implmented algorithm works in general

text prediction flow chart

Shiny Next Words Usage

  • The user types a sequence of at least two words into an text input field
  • While typing he can benefit from auto-complete-like suggestions read from uni-grams
  • After pressing a button he will see the predicted words as output after a short moment with highest prediction on top
  • The user can choose from the predicted words and add one to the input sequence
  • Alerts will be shown if something went wrong or just as information
  • Usage and background information are provided on a separate page

You can test the app at shinyapps.io