Coursera Data Science Capstone Project NextWord

Tomas Gogorza
April 2016

WHY

The goal of this project is to create a text typing helper, which lets users reduce typing time and effort by predicting and suggesting possible next words. The application was trained by reading thousands of phrases and processing ~270000 different words and their relationships from 3 different sources:

  • Blogs
  • News
  • Twitter

The total size of the data set used in the creation of the language model is ~15 million words.

WHAT

To achieve a nice prediction accuracy the application makes use of 3 different language models

  • Bigrams
  • Trigrams
  • Fourgrams

To account for missing words and possible scenarios where no prediction can be made, a couple enhancements were made to the models

  • Good-Turing discounting
  • Back-off model

HOW

The application is pretty straight-forward to use:

  1. Type or paste some text, NextWord will predict the next word and create 3 buttons with the best suggestions

  2. Additionally, you can see more possible word predictions in the word cloud that will show bellow the prediction buttons

  3. Select your desired word by clicking on the corresponding button, or just type it.

NextWord Web App

You can access the shiny app at http://tgogorza.shinyapps.io/WordPredictor