Data Science Capstone Project

Bipin Karunakaran
01/23/2016

Outlines the application for next word prediction, developed as part of the Coursera Data Science Capstone Project

Summary

Objective

  • To create an application that predicts the next English word after a phrase of one or more words

Methodology

  • Uses Ngram frequency tables generated from a corpus of 5% sample of US News Blogs and Tweets
  • Algorithm for prediction is based on the Stupid back off approach
  • Trade off between complexity and performance
  • Accuracy measured using out of sample phrases

Ngram Tokenization

Image

In addition, memory resources were managed by writing out Rdata files for each of the Ngram data frames and removing the datasets to clear memory.

Algorithm (Adapted from Stupid Back off)

Image

Results

Link to word predictor application

Accuracy (Tested vs out of sample phrases from the same corpora)

# of words in phrase Accuracy of predicted next word
4 83%
3 75%
2 38%
1 11%

Additional Information