Coursera Capstone Project

Claudia V
August 2015

About this project

This project's objective is to create a text predicting application based on a corpus made of news, twitter and blogs. In order to make it work we created files with the frequency of appearance with unigram, bigram, trigram and quadrigram

An example for a bigram text looks like:

  • The dog - 10
  • dog is - 4
  • is happy - 2

Once we have all the words and their frequency we are able to calculate probability as Maximum Likelihood Estimate, this will be used just for unigrams.

n-Grams & interpolation

MLE is a good approach but fails to deliver the best results. To improve our results we are using a linear interpolation to calculate each n-Gram probability and then adding this probability to a lookup table

Linear Interpolation.

Stupid Backoff

In order to find the best words we will use stupid backoff, we will give the best 3 results based on the query

Stupid Backoff.

How to use the app

Write the sentence you want the next word to be predicted and hit: Predict!

The three most likely words to come after your text will be displayed under: “Your prediction”

In prediction data you will see the data within the corpus.

  • probA corresponds to the adjusted probability via interpolation
  • prob corresponds to MLE probability

How to use the app