Claudia V
August 2015

About this project

This project's objective is to create a text predicting application based on a corpus made of news, twitter and blogs. In order to make it work we created files with the frequency of appearance with unigram, bigram, trigram and quadrigram

An example for a bigram text looks like:

The dog - 10
dog is - 4
is happy - 2

Once we have all the words and their frequency we are able to calculate probability as Maximum Likelihood Estimate, this will be used just for unigrams.

n-Grams & interpolation

_{MLE is a good approach but fails to deliver the best results. To improve our results we are using a linear interpolation to calculate each n-Gram probability and then adding this probability to a lookup table}

Linear Interpolation.

Stupid Backoff

In order to find the best words we will use stupid backoff, we will give the best 3 results based on the query

Stupid Backoff.

How to use the app

Write the sentence you want the next word to be predicted and hit: Predict!

The three most likely words to come after your text will be displayed under: “Your prediction”

In prediction data you will see the data within the corpus.

probA corresponds to the adjusted probability via interpolation
prob corresponds to MLE probability

How to use the app

https://vclau.shinyapps.io/capstone

App