JHU Data Science Capstone Slide Deck

Gabriel Juarez. Logician, Computational Linguist, Data Scientist.
January 24, 2016

Corpus Based Word Prediction

Boss, I know what you're going to say-

…because that is EXACTLY what is made possible when we combine programming, statistics and linguistics to a large body of text.

We can predict what you are going to say.

The following describes the method I used to develop and web application capable of taking some sample text as input and generating a prediction of the most likely next word.

The DATA

For my language model, I used content coming from three very different sources of language data

a million lines of news stories
a million lines of text taken from blogs, with all it's slang and jargon
a million tweets from twitter, with all it's emoji's and shortenings (omg,fwiw,j/k)

The MODELING

I had a LOT of data to process to make my language data model. I used map/reduce techniques to generate counts of single terms, word pairs and triples, and sequences of language four words long.

In order to prune the model down to something that could load quickly, I worked the model to provide references for only the most likely terms.

The PREDICTING

Given some text as input, the basic idea is that I can take the last three words from the input, try and match the first three words of a four word “quadragram”“ and offer the fourth term of the quadragram as a prediction. If this fails, I turn to using the last two words of the input and matching against the first two word of the trigrams, in order to offer the third term of the trigram as a prediction. If this doesn't work, I check the last word from the input text against my word pairs, if all fails, I offer "the” as a prediction.

The PROTOTYPE

The web app can be accessed online at:

https://gadaju.shinyapps.io/MyDataScienceCapstoneProject/