nextWord

Coursera / Data Science Capstone project on word prediction
January 2016

Introducing: nextWord

Imagine, if you will, it is 2004, and you have been tasked with building an “app” for a mobile phone, one that is great at suggesting the next word you might type can be guessed, which ultimately can save time for our customers and make phones keys last longer from less overall keypresses. Clearly the cutting edge of T9 technology!

T9 History: https://en.wikipedia.org/wiki/T9_(predictive_text)

The Pitch

The scenario described on the first slide is a bit of a joke, since we have had text prediction on mobile phones prior to the popular Motorola Razr flip phones, which used T9 Predictive text methods, and did a pretty decent job guessing what people are trying to type. T9 prediction is really amazing considering it ran on something with such few resources.

Today Natural Language Processing (NLP) is used for many things, including:

Machine Language Translation, like from english to french or others

Word prediction in search engines, like Google's suggested searches list

Word prediction on Smart Phones with touch screens, the similar purposes as T9

This is not an exhaustive list above, but we will focus on Text Prediction for Smart Phones and our purpose for building a Prototype Application that can predict a single word.

The Approach

Get lots of data (provided)
Get to know the data (poke it with R-sticks, write an .Rmd analysis)
Clean the data (Munging) * See the talk at the end!
Algorithms (try many, see what works)
Profit (???) * only if what we run is computationally easy and can fit on a mobile device.

I decided to approach the problem with the Katz's back-off model, which sounded nice because it is an “generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by "backing-off” to models with smaller histories under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.“ https://en.wikipedia.org/wiki/Katz%27s_back-off_model

Algorithms and Munging

LeCook book:

Build a dictionary/corpus (cleaning)
Build a frequency matrix, counting words from our data
Build n-grams, 1 through 4
Build n-skipgrams, 0 through 2
Start over with 1., but perform more cleaning
All of these items are what is called munging the data *You may do this over and over, until something works.

I was testing both Quanteda and text2vec…

Source Texts -> Corpus (Multiple Documents) ->

White Space removal Punctuation removal Lower case letters Stemming (gets to the root of a word, suffixes removed)

Term-Document Matrix (Bag of Words) sparse frequency counts -> N-gram tokenizer (Bag of N-neighbors) sparse frequency counts ->

Optimizing with ML Split data into 75% train, 25% test, and use k-nearest neighbors

Structure: Key N-gram (1 to 3) Freq (sorted)

Application Instructions

To use the app:

Start typing out a phrase in the provided text field, when you stop typing it will be sent through the engine

Optionally, you can also paste in a phrase or a partial sentance and come up with a prediction

Be amazed!

I ran into major problems and I broke the my app, so the link takes you to the basic prototype. -Jamin Ragle

https://zombieprocess.shinyapps.io/nextWord-app/

I found this talk useful on Machine Learning, I highly recommend checking it out: Nathan Taggart on Machine Learning and Ponies https://www.youtube.com/watch?v=xeAB10QgDW8