nextWord

Coursera / Data Science Capstone project on word prediction
January 2016

Introducing: nextWord

Imagine, if you will, it is 2004, and you have been tasked with building an “app” for a mobile phone, one that is great at suggesting the next word you might type can be guessed, which ultimately can save time for our customers and make phones keys last longer from less overall keypresses. Clearly the cutting edge of T9 technology!
  1. T9 History: https://en.wikipedia.org/wiki/T9_(predictive_text)

The Pitch

The scenario described on the first slide is a bit of a joke, since we have had text prediction on mobile phones prior to the popular Motorola Razr flip phones, which used T9 Predictive text methods, and did a pretty decent job guessing what people are trying to type. T9 prediction is really amazing considering it ran on something with such few resources.

Today Natural Language Processing (NLP) is used for many things, including:

  • Machine Language Translation, like from english to french or others
  • Word prediction in search engines, like Google's suggested searches list
  • Word prediction on Smart Phones with touch screens, the similar purposes as T9

This is not an exhaustive list above, but we will focus on Text Prediction for Smart Phones and our purpose for building a Prototype Application that can predict a single word.

The Approach

  • Get lots of data (provided)
  • Get to know the data (poke it with R-sticks, write an .Rmd analysis)
  • Clean the data (Munging) * See the talk at the end!
  • Algorithms (try many, see what works)
  • Profit (???) * only if what we run is computationally easy and can fit on a mobile device.

I decided to approach the problem with the Katz's back-off model, which sounded nice because it is an “generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by "backing-off” to models with smaller histories under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.“ https://en.wikipedia.org/wiki/Katz%27s_back-off_model

Algorithms and Munging

LeCook book:

  1. Build a dictionary/corpus (cleaning)
  2. Build a frequency matrix, counting words from our data
  3. Build n-grams, 1 through 4
  4. Build n-skipgrams, 0 through 2
  5. Start over with 1., but perform more cleaning
  6. All of these items are what is called munging the data *You may do this over and over, until something works.

I was testing both Quanteda and text2vec

Source Texts -> Corpus (Multiple Documents) ->

White Space removal Punctuation removal Lower case letters Stemming (gets to the root of a word, suffixes removed)

Term-Document Matrix (Bag of Words) sparse frequency counts -> N-gram tokenizer (Bag of N-neighbors) sparse frequency counts ->

Optimizing with ML Split data into 75% train, 25% test, and use k-nearest neighbors

Structure: Key N-gram (1 to 3) Freq (sorted)

Application Instructions

To use the app:

  • Start typing out a phrase in the provided text field, when you stop typing it will be sent through the engine
  • Optionally, you can also paste in a phrase or a partial sentance and come up with a prediction
  • Be amazed!

I ran into major problems and I broke the my app, so the link takes you to the basic prototype. -Jamin Ragle

nextWord https://zombieprocess.shinyapps.io/nextWord-app/

I found this talk useful on Machine Learning, I highly recommend checking it out: Nathan Taggart on Machine Learning and Ponies https://www.youtube.com/watch?v=xeAB10QgDW8