Coursera / Data Science Capstone project on word prediction
January 2016
Introducing: nextWord
The scenario described on the first slide is a bit of a joke, since we have had text prediction on mobile phones prior to the popular Motorola Razr flip phones, which used T9 Predictive text methods, and did a pretty decent job guessing what people are trying to type. T9 prediction is really amazing considering it ran on something with such few resources.
Today Natural Language Processing (NLP) is used for many things, including:
- Machine Language Translation, like from english to french or others
- Word prediction in search engines, like Google's suggested searches list
- Word prediction on Smart Phones with touch screens, the similar purposes as T9
This is not an exhaustive list above, but we will focus on Text Prediction for Smart Phones and our purpose for building a Prototype Application that can predict a single word.
I decided to approach the problem with the Katz's back-off model, which sounded nice because it is an “generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by "backing-off” to models with smaller histories under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.“ https://en.wikipedia.org/wiki/Katz%27s_back-off_model
I was testing both Quanteda and text2vec…
Source Texts -> Corpus (Multiple Documents) ->
White Space removal Punctuation removal Lower case letters Stemming (gets to the root of a word, suffixes removed)
Term-Document Matrix (Bag of Words) sparse frequency counts -> N-gram tokenizer (Bag of N-neighbors) sparse frequency counts ->
Optimizing with ML Split data into 75% train, 25% test, and use k-nearest neighbors
Structure: Key N-gram (1 to 3) Freq (sorted)
To use the app:
- Start typing out a phrase in the provided text field, when you stop typing it will be sent through the engine
- Optionally, you can also paste in a phrase or a partial sentance and come up with a prediction
- Be amazed!
I ran into major problems and I broke the my app, so the link takes you to the basic prototype. -Jamin Ragle
https://zombieprocess.shinyapps.io/nextWord-app/
I found this talk useful on Machine Learning, I highly recommend checking it out: Nathan Taggart on Machine Learning and Ponies https://www.youtube.com/watch?v=xeAB10QgDW8