Swiftkey Data Science Capstone Project
Rithesh Kumar
Sat Apr 18 14:25:38 2015
Introduction
The goal of this project is to allow a user to input a phrase into the application, and it would predict the next word that they “most likely” want to type.
The primary use case for this application is text messaging on mobile phones.
The data available for training the predictive model is millions of tweets, blog posts, and news articles in English
Milestone Report Link : Milestone Report
Application link : Shiny App - Next Word Prediction
Github Link : Codes
Text Prediction Algorithm
- Preprocessing the text (e.g. filter non-English words, symbols)
- Tokenization
- Prepare unigram, bigram and trigram from the data
- Count the occurrences of each unique unigram, bigram, trigram and quadgram
- Calculate probabilties for each N-Gram using Maximum Likelihood Estimate And Simlple Linear Interpolation
- Get the text phrase from the user
- Extract the last three tokens (e.g. prev1, prev2) from the phrase. If the phrase is not long enough, extract the last two tokens or last token.
- Return the top 3 matches with high proabability.
Shiny App - Next Word Prediction
- Screenshot Of The App
- Instructions
- Wait 10 seconds for the app to load
- Enter text in input textbox
- Top 3 most probable next words are displayed in the output textbox
Conclusion
Limitations
- RAM built-in to the laptop wasn't enough to handle the sheer size of the data
- A sample representative population of ~1% was only used to train the model
- Sparse values were removed during term document creation
- The prediction model is biased towards train data. New word prediction is not very accurate
References