JHU/SwiftKey Capstone Project

Monnappa Somanna
Fri Dec 23 19:54:43 2016

Introduction

The goal of this project is to “predict” the “most likely” word the user want to type based on previous 2 or 3 words

This application is useful in Mobile texting to enhance user experience by faster typing

Millions of News, Blog posts and Tweets are used as “Corpus” for training the Dataset.

Following are the key links:

Link to the Milestone report(Submitted during week-2)
Link to the Application
Link to the Github

Algorithm Design for Text Prediction

Following are the key steps:

Create a 'Corpus' by preprocessing of text from millions of News, Blogs and Twitter

'Tokenization' of Text by breaking up the given text into units called Tokens.

Create n-gram sequence from the above Data. an N-gram is a contiguous sequence of N items from a given sequence of text or speech. … An n-gram of size 1 is referred to as a “unigram”;size 2 is a “bigram” ; size 3 is a “trigram”

Count the number of occurences of N-grams, We shall limit the n=4 for memory limitations

Calculate probabilties for each N-Gram using Maximum Likelihood Estimate And Simlple Linear Interpolation

Lookup the user input data for unigram, bigram and trigram

Extract the last three tokens (e.g. prev1, prev2) from the phrase. If the phrase is not long enough, extract the last two tokens or last token

Return thr Top 3 matches with high Probablity

APP User Interface

Instructions for using the App:

Instructions:
- Wait 10 seconds for the app to load
- Enter text in input textbox
- Top 3 most probable next words are displayed in the output textbox

Conclusion

Limitations of the Model

Considerring RAM limiations for processing the data sample representation (~10K) was used from a Corpus of 1M+ Blogs, News and Twitter Data
The prediction model is biased towards train data. New word prediction is moderately accurate because of the aboove limitation

References