Monnappa Somanna
Fri Dec 23 19:54:43 2016
The goal of this project is to “predict” the “most likely” word the user want to type based on previous 2 or 3 words
This application is useful in Mobile texting to enhance user experience by faster typing
Millions of News, Blog posts and Tweets are used as “Corpus” for training the Dataset.
Following are the key links:
Link to the Application
Link to the Github
Following are the key steps:
Create a 'Corpus' by preprocessing of text from millions of News, Blogs and Twitter
'Tokenization' of Text by breaking up the given text into units called Tokens.
Create n-gram sequence from the above Data. an N-gram is a contiguous sequence of N items from a given sequence of text or speech. … An n-gram of size 1 is referred to as a “unigram”;size 2 is a “bigram” ; size 3 is a “trigram”
Count the number of occurences of N-grams, We shall limit the n=4 for memory limitations
Calculate probabilties for each N-Gram using Maximum Likelihood Estimate And Simlple Linear Interpolation
Lookup the user input data for unigram, bigram and trigram
Extract the last three tokens (e.g. prev1, prev2) from the phrase. If the phrase is not long enough, extract the last two tokens or last token
Return thr Top 3 matches with high Probablity
Instructions for using the App:
Limitations of the Model
Considerring RAM limiations for processing the data sample representation (~10K) was used from a Corpus of 1M+ Blogs, News and Twitter Data
The prediction model is biased towards train data. New word prediction is moderately accurate because of the aboove limitation
References