Johns Hopkins University Data Science Capstone

Monnappa Somanna
23-Dec-2016

The goal of this project is to “predict” the “most likely” word the user want to type based on previous 2 or 3 words

This application is useful in Mobile texting to enhance user experience by faster typing

Millions of News, Blog posts and Tweets are used as “Corpus” for training the Dataset.

Following are the key links:

Following are the key steps:

Create a 'Corpus' by preprocessing of text from millions of News, Blogs and Twitter

2.'Tokenization' of Text by breaking up the given text into units called Tokens. The tokens may be words or number or punctuation mark

Create n-gram sequence from the above Data. an N-gram is a contiguous sequence of N items from a given sequence of text or speech. … An n-gram of size 1 is referred to as a “unigram”;size 2 is a “bigram” ; size 3 is a “trigram”
Count the number of occurences of N-grams, We shall limit the n=4 for memory limitations
Calculate probabilties for each N-Gram using Maximum Likelihood Estimate And Simlple Linear Interpolation
Lookup the user input data for unigram, bigram and trigram

7.Extract the last three tokens (e.g. prev1, prev2) from the phrase. If the phrase is not long enough, extract the last two tokens or last token

Instructions for using the App:

Limitations of the App:

References: