Data Science Capstone project

Nagesh Subrahmanyam
Oct 9th, 2016

Predict the next word based on an input phrase

Applicability in smart phone keyboards to improve typing experience
A key-board application will predict the next word when a user has typed some text
Prediction based on learning developed out of English language corpus
English language corpus available from this site: http://www.corpora.heliohost.org/
Corpus has three types of content: Twitter tweets, blog posts and news items

Develop model to predict next word based on user input

Human languages enforce a syntax (grammar) for both spoken and written text
The syntax can help arrive at a structure of a sentence
The structure being: for some words of a sentence known in advance, there is only a subset of text that can follow
The 'some words' are known as n-grams, where, n=2 is a bigram, n=3 is a trigram, etc.
Model development follows the same pattern i.e. develop n-grams and arrive at the probable next word.

Data acquisition and cleaning

Tokenizing the text

Store and retrieve

The result is stored in a SQLLite data base and queried by Shiny application for results.

Sample walkthrough of a input phrase

grams	gramNumWords	totalCount	word	wordCount	score
i love you	3	2775	so	949	0.34198
love you	2	2746	too	1535	0.55899
you	1	288047	are	29052	0.10086

The trigram i love you occurred 2275 times and so followed it 949 times.
Therefore, the score is 0.341982.
If the trigram was not found, then the bigram love you is tested and so on.
If a n-gram was not available, it is automatically skipped from the list.
Lowering n as we proceed is a means of implementing Stupid Backoff algortihm

List of references