Capstone Data Science Project

ACGII
Tue Dec 20 16:52:28 2016

The App - ACG's Next Word Predictor

Description

Often when entering text into a mobile telephone or other handheld device, your are prompted with a likely next word. This is the jist of this application. The user will supply a partial phrase or sentence and the application will provide its best guess for the next word.

Data Set

The SwiftKey data set used to make this application consists of samples taken from three sources, the news, blogs and twitter messages. The data set is huge and must be rendered into a usable form that is both timely and compact.

Application

The application uses this data to predict the next word in a phrase. The method used for this prediction relies on a history of previous phrase fragments, known as Ngrams, to accomplish this. The application searches through the data and determines the next word based on statistical likelihood.

More about Ngrams. . .

Types

The different types of Ngrams used are:
(borrowed from Dr. Seuss - I do not like green eggs and ham.)

 * Unigrams -   single words found in text ( I)
 * Bigrams -    two contiguous words found in text ( I do)
 * Trigrams -   three contiguous words '' '' '' ( I do not)
 * Quadragrams -  four contiguous words '' '' '' ( I do not like)
 * Pentagrams - five contiguous words '' '' '' ( I do not like green)

Storage

Each of the different type of Ngrams are stored in their own separate table along with the frequency of occurance. In other words there are five separate tables, one each, for unigrams, bigrams, trigrams, quadragrams and pentagrams. These table contain two columns, name and frequency.

Algorithym and Data

Example - I am Sam. Sam I am.

The simple sentences are decomposed into unigrams, bigrams and trigrams. There are no quadragrams or pentagrams.

 * Unigram(frequency) - I(2), am(2), Sam(2)
 * Bigram(frequency) - I am(2), am Sam(1), Sam I(1)
 * Trigram(frequency) - I am Sam(1), Sam I am(1)

The Algorithym - Explanation

When a sentence fragment is input, it is broken down into quadragrams and the pentagram table is searched for entries matching the last quadragram.(Markov). The algorithyhm will select the most populus result.

If no matches are found the fragment is broken into trigrams and the quadragram table is searched. If no matches are found this process is repeated for bigrams and unigrams.

The word match with the highest frequency is used as the most likely candidate.

Algorithym in Action

 * Input:  Merry      Next Word:  christmas
 * Input:   Happy      Next Word:  birthday
 * Input:   Happy New  Next word:  year

Try it yourself at:

https://acgii.shinyapps.io/finalproject/