Teo Tse Tsong
15th April 2016
The objective of the project is to build a model for next word prediction given a “phrase”. The model is to be build on a corpus of text collected from blogs, news and twitter posts from the following URL
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
The prediction model is based on the use of N-grams. An N-gram is a contiguous sequence of N tokens(words) in a sentence or phrase. In the model developed, tables of 2-gram, 3-gram, 4-gram are developed and stored. These tables are sorted in terms of frequencies to determine relative probabilities of different n-grams.
The N-grams have been extracted based on 50,000 lines of text each from the blogs and news corpi. Limitations in memory prevents more lines from being added. This constitute about 5.5% of blogs but about 65% of available news lines.
Memory limitations also limits the use of N-grams to 4. More accurate prediction should be possible with inclusion of higher-order N-grams.
The twitter data set was not used because it tended to contain more colloquial expressions than complete structured phrases and words.
A very simple backoff approach is used in the following manner :
The prediction app can be found at
https://tsetsong.shinyapps.io/CapstoneShiny/
The steps required to use the app are labelled in the figure above. Upon entry of a short phrase, there will be a 1-2 minute wait while the code does the prediction work, and then the results will be displayed.
A model for predicting the next word in an N-gram phrase has been developed and functions reasonably well. Prior to using this method, both the Maximum Entropy model and the Naive Bayesian model were explored but the results were not satisfactory.
While functioning reasonably well, the current model has a number of shortcomings :
Inability to discern contextual word associations. For example, “life” and “death” or “rags” and “riches”.
Large data tables required.
Current implementation is limited in “vocabulary” with no self learning ability.