Word Prediction Model

Aleksandr Voishchev
April 23, 2016

Data Science Specialization Capstone Project provided by

Coursera
Johns Hopkins Bloomberg School of Public Health
SwiftKey

Usage

The source code is available on GitHub
The Shiny App is available here

Instructions

Application and model description

NGram model with “Stupid Backoff” (Brants 2007) is using.
Model dataset was bulit with quanteda R package, which is so much faster than tm package.
3% random samples of the entire datasets (HC corpora) were used. In this case it is a compromise value between speed, size, Shinyapp server technical requirements and accuracy.
Some statistics of this dataset is available here.
2-gram, 3-gram, 4-gram and 5-gram sub-models were used with no stopwords.
Entire model data contains ~6.8 million observations, it takes about 53 Mb on HDD.

Test model

Model had been tested by Next word prediction benchmark

    Overall top-3 score:     13.39 %
    Overall top-1 precision: 10.40 %
    Overall top-3 precision: 15.98 %
    Average runtime:         753.63 msec
    Number of predictions:   5751
    Total memory used:       440.42 MB    

    Dataset details
    Dataset "blogs" (119 lines, 2986 words)
    Score: 13.11 %, Top-1 precision: 10.38 %, Top-3 precision: 15.54 %
    Dataset "tweets" (159 lines, 2808 words)
    Score: 13.67 %, Top-1 precision: 10.42 %, Top-3 precision: 16.42 %

It is not a bad results.

Problems and further Exploration

Cleaning real texts from blogs and similar sources is the most important and dissicult task.
Some non-letter symbols not only should be replaced with spaces. In this model @ was replaced to as word after manual analysis ngram tables.
misspellings, abbreviations, and acronyms added complexity.
More sophisticated algorithms: “Good Turing”, “Kneser-Ney” smoothing are interested to realize. Also it need some experiments with Amazon EC2 instances to determine and optimize memory and other resources usage.
Requires further study the issue of real text cleaning.
User interface is very simple. It need to improve usability.