Word Prediction Model

Aleksandr Voishchev
April 23, 2016

Data Science Specialization Capstone Project provided by

  • Coursera
  • Johns Hopkins Bloomberg School of Public Health
  • SwiftKey

Usage

The source code is available on GitHub
The Shiny App is available here

Instructions

Application and model description

  • NGram model with “Stupid Backoff” (Brants 2007) is using.
  • Model dataset was bulit with quanteda R package, which is so much faster than tm package.
  • 3% random samples of the entire datasets (HC corpora) were used. In this case it is a compromise value between speed, size, Shinyapp server technical requirements and accuracy.
  • Some statistics of this dataset is available here.
  • 2-gram, 3-gram, 4-gram and 5-gram sub-models were used with no stopwords.
  • Entire model data contains ~6.8 million observations, it takes about 53 Mb on HDD.

Test model

Model had been tested by Next word prediction benchmark

    Overall top-3 score:     13.39 %
    Overall top-1 precision: 10.40 %
    Overall top-3 precision: 15.98 %
    Average runtime:         753.63 msec
    Number of predictions:   5751
    Total memory used:       440.42 MB    

    Dataset details
    Dataset "blogs" (119 lines, 2986 words)
    Score: 13.11 %, Top-1 precision: 10.38 %, Top-3 precision: 15.54 %
    Dataset "tweets" (159 lines, 2808 words)
    Score: 13.67 %, Top-1 precision: 10.42 %, Top-3 precision: 16.42 %

It is not a bad results.

Problems and further Exploration

  • Cleaning real texts from blogs and similar sources is the most important and dissicult task.
  • Some non-letter symbols not only should be replaced with spaces. In this model @ was replaced to as word after manual analysis ngram tables.
  • misspellings, abbreviations, and acronyms added complexity.
  • More sophisticated algorithms: “Good Turing”, “Kneser-Ney” smoothing are interested to realize. Also it need some experiments with Amazon EC2 instances to determine and optimize memory and other resources usage.
  • Requires further study the issue of real text cleaning.
  • User interface is very simple. It need to improve usability.