Run-on Sentence Generator

Charles R. Allen
2016-06-12

The magic of n-gram prediction models

The Need for Prediction Models

User input on devices lacking a full-sized keyboard benefit greatly from reduced effort of expressing English language words.

Given the large corpus of human-generated text out there, it is desirable that a machine can ease experience of typing on a device not designed for a large quantity of text input.

In order to make such a system usable, reasonable Natural Language Processing needs to be in place, and the capability to pre-process large quantities of data into smaller mobile-friendly chunks needs developed.

This body of work presents the methods and results of such a system.

Data Cleaning

The raw data is tokenized and cleaned in the following ways:

  1. Each punctuation mark is eliminated.
  2. All letters are lower-cased.
  3. All numbers are eliminated.
  4. Lines are then tokenized by splitting on the space character.
  5. Words, bi-grams, tri-grams, and quad-grams are generated.
  6. Only n-grams which appear at least 3 times are kept.
  7. The n-grams are split into their terminal word and their prefix.
  8. The (n-1)-grams, the suffix word, and the occurance count are stored for use in R.

Processing And Prediction

SwiftKey provided a corpus of tweets, blog posts, and news stories for analysis. It contains 150,000 unique words over more than 3 million lines of text! This corpus is processed in a Big Data friendly way with Spark. App hosting is accomplished through shinyapps. R allows reasonable flexability in creating interactive statistic apps.

Prediction is accomplished int he R appliation with a smiple backoff model, whereby the tailing n-gram's next word is judged by its occurance previously. Prediction is judged initially by an input's terminal tri-gram. If an entry is found in the tri-gram to quad-gram mapping, it is used for prediction. Else, if the termina bi-gram is found in the bi-gram to tri-gram mapping, it is used for prediction. Else if the terminal word is used in the word to bi-gram mapping, it is used for prediction. Else a simple word count is used to predict the user's next input.

Result

The app implements the preceeding methodology on the SwiftKey provided corpus. The result is a simple app that effectively becomes a run-on sentence generator. The app can be found at https://allen-net.shinyapps.io/CourseraSwiftKeyNGramPredictor. The raw data, comprising over 500MB is stored in highly refined form for consumption by the app into a data file of only 17MB! making it suitable to be used on mobile applications.

Any user wishing to see what other words may have been choosen may do so by enabling a checkbox. A dynamic Kneser-Ney smoothing to get the probability weight of the resulting words can be enabled, but is computationally expensive so off by default and is intended primarily as a teaching and learning aid rather than a major feature of the prediction.