Charles R. Allen
2016-06-12
The magic of n-gram prediction models
User input on devices lacking a full-sized keyboard benefit greatly from reduced effort of expressing English language words.
Given the large corpus of human-generated text out there, it is desirable that a machine can ease experience of typing on a device not designed for a large quantity of text input.
In order to make such a system usable, reasonable Natural Language Processing needs to be in place, and the capability to pre-process large quantities of data into smaller mobile-friendly chunks needs developed.
This body of work presents the methods and results of such a system.
The raw data is tokenized and cleaned in the following ways:
SwiftKey provided a corpus of tweets, blog posts, and news stories for analysis. It contains 150,000 unique words over more than 3 million lines of text! This corpus is processed in a Big Data friendly way with Spark. App hosting is accomplished through shinyapps. R allows reasonable flexability in creating interactive statistic apps.
Prediction is accomplished int he R appliation with a smiple backoff model, whereby the tailing n-gram's next word is judged by its occurance previously. Prediction is judged initially by an input's terminal tri-gram. If an entry is found in the tri-gram to quad-gram mapping, it is used for prediction. Else, if the termina bi-gram is found in the bi-gram to tri-gram mapping, it is used for prediction. Else if the terminal word is used in the word to bi-gram mapping, it is used for prediction. Else a simple word count is used to predict the user's next input.
The app implements the preceeding methodology on the SwiftKey provided corpus. The result is a simple app that effectively becomes a run-on sentence generator. The app can be found at https://allen-net.shinyapps.io/CourseraSwiftKeyNGramPredictor. The raw data, comprising over 500MB is stored in highly refined form for consumption by the app into a data file of only 17MB! making it suitable to be used on mobile applications.
Any user wishing to see what other words may have been choosen may do so by enabling a checkbox. A dynamic Kneser-Ney smoothing to get the probability weight of the resulting words can be enabled, but is computationally expensive so off by default and is intended primarily as a teaching and learning aid rather than a major feature of the prediction.