Predict Next Words

Data Science Capstone Final Project

Guanglai Li
Aug. 22, 2015

cover

Summary of the project

In this project we build a model to predict the next word after typing a phrase. This kind of models have been widely used for text input in mobile devices such as cell phones and tablets.

data: We will use three text files, 'en_US.blogs.txt', 'en_US.news.txt', and 'en_US.twitter.txt' to build models for English language.
tools: Linux text editing untilities are used to extract ngrams. R language is used to build the prediction model and application.
app: The app is hosted on shinyapps.io. click here to use the model.

Algrithm - extract n-grams using Linux

procedures:
- profanity filtering
- removing dots from words like Dr., Mr., etc., i.e., …
- break sentence at quotation marks except for at apostrophes.
- additional cleaning so that the final file only contains a-z, space, and apostrophe ' in words like i'm, wasn't, …
- extract and count uni-, bi-, tri-, and four-grams
benifit of linux text editing utilities
- fast and memory efficient
- all done in one hour for all text in three files using a laptop.

Algrithm - build model using R

procedures:
- remove n-grams that rarely observed
- predict the k^th word based on the count of k-grams
- use stupid backoff when k-gram does not exist. For example, knowing three words (w1, w2, w3), predict w4 based on the count of fourgram (w1, w2, w3, w4). If fourgram does not exist for any w4, predict w4 based on the count of trigram (w2, w3, w4), and so on …
priority concerns:
- keep the computing time under one second
- predict next words with an acceptable accuracy

The Shiny App

key points to improve user experiences:

simple user interface
the predicted words only show up after typing a space
warning message for possible typos

link to the app shiny app