Next Word Prediction

Andriy Muzychuk
12/13/2014

Capstone Project

Task: Predict next word for a given text (sequence of words, signs, numbers, or even empty string)

Data Source: Text Corpora that includes samples of Blogs, News and Tweets (English) downloaded from Capstone Dataset

Expected Output Prediction Model, Shiny app for testing and visualization

Prediction Model

Motivation N-gram model is successfully used to predict next item in a sequence. This model is very sencitive to a train set. Thus it was desided to apply model separatelly for each data source. To combine result we can use some weights that represent the probability of a given text comes from blogs, news or tweets.

Tasks Overview

For each data source predict the probability of next words using 4-gram model and Stupid Backoff algorithm.
Combine results and find rank for each word as weighted probabilities. Weights represent the probability of text source. Weights are predicted with LDA classification algorithm.

Data source prediction model

Collected Features: Number of special characters (@ or #), count of abbreviations, count of htlm labels, count of numerical sequences, counts of non character symbols, total number of words, average words length, etc.

LDA motivation and results: We need fast, lightweight and eazy to deploy model. We need probabilities and we have 3 classes. LDA appears to be the best choise. It showed 57% of accuracy on test data. Other models like QDA, Decision Trees, Random Forest shows the same accuracy on formed feature set. Trained on 1.5 million of instances (tested on 0.5 million of instances).

Next Word Rank Calculations

Input for prediction model is a block of text. This text is processed in the same way as it was done to prepare dictionary.

N-gram model returns the list of words that may go next with the probabilities and Stupid Backoff algorithm scales those probabilities. Let \( P(w_{i}|s_{k}) \) denotes the probability for word \( w_{i} \) to be next for source \( s_{k} \) (\( s_{1} = blogs, .... \) ).

Source Probabilities are results of LDA classification model. Let \( P(s_{k}|text) \) be the probability of entered text comes from \( s_{k} \)

Word Rank For a given text, rank for words \( w_{i} \): \[ R(w_{i}) = \sum_{k=1:3}{P(w_{i}|s_{k})*P(s_{k}|text)} \]

Visualization Results

The results are combined into an app that has following featues:

Ability to input custom text for prediction
Specify the source dictionary to make prediction
Observe the probabilities of source for entered text
Review histogram of probailities for up to top 50 words with highest rank separately for each data source
Review wordcloud for up to 50 top ranked words weighted accordigly to rank

*For more details visit words prediction app HERE *