Data Science Capstone - Predictive Text Model

Toby H.W. Lam
26 Nov 2014

In this project, we are required to use the text data, which are collected from publicly available sources by a web crawler [1], to create a predictive text model
The Coursera-SwiftKey.zip can be downloaded from the Coursera page. There are three files:
- en_US.blogs.txt,
- en_US.news.txt,
- en_US.twitter.txt

Prediction Algorithm

Preprocessing the text (e.g. filter non-English words, symbols)
Tokenization
Prepare unigram, bigram and trigram from the data
Count the occurrences of each unique unigram, bigram and trigram
Get the text phrase from the user
Extract the last two tokens (e.g. prev1, prev2) from the phrase. If the phrase is not long enough, extract the last token (i.e. prev2)
Calculate the probabilty of all the possible match (more detail in below)
Return the top 10 predicted words

To take the diversity of histories into account, I adopted Modified Kneser-Ney Smoothing for calculating the probability [2].

Check if there is any match in Trigram (i.e. prev1, prev2 -> matchWord). If cannot find any match, move to step 3.
Return the top 10 predicted word by using the calculated probability. Done!
[Back-off] Check if there is any match in Bigram (i.e. prev2 -> matchWord). If cannot find any match, move to step 5.
Return the top 10 predicted word by using the calculated probability. Done!
[Back-off] Return the top 10 most frequently used Unigram. Done!

The application can be accessed from https://tobylam.shinyapps.io/DataSciCapstone/
Screenshot:

screenshot

Basic usage of the application:
1. Input phrase in the text field
2. Press the Submit button
3. Please wait for a while… (be patient)
4. The top 10 next prediction words will be listed on the right hand side

At the time creating the slide, I only used 1% of the data for developing the prediction model.
- Smaller data size -> quick response
I will check if I can updating the application by using the larger data set.
The milestone report and the source code can be downloaded from: https://github.com/2blam/DataSciCapstone

[2] P. Koehn, Statisitcal Machine Translation, pp.201 - 203, Cambridge, 2010