Text Prediction Shiny Application

Hsin-Hua Lai
June 10, 2016

Slide presentation for Capstone project
Johns Hopkins University
Coursera Data Science Specialization

This presentation illustrateds the word prediction Shiny app I develop for the Capston project
The Shiny app predicts the most probable word following a partial sentence entered by Users
The following slides illustrate
- Data source and preprocessing
- Algorithm for word prediction
- Future Plan

The complete text data for buidling text model is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The data file contains four folders, in which the one names en_US is used
The en_US file folder contains three text files from blogs, news, and twitter. The more complete exploratory analysis is at https://rpubs.com/hsinhua/177290.
Preprocessing steps
- Sample about 22%, 20%, and 10% from the blogs, news, and twitter text data
- Convert characters to lower case and remove all bad words and non-alphabetic characters including numbers, punctuations
- Extract unigrams, bigrams, trigrams, quadgram, and pentagram along with their frequencies
- Use only 90% of the unigrams for efficiency and the corresponding bigrams, trigrams, quadgrams, and pentagrams
- Eliminate all trigrams, quadgrams, and pentagrams whose frequencies are one, which significantly reduces the files' size

The approach is a Back-off model illustrated in the Natural Language Processing lectures, http://web.mit.edu/6.863/www/fall2012/lectures/lecture2&3-notes12.pdf
The backoff smoother used do a weighted average between all pentagrams, quadgrams, trigrams, bigrams, and unigrams.
For a partial sentence whose length is larger than four, the last four words will be used for next word prediction
- The last four words will be combined with any unigram to give a pentagram, two quadgrams, three trigrams, four bigrams, and five unigrams
- The built algorithm will search for built-in pentagram, quadgram, trigram, bigram, and unigram tables.
- If there is a match in higher n-grams, the weight contributed from higher n-grams is significantly larger
- If there is no match in n-grams, it will back off to search (n-1)-grams tables and do weight average between (n-1)-grams down to unigrams

The Shiny App is on the shinyapp.io website: https://hsinhualai.shinyapps.io/Capstone_Project-Text_Prediction/
- On the left panel, the users enter a partial sentence
- On the main panel, the app gives the first five the most probable words
Future Plan
- Try to build in higher n-grams with \( ~n>5 \) for a better prediction app
- Try to build in an algorithm which can learn from the Users' input to generalize the dictionary