Hsin-Hua Lai
June 10, 2016
Slide presentation for Capstone project
Johns Hopkins University
Coursera Data Science Specialization
This presentation illustrateds the word prediction Shiny app I develop for the Capston project
The Shiny app predicts the most probable word following a partial sentence entered by Users
The following slides illustrate
The Shiny App is on the shinyapp.io website: https://hsinhualai.shinyapps.io/Capstone_Project-Text_Prediction/
All the codes and relevant materials are on my github: https://github.com/hsinhualai/datasciencecoursera/tree/master/Capstone
The complete text data for buidling text model is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The data file contains four folders, in which the one names en_US is used
The en_US file folder contains three text files from blogs, news, and twitter. The more complete exploratory analysis is at https://rpubs.com/hsinhua/177290.
Preprocessing steps
The approach is a Back-off model illustrated in the Natural Language Processing lectures, http://web.mit.edu/6.863/www/fall2012/lectures/lecture2&3-notes12.pdf
The backoff smoother used do a weighted average between all pentagrams, quadgrams, trigrams, bigrams, and unigrams.
For a partial sentence whose length is larger than four, the last four words will be used for next word prediction
The Shiny App is on the shinyapp.io website: https://hsinhualai.shinyapps.io/Capstone_Project-Text_Prediction/
Future Plan