Yanhua Hou
03/30/17
Presentation for Coursera Data Science Capstone Project
These slides serve as a presentation for the predictive text model built for the project of Data Science Capstone.
Data downloaded from a corpus called HC Corpora consists of docs in English from three sources: Twitter, Blogs and News articles. Select randomly 5% samples from each file and combine them as a whole. Take 80% of the data as training, 10% as testing and 10% as validation data.
Data cleaning involving
Building a n-gram dictionary