Cho Seng Mong
16 April 2016
This presentation is a high level description of the language modeling Capstone Project of Coursera Data Science Specilization
The purpose of this project is to build a natural language model that suggests an appropriate next unseen word in the user specified words sequence. Three types of data including twitter, news and blogs were consumed to train the model. Appropriate data cleaning and sub-setting techniques were applied to finalize the training data. Various word combinations (N-Grams) were then created using clean data sets and a predictive algorithm (Katz Back-off) was applied to predict next word. The final predictive model was optimized appropriately to work as a Shiny application.
Prior to building word prediction algorithm, the following steps were executed to handle and clean very large twitter, news and blogs files
The next word prediction model is based on the Katz Back-off algorithm. Here are the steps involved in predicting the next word of the user specified sentence
A Shiny application was developed based on the next word prediction model described previously. Here are key features of the App