CY Ting
29/4/2016
There are three data sources used in this project, obtained from twitter, blogs, and news. The overall dataset size goes up to 500++ MB. In this project, however, only 10% from the combined datasets were used for next word prediction.
This project uses n-grams models, with n=2, 3, and 4 only.
The final product can be found at https://tingshinyapps.shinyapps.io/WordPrediction/
The original dataset consist of
The data requires pre-processing before any prediction of next word could happen. The pre-processing includes data cleaning, outlier removal, data tranformation, and data sampling. Only 10% of the cleaned dataset was used in this project.
The dataset was prepared for 2-gram, 3-gram, and 4-gram next word prediction purpose. Data is saved in “.RData”
A step by step illustration of prediction process is given below:
