CY Ting
29/4/2016
There were three data sources used in this project; obtained from twitter, blogs, and news. The total dataset size goes up to 500++ MB. In this project, however, only 10% from the combined datasets were used for next word prediction. This is to reduce the server loading time.
This project implemented n-grams models, with n=2, 3, and 4 only.
The final product can be found at https://tingshinyapps.shinyapps.io/WordPrediction/
Detail information about the original dataset:
The data requires pre-processing before any prediction of next word could happen. The pre-processing includes data cleaning, outlier removal, data tranformation, and data sampling. Only 10% of the cleaned dataset was used in this project.
The dataset was prepared for 2-gram, 3-gram, and 4-gram next word prediction purpose. Data is saved in “.RData”
A step by step illustration of prediction process is given below: