Jiameng Yu
10 January 2021
The project data-set consists of 3 files (blogs, twitter and news) which include texts obtained from each of the sources in English.
Out of the 71m words/combinations, top 150 words covers 50% whereas top 100000 90%.
The sample raw data is read into a data.table which is then transfmored into 4 sets of tokens (2 - 5 grams). All cases are transformed to lower.
Each sets of tokens are ranked in reversed order of frequency of use.
Each set is saved as a separate googlesheet in order to speed up processing time in the model.
The sample data set (2%) of total data is merely 8mb but the total of all tokens already reached 52mb.
The code can be accessed at https://github.com/Dark-angel2019/Data_science_capstone
The app interface invites Users to input a partial sentence or combination of words of at least 3 words.
It then carries out simply cleaning such as transforming all input texts into lower case and removing punctuation. The last 3 words are then filtered out to be used as basis for prediction.
Firstly, a search is carried out with the preceding 3 words returning the 4th as prediction. If no word is returned, then a search is done based on the preceding 2 words returning the 3rd. If still no words is returned, then search on based on preceding 1 returning 2nd. A word is usually returned if predicted based on 3 gram only. If no word can be returned, then “No Word Found” is returned.
Code for the app can be access on: https://github.com/Dark-angel2019/Data_science_capstone