Vitalii
16/08/2020
This work was performed as part of the final task of the course “Data Science Capstone Final Project Submission”.
The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others.
The application can be found here.
Before creating a predictive model of the words, datasets are first processed and cleared as described below:
input data set based on three sources (blogs, twitter and news);
this input data set is prepared for further analysis by clearing unnecessary characters;
tokenization is performed and the corresponding n-grams are created;
tables are formed with the counting of words and phrases, which are extracted from n-grams and sorted by frequency in descending order.
n-grams are stored as .RData files.
The application based on the prediction model of the next word works as follows:
Load compressed data sets containing ordered n-grams of descending frequenc;
The words entered by the user are cleared accordingly before predicting the next word;
To predict the next word, first use 4-gram;
If the 4-gram is not found, return to 3-gram;
If the 3-gram is not found, return to 2-gram;
If 2-gram is not found, return to the most common word with the highest frequency.
In order to get a prediction of the next word, you need to enter the word or phrase that interests you and get a prediction of the next word.