Final Project Submission

Vitalii
16/08/2020

Annotation

This work was performed as part of the final task of the course “Data Science Capstone Final Project Submission”.

The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others.

The application can be found here.

Data processing

Before creating a predictive model of the words, datasets are first processed and cleared as described below:

  • input data set based on three sources (blogs, twitter and news);

  • this input data set is prepared for further analysis by clearing unnecessary characters;

  • tokenization is performed and the corresponding n-grams are created;

  • tables are formed with the counting of words and phrases, which are extracted from n-grams and sorted by frequency in descending order.

  • n-grams are stored as .RData files.

The principle of operation of the word prediction model

The application based on the prediction model of the next word works as follows:

  1. Load compressed data sets containing ordered n-grams of descending frequenc;

  2. The words entered by the user are cleared accordingly before predicting the next word;

  3. To predict the next word, first use 4-gram;

  4. If the 4-gram is not found, return to 3-gram;

  5. If the 3-gram is not found, return to 2-gram;

  6. If 2-gram is not found, return to the most common word with the highest frequency.

Instructions for using the application

In order to get a prediction of the next word, you need to enter the word or phrase that interests you and get a prediction of the next word.