Data Science Capstone Course Project

20/7/2020

Introduction

This presentation aims to give a brief description of the application created, as a final project of the data science specialization offered by Johns Hopkins University on the course platform. This application is created with the shiny library, in RStudio and uses a prediction algorithm, which takes a word as input and predicts the word or the following words.

Application Description

This application takes a word as input and predicts the word or the following words. Below I explain in more detail the input and output of the application:

Input: The application must be entered with a complete specific word or sentence.

Output: The application returns a word that is the prediction, of the word that should follow it, to the entered word (in case a single word is entered) or the last word of the sentence (in case a sentence is entered).

Note: If no word or sentence is entered, the application will return “NULL” as output, and once a sentence or word is entered, the application will return to the word output of the prediction.

Application user interface

The application is available here.

Description of the algorithm used

For this prediction algorithm, files created from an organization and cleaning process that was made to the files provided by the SwiftKey application will be used; the files created contain the quadgram, trigram and bigram, and these in turn are ordered by frequency in descending order.

To create the prediction algorithm, firstly, load the previously affected files, then clean up the entered word or sentence, converting all letters to lowercase and removing punctuation and numbers from it, later prediction, using the quadgram first, with the last 3 words of the sentence entered as input, the first 3 words of the quadgram, in case of not finding a quadgram, uses the trigram, which has the same logic from the quadgram, only that with two words, finally in case of not finding a trigram, the bigram is used and if a bigram is not found, the algorithm returns the word with the most frequency as a result of the prediction.