30/9/2020

Introduction

This is a final product of the Data Science Specialization program, offered by Johns Hopkins University, through Coursera.

The objective of the project is to develop skills in the manipulation of lexicographic corpus and to develop a text prediction algorithm.

Overview

The data is from a corpus called HC Corpora. This exercise uses the files en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt

  • Application made entirely in R
  • About the {shiny} package, using various packages available for natural language processing.
  • The application is hosted on the brilliant Rstudio server

What is the concept of the algorithm?

Based on each of the three databases, previously refined, a prediction algorithm was trained on the pre-classification of the blocks of one, two, three and four consecutive words.

This model is applied to the text string that is being entered in the window that has been provided for it.

Conceptual Issues

In the training phase, it will be interesting to separate the options between the behavior at the beginning of the text and the behavior in the middle of it.

It is also desirable to optimize the response time.

The future

Welcome all suggestions:

francisco.alvarez@correounivalle.edu.co