Text Prediction App

Diego Angulo
9/28/2019

Coursera
Data Sciene Capstone Project Course - #10
Johns Hopkins University - Data Science Specialization

Summary

This presentation is part of the final course project within the Data Science Capstone Course on the Data Science Specialization by Johns Hopkins University on Coursera.

The goal of this project is to build a predictive text product based on the analysis of text data and natural language processing. The interface of this product is presented in an R Shiny App that will predict the next word based on the user's inputs.

All text mining and natural language processing was done with the usage of a variety of well-known R packages.

Data Processing

The data used in the model came from a corpus called HC Corpora, that was downloaded from the Coursera Site.

Three data samples from the HC Corpora data were created separately (blogs,twitter and news), and then merged into one main file. This sample was cleaned by conversion to lowercase, removing punctuation, links, white space and numbers.

The data sample was then tokenized into n-grams (contiguous sequence of n items from a given sequence of text or speech). In this project, Bigram, Trigram and Quadgram where used.

Then, this N-Grams were converted into frequency dictionaries as data frames, and later on used for the word prediction.

Application

App link: diegoangulo.shinyapps.io/Text_Prediction_App
The user interface is pretty straightforward. Once a word or a sentence input is written in the text box, the app will refresh instantaneously, and the predicted next word will appear in the main panel.