WordApp

Manuel Cázares
27/04/2015

Challenge

The final product of the Data Science Capstone is a ShinyApp. The Challenge was to predict the next probable word based on the analysis of a large corpus dataset from text files (News, Blogs, Twitter). The necessary steps where:

  • Getting & Cleaning Data
  • Create a prediction algorithm
  • Make some tests
  • Create a ShinyApp

Getting & Cleaning Data

Due to the size of the data we used only a small chunk and we used an Amazon EC2 Instance to improve the performance of the analysis. We used 1,000 lines of each text file.

Prediction Algorithm

For the prediction algorithm we used the Katz back-off model which is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by “backing-off” to models with smaller histories under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.

Shiny App

The Shiny App for this project can be found at: https://cazares.shinyapps.io/WordApp/

Packages used

Some of the packages used on this project are:

  • library(“pander”) #Tables
  • library(“ggplot2”) #Graphics
  • library(“NLP”) #Natural Language Processing
  • library(“openNLP”) #Natural Language Processing
  • library(“tm”) #text mining
  • library(“RWeka”) #Machine learning