Manuel Cázares
27/04/2015
The final product of the Data Science Capstone is a ShinyApp. The Challenge was to predict the next probable word based on the analysis of a large corpus dataset from text files (News, Blogs, Twitter). The necessary steps where:
Due to the size of the data we used only a small chunk and we used an Amazon EC2 Instance to improve the performance of the analysis. We used 1,000 lines of each text file.
For the prediction algorithm we used the Katz back-off model which is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by “backing-off” to models with smaller histories under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.
The Shiny App for this project can be found at: https://cazares.shinyapps.io/WordApp/
Some of the packages used on this project are: