Padma Girish
September 16th, 2016
The goal of this project is to understand the data which is provided by HC Corpora, perform analysis to provide summary statistics and establish a overall approach to build prediction algorithm using knowledge of NLP in R and shiny app to predict the next probable word by using the given input.
All text mining and natural language processing was done with the usage of a variety of well-known R packages.
Prediction algorithm uses the n-gram model a NLP technique and Markov’s assumption where the next word can be predicted with some probability based on a few previos words.
Model was built with the data from corpus HC Corpora. Data is then cleaned, analyzed and sampled. The sampled data is tokenized into n-grams.unigram,bigram and trigram frequency matrices are transferred into frequency dictionaries.
The resulting data.frames along with Markov chain rule are used to predict the next possible word by taking the text input from the user.
The app interface is very simple.Enter text in input textbox. Top 3 most probable next words are displayed in the output textbox.
Word prediction app is hosted on shinyapps.io and you can run the app by clicking the URL https://pgirish.shinyapps.io/ds_capstone/
The pitch deck is located at http://rpubs.com/pgirish/capstone
Data for this project has been downloaded from HC Corpora which is a collection of corpora for various languages freely available to download.