Word Prediction: Project

Sagar Pathak
May 14, 2016

Introduction

As referred by the name, this project involves to predict the next word from the word entered by the user by processing complex algoration which uses data tables of 4 grams with frequencies of occurences. The HC Corpora dataset is comprised of the output of crawls of news sites, blogs and twitter. The dataset contains 3 files across four languages (Russian, Finnish, German and English). This project is created using the English language datasets.

Features of prediction algorithm

  • Process last three words of input phrase to seek for the next combination
  • Uses the back-off strategy
  • In case of non-ideal situation 'NA' is replaced with most common words of english
  • Babble is implemented which generated random sentences (incomplete).

The Objective

The main goal of this capstone project is to build a shiny application that is able to predict the next word. This exercise was divided into seven sub tasks like data cleansing, exploratory analysis, the creation of a predictive model and more.

All text data that is used to create a frequency dictionary and thus to predict the next words comes from a corpus called HC Corpora.

All text mining and natural language processing was done with the usage of a variety of well-known R packages such as stylo, data.table etc.

Application Link: https://sagar1992.shinyapps.io/word-predict-project

User Interface I

User will be able to input phrase insde the input box. Result will display on the right side box (User Interface II) on the fly.

User Interface II

After user enters the input phrase. Predicted word will display as a tile as follows.

Conclusion

The word predicting application was successfully created and hosted on

https://sagar1992.shinyapps.io/word-predict-project

using R packages such as stylo, data.table etc. This project helped me get advanced use of R programming language and also a R studio with features such as RPress, shinyapps and R Pubs. Which will be definetely helpful for other research and presentations.

Thanks, Sagar