Data Science Specialization : Capstone Project

Binod Jung Bogati
June 23rd, 2018

Introduction

This is a NLP problem focused in text mining. This project allows ground work on raw data for text prediction.

About :

  • Text mining project to predict next word
  • Uses R programming
  • App made by Shiny and hosted at shinyapps.io

Word Prediction Model

The model is based on Tidy Data principles in R. The steps involved are :

  • Input: text files as training data
  • Clean data: Group data into 2, 3, 4 word n-grams
  • N-grams function: uses “Katz back-off” prediction model
    • last 3, 2, or 1 words to predict the best 4th, 3rd, or 2nd match in the repos
  • Output: Next word prediction

Word Prediction App

Provides a simple UI to the word prediction model.

Salient Features:

  • Predicted outputs dynamically below user input
  • Tabs with plots of most frequent n grams in the data-set
  • Allows large training data for future enhancement
  • Fast output response

Resources & Documentation