Data Science Specialization: Capstone Project

Uma Balakrishnan
April 21, 2016

Introduction

  • Main goal of the Capstone project is to analyze a large corpus of text documents to discover the structure of the data and how words are put together.
  • Project requires applying data science skills in the analysis of text data and Natural Language Processing (NLP).
  • The training data set for the project is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
  • The dataset consists of three corpus files (1. Twitter, 2. Blogs, 3. News) in four languages (German, US English, Finnish, and Russian).
  • My project, Shiny App and Presentation primarily focuses on the US English language.

Prediction Algorithm

  1. Download and extract the text files
  2. Choose each text file in English language for basic analysis
  3. Clean the corpus & perform basic exploratory analysis
  4. Sample data from each text file to perform analysis
  5. Build N-Gram model using RWeka package (NGramTokenizer) for sampled data
  6. Save unique unigram, bigram and trigram data along with their respective frequencies
  7. Construct predictive model to predict next possible words (I have used Kneser-Ney smoothing to predict next word)

Size of data file constructed using unigram, bigram and trigram is 64KB. Size of original data file is 563MB.

Shiny App for Word Prediction

  • Shiny App for this project consists of 2 tabs
    (resides on the “shinyapps.io” server)
    1. Prediction of Next Word
      @ In this user input words has to be completed with prediction.
      @ User input passes into the prediction algorithm, which cleans the input and look into trigram, bigram and unigram data frames, respectively in order for the frequently predicted words.
      @ Frequently used three words are displayed on the GUI.
    2. Project Overview
      @ Describes main goal of the project.

User's Guide

  • Shiny App can be accessed from
    https://umaram08.shinyapps.io/Shiny-Capstone/
  • Type a word / sentence in English where it says “Input your Text here:”; If a prediction is found, next 3 possible words are displayed; If not found, “Please Input Text for Possible Predictions” will be displayed as default.
  • Due to the size of laptop memory, Shiny app has been created with small training set sampled from given three different text files (RData for this Shiny app consists of unique 2568 Unigrams, 2399 Bigrams and 277 Trigrams).