Data Science Specialization: Capstone Project

Uma Balakrishnan
April 21, 2016

Main goal of the Capstone project is to analyze a large corpus of text documents to discover the structure of the data and how words are put together.
Project requires applying data science skills in the analysis of text data and Natural Language Processing (NLP).
The training data set for the project is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The dataset consists of three corpus files (1. Twitter, 2. Blogs, 3. News) in four languages (German, US English, Finnish, and Russian).
My project, Shiny App and Presentation primarily focuses on the US English language.

Download and extract the text files
Choose each text file in English language for basic analysis
Clean the corpus & perform basic exploratory analysis
Sample data from each text file to perform analysis
Build N-Gram model using RWeka package (NGramTokenizer) for sampled data
Save unique unigram, bigram and trigram data along with their respective frequencies
Construct predictive model to predict next possible words (I have used Kneser-Ney smoothing to predict next word)

Size of data file constructed using unigram, bigram and trigram is 64KB. Size of original data file is 563MB.

Shiny App can be accessed from
https://umaram08.shinyapps.io/Shiny-Capstone/
Type a word / sentence in English where it says “Input your Text here:”; If a prediction is found, next 3 possible words are displayed; If not found, “Please Input Text for Possible Predictions” will be displayed as default.
Due to the size of laptop memory, Shiny app has been created with small training set sampled from given three different text files (RData for this Shiny app consists of unique 2568 Unigrams, 2399 Bigrams and 277 Trigrams).