Capstone - Week2 - Milestone Report

Introduction

This is an intermediate report for the Coursera Data Science Capstone Project. The objective of this step is to understand and get used to work with the data. Through the perfroemd Exploratory Data Analysis we will lead to the main keys for the prediction app and algorithm. Particullary, main objectives of this report are:

Obtain the data
Create a basic report of summary statistics about the data sets
Report any interesting findings
Get feedback on plans for creating a prediction algorithm and Shiny app

Get Basic Information about Corpus dataset

The data set consists of a collection of text files in four languages: English, Russian, German and Finnish. For each language, three text files exist: blogs, news and twitter. We will only consider the English language files for this project

The corpora have been collected from numerous different webpages, with the aim of getting a varied and comprehensive corpus of current use of the respective language. The data sources are newspapers, magazines, blogs (personal and professional) and Twitter updates.

*Basic information from the Dataset( file size, line and word count for the blog, news and twitter files)

File_Name	File_SizeinMB	Lines	LinesNEmpty	Chars	CharsNWhite	Word_Count
en_US.blogs	200.4242	899288	899288	206824382	170389539	37570839
en_US.news	196.2775	1010242	1010242	203223154	169860866	34494539
en_US.twitter	159.3641	2360148	2360148	162096031	134082634	30451128

Selecting a sample and cleaning data

As one can see from the previous table, the data files are quite large. Exploratory data analysis on the full data set would be too time consuming. In order to facilitate faster exploratory analysis, we will create a random sample of the English language blogs, news and twitter files.We will randomly sample 15000 the lines in each file. These files will then be written to their own directory for subsequent analysis.

Sampling to 15000 lines/file, fitted to my laptop performance.

Next, sampled data is used to create a corpus; and following clean up steps are performed.

Convert all words to lowercase
Eliminate punctuation and numbers
Trim whitespace
Eliminate bad words
Eliminate English stop words
Stemming with Porter’s algorithm
Create Plain Text Format

A Matrix of terms has been created for each corpus.

Frequency of Words. Exploring Common words from the Sampled Corpus

We will show the most significant content of the three corpus by a Word Cloud chart. It is of interest to get an idea for the most frequently occurring words in the documents. The following code computes word frequencies for each document and orders them from largest to smallest. We report the top 20 most frequent words in each file.

Below is a summary of the most frequent words from the sampled data:

Ngrams Tokenization

We have used RWeka package to create unigram, bigram and trigram. And then ggplot2 package has been used to plot them in order to evaluate the frequency of the main woprds for each corpus.

Next Steps and plans for prediction algorithm

Prediction of the next word in a sentence will depend on the previos N-grams in that sentence or phrase. The prediction algorithm should be based on 2-, 3-, n-grams. For example, consider a given phrase where we now want to predict the next word. I would create separate predictions based on the previous 2-gram, 3-gram, etc.

After the exploratory analysis, next steps are:

Establish the predictive model by using the sampled data obtained from the previous analysis. The model created will subsequently be tested and tweaked to strike a good balance between accuracy and speed.
Develop the shiny app and presentation to make word prediction based on user inputs.