Introduction

This is an intermediate report for the Coursera Data Science Capstone Project. The objective of this step is to understand and get used to work with the data. Through the perfroemd Exploratory Data Analysis we will lead to the main keys for the prediction app and algorithm. Particullary, main objectives of this report are:

  1. Obtain the data
  2. Create a basic report of summary statistics about the data sets
  3. Report any interesting findings
  4. Get feedback on plans for creating a prediction algorithm and Shiny app

Get Basic Information about Corpus dataset

The data set consists of a collection of text files in four languages: English, Russian, German and Finnish. For each language, three text files exist: blogs, news and twitter. We will only consider the English language files for this project

The corpora have been collected from numerous different webpages, with the aim of getting a varied and comprehensive corpus of current use of the respective language. The data sources are newspapers, magazines, blogs (personal and professional) and Twitter updates.

*Basic information from the Dataset( file size, line and word count for the blog, news and twitter files)

File_Name File_SizeinMB Lines LinesNEmpty Chars CharsNWhite Word_Count
en_US.blogs 200.4242 899288 899288 206824382 170389539 37570839
en_US.news 196.2775 1010242 1010242 203223154 169860866 34494539
en_US.twitter 159.3641 2360148 2360148 162096031 134082634 30451128

Selecting a sample and cleaning data

As one can see from the previous table, the data files are quite large. Exploratory data analysis on the full data set would be too time consuming. In order to facilitate faster exploratory analysis, we will create a random sample of the English language blogs, news and twitter files.We will randomly sample 15000 the lines in each file. These files will then be written to their own directory for subsequent analysis.

Sampling to 15000 lines/file, fitted to my laptop performance.

Next, sampled data is used to create a corpus; and following clean up steps are performed.

A Matrix of terms has been created for each corpus.

Frequency of Words. Exploring Common words from the Sampled Corpus

We will show the most significant content of the three corpus by a Word Cloud chart. It is of interest to get an idea for the most frequently occurring words in the documents. The following code computes word frequencies for each document and orders them from largest to smallest. We report the top 20 most frequent words in each file.

Below is a summary of the most frequent words from the sampled data:

Ngrams Tokenization

We have used RWeka package to create unigram, bigram and trigram. And then ggplot2 package has been used to plot them in order to evaluate the frequency of the main woprds for each corpus.

Next Steps and plans for prediction algorithm

Prediction of the next word in a sentence will depend on the previos N-grams in that sentence or phrase. The prediction algorithm should be based on 2-, 3-, n-grams. For example, consider a given phrase where we now want to predict the next word. I would create separate predictions based on the previous 2-gram, 3-gram, etc.

After the exploratory analysis, next steps are: