Milestone Report - Exploratory Data Analysis

Introduction

This document is the first report of the milestone project which focus on an explaination of major features of the data and a summary of plans for creating the prediction algorithm and Shiny app.

The report will present different initial steps to work with the raw data, i.e.: (1) Load data; (2) Summary statistics of the data sets; and (3) other interesting findings.

Within the report, important summaries of the data set will be presented in form of tables and plots.

Part 1: Data Loading

Loading necessary packages: - Text Mining

library(tm)

## Loading required package: NLP

library(RWeka)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(slam)

Set working directory

setwd("/Volumes/Data/Coursera/Course10_DataProjectCapstone/")

Load 3 files (which were copied into the working directory): - en_US.blogs.txt - en_Us.news.txt - en_US.Twitter.txt

blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8",skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8",skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8",skipNul = TRUE)

Part 2: Summary statistics of the data

Summary of file sizes, number of lines and characters in each file

size = file.size(c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"))
size = round((size/1024)/1000)   # convert to MB
lines = sapply(list(blogs,news,twitter), length)
characters <- sapply(list(blogs, news, twitter), function(x){sum(nchar(x))})
summary = data.frame(files =c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"),size,lines,characters)
summary

##               files size   lines characters
## 1   en_US.blogs.txt  205  899288  206824505
## 2    en_US.news.txt  201 1010242  203223159
## 3 en_US.twitter.txt  163 2360148  162096241

Part 3: Other findings

Clean the Data and Create a Corpus

In order to analysis the data, it is necessary to clean data and create a corpu.

Because all 3 data files are too big which may take very long time to process, I will make an example of data cleaning and corpu creating for a sub set of en_US.blogs.txt (10% of the original file).

Create subset of blogs

set.seed(1010) # Make process reproducible
sub_blogs = blogs[sample(length(blogs),length(blogs)*0.1)] # make subset

Creating a corpus and cleaning data

sub_blogs <- iconv(sub_blogs, "latin1", "ASCII", sub="") # pre_clean data
sub_blogs <- iconv(sub_blogs,to="utf-8-mac")
sub_blogs_Corpus <- VCorpus(VectorSource(sub_blogs)) # Make corpus
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, stripWhitespace) # Remove unneccesary white spaces
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removePunctuation) # Remove punctuation
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removeNumbers) # Remove numbers
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, tolower) # Convert to lowercase
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, PlainTextDocument) # Plain text

Tokenizing, calculating frequencies and making plots of n-grams

After cleaning process, the data need to be converted into a type which is most useful for Natural Language Prpcessing (NLP). In this study, we chose the format of n-grams which is stored in Term Document Matrices (TDM). An n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”. Each TDM cell represents a frequency count at the corresponding n-grams and message.

Creating a function that: - Builds n-gram tokenizer, and creat the matrix of n-grams - Calculating the frequency of n-grams - Plot the 40 most popular n-grams in form of bar-plot

n_grams_plot <- function(n, data) {
  options(mc.cores=1)
  
  # Builds n-gram tokenizer 
  tk <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
  # Create matrix
  ngrams_matrix <- TermDocumentMatrix(data, control=list(tokenize=tk))
  # make matrix for easy view
  ngrams_matrix <- as.matrix(rollup(ngrams_matrix, 2, na.rm=TRUE, FUN=sum))
  ngrams_matrix <- data.frame(word=rownames(ngrams_matrix), freq=ngrams_matrix[,1])
  # find 20 most frequent n-grams in the matrix
  ngrams_matrix <- ngrams_matrix[order(-ngrams_matrix$freq), ][1:20, ]
  ngrams_matrix$word <- factor(ngrams_matrix$word, as.character(ngrams_matrix$word))
  
  # plots
  ggplot(ngrams_matrix, aes(x=word, y=freq)) + 
    geom_bar(stat="Identity") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("n-grams") + 
    ylab("Frequency")
}

Plot of frequency distribution of 1-gram

n_grams_plot(n=1, data=sub_blogs_Corpus)

Plot of frequency distribution of 2-gram

n_grams_plot(n=2, data=sub_blogs_Corpus)

Plot of frequency distribution of 3-gram

n_grams_plot(n=3, data=sub_blogs_Corpus)

Plot of frequency distribution of 4-gram

n_grams_plot(n=4, data=sub_blogs_Corpus)

Next step for Prediction Algorithm and Shiny App