Data Science Capstone Milestone Report

Introduction

This is a Week 2 milestone report for Data Science Specialization Capstone project. The aim of the report is to provide an initial exploratory data anylsis on the dataset. The main aim of the project is to build a predictive model that predicts the next phrase based on the phrase/word typed before it. This type of predictive study is categorized as Natural Language Processing (NLP) work.

## Preparing package
knitr::opts_chunk$set(echo = TRUE)
set.seed(100)
library(knitr)
library(stringi)
library(tm)

## Loading required package: NLP

library(RWeka)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(SnowballC)

Load data

Only English language datasetes were used for this report.

blogs<-readLines(con = "../en_US.blogs.txt",encoding = "UTF-8")
twitter<-readLines(con = "../en_US.twitter.txt",encoding = "UTF-8")

## Warning in readLines(con = "../en_US.twitter.txt", encoding = "UTF-8"):
## line 167155 appears to contain an embedded nul

## Warning in readLines(con = "../en_US.twitter.txt", encoding = "UTF-8"):
## line 268547 appears to contain an embedded nul

## Warning in readLines(con = "../en_US.twitter.txt", encoding = "UTF-8"):
## line 1274086 appears to contain an embedded nul

## Warning in readLines(con = "../en_US.twitter.txt", encoding = "UTF-8"):
## line 1759032 appears to contain an embedded nul

news<-readLines(con = "../en_US.news.txt",encoding = "UTF-8")

Data Statistics Summary

The summary statistics for the data were calculated by determining the number of lines, number of characters, and number of words for each of the 3 datasets (twitter, blogs, and news). The number of words per line (min, mean, and max) were also calculated.

WPL=sapply(list(blogs,news,twitter),function(x)summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WPL)=c('wplMIN','wplMEAN','wplMAX')
stats=data.frame(
  Dataset=c("blogs","news","twitter"),      
  t(rbind(
  sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
  Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
  WPL)
))
head(stats)

##   Dataset   Lines     Chars    Words wplMIN wplMEAN wplMAX
## 1   blogs  899288 206824382 37570839      0   41.75   6726
## 2    news 1010242 203223154 34494539      1   34.41   1796
## 3 twitter 2360148 162096031 30451128      1   12.75     47

Data sampling and cleaning

Since I have seen the summary stats of the data, it is crucial to do sampling because the data is too big for processing. I sampled the dataset for 1% each file. The sample dataset was cleaned by removing uncommon characters from the news, blogs and twitter sample dataset. Then, the cleaned samples were combined to become one new dataset called sampleTotal.

## Clean the data
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")

## Sampling the data
sample_blogs <- sample(blogs,size =1/100*length(blogs)) 
sample_news<- sample(news,size =1/100*length(news)) 
sample_twitter <- sample(twitter,size =1/100*length(twitter)) 

## Combine all the subsample into one sample
sample_total <- c(sample_blogs,sample_news,sample_twitter)
summary(sample_total)

##    Length     Class      Mode 
##     42695 character character

Corpus Creation

Next, a corpus was build using tm package. Tm package is used to clean my corpus before it is analyzed. Some pre-processing of the corpus done were:

Lowercase conversion
Punctuation removal
Number removal
White space stripping
Plain text conversion
Remove English stop words
Stemming

corpus <- VCorpus(VectorSource(sample_total))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument) 
corpus <- tm_map(corpus, PlainTextDocument)

N-Grams Analysis

The corpus created is converted into a readable NLP format which is based on the Term Document Matrices (TDM) of n-Grams. N-Grams is the representaion of text in n-tuple of words. Examples of n-grams are:

Unigram based on one word
2-grams based on a pair of words
3-grams based on 3 tuple of words
n-grams based on n tuple of words
‘n’ can be any number

The TDMs store the frequencies of the N-grams. To create the TDM, we use a special package called RWeka. RWeka is a package that links Weka(data mining program) with R. With this package loaded into the library, we used some available functions inside it.

unigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

unigrams <- TermDocumentMatrix(corpus, control = list(tokenize = unigram_tokenizer))
bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = bigram_tokenizer))
trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = trigram_tokenizer))

unigrams

## <<TermDocumentMatrix (terms: 41125, documents: 42695)>>
## Non-/sparse entries: 503672/1755328203
## Sparsity           : 100%
## Maximal term length: 85
## Weighting          : term frequency (tf)

bigrams

## <<TermDocumentMatrix (terms: 405384, documents: 42695)>>
## Non-/sparse entries: 515358/17307354522
## Sparsity           : 100%
## Maximal term length: 93
## Weighting          : term frequency (tf)

trigrams

## <<TermDocumentMatrix (terms: 471235, documents: 42695)>>
## Non-/sparse entries: 477373/20118900952
## Sparsity           : 100%
## Maximal term length: 101
## Weighting          : term frequency (tf)

N-grams Frequency

The frequency for each n-grams were calculated using tm package. With all the n-grams are tokkenized, the frequency for each n-grams were calculted using ‘findFreqTerms’ function from tm package. Each frequecny terms of each n-grams was stored in different dataframe for plotting purpose later.

unigrams_freqTerm <- findFreqTerms(unigrams,lowfreq = 50)
bigrams_freqTerm <- findFreqTerms(bigrams,lowfreq=50)
trigrams_freqTerm <- findFreqTerms(trigrams,lowfreq=8)

## Unigram freqeuncy dataframe
unigrams_freq <- rowSums(as.matrix(unigrams[unigrams_freqTerm,]))
unigrams_freq <- data.frame(word=names(unigrams_freq), frequency=unigrams_freq)
head(unigrams_freq)

##            word frequency
## abil       abil       100
## abl         abl       308
## absolut absolut       114
## abus       abus        67
## accept   accept       150
## access   access       106

## Bigram freqeuncy dataframe
bigrams_freq <- rowSums(as.matrix(bigrams[bigrams_freqTerm,]))
bigrams_freq <- data.frame(word=names(bigrams_freq), frequency=bigrams_freq)
head(bigrams_freq)

##                word frequency
## can get     can get       107
## can help   can help        52
## can make   can make        68
## can see     can see        74
## cant wait cant wait       188
## come back come back        84

## Trigram freqeuncy dataframe
trigrams_freq <- rowSums(as.matrix(trigrams[trigrams_freqTerm,]))
trigrams_freq <- data.frame(word=names(trigrams_freq), frequency=trigrams_freq)
head(trigrams_freq)

##                            word frequency
## blah blah blah   blah blah blah         8
## cant wait get     cant wait get        13
## cant wait see     cant wait see        40
## cinco de mayo     cinco de mayo        20
## coupl week ago   coupl week ago         8
## didnt even know didnt even know         9

Top n-grams Frequency Visualization

The n-grams word frequency dataframe has been created. Next is to create a bar graph visualization for these n-grams. A visuzalization function is created below to ease the creation of plots.

## Function for visualization of n-grams 
plot_n_grams <- function(df_gram, title, num, barC) {
  df_sort <- df_gram[order(-df_gram$frequency),][1:num,] 
  ggplot(data = df_sort[1:num,], aes(x = reorder(word, -frequency), y = frequency)) +
    geom_bar(stat = "identity", fill = barC, colour = "black") +
    coord_cartesian(xlim = c(0, num+1)) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count") +
    theme(axis.text.x=element_text(angle=90))
}

Unigram Visualization

plot_n_grams(unigrams_freq,"Top 10 Unigrams",10,"red")

Bigram Visualization

plot_n_grams(bigrams_freq,"Top 10 Bigrams",10,"green")

Trigram Visualization

plot_n_grams(trigrams_freq,"Top 10 Trigrams",10,"blue")

Findings

The highest Unigrams’s frequency: 3329
The highest Bigrams’s frequency: 253
The highest Trigrams’s frequency: 44

Future plans

This report showed the initial exploratory data analysis performed on the English news, blogs and twitter dataset. Unigrams, bigrams and trigrams of words frequecny were extracted from 1% of each dataset. The anaylsis showed that frequency decrease when the number of n-grams increase. Some notable future plans for this project:

Might consider more dataset (1%-5%)
Using shiny app for GUI
Predictive model based on NLP toolkit
Limit the number of repeating phrases thus reducing the complexity of the model.