Data Science Capstone Week2 - Milestone Report

1) Introduction
2) Loading the data
3) Overview
4) Overview of Sample Data
5) Corpus Process
6) Tokenize
7) Calculate the frequencies of the N-grams
8) Wordcloud

1) Introduction

The milestone report for week 2 in the Exploratory Analysis section is from the Coursera Data Science Capstone project. The main goal of the capstone project is the application based on a predictive text model using explain the Explortory Data Analysis and building an algorithm. Briefly, the application works with a worth ant then it will try to predict the next word. The model will be trained using a collection of English text (corpus) that is compiled from 3 sources - news, blogs, and tweets. The main parts are loading and cleaning the data as well as use NLM (Natural Language Processing) applications in R s a first step toward building a predictive model.

2) Loading the data

load the required files and set up the work environment.

blogs <- readLines("C:/Users/Chintan/Desktop/capstone/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("C:/Users/Chintan/Desktop/capstone/Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("C:/Users/Chintan/Desktop/capstone/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

3) Overview

Statistics

To get a sense of what the data looks like, I summerized the main information from each of the 3 datasets (Blog, News and Twitter). I calculate the size of each file in MB,number of lines and words in each file,average word count per line in each file, max count of char per line in each file and others details.

Overview <- data.frame(
  FileName=c("blogs","news","twitter"),
  "MaxCharacters" = sapply(list(blogs, news, twitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))}),
  "File.Size" = sapply(list(blogs, news, twitter), function(x){format(object.size(x),"MB")}),
  FileSizeinMB=c(file.info("C:/Users/Chintan/Desktop/capstone/Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size/1024^2,
                 file.info("C:/Users/Chintan/Desktop/capstone/Coursera-SwiftKey/final/en_US/en_US.news.txt")$size/1024^2,
                 file.info("C:/Users/Chintan/Desktop/capstone/Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size/1024^2),
  t(rbind(sapply(list(blogs,news,twitter),stri_stats_general),
          WordCount=sapply(list(blogs,news,twitter),stri_stats_latex)[4,])
    )
)
kable(Overview,caption = "The main datasets")

The main datasets
FileName	MaxCharacters	File.Size	FileSizeinMB	Lines	LinesNEmpty	Chars	CharsNWhite	WordCount
blogs	40833	255.4 Mb	200.4242	899288	899288	206824382	170389539	37570839
news	5760	19.8 Mb	196.2775	77259	77259	15639408	13072698	2651432
twitter	140	319 Mb	159.3641	2360148	2360148	162096241	134082806	30451170

4) Overview of Sample Data

Statistics to compare the all datasets

To summarize the all info until now, I seleted an small subset of each data and compared with the main files.

Blogs_subset <- sample(blogs, length(blogs) * 0.002)
News_subset <- sample(news, length(news) * 0.002)
twitter_subset <- sample(twitter, length(twitter) * 0.002)


subset_blog_news_twitter<-c(sample(blogs, length(blogs) * 0.002),
             sample(news, length(news) * 0.002),
             sample(twitter, length(twitter) * 0.002))

Overview.after.subset <- data.frame('File' = c("blogs","news","twitter","Blogs_subset","News_subset","twitter_subset","subset_blog_news_twitter"),
                      "File Size" = sapply(list(blogs,news,twitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){format(object.size(x),"MB")}),
                      'Nentries' = sapply(list(blogs,news,twitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){length(x)}),
                      'TotalCharacters' = sapply(list(blogs,news,twitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){sum(nchar(x))}),
                      'MaxCharacters' = sapply(list(blogs,news,twitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
)
kable(Overview.after.subset,caption = "7 datasets")

7 datasets
File	File.Size	Nentries	TotalCharacters	MaxCharacters
blogs	255.4 Mb	899288	206824505	40833
news	19.8 Mb	77259	15639408	5760
twitter	319 Mb	2360148	162096241	140
Blogs_subset	0.5 Mb	1798	405552	2145
News_subset	0 Mb	154	30906	727
twitter_subset	0.6 Mb	4720	327727	140
subset_blog_news_twitter	1.2 Mb	6672	773546	4265

5) Corpus Process

First step to clean the data

After reducing the size of each data set that were loaded sampled data is used to create a corpus, and following clean up steps are performed. I will be made into the corpus to:

Convert all words to lowercase Eliminate punctuation Eliminate numbers Strip whitespace Eliminate banned words Stemming Using Porter’s Stemming Algorithm Create Plain Text Format

Blogs_subset <- iconv(Blogs_subset, "UTF-8", "ASCII", sub="")
News_subset <- iconv(News_subset, "UTF-8", "ASCII", sub="")
twitter_subset <- iconv(twitter_subset, "UTF-8", "ASCII", sub="")
Data_subset <- c(Blogs_subset,News_subset,twitter_subset)



building.corpus <- function (x = Data_subset) {
  corpus <- VCorpus(VectorSource(Data_subset))
  corpus <- tm_map(corpus, tolower)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, PlainTextDocument)
}
corpues <- building.corpus(Data_subset)

6) Tokenize

Breaking a stream of text up into words or short phrases

I use the tm package to construct functions that tokenize the sample and construct matrices of uniqrams, bigrams, and trigrams. for that, we have a clean dataset we need to convert it to a format that is most useful for Natural Language Processing (NLP).

uni_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bi_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tri_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

corpus.uni.matrix <- TermDocumentMatrix(corpues, control = list(tokenize = uni_tokenizer))
corpus.bi.matrix<- TermDocumentMatrix(corpues, control = list(tokenize = bi_tokenizer))
corpus.tri.matrix <- TermDocumentMatrix(corpues, control = list(tokenize = tri_tokenizer))

corpus.uni <- findFreqTerms(corpus.uni.matrix,lowfreq = 10)
corpus.bi <- findFreqTerms(corpus.bi.matrix,lowfreq=10)
corpus.tri <- findFreqTerms(corpus.tri.matrix,lowfreq=10)

corpus.uni.f <- rowSums(as.matrix(corpus.uni.matrix[corpus.uni,]))
corpus.uni.f <- data.frame(word=names(corpus.uni.f), frequency=corpus.uni.f)
corpus.bi.f <- rowSums(as.matrix(corpus.bi.matrix[corpus.bi,]))
corpus.bi.f <- data.frame(word=names(corpus.bi.f), frequency=corpus.bi.f)
corpus.tri.f <- rowSums(as.matrix(corpus.tri.matrix[corpus.tri,]))
corpus.tri.f <- data.frame(word=names(corpus.tri.f), frequency=corpus.tri.f)

kable(head(corpus.uni.f),caption = "Only one word")

Only one word
	word	frequency
able	able	39
about	about	429
above	above	13
absolutely	absolutely	13
abuse	abuse	10
according	according	19

kable(head(corpus.bi.f),caption = "Two words")

Two words
	word	frequency
a beautiful	a beautiful	11
a big	a big	19
a bit	a bit	30
a chance	a chance	12
a couple	a couple	28
a day	a day	19

kable(head(corpus.tri.f),caption = "Three words")

Three words
	word	frequency
a couple of	a couple of	17
a little bit	a little bit	10
a long time	a long time	11
a lot of	a lot of	33
all of the	all of the	14
all the time	all the time	11

7) Calculate the frequencies of the N-grams

Frequency of words or short phrases

In this section, I will find the most frequently occurring words in the data. Here we list the most common unigrams, bigrams, and trigrams. The N-gram representation of a text lists all N-tuples of words that appear.

plot.n.grams <- function(data, title, num) {
  df2 <- data[order(-data$frequency),][1:num,] 
  ggplot(df2, aes(x = seq(1:num), y = frequency)) +
    geom_bar(stat = "identity", fill = "darkgreen", colour = "black", width = 1.1) +
    coord_cartesian(xlim = c(0, num+1)) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count") +
    scale_x_discrete(breaks = seq(1, num, by = 1), labels = df2$word[1:num]) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
}

 U<-plot.n.grams(corpus.uni.f,"Unigrams",20)
 B<-plot.n.grams(corpus.bi.f,"Bigrams",20)
 Tr<-plot.n.grams(corpus.tri.f,"Trigrams",20)
gridExtra::grid.arrange(U, B, Tr, ncol = 3)

8) Wordcloud

Alternative graph to see quicly the main word.

I made a wordcloud. As an alternative of the last plots, and to give a quick impression of the most common words, this graph shows the most common words of the corpus.

corpus.cloud<-list(corpus.tri.f,corpus.bi.f,corpus.uni.f)
par(mfrow=c(1, 3))
for (i in 1:3) {
  wordcloud(corpus.cloud[[i]]$word, corpus.cloud[[i]]$frequency, scale = c(3,1), max.words=100, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(8, "Dark2"))
}

finally, the next steps are the predictive algorithm and deploy the Shiny app. Briefly, the plan is to add in the filters, which will be a file full of foul words, then compared the data. there is a second way that I want to try. I will fill the all spaces between words and the cut in similar short part to identify phrases as only one word. These algorithms will be based on frequency.

I will build a UI of the Shiny app and this will consist of a text input box that will allow a user to enter a word or phrase.