Introduction

The main goal of the capstone project is the application based on a predictive text model using explain the Explortory Data Analysis and building an algorithm. Briefly, the application works with a worth ant then it will try to predict the next word. The model will be trained using a collection of English text (corpus) that is compiled from 3 sources - news, blogs, and tweets. The main parts are loading and cleaning the data as well as use NLP (Natural Language Processing) applications in R

Loading packages

Step 1: Load the required libraries.

Sys.setenv(JAVA_HOME="c:\\Program Files\\Java\\jdk1.8.0_161\\jre")
library(rJava)
library(knitr)

## Warning: package 'knitr' was built under R version 3.3.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tm)

## Loading required package: NLP

#install.packages("RWekajars")
library(RWekajars)
#install.packages("RWeka")
library(RWeka)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(tm)
library(stringi)

## Warning: package 'stringi' was built under R version 3.3.3

library(NLP)
library(RColorBrewer)
library(wordcloud)
#install.packages("ngram")
library(ngram)
library(slam)
#install.packages("htmlTable")
library(htmlTable)
library(xtable)

Loading Data

Step 2: Load the required files and set up the work environment.

fileName_blog="final/en_US/en_US.blogs.txt"
con=file(fileName_blog,open="r")
lineBlogs=readLines(con, encoding = "UTF-8", skipNul = TRUE) 
longBlogs=length(line)
close(con)

fileName_news="final/en_US/en_US.news.txt"
con=file(fileName_news,open="r")
lineNews=readLines(con, encoding = "UTF-8", skipNul = TRUE)

## Warning in readLines(con, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'final/en_US/en_US.news.txt'

longNews=length(line)
close(con)

fileName_twitter="final/en_US/en_US.twitter.txt"
con=file(fileName_twitter,open="r")
lineTwitter=readLines(con, encoding = "UTF-8", skipNul = TRUE) 
longTwitter=length(line)
close(con)

Overview

Step 3: statistics

To get a sense of what the data looks like, the main information from each of the 3 datasets (Blog, News and Twitter) is summarized.

Calculating the size of each file in MB,number of lines and words in each file,average word count per line in each file, max count of char per line in each file and others details.

Overview <- data.frame(
  FileName=c("lineBlogs","lineNews","lineTwitter"),
  "MaxCharacters" = sapply(list(lineBlogs, lineNews, lineTwitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))}),
  "File.Size" = sapply(list(lineBlogs, lineNews, lineTwitter), function(x){format(object.size(x),"MB")}),
  FileSizeinMB=c(file.info(fileName_blog)$size/1024^2,
                 file.info(fileName_news)$size/1024^2,
                 file.info(fileName_twitter)$size/1024^2),
  t(rbind(sapply(list(lineBlogs,lineNews,lineTwitter),stri_stats_general),
          WordCount=sapply(list(lineBlogs,lineNews,lineTwitter),stri_stats_latex)[4,])
    )
)

Visualise the data in the table

kable(Overview,caption = "The main datasets")

The main datasets
FileName	MaxCharacters	File.Size	FileSizeinMB	Lines	LinesNEmpty	Chars	CharsNWhite	WordCount
lineBlogs	40833	248.5 Mb	200.4242	899288	899288	206824382	170389539	37570839
lineNews	5760	19.2 Mb	196.2775	77259	77259	15639408	13072698	2651432
lineTwitter	140	301.4 Mb	159.3641	2360148	2360148	162096241	134082806	30451170

Overview of the sample data

Step 4: Statistics to compare the all datasets

To summarize the all info until now, select a small subset of each data and compare with the main files.

Blogs_subset <- sample(lineBlogs, length(lineBlogs) * 0.002)
News_subset <- sample(lineNews, length(lineNews) * 0.002)
twitter_subset <- sample(lineTwitter, length(lineTwitter) * 0.002)


subset_blog_news_twitter<-c(sample(lineBlogs, length(lineBlogs) * 0.002),
             sample(lineNews, length(lineNews) * 0.002),
             sample(lineTwitter, length(lineTwitter) * 0.002))

Overview.after.subset <- data.frame('File' = c("lineBlogs","lineNews","lineTwitter","Blogs_subset","News_subset","twitter_subset","subset_blog_news_twitter"),
                      "File Size" = sapply(list(lineBlogs,lineNews,lineTwitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){format(object.size(x),"MB")}),
                      'Nentries' = sapply(list(lineBlogs,lineNews,lineTwitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){length(x)}),
                      'TotalCharacters' = sapply(list(lineBlogs,lineNews,lineTwitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){sum(nchar(x))}),
                      'MaxCharacters' = sapply(list(lineBlogs,lineNews,lineTwitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
)

Visualise the data of the subsets in the table

kable(Overview.after.subset,caption = "7 datasets")

7 datasets
File	File.Size	Nentries	TotalCharacters	MaxCharacters
lineBlogs	248.5 Mb	899288	206824505	40833
lineNews	19.2 Mb	77259	15639408	5760
lineTwitter	301.4 Mb	2360148	162096241	140
Blogs_subset	0.5 Mb	1798	420411	2442
News_subset	0 Mb	154	32193	1098
twitter_subset	0.6 Mb	4720	325881	140
subset_blog_news_twitter	1.2 Mb	6672	783546	9810

Corpus process

Step 5: First step to clean the data

After reducing the size of each data set that were loaded sampled data is used to create a corpus, and following clean up steps are performed.

1)Convert all words to lowercase using tolower
2)Eliminate punctuation using removePunctuation
3)Eliminate numbers using removeNumbers
4)Strip whitespace using stripWhitespace
5)Eliminate banned words
6)Stemming Using Porter’s Stemming Algorithm
7)Create Plain Text Format using PlainTextDocument

Blogs_subset <- iconv(Blogs_subset, "UTF-8", "ASCII", sub="")
News_subset <- iconv(News_subset, "UTF-8", "ASCII", sub="")
twitter_subset <- iconv(twitter_subset, "UTF-8", "ASCII", sub="")
Data_subset <- c(Blogs_subset,News_subset,twitter_subset)



building.corpus <- function (x = Data_subset) {
  corpus <- VCorpus(VectorSource(Data_subset))
  corpus <- tm_map(corpus, tolower)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, PlainTextDocument)
}
corpues <- building.corpus(Data_subset)

Tokenize

Step 6: Breaking a stream of text up into words or short phrases

Using the tm package to construct functions that tokenize the sample and construct matrices of uniqrams, bigrams, and trigrams. for that, we have a clean dataset we need to convert it to a format that is most useful for Natural Language Processing (NLP).

#Unigrams
uni_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))

#Bigrams
bi_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

#Trigrams
tri_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# Make Term Document Matrix
corpus.uni.matrix <- TermDocumentMatrix(corpues, control = list(tokenize = uni_tokenizer))
corpus.bi.matrix<- TermDocumentMatrix(corpues, control = list(tokenize = bi_tokenizer))
corpus.tri.matrix <- TermDocumentMatrix(corpues, control = list(tokenize = tri_tokenizer))

corpus.uni <- findFreqTerms(corpus.uni.matrix,lowfreq = 10)
corpus.bi <- findFreqTerms(corpus.bi.matrix,lowfreq=10)
corpus.tri <- findFreqTerms(corpus.tri.matrix,lowfreq=10)

corpus.uni.f <- rowSums(as.matrix(corpus.uni.matrix[corpus.uni,]))
corpus.uni.f <- data.frame(word=names(corpus.uni.f), frequency=corpus.uni.f)
corpus.bi.f <- rowSums(as.matrix(corpus.bi.matrix[corpus.bi,]))
corpus.bi.f <- data.frame(word=names(corpus.bi.f), frequency=corpus.bi.f)
corpus.tri.f <- rowSums(as.matrix(corpus.tri.matrix[corpus.tri,]))
corpus.tri.f <- data.frame(word=names(corpus.tri.f), frequency=corpus.tri.f)

Visualisation of Unigrams

kable(head(corpus.uni.f),caption = "Only one word")

Only one word
	word	frequency
able	able	32
about	about	436
above	above	19
absolutely	absolutely	20
access	access	11
according	according	20

Visualisation of Bigrams

kable(head(corpus.bi.f),caption = "Two words")

Two words
	word	frequency
a bad	a bad	10
a beautiful	a beautiful	11
a better	a better	13
a big	a big	22
a bit	a bit	44
a book	a book	10

Visualisation of Trigrams

kable(head(corpus.tri.f),caption = "Three words")

Three words
	word	frequency
a couple of	a couple of	28
a long time	a long time	10
a lot of	a lot of	37
all of the	all of the	13
all the time	all the time	10
and i am	and i am	11

Calculate Frequencies of N-Grams

Step 7: frequency of words or short phrases

In this section, We will find the most frequently occurring words in the data. Here we list the most common unigrams, bigrams, and trigrams. The N-gram representation of a text lists all N-tuples of words that appear.

plot.n.grams <- function(data, title, num) {
  df2 <- data[order(-data$frequency),][1:num,] 
  ggplot(df2, aes(x = seq(1:num), y = frequency)) +
    geom_bar(stat = "identity", fill = "darkgreen", colour = "black", width = 1.1) +
    coord_cartesian(xlim = c(0, num+1)) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count") +
    scale_x_discrete(breaks = seq(1, num, by = 1), labels = df2$word[1:num]) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
}

 U<-plot.n.grams(corpus.uni.f,"Unigrams",20)
 B<-plot.n.grams(corpus.bi.f,"Bigrams",20)
 Tr<-plot.n.grams(corpus.tri.f,"Trigrams",20)
gridExtra::grid.arrange(U, B, Tr, ncol = 3)

## Warning: position_stack requires non-overlapping x intervals

## Warning: position_stack requires non-overlapping x intervals

## Warning: position_stack requires non-overlapping x intervals

Wordcloud

For a better visualistion, we are making a Wordcloud that is based on the frequencies of the N-grams