Introduction

The main goal of the capstone project is the application based on a predictive text model using explain the Explortory Data Analysis and building an algorithm. Briefly, the application works with a worth ant then it will try to predict the next word. The model will be trained using a collection of English text (corpus) that is compiled from 3 sources - news, blogs, and tweets. The main parts are loading and cleaning the data as well as use NLP (Natural Language Processing) applications in R

Loading packages

Step 1: Load the required libraries.

Sys.setenv(JAVA_HOME="c:\\Program Files\\Java\\jdk1.8.0_161\\jre")
library(rJava)
library(knitr)
## Warning: package 'knitr' was built under R version 3.3.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tm)
## Loading required package: NLP
#install.packages("RWekajars")
library(RWekajars)
#install.packages("RWeka")
library(RWeka)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(tm)
library(stringi)
## Warning: package 'stringi' was built under R version 3.3.3
library(NLP)
library(RColorBrewer)
library(wordcloud)
#install.packages("ngram")
library(ngram)
library(slam)
#install.packages("htmlTable")
library(htmlTable)
library(xtable)

Loading Data

Step 2: Load the required files and set up the work environment.

fileName_blog="final/en_US/en_US.blogs.txt"
con=file(fileName_blog,open="r")
lineBlogs=readLines(con, encoding = "UTF-8", skipNul = TRUE) 
longBlogs=length(line)
close(con)

fileName_news="final/en_US/en_US.news.txt"
con=file(fileName_news,open="r")
lineNews=readLines(con, encoding = "UTF-8", skipNul = TRUE) 
## Warning in readLines(con, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'final/en_US/en_US.news.txt'
longNews=length(line)
close(con)

fileName_twitter="final/en_US/en_US.twitter.txt"
con=file(fileName_twitter,open="r")
lineTwitter=readLines(con, encoding = "UTF-8", skipNul = TRUE) 
longTwitter=length(line)
close(con)

Overview

Step 3: statistics

To get a sense of what the data looks like, the main information from each of the 3 datasets (Blog, News and Twitter) is summarized.

Calculating the size of each file in MB,number of lines and words in each file,average word count per line in each file, max count of char per line in each file and others details.

Overview <- data.frame(
  FileName=c("lineBlogs","lineNews","lineTwitter"),
  "MaxCharacters" = sapply(list(lineBlogs, lineNews, lineTwitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))}),
  "File.Size" = sapply(list(lineBlogs, lineNews, lineTwitter), function(x){format(object.size(x),"MB")}),
  FileSizeinMB=c(file.info(fileName_blog)$size/1024^2,
                 file.info(fileName_news)$size/1024^2,
                 file.info(fileName_twitter)$size/1024^2),
  t(rbind(sapply(list(lineBlogs,lineNews,lineTwitter),stri_stats_general),
          WordCount=sapply(list(lineBlogs,lineNews,lineTwitter),stri_stats_latex)[4,])
    )
)

Visualise the data in the table

kable(Overview,caption = "The main datasets")
The main datasets
FileName MaxCharacters File.Size FileSizeinMB Lines LinesNEmpty Chars CharsNWhite WordCount
lineBlogs 40833 248.5 Mb 200.4242 899288 899288 206824382 170389539 37570839
lineNews 5760 19.2 Mb 196.2775 77259 77259 15639408 13072698 2651432
lineTwitter 140 301.4 Mb 159.3641 2360148 2360148 162096241 134082806 30451170

Overview of the sample data

Step 4: Statistics to compare the all datasets

To summarize the all info until now, select a small subset of each data and compare with the main files.

Blogs_subset <- sample(lineBlogs, length(lineBlogs) * 0.002)
News_subset <- sample(lineNews, length(lineNews) * 0.002)
twitter_subset <- sample(lineTwitter, length(lineTwitter) * 0.002)


subset_blog_news_twitter<-c(sample(lineBlogs, length(lineBlogs) * 0.002),
             sample(lineNews, length(lineNews) * 0.002),
             sample(lineTwitter, length(lineTwitter) * 0.002))

Overview.after.subset <- data.frame('File' = c("lineBlogs","lineNews","lineTwitter","Blogs_subset","News_subset","twitter_subset","subset_blog_news_twitter"),
                      "File Size" = sapply(list(lineBlogs,lineNews,lineTwitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){format(object.size(x),"MB")}),
                      'Nentries' = sapply(list(lineBlogs,lineNews,lineTwitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){length(x)}),
                      'TotalCharacters' = sapply(list(lineBlogs,lineNews,lineTwitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){sum(nchar(x))}),
                      'MaxCharacters' = sapply(list(lineBlogs,lineNews,lineTwitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
)

Visualise the data of the subsets in the table

kable(Overview.after.subset,caption = "7 datasets")
7 datasets
File File.Size Nentries TotalCharacters MaxCharacters
lineBlogs 248.5 Mb 899288 206824505 40833
lineNews 19.2 Mb 77259 15639408 5760
lineTwitter 301.4 Mb 2360148 162096241 140
Blogs_subset 0.5 Mb 1798 420411 2442
News_subset 0 Mb 154 32193 1098
twitter_subset 0.6 Mb 4720 325881 140
subset_blog_news_twitter 1.2 Mb 6672 783546 9810

Corpus process

Step 5: First step to clean the data

After reducing the size of each data set that were loaded sampled data is used to create a corpus, and following clean up steps are performed.

1)Convert all words to lowercase using tolower
2)Eliminate punctuation using removePunctuation
3)Eliminate numbers using removeNumbers
4)Strip whitespace using stripWhitespace
5)Eliminate banned words
6)Stemming Using Porter’s Stemming Algorithm
7)Create Plain Text Format using PlainTextDocument

Blogs_subset <- iconv(Blogs_subset, "UTF-8", "ASCII", sub="")
News_subset <- iconv(News_subset, "UTF-8", "ASCII", sub="")
twitter_subset <- iconv(twitter_subset, "UTF-8", "ASCII", sub="")
Data_subset <- c(Blogs_subset,News_subset,twitter_subset)



building.corpus <- function (x = Data_subset) {
  corpus <- VCorpus(VectorSource(Data_subset))
  corpus <- tm_map(corpus, tolower)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, PlainTextDocument)
}
corpues <- building.corpus(Data_subset)

Tokenize

Step 6: Breaking a stream of text up into words or short phrases

Using the tm package to construct functions that tokenize the sample and construct matrices of uniqrams, bigrams, and trigrams. for that, we have a clean dataset we need to convert it to a format that is most useful for Natural Language Processing (NLP).

#Unigrams
uni_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))

#Bigrams
bi_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

#Trigrams
tri_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# Make Term Document Matrix
corpus.uni.matrix <- TermDocumentMatrix(corpues, control = list(tokenize = uni_tokenizer))
corpus.bi.matrix<- TermDocumentMatrix(corpues, control = list(tokenize = bi_tokenizer))
corpus.tri.matrix <- TermDocumentMatrix(corpues, control = list(tokenize = tri_tokenizer))

corpus.uni <- findFreqTerms(corpus.uni.matrix,lowfreq = 10)
corpus.bi <- findFreqTerms(corpus.bi.matrix,lowfreq=10)
corpus.tri <- findFreqTerms(corpus.tri.matrix,lowfreq=10)

corpus.uni.f <- rowSums(as.matrix(corpus.uni.matrix[corpus.uni,]))
corpus.uni.f <- data.frame(word=names(corpus.uni.f), frequency=corpus.uni.f)
corpus.bi.f <- rowSums(as.matrix(corpus.bi.matrix[corpus.bi,]))
corpus.bi.f <- data.frame(word=names(corpus.bi.f), frequency=corpus.bi.f)
corpus.tri.f <- rowSums(as.matrix(corpus.tri.matrix[corpus.tri,]))
corpus.tri.f <- data.frame(word=names(corpus.tri.f), frequency=corpus.tri.f)

Visualisation of Unigrams

kable(head(corpus.uni.f),caption = "Only one word")
Only one word
word frequency
able able 32
about about 436
above above 19
absolutely absolutely 20
access access 11
according according 20

Visualisation of Bigrams

kable(head(corpus.bi.f),caption = "Two words")
Two words
word frequency
a bad a bad 10
a beautiful a beautiful 11
a better a better 13
a big a big 22
a bit a bit 44
a book a book 10

Visualisation of Trigrams

kable(head(corpus.tri.f),caption = "Three words")
Three words
word frequency
a couple of a couple of 28
a long time a long time 10
a lot of a lot of 37
all of the all of the 13
all the time all the time 10
and i am and i am 11

Calculate Frequencies of N-Grams

Step 7: frequency of words or short phrases

In this section, We will find the most frequently occurring words in the data. Here we list the most common unigrams, bigrams, and trigrams. The N-gram representation of a text lists all N-tuples of words that appear.

plot.n.grams <- function(data, title, num) {
  df2 <- data[order(-data$frequency),][1:num,] 
  ggplot(df2, aes(x = seq(1:num), y = frequency)) +
    geom_bar(stat = "identity", fill = "darkgreen", colour = "black", width = 1.1) +
    coord_cartesian(xlim = c(0, num+1)) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count") +
    scale_x_discrete(breaks = seq(1, num, by = 1), labels = df2$word[1:num]) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
}

 U<-plot.n.grams(corpus.uni.f,"Unigrams",20)
 B<-plot.n.grams(corpus.bi.f,"Bigrams",20)
 Tr<-plot.n.grams(corpus.tri.f,"Trigrams",20)
gridExtra::grid.arrange(U, B, Tr, ncol = 3)
## Warning: position_stack requires non-overlapping x intervals

## Warning: position_stack requires non-overlapping x intervals

## Warning: position_stack requires non-overlapping x intervals

Wordcloud

For a better visualistion, we are making a Wordcloud that is based on the frequencies of the N-grams

Step 7: frequency of words or short phrases

In this section, We will find the most frequently occurring words in the data. Here we list the most common unigrams, bigrams, and trigrams. The N-gram representation of a text lists all N-tuples of words that appear.

corpus.cloud<-list(corpus.tri.f,corpus.bi.f,corpus.uni.f)
par(mfrow=c(1, 3))
for (i in 1:3) {
  wordcloud(corpus.cloud[[i]]$word, corpus.cloud[[i]]$frequency, scale = c(3,1), max.words=100, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(8, "Dark2"))
}

Next Steps

The next steps are to:

1)Remove the common words to limit the skew that they may provide.
2)Add sentiment analysis.
3)Implement word prediction algorithm.
4)Implement Shiny app.