Synopsis

For the capstone project of the Data Science Specialization a predictive text product shall be build. In order to achieve this project a large corpus of text documents must be used. In this report an exploratory analysis of the available data will be present. The report will be separated in four sections which will contain the following:

Downloading and loading the data.
Basic summary report for the data.
N-grams counting.
A way forward for the project.

Downloading and loading the data.

Initially the necessary packages will be loaded and the appropriate directory will be set.

library("NLP")
library("tm")
library("RWeka")
library("stringi")
library("ggplot2")
library("dplyr")
library("gridExtra")
setwd("D:/coursera/capstone")

After this procedure the data will be downloaded in a new file called data in the chosen directory and unzipped there.

if (!dir.exists("D:/coursera/capstone/data")){
     dir.create("D:/coursera/capstone/data")}#creating directory for the data 
setwd("D:/coursera/capstone/data")
if (!file.exists("Coursera-SwiftKey.zip")){
     path = getwd()
     url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
     download.file(url, file.path(path, "Coursera-SwiftKey.zip"))
     unzip(zipfile = "Coursera-SwiftKey.zip")}

The data can be found in this url . Next thing to do is the reading-loading off the data into the r.

setwd("D:/coursera/capstone/data/final/en_US")#setting the directory where the data was downloaded
twitter=readLines("en_US.twitter.txt",encoding = "UTF-8",skipNul = TRUE)#read text twitter
blogs= readLines("en_US.blogs.txt",encoding = "UTF-8",skipNul = TRUE)#read text blogs
con = file("D:/coursera/capstone/data/final/en_US/en_US.news.txt", open = "rb")
news = readLines(con, encoding = "UTF-8", skipNul = TRUE)#read text news
close.connection(con)
rm(con)

Basic summary report for the data.

In the following chunk a brief summary of the three documents will be produced. The summary includes the size of the files, the total number of records, the total number of character, the total number of words also the min, max and mean number of words per record (wpr).As record considered each line of the documents.

setwd("D:/coursera/capstone/data/final/en_US")
files=list.files(path="D:/coursera/capstone/data/final/en_US")
size=file.info(files)$size/(1024)^2#size of the file
num_Records=sapply(list(blogs,news,twitter),length)#number of Records
num_char=sapply(list(nchar(blogs),nchar(news),nchar(twitter)),sum)#number of character
num_Words_perRecords = sapply(list(blogs, news, twitter), stri_count_words)#number of words
num_Words = sapply(num_Words_perRecords,sum)#number of words
summary=sapply(num_Words_perRecords,summary)#finding mean,min max number of word per record
summarydf=data.frame(sizeMB=round(size,digits = 0),Records=num_Records,Characters=num_char,
          words=num_Words,min_wpr=round(summary[1,],digits = 0),
          mean_wpr=round(summary[4,],digits = 0),
          max_wpr=round(summary[6,],digits = 0),
          row.names=c("blogs","news","twitter"))
    summarydf

##         sizeMB Records Characters    words min_wpr mean_wpr max_wpr
## blogs      200  899288  206824505 37546239       0       42    6726
## news       196 1010242  203223159 34762395       1       34    1796
## twitter    159 2360148  162096241 30093413       1       13      47

From the summary we can infer that the files are large and the following analysis will be performed on a small sample (equal with 1% of each document), in order to reduce the amount of memory that will be consumed. Also we can infer that each document has specific characteristics. For example twitter records tends to be shorter than the other, something that we expect because the nature of this application.

Sampling the data

The sample will be random and equal with the 1% off each initial document. The sampled data will be written into the hard disk. It will help to keep corpus objects smaller.At this point the first cleaning of the data will be performed. Specifically all the non English characters will be removed.

set.seed(1000)
sampledtwitter=sample(twitter,length(twitter)*0.01)
sampledblogs=sample(blogs,length(blogs)*0.01)
samplednews=sample(news,length(news)*0.01)
sampledtwitter = iconv(sampledtwitter, 'latin1', 'ASCII')
sampledblogs = iconv(sampledblogs, 'latin1', 'ASCII')
samplednews = iconv(samplednews, 'latin1', 'ASCII')
if (!dir.exists("D:/coursera/capstone/data/final/sample")){
    dir.create("D:/coursera/capstone/data/final/sample")}#creating directory for the sampled data 
setwd("D:/coursera/capstone/data/final/sample")
write.table(sampledtwitter,"en_US.twitter.txt",row.names = FALSE)
write.table(sampledblogs,"en_US.blogs.txt",row.names = FALSE)
write.table(samplednews,"en_US.news.txt",row.names = FALSE)
rm(news,twitter,blogs,sampledtwitter,sampledblogs,samplednews,num_Records,num_char,num_Words_perRecords,
   num_Words)#removing not usable objects

N-grams counting.

It will be very interesting, especially for the building of a predictive product, to study the appearance of each unigram, bigram and trigram on the sampled document. In order to achieve this a corpus will be created using the tm package. Also with the help of the tm package the documents will be cleaned from numbers, punctuation. Also all upper case letters will be translated into lower case.

Sample_dir=DirSource("D:/coursera/capstone/data/final/sample",encoding ="UTF-8",mode="text" )
corpus=VCorpus(Sample_dir);corpus=tm_map(corpus,content_transformer(tolower));corpus=tm_map(corpus,removeNumbers)
corpus=tm_map(corpus,removePunctuation);

In order to avoid misspelled words or terms not in the English language (for example terms like web address) a dictionary will be created from words included in a text file. The text file comes from a github project which contain 466k English language and common terms. More information can be found here dwyl/english-words. The text file that was used can be found here words_alpha.text the specific file contains only [[:alpha:]] words (words that only have letters, no numbers or symbols). In the following chunk the words that must be removed from the corpus will be created.

setwd("D:/coursera/wordnet/Dictionary")
if (!file.exists("world_alpha.text")){
  path <- getwd()
  url <- "https://github.com/dwyl/english-words/raw/master/words_alpha.txt"
  download.file(url, file.path(path, "world_alpha.text"))}
con <- file("D:/coursera/wordnet/Dictionary/world_alpha.text", open = "rb")
dictionary_il <- readLines(con, encoding = "UTF-8", skipNul = TRUE,)
close.connection(con)
rm(con)
dictionary_il=dictionary_il[dictionary_il!=""]
corpusTDM = DocumentTermMatrix(corpus)
MacorpusTDM=as.matrix(corpusTDM)
is.word  <- function(x) x %in% dictionary_il
MacorpusTDM <- MacorpusTDM[,!is.word(colnames(corpusTDM))]
iliasstopword=colnames(MacorpusTDM)
chunk <- 500
n = length(iliasstopword)
r = rep(1:ceiling(n/chunk),each=chunk)[1:n]
d = split(iliasstopword,r)
for (i in 1:length(d)) {
     corpus <- tm_map(corpus, removeWords, c(paste(d[[i]])))
}
rm(corpusTDM,MacorpusTDM,dictionary_il,d,iliasstopword)

In this stadium of the capstone project two approach will be considered regarding the stop words. In one approach the stop words wll be removed and in the other they will be included. Α second corpus without the stop words will be created.

corpusNSW=tm_map(corpus,removeWords,stopwords("english"))

Plotting the top 20 unigrams

In the following figure the top 20 unigram will be plotted for both cases(with or without stop words)

corpusTDM = DocumentTermMatrix(corpus)
unigramscount=colSums(as.matrix(corpusTDM));unigramscount=sort(unigramscount,decreasing = TRUE)
top20unigrams=as.data.frame(unigramscount[1:20])
top20unigrams <- tibble::rownames_to_column(top20unigrams, "unigram")
colnames(top20unigrams)=c("unigram","Counts")
corpusNSWTDM = DocumentTermMatrix(corpusNSW)
unigramsNSWcount=colSums(as.matrix(corpusNSWTDM));unigramsNSWcount=sort(unigramsNSWcount,decreasing = TRUE)
top20unigramsNSW=unigramsNSWcount[1:20]
top20unigramsNSW=as.data.frame(unigramsNSWcount[1:20])
top20unigramsNSW <- tibble::rownames_to_column(top20unigramsNSW, "unigram")
colnames(top20unigramsNSW)=c("unigram","Counts")
g1<-ggplot(top20unigrams,aes(x = reorder(unigram, -Counts), 
                             y = Counts))+geom_col()+labs(title="20 most frequent Unigrams Stop words included", x="Unigrams", y="Counts")+theme(plot.title = element_text(hjust = 0.5))
g2<-ggplot(top20unigramsNSW,aes(x = reorder(unigram, -Counts),y = Counts))+geom_col()+labs(title="20 most frequent Unigrams Stop words not included", x="Unigrams",y="Counts")+theme(plot.title = element_text(hjust = 0.5))
grid.arrange(g1,g2,ncol=1)

Plotting the top 20 bigrams.

The next figure will give as the most frequent bigrams for the two cases under examination.

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
corpus2gram <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
corpus2gramNSW <- TermDocumentMatrix(corpusNSW, control = list(tokenize = BigramTokenizer))
bigramscount=rowSums(as.matrix(corpus2gram));bigramscount=sort(bigramscount,decreasing = TRUE)
top20bigrams=as.data.frame(bigramscount[1:20])
top20bigrams <- tibble::rownames_to_column(top20bigrams, "bigram")
colnames(top20bigrams)=c("bigram","Counts")
bigramscountNSW=rowSums(as.matrix(corpus2gramNSW));bigramscountNSW=sort(bigramscountNSW,decreasing = TRUE)
top20bigramsNSW=as.data.frame(bigramscountNSW[1:20])
top20bigramsNSW = tibble::rownames_to_column(top20bigramsNSW, "bigram")
colnames(top20bigramsNSW)=c("bigram","Counts")
g1<-ggplot(top20bigrams,aes(x = reorder(bigram, -Counts), 
                             y = Counts))+geom_col()+labs(title="20 most frequent bigrams Stop words included",                               x="bigrams",y="Counts")+theme(axis.text.x=element_text(angle=90, hjust=1),plot.title = element_text(hjust = 0.5))
g2<-ggplot(top20bigramsNSW,aes(x = reorder(bigram, -Counts),
                                y = Counts))+geom_col()+labs(title="20 most frequent bigrams Stop words not included",
                                x="bigrams",y="Counts")+theme(axis.text.x=element_text(angle=90, hjust=1),plot.title = element_text(hjust = 0.5))
grid.arrange(g1,g2,ncol=1)

Plotting the top 20 trigrams.

In the last figure the 20 most frequent trigrams will be depicted.

trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
corpus3gram <- TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))
corpus3gramNSW <- TermDocumentMatrix(corpusNSW, control = list(tokenize = trigramTokenizer))
trigramscount=rowSums(as.matrix(corpus3gram));trigramscount=sort(trigramscount,decreasing = TRUE)
top20trigrams=as.data.frame(trigramscount[1:20])
top20trigrams <- tibble::rownames_to_column(top20trigrams, "trigram")
colnames(top20trigrams)=c("trigram","Counts")
trigramscountNSW=rowSums(as.matrix(corpus3gramNSW));trigramscountNSW=sort(trigramscountNSW,decreasing = TRUE)
top20trigramsNSW=as.data.frame(trigramscountNSW[1:20])
top20trigramsNSW = tibble::rownames_to_column(top20trigramsNSW, "trigram")
colnames(top20trigramsNSW)=c("trigram","Counts")
g1<-ggplot(top20trigrams,aes(x = reorder(trigram, -Counts), 
                             y = Counts))+geom_col()+labs(title="20 most frequent trigrams Stop words included",x="trigrams",y="Counts")+theme(axis.text.x=element_text(angle=90, hjust=1),plot.title = element_text(hjust = 0.5))
g2<-ggplot(top20trigramsNSW,aes(x = reorder(trigram, -Counts),
                                y = Counts))+geom_col()+labs(title="20 most frequent trigrams Stop words not included",x="trigrams",y="Counts")+theme(axis.text.x=element_text(angle=90, hjust=1),plot.title = element_text(hjust = 0.5))
grid.arrange(g1,g2,ncol=1)

A way forward for the project.

The next step in the project is the building of the appropriate algorithm and finally the shiny application. The tokenazition performed above will help a lot.The probability of the appearance of each unigram, bigram, trigram or even more n-agram will be used in the predictive algorithm. Based on the analysis there are some thoughts to take take under consideration during the building of the algorithm:

The sample is small and a lot of the documents stay useless. Currently, this is done in purpose in order to keep the needed memory low. In the next steps I will use more percentage of the data.
The usage or not of the stop words is a decision that I must take. The presence of the stop words seems to occupy the probability of the appearance of a unigram, bigram or trigram. This may cause a problem to the predictive model giving wrong answers. Maybe a weighting in the stop words will help.

Milestone Report of the Capstone Project for the Data Science Specialization

IliasS

18/3/2020