This report shows some early findings and analysis done on the datasets including in the Data Science Capstone for the Data Science Coursera Specialization program. We’ll be setting up all the basic knowledge we need towards building the text prediction algorithm that is the goal of the capstone. The logical steps we’ve followed were these: - Datasets loading - Datasets sampling - Text cleansing - Exploratory analysis (wordcloud graphs) - N-grams (1-5) analysis We’ll base the analysis on the 3 english files for Webs, Twitter and Blogs.
First we load into R the packages we will use for the analysis.
In our case, we’re using a Windows 10 64-bit machine, that oblies us to use a 64 bit JVM. Thus, we’re setting the environment variable JAVA_HOME accordingly and increasing the java heap space. This two steps are necessary to use rJava and RWeka libraries.
Sys.setenv(JAVA_HOME="C:\\Program Files\\Java\\jdk-14.0.2\\")
options(java.parameters = "-Xmx2048m")
library(dplyr)
library(stringi)
library(NLP)
library(tm)
library(wordcloud)
library(ggplot2)
library(plotrix)
library(rJava)
library(RWeka)
We first load the three files corresponding to Blogs, News and Tweets.
conn<- file("C:/Users/agustin.izquierdo/Documents/R/coursera/Capstone/final/en_US/en_US.blogs.txt", "rb")
Blogs <- readLines(conn)
close.connection(conn)
connection <- file("C:/Users/agustin.izquierdo/Documents/R/coursera/Capstone/final/en_US/en_US.news.txt", "rb")
Webs <- readLines(conn)
close.connection(conn)
connection <- file("C:/Users/agustin.izquierdo/Documents/R/coursera/Capstone/final/en_US/en_US.twitter.txt", "rb")
Twitter <- readLines(conn, skipNul = TRUE)
close.connection(conn)
rm(conn)
rm(connection)
And on the raw data, we delete non-english characters and generate a summary dataframe for each data source including number of characters, words and lines.
#Remove all non-english charcters
Blogs <- iconv(Blogs, "latin1", "ASCII", sub="")
Webs <- iconv(Webs, "latin1", "ASCII", sub="")
Twitter <- iconv(Twitter, "latin1", "ASCII", sub="")
#Generate a summary dataframe for Blogs, Tweets and Webs including number of chars, words and lines
summary_data <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(summary_data)<- c("source","#chars", "# words", "# lines")
summary_data[1,1]<-"Blogs"
summary_data[1,2]<-sum(nchar(Blogs))
summary_data[1,3]<-sum(stri_count_words(Blogs))
summary_data[1,4]<-length(Blogs)
summary_data[2,1]<-"Twitter"
summary_data[2,2]<-sum(nchar(Twitter))
summary_data[2,3]<-sum(stri_count_words(Twitter))
summary_data[2,4]<-length(Twitter)
summary_data[3,1]<-"Blogs"
summary_data[3,2]<-sum(nchar(Webs))
summary_data[3,3]<-sum(stri_count_words(Webs))
summary_data[3,4]<-length(Webs)
Yiending this result for the information summary
summary_data
## source #chars # words # lines
## 1 Blogs 206043906 37510168 899288
## 2 Twitter 161961555 30088605 2360148
## 3 Blogs 202917604 34749301 1010242
Due to the size of the data sources, before moving forward we get a 10% sample of the data for the initial analysis in the scope of this report. We store the samples for Blogs, Tweets and Webs into three sepatare files we’ll be using later.
remove(summary_data)
set.seed(2020)
ssize<-0.1
sampleBlog <- sample(Blogs, ssize*length(Blogs))
sampleTwitter <- sample(Twitter, ssize*length(Twitter))
sampleWeb <- sample(Webs, ssize*length(Webs))
write(sampleBlog, file = "C:/Users/agustin.izquierdo/Documents/R/coursera/Capstone/final/en_US/samples/blog_sample.txt")
write(sampleTwitter, file = "C:/Users/agustin.izquierdo/Documents/R/coursera/Capstone/final/en_US/samples/twitter_sample.txt")
write(sampleWeb, file = "C:/Users/agustin.izquierdo/Documents/R/coursera/Capstone/final/en_US/samples/web_sample.txt")
rm(Blogs)
rm(Twitter)
rm(Webs)
rm(sampleBlog)
rm(sampleTwitter)
rm(sampleWeb)
With the sample files, now it’s time to generate the corpus we’ll be analyzing. We clean it setting to lower case all the words, removing punctuation and white spaces. We keep one sample corpus with just this cleansing and we also get a second one to which we remove also numbers and common english words. We’ll be using the combination of both of them in the n-gram analsis.
data_dir <- "C:/Users/agustin.izquierdo/Documents/R/coursera/Capstone/final/en_US/samples/"
corpus_sample <- VCorpus(DirSource(data_dir, encoding = "UTF-8"), readerControl = list(language = "en"))
rm(data_dir)
#remove capital letters
corpus_sample <- tm_map(corpus_sample,content_transformer(tolower))
#we remove all punctuation
corpus_sample <- tm_map(corpus_sample, removePunctuation)
#we remove all white spaces
corpus_sample <- tm_map(corpus_sample, stripWhitespace)
#we keep this copy of the corpus sample without removing numbers and common english words
corpus_sample2<-corpus_sample
#we remove common english words
corpus_sample <- tm_map(corpus_sample, removeWords, stopwords("english"))
#remove numbers
corpus_sample <- tm_map(corpus_sample, content_transformer(removeNumbers))
We’ll be using wordclouds for visual inspection of the data sources. We’ve chosen this option because it is a nice way of visualizing this type of sources in a different way (it’s something else from common bar plots or standard grids).
Notice: - we’re using the 3 data sources all together in the term document matrix bu thanks to the way the term document structure is created we know to which data source each word belongs - our goal with this visual analysis is to detect the most frequently used words in the 3 datasources together and separately, to see differences. - we’ll be using the sample corpus with numbers and common english words removed
To prepare the data as we need them for the wordclouds, first we need to create a term document matrix (we’ll use the tm package for this purpose). With the term document matrix we generate a matrix including the words and its frequency (number of times it appears).
#Create a Term Document Matrix
tdm <- TermDocumentMatrix(corpus_sample)
#Convert the tdm into a matrix
text_matrix<- as.matrix(tdm)
rm(tdm)
#sum rows to get frequency
text_words <- sort(rowSums(text_matrix), decreasing = T)
#convert the matrix into a DF to get names and frequencies
text_freq <- data.frame(terms =names(text_words), num = text_words)
#delete objects not required anymore
rm(text_words)
With a commonality cloud based on the filtered samples (excluding common English words) we show the most frequent words visualization considering the 3 sources all together.
#Commonality cloud
commonality.cloud(text_matrix,random.order=FALSE,max.words=500, colors=brewer.pal(1,"Dark2"),title.size=0.5)
And by each of the source, based on the comparison cloud we get the most frequent words in each data source
#Comparison cloud
comparison.cloud(text_matrix, random.order=FALSE, colors = c("#00B2FF", "red", "#FF0099"),title.size=1.0)
We use the RWeka package to construct functions that tokenize the sample and construct matrices of uni-qrams, bi-grams, tri-grams, quatri-grams and penta-grams
First action is to generate the functions we’ll use later to get the groups of 1-5 words.
tok1 <- function(x){ NGramTokenizer(x, Weka_control(min = 1, max = 1))}
tok2 <- function(x) {NGramTokenizer(x, Weka_control(min = 2, max = 2))}
tok3 <- function(x) {NGramTokenizer(x, Weka_control(min = 3, max = 3))}
tok4 <- function(x){ NGramTokenizer(x, Weka_control(min = 4, max = 4))}
tok5 <- function(x){ NGramTokenizer(x, Weka_control(min = 5, max = 5))}
Now we move from 1 to 5 unigrams to analyze their apppearance frequency. We’ll be setting the lower limits of n-gram appearance for each case based on previous graphs we’ve generated to minimize objects size. We’ll display the code only for the first example, since it is basically the same for the others.
Now we generate the term document matrix for the corpus for terms that appear more than 5,000 times. Then we find the frequency of terms in each of this matrixes and construct a dataframe to store the unigram and its frequency, sorted decreasing by frequency. Finally we plot the results for the top 25 unigrams.
matrix <- TermDocumentMatrix(corpus_sample, control = list(tokenize = tok1))
corpus <- findFreqTerms(matrix,lowfreq=5000)
corpus_freq <- rowSums(as.matrix(matrix[corpus,]))
corpus_freq<- sort(corpus_freq, decreasing = TRUE)
corpus_freq <- data.frame(word=names(corpus_freq), freq=corpus_freq)
topN<-25
ggplot(corpus_freq[1:topN,], aes(x = reorder(word,freq), y = freq,)) + geom_bar(stat = "identity",width=0.3, fill="#FF3466") +xlab("Words") + ylab("Frequency") + coord_flip()
The same action for the non-filtered sample corpus (including common english terms) considering a frequency higher than 10,000, yields this result
We perform the same action for bi-grams considering terms appearing more than 500 times and this is the result
The same action for the non-filtered sample corpus (including common english terms) considering terms appearing more than 5000 times gives this result for bi-grams
We perform the same action for tri-grams considering terms appearing more than 50 times and this is the result.
The same action for the non-filtered sample corpus (including common english terms) considering terms appearing more than 500 times, yields this result for tri-grams.
We perform the same action for quatri-grams considering terms appearing more than 20 times and this is the result
The same action for the non-filtered sample corpus (including common english terms)considering terms appearing more than 200 times, yields this result for quatri-grams
We perform the same action for penta-grams considering terms appearing more than 20 times and this is the result.
The same action for the non-filtered sample corpus (including common english terms) considering terms appearing more than 50 times yields this result for penta-grams.
Once we have a good set of n-grams and the basic data cleansing guidelines we can begin to set up and apply machine learning algorithms to begin to predict words given a typed text. We have a better knowledge on how many n-grams to use, what words have to be avoided…
It is clear that for the text prediction algorithm we cannot skip common english words, since they are a very relevant part of the words that may be predicted (this can be clearly seen in the tri, quatri and penta-gram visual exploration).
Besides for this project it is very relevant to keep memory as clean as possible, otherwise we’ll have severe performance issues.