Introduction

The goal of this report is to show the approach I have taken to start processing the data. All the processing and analysis was and will be done on the EN_US dataset. And because we are dealing with a memory and processing consuming application, here are the machine specs that I am using.

System Model: HP Pavilion dv6 Notebook PC
               BIOS: InsydeH2O Version 03.60.48F.1C
          Processor: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz (8 CPUs), ~2.2GHz
             Memory: 8192MB RAM
Available OS Memory: 8140MB RAM

Loading the data

Because it is almost impossible to work with the whole dataset, so I have sampled part of the data to work with. The code below was used to load each file of the three files.

twt <- file("./final/en_US/en_US.twitter.txt", "r")
news <- file("./final/en_US/en_US.news.txt", "r")
blog <- file("./final/en_US/en_US.blogs.txt", "r")

df100<-readLines(twt, 8000,skipNul = TRUE)

Because the project needed many iterations and repeated tasks, so I decided to make it simple and just change the input of the ReadLines command to direct the whole analysis to the desired dataset.

Dataset Properties

Here is a table includes the three datasets properties.

Data Set #of Lines Size (Mb)
en_US.twitter.txt 2360148 301.4
en_US.news.txt 77259 19.2
en_US.blogs.txt 899288 248.5

Data Pre-Processing

In order to start processing the data and make it ready to perform different operations, firstly it was converted to vector space using VectorSource command and then it was converted to volitile corpus using VCorpus the text operations included:

  1. Removing special characters
  2. Transform to lowercase
  3. Removing numbers
  4. Removing stop words
  5. Removing punctuation
  6. Strip White Space
  7. Stemming
  8. Complete Stemming

Refer to the code in the Appendix

Memory and Processing Resources Allocation

Because of the large dataset volume, and the limitation of resources. I did some basic testing to investigate the bottelnecks in the code in terms of memory and CPU, from the table below, it can be seen that Text Operations, Ngrams building and matrix conversion are the main bottleneck in the code. I also noticed that Bigrams are consuming more memory and CPU than the Trigrams, it is because of the size of the generated segments.

Command 1-Gram Bigram Tri Gram
as.matrix (time ms) 170 1090 580
DocumentTermMatrix (time ms) 2250 6210 5920
Text Opetations (time ms) 12030 12030 12030
Vcorpus (time ms) 490 490 490
as.matrix (memory Mb) 570.8 2086.7 1890
DocumentTermMatrix (memory Mb) 64.4 43.4 42.2
Text Opetations (Mb) 103.7 103.7 103.7
Vcorpus (Mb) 8.4 8.4 8.4

I have also investigated the relation between the number of lines in the dataset and the memory allocation needed in order to select optimum number of lines that allowes some room in the memory for the subsequent processing. I have selected 5000 lines, this size allowed some room in memory for subsequent processing and matrix conversion. The figure below shows the relation between the number of lines in the source text and memory allocation

I am testing the usage of Amazon AWS or IBM DSX platforms in the process of text processing and mining operations, seems that DSX is simpler than AWS, still have some issues with the runing in background using RStudio.

Tokenization, Bigram and Trigram Generation

As for the generation of the tokens or the 1-grams, the TermDocumentMatrix command was used. The generation of the Bigrams and Trigrams I used the functions below and passed them to TermDocumentMatrix

BigramTokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

Plotting the word Clouds and Histogram for Tokens

The package wordcloud was used to generate the clouds.

wordcloud(words = df100.16$word, freq = df100.16$freq, min.freq = 1
         , max.words=200, random.order=FALSE, rot.per=0.35,         colors=brewer.pal(8, "Dark2"))

#Histogrm
barplot(df100.16[1:30,]$freq, las = 2, names.arg = df100.16[1:30,]$word,
        col ="lightblue", main ="Most frequent words",
        ylab = "Word frequencies")

Tokens Cloud and Histogram

</

Bigram Cloud and Histogram

Trigram Cloud and Histogram

Ngrams size and Number

I found it very interesting how the number of Ngrams changes with the size of Ngram we need to generate, for example I wanted to know want rules how many Ngrams we will get from a specific corpus by knowing the size of the Ngram whether it is a Bigram, Trigram and so on. So I have implemented a test case and plotted the relation between the Ngram size and number. I found that the Ngram numbers starts low and then they increase to a certain point, after than they will start do decrease. My explanation to this behavior is that if you selected to generate 1-grams or tokens, there will be too many repetitions thus the number of generated ngrams will be low, while if you started increasing the size of the needed n-gram you will get number of Ngrams more than the previous size. You will also reach a point where the size of Ngrams will be big enough so the corpus will hold number of Ngrams lower than the previous. This testing has nothing to do with what is required in the Capstone, I did it out of curiosity.

Below you can find a plot explains the whole relation.

Goals and Objectives of this project, What is Next?

  1. Developing a predictive algorithm to predict the next input words by calculating the joint probabilities.
  2. Wrap this algorithm into a user friendly Shiny App.
  3. Seeking the available methods to increase the processing and solve memory allocation issues.

Appendix

Text Operations Code

df100.1<-(VectorSource(df100))
df100.2<-VCorpus(df100.1)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
df100.3 <- tm_map(df100.2, toSpace, "/")
df100.4 <- tm_map(df100.3, toSpace, "@")
df100.5 <- tm_map(df100.4, toSpace, "\\|")
df100.6 <- tm_map(df100.5, content_transformer(tolower))
df100.7 <- tm_map(df100.6, removeNumbers)
df100.8 <- tm_map(df100.7, removeWords, stopwords("english"))
df100.9 <- tm_map(df100.8, removeWords, c("just", "like", "will", "can"))
df100.10 <- tm_map(df100.9, removePunctuation)
df100.11 <- tm_map(df100.10, stripWhitespace)
df100.12 <- tm_map(df100.11, stemDocument)
df100.13 <- TermDocumentMatrix(df100.12)