The goal of this report is to show the approach I have taken to start processing the data. All the processing and analysis was and will be done on the EN_US dataset. And because we are dealing with a memory and processing consuming application, here are the machine specs that I am using.
System Model: HP Pavilion dv6 Notebook PC
BIOS: InsydeH2O Version 03.60.48F.1C
Processor: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz (8 CPUs), ~2.2GHz
Memory: 8192MB RAM
Available OS Memory: 8140MB RAM
Because it is almost impossible to work with the whole dataset, so I have sampled part of the data to work with. The code below was used to load each file of the three files.
twt <- file("./final/en_US/en_US.twitter.txt", "r")
news <- file("./final/en_US/en_US.news.txt", "r")
blog <- file("./final/en_US/en_US.blogs.txt", "r")
df100<-readLines(twt, 8000,skipNul = TRUE)
Because the project needed many iterations and repeated tasks, so I decided to make it simple and just change the input of the ReadLines command to direct the whole analysis to the desired dataset.
Here is a table includes the three datasets properties.
| Data Set | #of Lines | Size (Mb) |
|---|---|---|
| en_US.twitter.txt | 2360148 | 301.4 |
| en_US.news.txt | 77259 | 19.2 |
| en_US.blogs.txt | 899288 | 248.5 |
In order to start processing the data and make it ready to perform different operations, firstly it was converted to vector space using VectorSource command and then it was converted to volitile corpus using VCorpus the text operations included:
Refer to the code in the Appendix
Because of the large dataset volume, and the limitation of resources. I did some basic testing to investigate the bottelnecks in the code in terms of memory and CPU, from the table below, it can be seen that Text Operations, Ngrams building and matrix conversion are the main bottleneck in the code. I also noticed that Bigrams are consuming more memory and CPU than the Trigrams, it is because of the size of the generated segments.
| Command | 1-Gram | Bigram | Tri Gram |
|---|---|---|---|
| as.matrix (time ms) | 170 | 1090 | 580 |
| DocumentTermMatrix (time ms) | 2250 | 6210 | 5920 |
| Text Opetations (time ms) | 12030 | 12030 | 12030 |
| Vcorpus (time ms) | 490 | 490 | 490 |
| as.matrix (memory Mb) | 570.8 | 2086.7 | 1890 |
| DocumentTermMatrix (memory Mb) | 64.4 | 43.4 | 42.2 |
| Text Opetations (Mb) | 103.7 | 103.7 | 103.7 |
| Vcorpus (Mb) | 8.4 | 8.4 | 8.4 |
I have also investigated the relation between the number of lines in the dataset and the memory allocation needed in order to select optimum number of lines that allowes some room in the memory for the subsequent processing. I have selected 5000 lines, this size allowed some room in memory for subsequent processing and matrix conversion. The figure below shows the relation between the number of lines in the source text and memory allocation
I am testing the usage of Amazon AWS or IBM DSX platforms in the process of text processing and mining operations, seems that DSX is simpler than AWS, still have some issues with the runing in background using RStudio.
As for the generation of the tokens or the 1-grams, the TermDocumentMatrix command was used. The generation of the Bigrams and Trigrams I used the functions below and passed them to TermDocumentMatrix
BigramTokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
The package wordcloud was used to generate the clouds.
wordcloud(words = df100.16$word, freq = df100.16$freq, min.freq = 1
, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
#Histogrm
barplot(df100.16[1:30,]$freq, las = 2, names.arg = df100.16[1:30,]$word,
col ="lightblue", main ="Most frequent words",
ylab = "Word frequencies")
I found it very interesting how the number of Ngrams changes with the size of Ngram we need to generate, for example I wanted to know want rules how many Ngrams we will get from a specific corpus by knowing the size of the Ngram whether it is a Bigram, Trigram and so on. So I have implemented a test case and plotted the relation between the Ngram size and number. I found that the Ngram numbers starts low and then they increase to a certain point, after than they will start do decrease. My explanation to this behavior is that if you selected to generate 1-grams or tokens, there will be too many repetitions thus the number of generated ngrams will be low, while if you started increasing the size of the needed n-gram you will get number of Ngrams more than the previous size. You will also reach a point where the size of Ngrams will be big enough so the corpus will hold number of Ngrams lower than the previous. This testing has nothing to do with what is required in the Capstone, I did it out of curiosity.
Below you can find a plot explains the whole relation.
df100.1<-(VectorSource(df100))
df100.2<-VCorpus(df100.1)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
df100.3 <- tm_map(df100.2, toSpace, "/")
df100.4 <- tm_map(df100.3, toSpace, "@")
df100.5 <- tm_map(df100.4, toSpace, "\\|")
df100.6 <- tm_map(df100.5, content_transformer(tolower))
df100.7 <- tm_map(df100.6, removeNumbers)
df100.8 <- tm_map(df100.7, removeWords, stopwords("english"))
df100.9 <- tm_map(df100.8, removeWords, c("just", "like", "will", "can"))
df100.10 <- tm_map(df100.9, removePunctuation)
df100.11 <- tm_map(df100.10, stripWhitespace)
df100.12 <- tm_map(df100.11, stemDocument)
df100.13 <- TermDocumentMatrix(df100.12)