Data Science Capstone Report #1

Milestone Report

Introduction and Objectives

The main goal of this report is to demonstrate the the capacity developed to handle unstructred data in order to produce a structured set of records which can be used for the purposes of statistical modeling in the relms of Text Mining.

Due the size of the provided raw files I took the approach to structure different scripts to handle specific steps of the entire process. for more detais, please refer to my gitrepo wich contains the whole project.

I also tried to use as much as possible of the TM R package among with its plugins for multicore processing (which, at this time I still developing).

The Process

I approach this project from a streamline point of view, in which I separete different objectives in diferent files. So at this point the project contains 4 folder structures as follows

data : Contains a individual folder for each step of the process, from raw file to Ngram Files
Prepare : Contains the R Scripts to download and setup the local systems data structure and split the raw data into smaller sampler files. the File itself can be found here
Exploratory : Contains the R Scripts That I used to learn the TM package among some preliminary studies regarding the data it self. check the file here
ngramGennerator : Contains the R Scripts That I used to process the short files (previously generated by the Prepare script) into DocumentTermMatrix into three new CVS files containing the frequency of the most comon Ngram in question.here

The overall process

Idealistically you can run the scripts in the following squence to archive the same result I got.

Prepare : [Prepare everything] (https://github.com/jguszr/DSCapstone/blob/master/Prepare/prepareData.R)
ngramGennerator : [Generate the Ngrams] (https://github.com/jguszr/DSCapstone/blob/master/Prepare/prepareData.R)

This two steeps will get the .cvs files containg Ngram frequencies.

And now some initial analisys

loading the Libraries

require(tm)
require(RWeka)

Using a VCorpus tm object to load all files in german.

setwd("~/Documents/coursera/datasciencecoursera/DSCapstone")
corpus <- VCorpus(DirSource("./data/short",pattern = "^de_DE",encoding = "UTF-8"),
                   readerControl = list(reader= readPlain,language="de",load=TRUE))

After it we can use the inspect and the meta tm functions to check data and metadata of a specific corpus property, or have a full view of it using the bellow code chunk.

inspect(corpus)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 8354894
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 9393247
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 7282117

Also, we can check specific metada unsing the meta for the entire corpus (in our case 3 documents) or we can bem more specific and specify wich document we want to extract the metadata

meta(corpus,"language")

## $de_DE.blogs.txt
## [1] "de"
## 
## $de_DE.news.txt
## [1] "de"
## 
## $de_DE.twitter.txt
## [1] "de"

meta(corpus[2],"id")

## $de_DE.news.txt
## [1] "de_DE.news.txt"

So, after loading the corpus, its necessary to clean up it a bit in order to remove “noisy information” such as numbers, special characters, and so on… I will ommit the results of the next chunk to be a little more consice on this report.

I choose to generate a TermDocumentMatrix to first explore the corpus, since it, already calculate several aspect such as term frequency, sparsity among possible others. The matrix generation is quite resource demanding and take some time, that’s why I had to short the raw data in shorter version. Otherwise Im unable to generate the data matrix.

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

termMatrix <- TermDocumentMatrix(corpus, 
                          control =  list(tokenize = BigramTokenizer,
                                          language="german",
                                          removeNumbers=TRUE,
                                          stemming=TRUE,
                                          stopwords=TRUE
                                          )
                          )

Visualizations

So, while learning about NLP and TM, I find a visualization practice named “wordclod” which is somewhat a interesting visualiztion to approach a new text-based dataset.

require(wordcloud)

mtrx <- as.matrix(termMatrix)
freq <- sort (rowSums (mtrx))
tmdata <- data.frame (words=names(freq), freq)

wordcloud (tail(tmdata$words,200), tail(tmdata$freq,200), random.order=FALSE, colors=brewer.pal(8, "Spectral"))

And also bigram frenquency distribution, as a barplot, to help on understanding the data pattern. I had to rm some object in order to have enough memory to process the barplot.

require(ggplot2)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

rm(mtrx)
rm(corpus)
gc()

##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  1953011 104.4    4703850 251.3  4703850 251.3
## Vcells 16291050 124.3   35222296 268.8 35207677 268.7

tmdata <- subset(tmdata,tmdata$freq>150)

ggplot(data=head(tmdata[order(tmdata$freq,decreasing = TRUE),],20), aes(x=words, y=freq)) +
  geom_bar(stat="identity", fill="steelblue")+ 
  geom_text(aes(label=words), vjust=-0.3, size=1.9,color="blue")+ ggtitle("Top 20 most frequent  \n bigrams") + 
  theme_minimal()

Conclusion

Even working with a small (10%) sample of the full dataset, the computing time is more than my current system can handle.
I should worry, not with the most frequent terms, but with them on the middle, even on this tiny top 20 sample, it is quite “visible” that bigrans with frequencies around 300 would be more challenging to handle than those above 600