Capstone Project, Report 1 - Exploratory Analysis (Updated)

For reference, see

http://class.coursera.org/dsscapstone-002 (login required) )
Capstone Project, Report 1 (Graded version)
Download link for the datasets (550 MB)

## Loading required package: NLP

Programming Project: Exploratory Analysis of Text Corpora

Synopsis

This document describes the basic metadata of the 12 data files given. Then it focuses on the German Blog Posts dataset.
I think this is legitimate, because all documents are very similar, and besides it says in the instructions for this assignment:

In this exercise, you will use the English database but may consider three other databases in German, Russian and Finnish.

Exploratory Analysis 1: Basics

There are 3 document collections (blogs, news, twitter), each given in four languages German, English, Russian, Finnish. This makes 4 * 3 = 12 document collections. The following table lists the number of lines/documents, words, and characters for each file. The average word length is also given.
Inside each document collection, each document is stretched out on a single line. This is because of processing purposes. The 12 files are all very similar, with the documents in the English language being the largest of them.

The average word length for Russian seems to be larger. This might be an artifact, because the Cyrillic character set uses two bytes per character.

Blogs collections

##    filename.blogs nlines.blogs nwords.blogs nchars.blogs avgwordl.blogs
## 1 de_DE.blogs.txt       371440     12652984     85459666           6.75
## 2 en_US.blogs.txt       899288     37334114    210160014           5.63
## 3 fi_FI.blogs.txt       439785     12731004    108503595           8.52
## 4 ru_RU.blogs.txt       337100      9405377    116855835          12.42

News collections

##    filename.news nlines.news nwords.news nchars.news avgwordl.news
## 1 de_DE.news.txt      244743    13216346    95591959          7.23
## 2 en_US.news.txt     1010242    34365936   205811889          5.99
## 3 fi_FI.news.txt      485758    10444685    94234350          9.02
## 4 ru_RU.news.txt      196360     9115829   118996424         13.05

Twitter collections

##           filenames nlines.twitter nwords.twitter nchars.twitter avgwl.tw
## 1 de_DE.twitter.txt         947774       11802976       75578341     6.40
## 2 en_US.twitter.txt        2360148       30359804      167105338     5.50
## 3 fi_FI.twitter.txt         285214        3152757       25331142     8.03
## 4 ru_RU.twitter.txt         881414        9223265      105182346    11.40

Exploratory Analysis 2: General remarks on the datasets

The dataset is pretty messy. Generally, the dataset is poor in metadata. There is no documents metadata such as authorship, date created etc. Visual inspection of randomly chosen documents indicates that most items are less than one page long; the content is mostly everyday matters, mostly written in a conversational tone. There are a few irrelevant outliers regarding document length.

Any interesting metadata (such as sentence count, word types) must be created manually by parsing the entire text body, or by sampling. In doing so, it is clear that the file creators have inserted a few documents that are written in a foreign language, or in foreign (e.g Asian) character sets. There are some ASCII null characters in very few files (~3 of 370000), and there are probably a few “multiline strings”.

Exploratory Analysis 3: Basics of the German Blog Posts dataset

This is some metadata from the German Blog Posts data file:

## data frame with 0 columns and 371434 rows

A few (n=50) of the most frequent words are shown here, ordered by decon tributirscreasing frequency:

vv <- sort(rowSums(as.matrix(tdm)),decreasing=TRUE)

head(vv, freq)

##    und    die    der    ich    das    den  nicht    ist    mit    ein 
## 370883 352659 275362 217509 160219 121142 113817 109691 108717 101227 
##    von    sie   auch    auf   sich   eine    für    dem   dass   aber 
##  94684  92821  92741  86584  83668  77121  75061  66171  63992  63351 
##   noch    als    wie    man    wir    des   dann    mir    nur  einen 
##  58779  56584  55277  53786  48918  44818  43098  41913  41385  40436 
##   oder    was   mich    bei   wenn    hat    war   nach    aus   sind 
##  40281  40247  39305  39106  38396  37813  37718  37030  36703  34389 
##  schon   habe  einem    zum    mal wieder  einer   wird   kann    ihr 
##  32881  32618  31007  30816  30268  30092  29804  29455  26886  26837

Comparing document lengths, a summary looks like this:

Word counts:

lenw <- sapply(crp, function(x){ meta(x, tag="wcount")})
summary(lenw)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    10.0    25.0    39.2    53.0  1870.0

hist(log10(lenw),  main="Histogram of word counts (Semilog plot)", xlab="log10( Words per document)", right = TRUE)

plot of chunk unnamed-chunk-5

Sentence counts:

lens <- sapply(crp, function(x){ meta(x, tag="scount")})
summary(lens)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    1.00    2.35    3.00   96.00

hist(log10(lens), main="Histogram of sentence lengths (Semilog plot)", xlab="log10( Sentences per document)", right = TRUE)

plot of chunk unnamed-chunk-6

Plotting the Term freqencies demonstrates that the word frequency distribution follows a power law.

plot of chunk pl

An alternative representation, a log-log visualization, is called [Zipf’s law](http://en.wikipedia.org/wiki/Zipf's_law). The most frequent words in common use, such as (in English:) “is”, “the”, “me”, etc, plot on the left side of the graph. They fall below the straight log-log line. This means they are less frequent than expected; and they are for a reason, because I have applied a stopword removal filter on the dataset. I have not removed punctuation, so there might still be many stopwords the corpus (such as “is.”, “is,”, “,is” etc)

plot of chunk zipf

## (Intercept)           x 
##      13.920      -1.004

For ease of understanding, here are a few word clouds for 1-word, 2-word, 3-word phrases of the German blogs posts data. Rest assured, dear Reader, that the English corpus does not look much more interesting either.

plot of chunk wcloud

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

Exploratory Analysis 4: Outlook

I plan to do this:

Perform experiments with Unicode properties, or rather Unicode Character Categories, in order to find how many “dirty” datasets there are. There are very few, but they can still pollute data structures and models.
Try interesting data structures. I’ll try out the tm “index” dataframe that is associated with a corpus. It can be joined in intersting ways with a term-doc matrix; and I’ll experiment more with ngrams.
Study Part-of-speech annotation of sentences. Add more attributes to the corpus about the grammatical structure. The openNLP package can do this.
Experiment with Position specific Scoring Matrices, also known as
Work with persistent corpora, just to know what it’s like
Study more web service offers to include preprocessed corpora (e.g. Google BigQuery, Microsoft Ngram Service)
Do more with RWeka, e.g. check out the properties of the MultinomialNaiveBayes Classifier.
Evaluate performance metrics, cost matrices and learning curves. FOr this it is necessary to storing modeling results in a database. When the Weka Experimenter is run from R, it can store lots of metrics in a JDBC database.
Learn more about Nat.Lang Processing.