For reference, see
## Loading required package: NLP
This document describes the basic metadata of the 12 data files given. Then it focuses on the German Blog Posts dataset.
I think this is legitimate, because all documents are very similar, and besides it says in the instructions for this assignment:
In this exercise, you will use the English database but may consider three other databases in German, Russian and Finnish.
There are 3 document collections (blogs, news, twitter), each given in four languages German, English, Russian, Finnish. This makes 4 * 3 = 12 document collections. The following table lists the number of lines/documents, words, and characters for each file. The average word length is also given.
Inside each document collection, each document is stretched out on a single line. This is because of processing purposes. The 12 files are all very similar, with the documents in the English language being the largest of them.
The average word length for Russian seems to be larger. This might be an artifact, because the Cyrillic character set uses two bytes per character.
## filename.blogs nlines.blogs nwords.blogs nchars.blogs avgwordl.blogs
## 1 de_DE.blogs.txt 371440 12652984 85459666 6.75
## 2 en_US.blogs.txt 899288 37334114 210160014 5.63
## 3 fi_FI.blogs.txt 439785 12731004 108503595 8.52
## 4 ru_RU.blogs.txt 337100 9405377 116855835 12.42
## filename.news nlines.news nwords.news nchars.news avgwordl.news
## 1 de_DE.news.txt 244743 13216346 95591959 7.23
## 2 en_US.news.txt 1010242 34365936 205811889 5.99
## 3 fi_FI.news.txt 485758 10444685 94234350 9.02
## 4 ru_RU.news.txt 196360 9115829 118996424 13.05
## filenames nlines.twitter nwords.twitter nchars.twitter avgwl.tw
## 1 de_DE.twitter.txt 947774 11802976 75578341 6.40
## 2 en_US.twitter.txt 2360148 30359804 167105338 5.50
## 3 fi_FI.twitter.txt 285214 3152757 25331142 8.03
## 4 ru_RU.twitter.txt 881414 9223265 105182346 11.40
The dataset is pretty messy. Generally, the dataset is poor in metadata. There is no documents metadata such as authorship, date created etc. Visual inspection of randomly chosen documents indicates that most items are less than one page long; the content is mostly everyday matters, mostly written in a conversational tone. There are a few irrelevant outliers regarding document length.
Any interesting metadata (such as sentence count, word types) must be created manually by parsing the entire text body, or by sampling. In doing so, it is clear that the file creators have inserted a few documents that are written in a foreign language, or in foreign (e.g Asian) character sets. There are some ASCII null characters in very few files (~3 of 370000), and there are probably a few “multiline strings”.
This is some metadata from the German Blog Posts data file:
## data frame with 0 columns and 371434 rows
A few (n=50) of the most frequent words are shown here, ordered by decon tributirscreasing frequency:
vv <- sort(rowSums(as.matrix(tdm)),decreasing=TRUE)
head(vv, freq)
## und die der ich das den nicht ist mit ein
## 370883 352659 275362 217509 160219 121142 113817 109691 108717 101227
## von sie auch auf sich eine für dem dass aber
## 94684 92821 92741 86584 83668 77121 75061 66171 63992 63351
## noch als wie man wir des dann mir nur einen
## 58779 56584 55277 53786 48918 44818 43098 41913 41385 40436
## oder was mich bei wenn hat war nach aus sind
## 40281 40247 39305 39106 38396 37813 37718 37030 36703 34389
## schon habe einem zum mal wieder einer wird kann ihr
## 32881 32618 31007 30816 30268 30092 29804 29455 26886 26837
Comparing document lengths, a summary looks like this:
Word counts:
lenw <- sapply(crp, function(x){ meta(x, tag="wcount")})
summary(lenw)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 10.0 25.0 39.2 53.0 1870.0
hist(log10(lenw), main="Histogram of word counts (Semilog plot)", xlab="log10( Words per document)", right = TRUE)
Sentence counts:
lens <- sapply(crp, function(x){ meta(x, tag="scount")})
summary(lens)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 1.00 2.35 3.00 96.00
hist(log10(lens), main="Histogram of sentence lengths (Semilog plot)", xlab="log10( Sentences per document)", right = TRUE)
Plotting the Term freqencies demonstrates that the word frequency distribution follows a power law.
An alternative representation, a log-log visualization, is called [Zipf’s law](http://en.wikipedia.org/wiki/Zipf's_law). The most frequent words in common use, such as (in English:) “is”, “the”, “me”, etc, plot on the left side of the graph. They fall below the straight log-log line. This means they are less frequent than expected; and they are for a reason, because I have applied a stopword removal filter on the dataset. I have not removed punctuation, so there might still be many stopwords the corpus (such as “is.”, “is,”, “,is” etc)
## (Intercept) x
## 13.920 -1.004
For ease of understanding, here are a few word clouds for 1-word, 2-word, 3-word phrases of the German blogs posts data. Rest assured, dear Reader, that the English corpus does not look much more interesting either.
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
I plan to do this: