This report provides an exploratory analysis of text data provided in the Coursera’s Data Science Capstone Project. The original data used in our analysis is from a corpus called HC Corpora and was downloaded from Coursera’s website. It consists of 1.31Gb worth of texts in 4 languages (English, German, Finnish and Russian) collected from 3 kinds of sources (blogs, news and twitter). We initially assess the data based on a small sample of 500 sentences extracted from each English corpus, then make our exploratory analysis on a larger sample of some 42,000 sentences (1% of the full English corpus). We summarize the data, find most frequent words and explore the relation between vocabulary size and text coverage. We conclude by presenting our future plans for a prediction model and an app to be uploaded to ShinyApps web service.
Sampling strategy
Given such a large dataset – 1.31Gb compressed on a 0.55Gb zip file –, we devised a strategy to sample the data, keeping the original file in its compressed form. Our sampling method allows us to set the proportion of lines to be sampled (put in another words, the probability that each line has to be selected) and optionally the maximum number of lines in the sample. The method also accepts a numeric id, so the sample can be recreated as needed, following reproducible research guidelines. The numeric id (user-specified or random generated) used to create the sample is appended to the sample file name. All source code is available at this Github repository.
Getting familiar with the data
Our first step is to extract a very small sample of 500 sentences of each corpus to get a rough idea of the data we’re dealing with. We’re interested in basic features like total number of words and number of unique words in each corpus, average word and sentence length, average number of words in a sentence. Later on, in our exploratory analysis, we’ll use a larger sample.
set.seed(14112014)
corpusList <- c("blogs", "news", "twitter")
names(corpusList) <- corpusList
sample1 <- list(corpusFiles = lapply(corpusList, cpskSampleData, size = 500))
sample1$tokens <- lapply(sample1$corpusFiles, cpskTokensFromFile, ngramOrder=1L)
sample1$sentencestats <- lapply(sample1$tokens, cpskSentenceStats)
Initial sample files
as.matrix(unlist(lapply(sample1$corpusFiles, basename)))
## [,1]
## blogs "en_US.blogs.98AD5F0F.500.txt"
## news "en_US.news.A5C925E6.500.txt"
## twitter "en_US.twitter.D903B313.500.txt"
Summaries of 500-sentence samples of each corpus
(sample1Stats <- t(round(cpskCorpusStats(sample1), 1)))
## blogs news twitter
## sentences 500.0 500.0 500.0
## total.words 20588.0 20000.0 7486.0
## unique.words 4771.0 5380.0 2155.0
## avg.words.per.sentence 41.2 40.0 15.0
## avg.chars.per.sentence 205.4 210.4 69.0
## avg.word.length 4.0 4.3 3.7
## max.word.length 78.0 21.0 45.0
As expected, considering that all tweets are limited to 140 characters, the twitter
corpus sample has the smallest average number of words and characters per sentence. It also shows the smallest number of unique words, thus needs a smaller vocabulary.
On the other hand, most blogs
and news
statistics are similar, except for the maximum word lenght. The longest word from the news
corpus sample is 21 characters long while the blogs
corpus sample has an impressive 78-character(!) long word. We will check out the 5 longest words in each sample.
Long words
lapply(sample1$tokens, function(x) x[order(nchar(x), decreasing=TRUE)[1:5]])
## $blogs
## [1] "www.birthersummit.org/news/73-was-baby-virginia-sunaharas-identity-stolen.html"
## [2] "ahhhhhaaaaaaaaaaaaaaaaaaaaaa"
## [3] "definitions-particularly"
## [4] "means-plus-function"
## [5] "american!bandstand"
##
## $news
## [1] "stamping-and-assembly" "better-than-expected" "directing-producing"
## [4] "immigrant-shooting" "guitarist/vocalist"
##
## $twitter
## [1] "www.chea.org/about/2010ac/2010_acis_final.asp"
## [2] "he'sthegluethatheldtheshowtogether"
## [3] "www.youtube.com/kaebelltunes"
## [4] "dc/maryland/virginia"
## [5] "rumdiarieshouseparty"
It turns out that very long words are not likely to be real words after all. We’ll have to choose how to treat them, maybe replacing them by class tokens such as <url>
, <email>
and <hashtag>
to mark their places in a sentence or just getting rid of them if they don’t contribute to our next-word prediction task.
For our exploratory analysis, we’ll take a larger sample of each corpus. These new samples comprise 1% of all sentences of the original dataset.
sample2 <- list(corpusFiles = lapply(corpusList, cpskSampleData, ratio=0.01))
sample2$tokens <- lapply(sample2$corpusFiles, cpskTokensFromFile, ngramOrder=1L)
sample2$sentencestats <- lapply(sample2$tokens, cpskSentenceStats)
as.matrix(unlist(lapply(sample2$corpusFiles, basename)))
## [,1]
## blogs "en_US.blogs.7D3DB9C6.txt"
## news "en_US.news.1FDEF851.txt"
## twitter "en_US.twitter.84DA6F3D.txt"
(sample2Stats <- t(round(cpskCorpusStats(sample2), 1)))
## blogs news twitter
## sentences 8910.0 10113.0 23519.0
## total.words 423520.0 400599.0 361856.0
## unique.words 30948.0 34083.0 27133.0
## avg.words.per.sentence 47.5 39.6 15.4
## avg.chars.per.sentence 236.7 207.6 70.7
## avg.word.length 4.0 4.3 3.7
## max.word.length 151.0 73.0 85.0
Based on the number of sentences in our samples, we estimate the number of sentences in the full dataset as around 891,000 sentences in the blogs
corpus, 1,011,300 sentences in news
and 2,351,900 sentences in twitter
. As we can see below, between 7.3% and 8.5% of all words are unique in each corpus giving us an upper bound estimate of each vocabulary size.
Proportion of unique words
sample2Stats["unique.words", ]/sample2Stats["total.words", ]
## blogs news twitter
## 0.07307 0.08508 0.07498
sum(sample2Stats["unique.words", ])/sum(sample2Stats["total.words", ])
## [1] 0.07771
In practice, a vocabulary will be much smaller because we would want to remove garbage, ill-formed and meaningless words, group multiple forms of the same word, group some words (like numbers, dates, URLs) under the same token/class, fix typos and some other kinds of text cleaning and preprocessing. And we’ll probably reduce it even further in order to trade off storage and memory efficiency against vocabulary size. As an example, the very long words/tokens shown below are potential candidates to be removed or grouped into classes.
lapply(sample2$tokens, function(x) x[order(nchar(x), decreasing=TRUE)[1:2]])
## $blogs
## [1] "www.prnewswire.com/news-releases/philip-morris-limited-files-high-court-challenge-against-the-australian-government-over-plain-packaging-135897478.html"
## [2] "globalvoicesonline.org/2011/04/15/bloggers-react-to-demeaning-article-about-ghana"
##
## $news
## [1] " ________________________________________________________________________"
## [2] "www.rcscw.com/recreation_centers/receationcenters.htm"
##
## $twitter
## [1] "www.details.com/culture-trends/critical-eye/201204/rust-belt-revival-detroit-michigan"
## [2] "decidedtomakealonghashtagandithinkimdoingprettywellatit"
Tokenization and word frequency
Tokenization is “the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements”[2]. The result of it, as its name suggests, are not words, in the literary sense, but tokens. In this section, even if we use words here and there, we really mean tokens, pieces of text produced by our (simple) tokenization process.
We can see in the histogram further below that the distribution of words frequencies of all three corpus combined is highly skewed, with many (or most) words appearing scarcely in the texts that form the corpus and some words appearing many, many times. To better visualize it, we have transformed word frenquencies to a log scale.
tokenFreq <- lapply(sample2$tokens, cpskTokenFreq)
# token frequency in combined corpus
tokenFreqComb <- aggregate(freq ~ token, do.call(rbind, tokenFreq), sum)
tokenFreqComb <- tokenFreqComb[order(tokenFreqComb$freq, decreasing=TRUE), ]
rownames(tokenFreqComb) <- NULL
ggplot(tokenFreqComb, aes(x=log(freq))) +
geom_histogram(binwidth = 0.5, fill="darkolivegreen3") +
scale_x_continuous(breaks=seq(0, 12, 2)) +
labs(title= expression(atop("Histogram of token frequencies",
scriptstyle("All corpora (1% sample)"))),
x="log(token frequency)",
y="Count") +
guides(fill=FALSE) +
theme_minimal() +
theme(plot.title=element_text(size=18, face="bold"))
Most frequent words
In the table below, we can see the 20 most frequent words in each corpus and in all corpus combined. Many of these words (or tokens) are common to each top 20 list, like the
, to
, and
, a
, of
and some punctuation marks, but there are some odd words in there.
In both blogs
and twitter
corpora, I
and my
are among the 20 most frequent words, but they don’t make it into the news
top 20 list. This might be related to the extremely personal aspect of social media like blogs and tweets. Conversely, news
top 20 list has secured places for he
and said
giving us a hint of the kind of content that make the news.
top20comb <- topTokens(20)
top20corpus <- topTokens(20, FALSE)
knitr::kable(cbind(all.token=top20comb$token, all.freq=top20comb$freq, do.call(cbind, top20corpus)), format = "pandoc", align = rep("r", 8))
all.token | all.freq | blogs.token | blogs.freq | news.token | news.freq | twitter.token | twitter.freq |
---|---|---|---|---|---|---|---|
. | 66113 | . | 21138 | . | 20135 | . | 24840 |
the | 47632 | the | 18673 | , | 19983 | ! | 12464 |
, | 44803 | , | 17553 | the | 19730 | the | 9229 |
to | 27592 | and | 10815 | to | 9083 | to | 7774 |
and | 23986 | to | 10735 | and | 8890 | , | 7267 |
a | 23437 | a | 8897 | a | 8656 | i | 7142 |
of | 20222 | of | 8692 | of | 7888 | a | 5884 |
in | 16522 | i | 7564 | “ | 7843 | you | 5454 |
i | 16305 | in | 5854 | in | 6875 | and | 4281 |
“ | 14657 | that | 4473 | that | 3503 | ? | 4055 |
! | 14432 | is | 4373 | for | 3480 | : | 3855 |
for | 10814 | it | 4090 | is | 2843 | for | 3799 |
is | 10802 | “ | 3613 | on | 2711 | in | 3793 |
that | 10282 | for | 3535 | said | 2539 | of | 3642 |
you | 9321 | you | 2956 | with | 2505 | is | 3586 |
it | 9100 | on | 2758 | was | 2348 | “ | 3201 |
on | 8216 | with | 2754 | he | 2294 | my | 2885 |
with | 6984 | my | 2714 | it | 2166 | it | 2844 |
was | 6215 | was | 2714 | at | 2120 | on | 2747 |
: | 6181 | this | 2588 | - | 1994 | # | 2567 |
Text coverage and vocabulary size
In this final section of our exploratory analysis, we take a look at the proportion of the three corpus covered by different vocabulary sizes. By keeping only the corpus most frequent words, fit to a specified vocabulary size, we try to find an optimal trade-off between efficiency and accuracy of our future model.
Confirming the words frenquency distribution skewness, we can see that the 20 most frequent words cover more than 33% of all three corpus, even all three combined.
unlist(lapply(top20corpus, function(x) sum(x$freq)))/sample2Stats["total.words", ]
## blogs news twitter
## 0.3459 0.3435 0.3352
sum(top20comb$freq)/sum(sample2Stats["total.words", ])
## [1] 0.3319
We show in the table and plot below, the vocabulary size needed to cover from 50% to 95% of the sampled corpora.
coverages <- seq(0.5, 0.9, 0.05)
names(coverages) <- paste0(coverages * 100, "%")
covdf <- cbind(all=sapply(coverages, coverage),
do.call(rbind, lapply(coverages, function(x) unlist(coverage(x, full=FALSE)))))
covdf
## all blogs news twitter
## 50% 86 73 104 73
## 55% 136 111 185 107
## 60% 223 175 326 160
## 65% 383 296 553 243
## 70% 652 517 912 381
## 75% 1110 899 1472 629
## 80% 1901 1561 2378 1096
## 85% 3383 2796 4001 2017
## 90% 6651 5405 7205 4075
covplot <- melt(covdf, measure.vars = names(covdf))
names(covplot) <- c("coverage", "corpus", "words")
ggplot(covplot, aes(x=coverage, y=words, group=corpus, color=corpus)) +
geom_line(size=1) + geom_point(size=4) +
labs(title="Vocabulary size and text coverage",
x="Coverage (%)",
y="Number of words",
color="Corpus") +
guides(fill=FALSE) +
theme_minimal() +
theme(plot.title=element_text(size=16, face="bold"),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
Our future work will focus on building some ngram-based statistical language models trained on a much larger sample (60%) of the original dataset, prune these models with a validation set (20%) e choose the best model based on a test set (20%), with sentences never used in training or validation. When building these models, we’ll try to optimize storage and memory utilization keeping in mind the constraints of free Shiny App platform.
[1] COURSERA.; Capstone Dataset. Available online at: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
[2] WIKIPEDIA.; Tokenization (lexical analysis). Available online at: http://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)
[3] McENERY, T., HARDIE, A.; Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press, 2012. Available online at: http://corpora.lancs.ac.uk/clmtp/index.php
[4] NATION, P., WARING, R.; Vocabulary Size, Text Coverage and Word Lists. Available at http://www.fltr.ucl.ac.be/fltr/germ/etan/bibs/vocab/cup.html
© 2014, Paulo Jean