Milestone report

November 2014

     

Executive Summary

This report provides an exploratory analysis of text data provided in the Coursera’s Data Science Capstone Project. The original data used in our analysis is from a corpus called HC Corpora and was downloaded from Coursera’s website. It consists of 1.31Gb worth of texts in 4 languages (English, German, Finnish and Russian) collected from 3 kinds of sources (blogs, news and twitter). We initially assess the data based on a small sample of 500 sentences extracted from each English corpus, then make our exploratory analysis on a larger sample of some 42,000 sentences (1% of the full English corpus). We summarize the data, find most frequent words and explore the relation between vocabulary size and text coverage. We conclude by presenting our future plans for a prediction model and an app to be uploaded to ShinyApps web service.

Introduction

Sampling strategy

Given such a large dataset – 1.31Gb compressed on a 0.55Gb zip file –, we devised a strategy to sample the data, keeping the original file in its compressed form. Our sampling method allows us to set the proportion of lines to be sampled (put in another words, the probability that each line has to be selected) and optionally the maximum number of lines in the sample. The method also accepts a numeric id, so the sample can be recreated as needed, following reproducible research guidelines. The numeric id (user-specified or random generated) used to create the sample is appended to the sample file name. All source code is available at this Github repository.

Getting familiar with the data

Our first step is to extract a very small sample of 500 sentences of each corpus to get a rough idea of the data we’re dealing with. We’re interested in basic features like total number of words and number of unique words in each corpus, average word and sentence length, average number of words in a sentence. Later on, in our exploratory analysis, we’ll use a larger sample.

set.seed(14112014)
corpusList <- c("blogs", "news", "twitter")
names(corpusList) <- corpusList
sample1 <- list(corpusFiles = lapply(corpusList, cpskSampleData, size = 500))
sample1$tokens <- lapply(sample1$corpusFiles, cpskTokensFromFile, ngramOrder=1L)
sample1$sentencestats <- lapply(sample1$tokens, cpskSentenceStats)

Initial sample files

as.matrix(unlist(lapply(sample1$corpusFiles, basename)))
##         [,1]                            
## blogs   "en_US.blogs.98AD5F0F.500.txt"  
## news    "en_US.news.A5C925E6.500.txt"   
## twitter "en_US.twitter.D903B313.500.txt"

Summaries of 500-sentence samples of each corpus

(sample1Stats <- t(round(cpskCorpusStats(sample1), 1)))
##                          blogs    news twitter
## sentences                500.0   500.0   500.0
## total.words            20588.0 20000.0  7486.0
## unique.words            4771.0  5380.0  2155.0
## avg.words.per.sentence    41.2    40.0    15.0
## avg.chars.per.sentence   205.4   210.4    69.0
## avg.word.length            4.0     4.3     3.7
## max.word.length           78.0    21.0    45.0

As expected, considering that all tweets are limited to 140 characters, the twitter corpus sample has the smallest average number of words and characters per sentence. It also shows the smallest number of unique words, thus needs a smaller vocabulary.

On the other hand, most blogs and news statistics are similar, except for the maximum word lenght. The longest word from the news corpus sample is 21 characters long while the blogs corpus sample has an impressive 78-character(!) long word. We will check out the 5 longest words in each sample.

Long words

lapply(sample1$tokens, function(x) x[order(nchar(x), decreasing=TRUE)[1:5]])
## $blogs
## [1] "www.birthersummit.org/news/73-was-baby-virginia-sunaharas-identity-stolen.html"
## [2] "ahhhhhaaaaaaaaaaaaaaaaaaaaaa"                                                  
## [3] "definitions-particularly"                                                      
## [4] "means-plus-function"                                                           
## [5] "american!bandstand"                                                            
## 
## $news
## [1] "stamping-and-assembly" "better-than-expected"  "directing-producing"  
## [4] "immigrant-shooting"    "guitarist/vocalist"   
## 
## $twitter
## [1] "www.chea.org/about/2010ac/2010_acis_final.asp"
## [2] "he'sthegluethatheldtheshowtogether"           
## [3] "www.youtube.com/kaebelltunes"                 
## [4] "dc/maryland/virginia"                         
## [5] "rumdiarieshouseparty"

It turns out that very long words are not likely to be real words after all. We’ll have to choose how to treat them, maybe replacing them by class tokens such as <url>, <email> and <hashtag> to mark their places in a sentence or just getting rid of them if they don’t contribute to our next-word prediction task.

Exploratory Analysis

For our exploratory analysis, we’ll take a larger sample of each corpus. These new samples comprise 1% of all sentences of the original dataset.

sample2 <- list(corpusFiles = lapply(corpusList, cpskSampleData, ratio=0.01))
sample2$tokens <- lapply(sample2$corpusFiles, cpskTokensFromFile, ngramOrder=1L)
sample2$sentencestats <- lapply(sample2$tokens, cpskSentenceStats)
as.matrix(unlist(lapply(sample2$corpusFiles, basename)))
##         [,1]                        
## blogs   "en_US.blogs.7D3DB9C6.txt"  
## news    "en_US.news.1FDEF851.txt"   
## twitter "en_US.twitter.84DA6F3D.txt"
(sample2Stats <- t(round(cpskCorpusStats(sample2), 1)))
##                           blogs     news  twitter
## sentences                8910.0  10113.0  23519.0
## total.words            423520.0 400599.0 361856.0
## unique.words            30948.0  34083.0  27133.0
## avg.words.per.sentence     47.5     39.6     15.4
## avg.chars.per.sentence    236.7    207.6     70.7
## avg.word.length             4.0      4.3      3.7
## max.word.length           151.0     73.0     85.0

Based on the number of sentences in our samples, we estimate the number of sentences in the full dataset as around 891,000 sentences in the blogs corpus, 1,011,300 sentences in news and 2,351,900 sentences in twitter. As we can see below, between 7.3% and 8.5% of all words are unique in each corpus giving us an upper bound estimate of each vocabulary size.

Proportion of unique words

sample2Stats["unique.words", ]/sample2Stats["total.words", ]
##   blogs    news twitter 
## 0.07307 0.08508 0.07498
sum(sample2Stats["unique.words", ])/sum(sample2Stats["total.words", ])
## [1] 0.07771

In practice, a vocabulary will be much smaller because we would want to remove garbage, ill-formed and meaningless words, group multiple forms of the same word, group some words (like numbers, dates, URLs) under the same token/class, fix typos and some other kinds of text cleaning and preprocessing. And we’ll probably reduce it even further in order to trade off storage and memory efficiency against vocabulary size. As an example, the very long words/tokens shown below are potential candidates to be removed or grouped into classes.

lapply(sample2$tokens, function(x) x[order(nchar(x), decreasing=TRUE)[1:2]])
## $blogs
## [1] "www.prnewswire.com/news-releases/philip-morris-limited-files-high-court-challenge-against-the-australian-government-over-plain-packaging-135897478.html"
## [2] "globalvoicesonline.org/2011/04/15/bloggers-react-to-demeaning-article-about-ghana"                                                                      
## 
## $news
## [1] " ________________________________________________________________________"
## [2] "www.rcscw.com/recreation_centers/receationcenters.htm"                    
## 
## $twitter
## [1] "www.details.com/culture-trends/critical-eye/201204/rust-belt-revival-detroit-michigan"
## [2] "decidedtomakealonghashtagandithinkimdoingprettywellatit"

Tokenization and word frequency

Tokenization is “the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements”[2]. The result of it, as its name suggests, are not words, in the literary sense, but tokens. In this section, even if we use words here and there, we really mean tokens, pieces of text produced by our (simple) tokenization process.

We can see in the histogram further below that the distribution of words frequencies of all three corpus combined is highly skewed, with many (or most) words appearing scarcely in the texts that form the corpus and some words appearing many, many times. To better visualize it, we have transformed word frenquencies to a log scale.

tokenFreq <- lapply(sample2$tokens, cpskTokenFreq)
# token frequency in combined corpus
tokenFreqComb <- aggregate(freq ~ token, do.call(rbind, tokenFreq), sum)
tokenFreqComb <- tokenFreqComb[order(tokenFreqComb$freq, decreasing=TRUE), ]
rownames(tokenFreqComb) <- NULL
ggplot(tokenFreqComb, aes(x=log(freq))) +
  geom_histogram(binwidth = 0.5, fill="darkolivegreen3") +
  scale_x_continuous(breaks=seq(0, 12, 2)) + 
  labs(title= expression(atop("Histogram of token frequencies",
                        scriptstyle("All corpora (1% sample)"))),       
       x="log(token frequency)",
       y="Count") +
  guides(fill=FALSE) +
  theme_minimal() +
  theme(plot.title=element_text(size=18, face="bold"))

plot of chunk hist-tokenfreq

Most frequent words

In the table below, we can see the 20 most frequent words in each corpus and in all corpus combined. Many of these words (or tokens) are common to each top 20 list, like the, to, and, a, of and some punctuation marks, but there are some odd words in there.

In both blogs and twitter corpora, I and my are among the 20 most frequent words, but they don’t make it into the news top 20 list. This might be related to the extremely personal aspect of social media like blogs and tweets. Conversely, news top 20 list has secured places for he and said giving us a hint of the kind of content that make the news.

top20comb <- topTokens(20)
top20corpus <- topTokens(20, FALSE)
knitr::kable(cbind(all.token=top20comb$token, all.freq=top20comb$freq, do.call(cbind, top20corpus)), format = "pandoc", align = rep("r", 8))
all.token all.freq blogs.token blogs.freq news.token news.freq twitter.token twitter.freq
. 66113 . 21138 . 20135 . 24840
the 47632 the 18673 , 19983 ! 12464
, 44803 , 17553 the 19730 the 9229
to 27592 and 10815 to 9083 to 7774
and 23986 to 10735 and 8890 , 7267
a 23437 a 8897 a 8656 i 7142
of 20222 of 8692 of 7888 a 5884
in 16522 i 7564 7843 you 5454
i 16305 in 5854 in 6875 and 4281
14657 that 4473 that 3503 ? 4055
! 14432 is 4373 for 3480 : 3855
for 10814 it 4090 is 2843 for 3799
is 10802 3613 on 2711 in 3793
that 10282 for 3535 said 2539 of 3642
you 9321 you 2956 with 2505 is 3586
it 9100 on 2758 was 2348 3201
on 8216 with 2754 he 2294 my 2885
with 6984 my 2714 it 2166 it 2844
was 6215 was 2714 at 2120 on 2747
: 6181 this 2588 - 1994 # 2567

Text coverage and vocabulary size

In this final section of our exploratory analysis, we take a look at the proportion of the three corpus covered by different vocabulary sizes. By keeping only the corpus most frequent words, fit to a specified vocabulary size, we try to find an optimal trade-off between efficiency and accuracy of our future model.

Confirming the words frenquency distribution skewness, we can see that the 20 most frequent words cover more than 33% of all three corpus, even all three combined.

unlist(lapply(top20corpus, function(x) sum(x$freq)))/sample2Stats["total.words", ]
##   blogs    news twitter 
##  0.3459  0.3435  0.3352
sum(top20comb$freq)/sum(sample2Stats["total.words", ])
## [1] 0.3319

We show in the table and plot below, the vocabulary size needed to cover from 50% to 95% of the sampled corpora.

coverages <- seq(0.5, 0.9, 0.05)
names(coverages) <- paste0(coverages * 100, "%")
covdf <- cbind(all=sapply(coverages, coverage), 
               do.call(rbind, lapply(coverages, function(x) unlist(coverage(x, full=FALSE)))))
covdf
##      all blogs news twitter
## 50%   86    73  104      73
## 55%  136   111  185     107
## 60%  223   175  326     160
## 65%  383   296  553     243
## 70%  652   517  912     381
## 75% 1110   899 1472     629
## 80% 1901  1561 2378    1096
## 85% 3383  2796 4001    2017
## 90% 6651  5405 7205    4075
covplot <- melt(covdf, measure.vars = names(covdf))
names(covplot) <- c("coverage", "corpus", "words")

ggplot(covplot, aes(x=coverage, y=words, group=corpus, color=corpus)) + 
  geom_line(size=1) + geom_point(size=4) +
  labs(title="Vocabulary size and text coverage",
       x="Coverage (%)",
       y="Number of words", 
       color="Corpus") +
  guides(fill=FALSE) +
  theme_minimal() +
  theme(plot.title=element_text(size=16, face="bold"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

plot of chunk coverage-plots

Further work

Our future work will focus on building some ngram-based statistical language models trained on a much larger sample (60%) of the original dataset, prune these models with a validation set (20%) e choose the best model based on a test set (20%), with sentences never used in training or validation. When building these models, we’ll try to optimize storage and memory utilization keeping in mind the constraints of free Shiny App platform.

References

[1] COURSERA.; Capstone Dataset. Available online at: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

[2] WIKIPEDIA.; Tokenization (lexical analysis). Available online at: http://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)

[3] McENERY, T., HARDIE, A.; Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press, 2012. Available online at: http://corpora.lancs.ac.uk/clmtp/index.php

[4] NATION, P., WARING, R.; Vocabulary Size, Text Coverage and Word Lists. Available at http://www.fltr.ucl.ac.be/fltr/germ/etan/bibs/vocab/cup.html

© 2014, Paulo Jean