Analysis of text data and Natural Language Processing

Milestone report

November 2014

Executive Summary

This report provides an exploratory analysis of text data provided in the Coursera’s Data Science Capstone Project. The original data used in our analysis is from a corpus called HC Corpora and was downloaded from Coursera’s website. It consists of 1.31Gb worth of texts in 4 languages (English, German, Finnish and Russian) collected from 3 kinds of sources (blogs, news and twitter). We initially assess the data based on a small sample of 500 sentences extracted from each English corpus, then make our exploratory analysis on a larger sample of some 42,000 sentences (1% of the full English corpus). We summarize the data, find most frequent words and explore the relation between vocabulary size and text coverage. We conclude by presenting our future plans for a prediction model and an app to be uploaded to ShinyApps web service.

Introduction

Sampling strategy

Given such a large dataset – 1.31Gb compressed on a 0.55Gb zip file –, we devised a strategy to sample the data, keeping the original file in its compressed form. Our sampling method allows us to set the proportion of lines to be sampled (put in another words, the probability that each line has to be selected) and optionally the maximum number of lines in the sample. The method also accepts a numeric id, so the sample can be recreated as needed, following reproducible research guidelines. The numeric id (user-specified or random generated) used to create the sample is appended to the sample file name. All source code is available at this Github repository.

Getting familiar with the data

Our first step is to extract a very small sample of 500 sentences of each corpus to get a rough idea of the data we’re dealing with. We’re interested in basic features like total number of words and number of unique words in each corpus, average word and sentence length, average number of words in a sentence. Later on, in our exploratory analysis, we’ll use a larger sample.

set.seed(14112014)
corpusList <- c("blogs", "news", "twitter")
names(corpusList) <- corpusList
sample1 <- list(corpusFiles = lapply(corpusList, cpskSampleData, size = 500))
sample1$tokens <- lapply(sample1$corpusFiles, cpskTokensFromFile, ngramOrder=1L)
sample1$sentencestats <- lapply(sample1$tokens, cpskSentenceStats)

Initial sample files

as.matrix(unlist(lapply(sample1$corpusFiles, basename)))

##         [,1]                            
## blogs   "en_US.blogs.98AD5F0F.500.txt"  
## news    "en_US.news.A5C925E6.500.txt"   
## twitter "en_US.twitter.D903B313.500.txt"

Summaries of 500-sentence samples of each corpus

(sample1Stats <- t(round(cpskCorpusStats(sample1), 1)))

##                          blogs    news twitter
## sentences                500.0   500.0   500.0
## total.words            20588.0 20000.0  7486.0
## unique.words            4771.0  5380.0  2155.0
## avg.words.per.sentence    41.2    40.0    15.0
## avg.chars.per.sentence   205.4   210.4    69.0
## avg.word.length            4.0     4.3     3.7
## max.word.length           78.0    21.0    45.0

As expected, considering that all tweets are limited to 140 characters, the twitter corpus sample has the smallest average number of words and characters per sentence. It also shows the smallest number of unique words, thus needs a smaller vocabulary.

On the other hand, most blogs and news statistics are similar, except for the maximum word lenght. The longest word from the news corpus sample is 21 characters long while the blogs corpus sample has an impressive 78-character(!) long word. We will check out the 5 longest words in each sample.

Long words

lapply(sample1$tokens, function(x) x[order(nchar(x), decreasing=TRUE)[1:5]])

## $blogs
## [1] "www.birthersummit.org/news/73-was-baby-virginia-sunaharas-identity-stolen.html"
## [2] "ahhhhhaaaaaaaaaaaaaaaaaaaaaa"                                                  
## [3] "definitions-particularly"                                                      
## [4] "means-plus-function"                                                           
## [5] "american!bandstand"                                                            
## 
## $news
## [1] "stamping-and-assembly" "better-than-expected"  "directing-producing"  
## [4] "immigrant-shooting"    "guitarist/vocalist"   
## 
## $twitter
## [1] "www.chea.org/about/2010ac/2010_acis_final.asp"
## [2] "he'sthegluethatheldtheshowtogether"           
## [3] "www.youtube.com/kaebelltunes"                 
## [4] "dc/maryland/virginia"                         
## [5] "rumdiarieshouseparty"

It turns out that very long words are not likely to be real words after all. We’ll have to choose how to treat them, maybe replacing them by class tokens such as <url>, <email> and <hashtag> to mark their places in a sentence or just getting rid of them if they don’t contribute to our next-word prediction task.

Exploratory Analysis

For our exploratory analysis, we’ll take a larger sample of each corpus. These new samples comprise 1% of all sentences of the original dataset.

sample2 <- list(corpusFiles = lapply(corpusList, cpskSampleData, ratio=0.01))
sample2$tokens <- lapply(sample2$corpusFiles, cpskTokensFromFile, ngramOrder=1L)
sample2$sentencestats <- lapply(sample2$tokens, cpskSentenceStats)

as.matrix(unlist(lapply(sample2$corpusFiles, basename)))

##         [,1]                        
## blogs   "en_US.blogs.7D3DB9C6.txt"  
## news    "en_US.news.1FDEF851.txt"   
## twitter "en_US.twitter.84DA6F3D.txt"

(sample2Stats <- t(round(cpskCorpusStats(sample2), 1)))

##                           blogs     news  twitter
## sentences                8910.0  10113.0  23519.0
## total.words            423520.0 400599.0 361856.0
## unique.words            30948.0  34083.0  27133.0
## avg.words.per.sentence     47.5     39.6     15.4
## avg.chars.per.sentence    236.7    207.6     70.7
## avg.word.length             4.0      4.3      3.7
## max.word.length           151.0     73.0     85.0

Based on the number of sentences in our samples, we estimate the number of sentences in the full dataset as around 891,000 sentences in the blogs corpus, 1,011,300 sentences in news and 2,351,900 sentences in twitter. As we can see below, between 7.3% and 8.5% of all words are unique in each corpus giving us an upper bound estimate of each vocabulary size.

Proportion of unique words

sample2Stats["unique.words", ]/sample2Stats["total.words", ]
##   blogs    news twitter 
## 0.07307 0.08508 0.07498
sum(sample2Stats["unique.words", ])/sum(sample2Stats["total.words", ])
## [1] 0.07771

In practice, a vocabulary will be much smaller because we would want to remove garbage, ill-formed and meaningless words, group multiple forms of the same word, group some words (like numbers, dates, URLs) under the same token/class, fix typos and some other kinds of text cleaning and preprocessing. And we’ll probably reduce it even further in order to trade off storage and memory efficiency against vocabulary size. As an example, the very long words/tokens shown below are potential candidates to be removed or grouped into classes.

lapply(sample2$tokens, function(x) x[order(nchar(x), decreasing=TRUE)[1:2]])

## $blogs
## [1] "www.prnewswire.com/news-releases/philip-morris-limited-files-high-court-challenge-against-the-australian-government-over-plain-packaging-135897478.html"
## [2] "globalvoicesonline.org/2011/04/15/bloggers-react-to-demeaning-article-about-ghana"                                                                      
## 
## $news
## [1] " ________________________________________________________________________"
## [2] "www.rcscw.com/recreation_centers/receationcenters.htm"                    
## 
## $twitter
## [1] "www.details.com/culture-trends/critical-eye/201204/rust-belt-revival-detroit-michigan"
## [2] "decidedtomakealonghashtagandithinkimdoingprettywellatit"

Tokenization and word frequency

Tokenization is “the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements”[2]. The result of it, as its name suggests, are not words, in the literary sense, but tokens. In this section, even if we use words here and there, we really mean tokens, pieces of text produced by our (simple) tokenization process.

We can see in the histogram further below that the distribution of words frequencies of all three corpus combined is highly skewed, with many (or most) words appearing scarcely in the texts that form the corpus and some words appearing many, many times. To better visualize it, we have transformed word frenquencies to a log scale.

tokenFreq <- lapply(sample2$tokens, cpskTokenFreq)
# token frequency in combined corpus
tokenFreqComb <- aggregate(freq ~ token, do.call(rbind, tokenFreq), sum)
tokenFreqComb <- tokenFreqComb[order(tokenFreqComb$freq, decreasing=TRUE), ]
rownames(tokenFreqComb) <- NULL

ggplot(tokenFreqComb, aes(x=log(freq))) +
  geom_histogram(binwidth = 0.5, fill="darkolivegreen3") +
  scale_x_continuous(breaks=seq(0, 12, 2)) + 
  labs(title= expression(atop("Histogram of token frequencies",
                        scriptstyle("All corpora (1% sample)"))),       
       x="log(token frequency)",
       y="Count") +
  guides(fill=FALSE) +
  theme_minimal() +
  theme(plot.title=element_text(size=18, face="bold"))

plot of chunk hist-tokenfreq

Most frequent words

In the table below, we can see the 20 most frequent words in each corpus and in all corpus combined. Many of these words (or tokens) are common to each top 20 list, like the, to, and, a, of and some punctuation marks, but there are some odd words in there.

In both blogs and twitter corpora, I and my are among the 20 most frequent words, but they don’t make it into the news top 20 list. This might be related to the extremely personal aspect of social media like blogs and tweets. Conversely, news top 20 list has secured places for he and said giving us a hint of the kind of content that make the news.

top20comb <- topTokens(20)
top20corpus <- topTokens(20, FALSE)
knitr::kable(cbind(all.token=top20comb$token, all.freq=top20comb$freq, do.call(cbind, top20corpus)), format = "pandoc", align = rep("r", 8))

all.token	all.freq	blogs.token	blogs.freq	news.token	news.freq	twitter.token	twitter.freq
.	66113	.	21138	.	20135	.	24840
the	47632	the	18673	,	19983	!	12464
,	44803	,	17553	the	19730	the	9229
to	27592	and	10815	to	9083	to	7774
and	23986	to	10735	and	8890	,	7267
a	23437	a	8897	a	8656	i	7142
of	20222	of	8692	of	7888	a	5884
in	16522	i	7564	“	7843	you	5454
i	16305	in	5854	in	6875	and	4281
“	14657	that	4473	that	3503	?	4055
!	14432	is	4373	for	3480	:	3855
for	10814	it	4090	is	2843	for	3799
is	10802	“	3613	on	2711	in	3793
that	10282	for	3535	said	2539	of	3642
you	9321	you	2956	with	2505	is	3586
it	9100	on	2758	was	2348	“	3201
on	8216	with	2754	he	2294	my	2885
with	6984	my	2714	it	2166	it	2844
was	6215	was	2714	at	2120	on	2747
:	6181	this	2588	-	1994	#	2567

Text coverage and vocabulary size

In this final section of our exploratory analysis, we take a look at the proportion of the three corpus covered by different vocabulary sizes. By keeping only the corpus most frequent words, fit to a specified vocabulary size, we try to find an optimal trade-off between efficiency and accuracy of our future model.

Confirming the words frenquency distribution skewness, we can see that the 20 most frequent words cover more than 33% of all three corpus, even all three combined.

unlist(lapply(top20corpus, function(x) sum(x$freq)))/sample2Stats["total.words", ]
##   blogs    news twitter 
##  0.3459  0.3435  0.3352
sum(top20comb$freq)/sum(sample2Stats["total.words", ])
## [1] 0.3319

We show in the table and plot below, the vocabulary size needed to cover from 50% to 95% of the sampled corpora.

coverages <- seq(0.5, 0.9, 0.05)
names(coverages) <- paste0(coverages * 100, "%")
covdf <- cbind(all=sapply(coverages, coverage), 
               do.call(rbind, lapply(coverages, function(x) unlist(coverage(x, full=FALSE)))))
covdf

##      all blogs news twitter
## 50%   86    73  104      73
## 55%  136   111  185     107
## 60%  223   175  326     160
## 65%  383   296  553     243
## 70%  652   517  912     381
## 75% 1110   899 1472     629
## 80% 1901  1561 2378    1096
## 85% 3383  2796 4001    2017
## 90% 6651  5405 7205    4075

covplot <- melt(covdf, measure.vars = names(covdf))
names(covplot) <- c("coverage", "corpus", "words")

ggplot(covplot, aes(x=coverage, y=words, group=corpus, color=corpus)) + 
  geom_line(size=1) + geom_point(size=4) +
  labs(title="Vocabulary size and text coverage",
       x="Coverage (%)",
       y="Number of words", 
       color="Corpus") +
  guides(fill=FALSE) +
  theme_minimal() +
  theme(plot.title=element_text(size=16, face="bold"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

plot of chunk coverage-plots

Further work

Our future work will focus on building some ngram-based statistical language models trained on a much larger sample (60%) of the original dataset, prune these models with a validation set (20%) e choose the best model based on a test set (20%), with sentences never used in training or validation. When building these models, we’ll try to optimize storage and memory utilization keeping in mind the constraints of free Shiny App platform.

References

[1] COURSERA.; Capstone Dataset. Available online at: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

[2] WIKIPEDIA.; Tokenization (lexical analysis). Available online at: http://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)

[3] McENERY, T., HARDIE, A.; Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press, 2012. Available online at: http://corpora.lancs.ac.uk/clmtp/index.php

[4] NATION, P., WARING, R.; Vocabulary Size, Text Coverage and Word Lists. Available at http://www.fltr.ucl.ac.be/fltr/germ/etan/bibs/vocab/cup.html