Capstone Milestone Report

Executive Summary

An exploratory data analysis was conducted on three files from the the Heliohost Corpora, a collection of text gathered in multiple languages from the World Wide Web to assess the current usage of each language. We evaluated the English language versions of three files in the corpora:

blogs.txt – 899,288 blog posts,
news.txt – 1,010,242 news items, and
twitter.txt – 2,360,148 tweets.

To simplify the analysis, the three text files are combined into a single corpus of 4269678 documents are sampled and analyzed as 1-grams, 2-grams, and 3-grams.

Key Questions Considered

The purpose of the exploratory data analysis is to understand the data in the context of the final assignment – develop a model that predicts the next word as someone is typing words into mobile device. Given this objective, some of the key questions that are relevant in the exploratory data analysis include:

How many unique words must be in a frequency sorted dictionary to cover 50% of the word instances in the language? How many unique words are required for 90% coverage?
What are the distributions of word frequencies?
What are the frequencies of 2-grams and 3-grams in the dataset?
How many unique words must be in a frequency sorted dictionary to cover 50% of the word instances in the language? How many unique words are required for 90% coverage?
How does one evaluate how many words come from a foreign language?
How might one increase the coverage by identifying words outside the corpora or by using smaller number of words in the dictionary to cover the same number of phrases?

Required Unique Words

According to the Merriam Webster Dictonary Website, there are approximately 470,000 words in Webster’s Third New International Dictionary, Unabridged, together with its 1993 Addenda section. However, based on the frequency with which various words are used, the top 100 words / lexemes account for 50% of all of the words in the Oxford English Corpus, per Wikipedia’s Most Common Words in English article. The top 7,000 words account for 90% of the corpus based on usage, and the long tail includes lemmas that are rarely used, according to Oxford English Dictionary’s Facts About the Language.

## Warning in readLines(twitterFile): line 167155 appears to contain an
## embedded nul

## Warning in readLines(twitterFile): line 268547 appears to contain an
## embedded nul

## Warning in readLines(twitterFile): line 1274086 appears to contain an
## embedded nul

## Warning in readLines(twitterFile): line 1759032 appears to contain an
## embedded nul

## [1] "read_lines() took 20.2914791107178 secs"

## [1] "After sampling, allData contains 1280903 documents."

Understanding the Data

Each of the three text files contains a number of documents, where each row in the input text file corresponds to a document in the corpus. To manage the processing time for the analysis we will draw a 30% sample, for a total of approximately 1280903 documents. Once we’ve loaded the data into a corpus, we can see that the average number of sentences is very low, per the following histogram.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:readr':
## 
##     tokenize

## The following object is masked from 'package:stats':
## 
##     df

## The following object is masked from 'package:base':
## 
##     sample

## Loading required package: boot

## [1] "quanteda::corpus() took 7.33486199378967 secs"

## [1] "quanteda::summary() took 4.62915031512578 mins"

After loading the three text files provided for our research and combining them into a single corpus, the total number of documents in the corpus is 1280903.

How many sentences in the corpus?

There are 2449465 sentences in the corpus, with an average of 1.91 per document. The distribution of sentences in the corpus is as follows.

nbr.val	1280903.0000000
nbr.null	0.0000000
nbr.na	0.0000000
min	1.0000000
max	104.0000000
range	103.0000000
sum	2449465.0000000
median	1.0000000
mean	1.9122955
SE.mean	0.0013247
CI.mean.0.95	0.0025963
var	2.2476163
std.dev	1.4992052
coef.var	0.7839820

The data is positively skewed as fully 50% of the observations include a single sentence. Another way to visualize the descriptive statistics is with a histogram, that clearly demonstrates the positive skew of the data. This is understandable, given that the data sources included a large volume of tweets that are limited to 140 characters, and blog posts.

How many words / tokens in the corpus?

There are 36167741 tokens in the corpus, with an average of 28.24 per document.The distribution of tokens in the corpus is as follows.

nbr.val	1280903.0000000
nbr.null	0.0000000
nbr.na	0.0000000
min	1.0000000
max	2570.0000000
range	2569.0000000
sum	36167741.0000000
median	19.0000000
mean	28.2361280
SE.mean	0.0275323
CI.mean.0.95	0.0539624
var	970.9619990
std.dev	31.1602631
coef.var	1.1035601

The number of tokens per document varies more widely than the number of sentences, as illustrated by the following histogram.

Word Frequencies: 1-grams

Since the input data includes content from Twitter, we use the appropriate options on the tokenizer to remove hashtags and twitter handles. The following table lists the top 20 1-grams, along with their percentage frequency.

## Starting tokenization...
##   ...preserving Twitter characters (#, @)...total elapsed: 0.64 seconds.
##   ...tokenizing texts...total elapsed:  27.59 seconds.
##   ...replacing Twitter characters (#, @)...total elapsed: 13.75 seconds.
##   ...replacing names...total elapsed:  0.08 seconds.
## Finished tokenizing and cleaning 1,280,903 texts.

## [1] "quanteda::tokenize() took 44.3640871047974 secs"

## [1] "Building word frequencies took 48.5653610229492 secs"

	ngram1	Freq	pct
368211	the	1430217	0.0475714
373119	to	827045	0.0275089
56766	and	725535	0.0241325
45936	a	715601	0.0238021
273567	of	602554	0.0200419
195020	i	497623	0.0165518
198057	in	495646	0.0164860
159440	for	330303	0.0109864
203166	is	322820	0.0107375
368047	that	313175	0.0104167
412554	you	284410	0.0094599
203791	it	275413	0.0091607
275839	on	246382	0.0081951
402775	with	214875	0.0071471
395683	was	187372	0.0062323
258602	my	181456	0.0060355
63905	at	171815	0.0057149
72222	be	164377	0.0054675
369894	this	163293	0.0054314
183675	have	160476	0.0053377

There are a total of 415742 1-grams in the corpus. The top 20 1-grams represent 27.6417217 percent of the total frequency in the corpus. The top 100 1-grams represent 45.9049687 percent, and the top 7,000 represent 88.9554625 percent. These results are very similar to the statistics reported by Merriam Webster that were cited earlier in this report.

Frequencies of 2-grams and 3-grams

We expect that the 2-grams and 3-grams will have lower frequency than the 1-grams. The following table illustrates frequencies for the top 20 2-grams.

## [1] "Building word frequencies for 2-grams took 2.87706534862518 mins"

	ngram2	Freq	pct
3696703	of_the	128716	0.0044717
2642310	in_the	123801	0.0043010
5415179	to_the	64250	0.0022321
2032649	for_the	60454	0.0021002
3755613	on_the	59042	0.0020512
5389375	to_be	48555	0.0016869
551272	at_the	43061	0.0014960
397945	and_the	37822	0.0013140
2619378	in_a	36254	0.0012595
5918482	with_the	32021	0.0011124
2744231	is_a	30226	0.0010501
2785372	it_was	28665	0.0009959
2011621	for_a	28328	0.0009841
2572988	i_have	26232	0.0009113
2106538	from_the	26199	0.0009102
2577063	i_was	25887	0.0008993
372912	and_i	24731	0.0008592
2780384	it_is	24634	0.0008558
5897926	with_a	24420	0.0008484
5874292	will_be	24035	0.0008350

There are a total of 6057842 2-grams in the corpus. The top 20 2-grams represent 3.1174432 percent of the total frequency in the corpus. The top 100 2-grams represent 7.0993589 percent, and the top 7,000 represent 33.0870228 percent.

The next table illustrates frequencies for the top 20 3-grams.

## [1] "Building word frequencies for 3-grams took 9.06613223155339 mins"

	ngram3	Freq	pct
9857464	one_of_the	10447	0.0003797
273470	a_lot_of	9057	0.0003292
13000680	thanks_for_the	7200	0.0002617
14249722	to_be_a	5504	0.0002001
5554493	going_to_be	5247	0.0001907
13332134	the_end_of	4473	0.0001626
6612296	i_want_to	4461	0.0001621
10116911	out_of_the	4455	0.0001619
7309386	it_was_a	4280	0.0001556
12261378	some_of_the	4110	0.0001494
1623228	as_well_as	4081	0.0001483
1946735	be_able_to	4000	0.0001454
10270979	part_of_the	3637	0.0001322
6564472	i_have_a	3516	0.0001278
13618756	the_rest_of	3394	0.0001234
6566533	i_have_to	3377	0.0001227
8062054	looking_forward_to	3343	0.0001215
6550904	i_don’t_know	3255	0.0001183
13361442	the_first_time	3140	0.0001141
7124802	is_going_to	3096	0.0001125

There are a total of 16296115 3-grams in the corpus. The top 20 3-grams represent 0.3419354 percent of the total frequency in the corpus. The top 100 3-grams represent 0.9608744 percent, and the top 7,000 represent 8.153288 percent.

As expected, as the number of tokens increases in the n-gram, the percentage of total frequency accounted by the top N n-grams declines.

Evaluating Words from a Foreign Language

The dictionary capabilities within the quanteda package can be used to evaluate words from foreign languages. One can keep specific words in document feature matrix by specifying a foreign language dictionary and the keptFeatures= option within the quanteda::dfm() function. Since quanteda supports the Wordstat format, dictionaries for a variety of languages may be downloaded from the Provalis Research Dictionary Download web page.

Increasing Coverage

Coverage of the word combinations can be increased in the following ways:

Adding metadata to the model – by providing additional context beyond the words in the corpus, it is possible to improve the coverage of words without adding documents to the corpus.
Semantic analysis – by using techniques to identify sub-structures of a document, one can categories the n-grams into linguistic components such as “subject object predicate.” By understanding these components we may be able to improve our ability to predict the next word, knowing preceding words and the linguistic structure of sentences in a document. These techniques are described in Impact of Linguistic Analysis on the Semantic Graph Coverage and Learning of Document Extracts, Leskovec, et. al. 2005.

Preparing for the Prediction Assignment and Shiny App

As we learn more about natural language processing I will refine my strategy for developing a prediction algorithm. Based on what I know at this point, I expect to develop a document feature matrix based on n-grams, and based on a set of n-grams, use the words from 1 to n-1 to predict the n-th word. The Shiny application will store a predictive model, allow the user to enter some text, and then display a selection of words from which the user can select to add the n-th word.

Appendix

In the Appendix section we include a listing of the code used in the analysis. We refrained from echoing the code during the report to make the report easier to read.

programStart <- Sys.time()
setwd("C:/Users/Leonard/gitrepos/datascience")

# set seed to make results reproducible
set.seed(300755442)
# load libraries
library(readr)
library(stringr)
# assign names to source files 

blogFile <- "./capstone/data/en_us.blogs.txt"
newsFile <- "./capstone/data/en_us.news.txt"
twitterFile <- "./capstone/data/en_us.twitter.txt"

intervalStart <- Sys.time()
blogData <- read_lines(blogFile)
newsData <- read_lines(newsFile)
twitterData <- readLines(twitterFile) # had to use readLines because it handled embedded nulls in 4 lines (167155, 268547, 1274086, and 1759032)
allData <- c(blogData,newsData,twitterData)
intervalEnd <- Sys.time()
paste("read_lines() took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))

allData <- str_replace_all(allData, "â","")
allData <- str_replace_all(allData, "\"","")
allData <- str_replace_all(allData, "  "," ")
sample_pct <- .3
sample_size <- round(length(allData) * sample_pct,0)
allData <- sample(allData,sample_size)
paste("After sampling, allData contains",length(allData),"documents.")

library(quanteda)
library(knitr)
library(pastecs)
intervalStart <- Sys.time()
theText <- corpus(allData)
intervalEnd <- Sys.time()
paste("quanteda::corpus() took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))

# calculate number of words and sentences per blog 
intervalStart <- Sys.time()
summaryStats <- summary(theText,verbose=FALSE,n=ndoc(theText))
intervalEnd <- Sys.time()
paste("quanteda::summary() took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))

sentences <- stat.desc(summaryStats$Sentences,desc=TRUE,basic=TRUE)
tokens <- stat.desc(summaryStats$Tokens,desc=TRUE,basic=TRUE)

hist(summaryStats$Sentences,
     main="Sentences per Blog / News / Twitter post")

hist(summaryStats$Tokens,
     main="Tokens per Blog / News / Twitter post")

theText <- toLower(theText)
intervalStart <- Sys.time()

words <- quanteda::tokenize(toLower(theText),
                            removePunct=TRUE,
                            removeNumbers=TRUE,
                            removeSeparators=TRUE,
                            verbose=TRUE)
intervalEnd <- Sys.time()
paste("quanteda::tokenize() took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
ngram1 <- unlist(ngrams(words,n=1))
ngram2 <- unlist(ngrams(words,n=2))
ngram3 <- unlist(ngrams(words,n=3))
# create data table of frequencies 
intervalStart <- Sys.time()
wordFreq <- as.data.frame(table(ngram1))
wordFreq$pct <- wordFreq$Freq / sum(wordFreq$Freq)
intervalEnd <- Sys.time()
paste("Building word frequencies took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
wordFreq <- wordFreq[order(-wordFreq$Freq),]
kable(wordFreq[1:20,])
hist(wordFreq$Freq,
     main="Frequency Distribution of 1-grams")

intervalStart <- Sys.time()
ngram2Freq <- as.data.frame(table(ngram2))
ngram2Freq$pct <- ngram2Freq$Freq / sum(ngram2Freq$Freq)
intervalEnd <- Sys.time()
paste("Building word frequencies for 2-grams took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
ngram2Freq <- ngram2Freq[order(-ngram2Freq$Freq),]
kable(ngram2Freq[1:20,])
hist(ngram2Freq$Freq,
     main="Frequency Distribution of 2-grams")

intervalStart <- Sys.time()
ngram3Freq <- as.data.frame(table(ngram3))
ngram3Freq$pct <- ngram3Freq$Freq / sum(ngram3Freq$Freq)
intervalEnd <- Sys.time()
paste("Building word frequencies for 3-grams took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
ngram3Freq <- ngram3Freq[order(-ngram3Freq$Freq),]
kable(ngram3Freq[1:20,])
hist(ngram3Freq$Freq,
     main="Frequency Distribution of 3-grams")

Finally, we include the session information for the analysis.

sessionInfo()

## R version 3.2.4 (2016-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] pastecs_1.3-18 boot_1.3-18    knitr_1.12.3   quanteda_0.9.4
## [5] stringr_1.0.0  readr_0.2.2   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3      lattice_0.20-33  digest_0.6.9     chron_2.3-47    
##  [5] grid_3.2.4       formatR_1.3      magrittr_1.5     evaluate_0.8.3  
##  [9] highr_0.5.1      stringi_1.0-1    data.table_1.9.6 ca_0.64         
## [13] Matrix_1.2-4     rmarkdown_0.9.5  tools_3.2.4      parallel_3.2.4  
## [17] yaml_2.1.13      htmltools_0.3

intervalEnd <- Sys.time()
paste("Milestone Report took",intervalEnd - programStart,attr(intervalEnd - programStart,"units"))

## [1] "Milestone Report took 27.6026578823725 mins"