Executive Summary

An exploratory data analysis was conducted on three files from the the Heliohost Corpora, a collection of text gathered in multiple languages from the World Wide Web to assess the current usage of each language. We evaluated the English language versions of three files in the corpora:

To simplify the analysis, the three text files are combined into a single corpus of 4269678 documents are sampled and analyzed as 1-grams, 2-grams, and 3-grams.

Key Questions Considered

The purpose of the exploratory data analysis is to understand the data in the context of the final assignment – develop a model that predicts the next word as someone is typing words into mobile device. Given this objective, some of the key questions that are relevant in the exploratory data analysis include:

Required Unique Words

According to the Merriam Webster Dictonary Website, there are approximately 470,000 words in Webster’s Third New International Dictionary, Unabridged, together with its 1993 Addenda section. However, based on the frequency with which various words are used, the top 100 words / lexemes account for 50% of all of the words in the Oxford English Corpus, per Wikipedia’s Most Common Words in English article. The top 7,000 words account for 90% of the corpus based on usage, and the long tail includes lemmas that are rarely used, according to Oxford English Dictionary’s Facts About the Language.

## Warning in readLines(twitterFile): line 167155 appears to contain an
## embedded nul
## Warning in readLines(twitterFile): line 268547 appears to contain an
## embedded nul
## Warning in readLines(twitterFile): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(twitterFile): line 1759032 appears to contain an
## embedded nul
## [1] "read_lines() took 20.2914791107178 secs"
## [1] "After sampling, allData contains 1280903 documents."

Understanding the Data

Each of the three text files contains a number of documents, where each row in the input text file corresponds to a document in the corpus. To manage the processing time for the analysis we will draw a 30% sample, for a total of approximately 1280903 documents. Once we’ve loaded the data into a corpus, we can see that the average number of sentences is very low, per the following histogram.

## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:readr':
## 
##     tokenize
## The following object is masked from 'package:stats':
## 
##     df
## The following object is masked from 'package:base':
## 
##     sample
## Loading required package: boot
## [1] "quanteda::corpus() took 7.33486199378967 secs"
## [1] "quanteda::summary() took 4.62915031512578 mins"

After loading the three text files provided for our research and combining them into a single corpus, the total number of documents in the corpus is 1280903.

How many sentences in the corpus?

There are 2449465 sentences in the corpus, with an average of 1.91 per document. The distribution of sentences in the corpus is as follows.

nbr.val 1280903.0000000
nbr.null 0.0000000
nbr.na 0.0000000
min 1.0000000
max 104.0000000
range 103.0000000
sum 2449465.0000000
median 1.0000000
mean 1.9122955
SE.mean 0.0013247
CI.mean.0.95 0.0025963
var 2.2476163
std.dev 1.4992052
coef.var 0.7839820

The data is positively skewed as fully 50% of the observations include a single sentence. Another way to visualize the descriptive statistics is with a histogram, that clearly demonstrates the positive skew of the data. This is understandable, given that the data sources included a large volume of tweets that are limited to 140 characters, and blog posts.

How many words / tokens in the corpus?

There are 36167741 tokens in the corpus, with an average of 28.24 per document.The distribution of tokens in the corpus is as follows.

nbr.val 1280903.0000000
nbr.null 0.0000000
nbr.na 0.0000000
min 1.0000000
max 2570.0000000
range 2569.0000000
sum 36167741.0000000
median 19.0000000
mean 28.2361280
SE.mean 0.0275323
CI.mean.0.95 0.0539624
var 970.9619990
std.dev 31.1602631
coef.var 1.1035601

The number of tokens per document varies more widely than the number of sentences, as illustrated by the following histogram.

Word Frequencies: 1-grams

Since the input data includes content from Twitter, we use the appropriate options on the tokenizer to remove hashtags and twitter handles. The following table lists the top 20 1-grams, along with their percentage frequency.

## Starting tokenization...
##   ...preserving Twitter characters (#, @)...total elapsed: 0.64 seconds.
##   ...tokenizing texts...total elapsed:  27.59 seconds.
##   ...replacing Twitter characters (#, @)...total elapsed: 13.75 seconds.
##   ...replacing names...total elapsed:  0.08 seconds.
## Finished tokenizing and cleaning 1,280,903 texts.
## [1] "quanteda::tokenize() took 44.3640871047974 secs"
## [1] "Building word frequencies took 48.5653610229492 secs"
ngram1 Freq pct
368211 the 1430217 0.0475714
373119 to 827045 0.0275089
56766 and 725535 0.0241325
45936 a 715601 0.0238021
273567 of 602554 0.0200419
195020 i 497623 0.0165518
198057 in 495646 0.0164860
159440 for 330303 0.0109864
203166 is 322820 0.0107375
368047 that 313175 0.0104167
412554 you 284410 0.0094599
203791 it 275413 0.0091607
275839 on 246382 0.0081951
402775 with 214875 0.0071471
395683 was 187372 0.0062323
258602 my 181456 0.0060355
63905 at 171815 0.0057149
72222 be 164377 0.0054675
369894 this 163293 0.0054314
183675 have 160476 0.0053377

There are a total of 415742 1-grams in the corpus. The top 20 1-grams represent 27.6417217 percent of the total frequency in the corpus. The top 100 1-grams represent 45.9049687 percent, and the top 7,000 represent 88.9554625 percent. These results are very similar to the statistics reported by Merriam Webster that were cited earlier in this report.

Frequencies of 2-grams and 3-grams

We expect that the 2-grams and 3-grams will have lower frequency than the 1-grams. The following table illustrates frequencies for the top 20 2-grams.

## [1] "Building word frequencies for 2-grams took 2.87706534862518 mins"
ngram2 Freq pct
3696703 of_the 128716 0.0044717
2642310 in_the 123801 0.0043010
5415179 to_the 64250 0.0022321
2032649 for_the 60454 0.0021002
3755613 on_the 59042 0.0020512
5389375 to_be 48555 0.0016869
551272 at_the 43061 0.0014960
397945 and_the 37822 0.0013140
2619378 in_a 36254 0.0012595
5918482 with_the 32021 0.0011124
2744231 is_a 30226 0.0010501
2785372 it_was 28665 0.0009959
2011621 for_a 28328 0.0009841
2572988 i_have 26232 0.0009113
2106538 from_the 26199 0.0009102
2577063 i_was 25887 0.0008993
372912 and_i 24731 0.0008592
2780384 it_is 24634 0.0008558
5897926 with_a 24420 0.0008484
5874292 will_be 24035 0.0008350

There are a total of 6057842 2-grams in the corpus. The top 20 2-grams represent 3.1174432 percent of the total frequency in the corpus. The top 100 2-grams represent 7.0993589 percent, and the top 7,000 represent 33.0870228 percent.

The next table illustrates frequencies for the top 20 3-grams.

## [1] "Building word frequencies for 3-grams took 9.06613223155339 mins"
ngram3 Freq pct
9857464 one_of_the 10447 0.0003797
273470 a_lot_of 9057 0.0003292
13000680 thanks_for_the 7200 0.0002617
14249722 to_be_a 5504 0.0002001
5554493 going_to_be 5247 0.0001907
13332134 the_end_of 4473 0.0001626
6612296 i_want_to 4461 0.0001621
10116911 out_of_the 4455 0.0001619
7309386 it_was_a 4280 0.0001556
12261378 some_of_the 4110 0.0001494
1623228 as_well_as 4081 0.0001483
1946735 be_able_to 4000 0.0001454
10270979 part_of_the 3637 0.0001322
6564472 i_have_a 3516 0.0001278
13618756 the_rest_of 3394 0.0001234
6566533 i_have_to 3377 0.0001227
8062054 looking_forward_to 3343 0.0001215
6550904 i_don’t_know 3255 0.0001183
13361442 the_first_time 3140 0.0001141
7124802 is_going_to 3096 0.0001125

There are a total of 16296115 3-grams in the corpus. The top 20 3-grams represent 0.3419354 percent of the total frequency in the corpus. The top 100 3-grams represent 0.9608744 percent, and the top 7,000 represent 8.153288 percent.

As expected, as the number of tokens increases in the n-gram, the percentage of total frequency accounted by the top N n-grams declines.

Evaluating Words from a Foreign Language

The dictionary capabilities within the quanteda package can be used to evaluate words from foreign languages. One can keep specific words in document feature matrix by specifying a foreign language dictionary and the keptFeatures= option within the quanteda::dfm() function. Since quanteda supports the Wordstat format, dictionaries for a variety of languages may be downloaded from the Provalis Research Dictionary Download web page.

Increasing Coverage

Coverage of the word combinations can be increased in the following ways:

  1. Adding metadata to the model – by providing additional context beyond the words in the corpus, it is possible to improve the coverage of words without adding documents to the corpus.
  2. Semantic analysis – by using techniques to identify sub-structures of a document, one can categories the n-grams into linguistic components such as “subject object predicate.” By understanding these components we may be able to improve our ability to predict the next word, knowing preceding words and the linguistic structure of sentences in a document. These techniques are described in Impact of Linguistic Analysis on the Semantic Graph Coverage and Learning of Document Extracts, Leskovec, et. al. 2005.

Preparing for the Prediction Assignment and Shiny App

As we learn more about natural language processing I will refine my strategy for developing a prediction algorithm. Based on what I know at this point, I expect to develop a document feature matrix based on n-grams, and based on a set of n-grams, use the words from 1 to n-1 to predict the n-th word. The Shiny application will store a predictive model, allow the user to enter some text, and then display a selection of words from which the user can select to add the n-th word.

Appendix

In the Appendix section we include a listing of the code used in the analysis. We refrained from echoing the code during the report to make the report easier to read.

programStart <- Sys.time()
setwd("C:/Users/Leonard/gitrepos/datascience")

# set seed to make results reproducible
set.seed(300755442)
# load libraries
library(readr)
library(stringr)
# assign names to source files 

blogFile <- "./capstone/data/en_us.blogs.txt"
newsFile <- "./capstone/data/en_us.news.txt"
twitterFile <- "./capstone/data/en_us.twitter.txt"

intervalStart <- Sys.time()
blogData <- read_lines(blogFile)
newsData <- read_lines(newsFile)
twitterData <- readLines(twitterFile) # had to use readLines because it handled embedded nulls in 4 lines (167155, 268547, 1274086, and 1759032)
allData <- c(blogData,newsData,twitterData)
intervalEnd <- Sys.time()
paste("read_lines() took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))

allData <- str_replace_all(allData, "â","")
allData <- str_replace_all(allData, "\"","")
allData <- str_replace_all(allData, "  "," ")
sample_pct <- .3
sample_size <- round(length(allData) * sample_pct,0)
allData <- sample(allData,sample_size)
paste("After sampling, allData contains",length(allData),"documents.")
library(quanteda)
library(knitr)
library(pastecs)
intervalStart <- Sys.time()
theText <- corpus(allData)
intervalEnd <- Sys.time()
paste("quanteda::corpus() took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))

# calculate number of words and sentences per blog 
intervalStart <- Sys.time()
summaryStats <- summary(theText,verbose=FALSE,n=ndoc(theText))
intervalEnd <- Sys.time()
paste("quanteda::summary() took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))

sentences <- stat.desc(summaryStats$Sentences,desc=TRUE,basic=TRUE)
tokens <- stat.desc(summaryStats$Tokens,desc=TRUE,basic=TRUE)
hist(summaryStats$Sentences,
     main="Sentences per Blog / News / Twitter post")
hist(summaryStats$Tokens,
     main="Tokens per Blog / News / Twitter post")
theText <- toLower(theText)
intervalStart <- Sys.time()

words <- quanteda::tokenize(toLower(theText),
                            removePunct=TRUE,
                            removeNumbers=TRUE,
                            removeSeparators=TRUE,
                            verbose=TRUE)
intervalEnd <- Sys.time()
paste("quanteda::tokenize() took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
ngram1 <- unlist(ngrams(words,n=1))
ngram2 <- unlist(ngrams(words,n=2))
ngram3 <- unlist(ngrams(words,n=3))
# create data table of frequencies 
intervalStart <- Sys.time()
wordFreq <- as.data.frame(table(ngram1))
wordFreq$pct <- wordFreq$Freq / sum(wordFreq$Freq)
intervalEnd <- Sys.time()
paste("Building word frequencies took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
wordFreq <- wordFreq[order(-wordFreq$Freq),]
kable(wordFreq[1:20,])
hist(wordFreq$Freq,
     main="Frequency Distribution of 1-grams")
intervalStart <- Sys.time()
ngram2Freq <- as.data.frame(table(ngram2))
ngram2Freq$pct <- ngram2Freq$Freq / sum(ngram2Freq$Freq)
intervalEnd <- Sys.time()
paste("Building word frequencies for 2-grams took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
ngram2Freq <- ngram2Freq[order(-ngram2Freq$Freq),]
kable(ngram2Freq[1:20,])
hist(ngram2Freq$Freq,
     main="Frequency Distribution of 2-grams")
intervalStart <- Sys.time()
ngram3Freq <- as.data.frame(table(ngram3))
ngram3Freq$pct <- ngram3Freq$Freq / sum(ngram3Freq$Freq)
intervalEnd <- Sys.time()
paste("Building word frequencies for 3-grams took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
ngram3Freq <- ngram3Freq[order(-ngram3Freq$Freq),]
kable(ngram3Freq[1:20,])
hist(ngram3Freq$Freq,
     main="Frequency Distribution of 3-grams")

Finally, we include the session information for the analysis.

sessionInfo()
## R version 3.2.4 (2016-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] pastecs_1.3-18 boot_1.3-18    knitr_1.12.3   quanteda_0.9.4
## [5] stringr_1.0.0  readr_0.2.2   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3      lattice_0.20-33  digest_0.6.9     chron_2.3-47    
##  [5] grid_3.2.4       formatR_1.3      magrittr_1.5     evaluate_0.8.3  
##  [9] highr_0.5.1      stringi_1.0-1    data.table_1.9.6 ca_0.64         
## [13] Matrix_1.2-4     rmarkdown_0.9.5  tools_3.2.4      parallel_3.2.4  
## [17] yaml_2.1.13      htmltools_0.3
intervalEnd <- Sys.time()
paste("Milestone Report took",intervalEnd - programStart,attr(intervalEnd - programStart,"units"))
## [1] "Milestone Report took 27.6026578823725 mins"