An exploratory data analysis was conducted on three files from the the Heliohost Corpora, a collection of text gathered in multiple languages from the World Wide Web to assess the current usage of each language. We evaluated the English language versions of three files in the corpora:
To simplify the analysis, the three text files are combined into a single corpus of 4269678 documents are sampled and analyzed as 1-grams, 2-grams, and 3-grams.
The purpose of the exploratory data analysis is to understand the data in the context of the final assignment – develop a model that predicts the next word as someone is typing words into mobile device. Given this objective, some of the key questions that are relevant in the exploratory data analysis include:
According to the Merriam Webster Dictonary Website, there are approximately 470,000 words in Webster’s Third New International Dictionary, Unabridged, together with its 1993 Addenda section. However, based on the frequency with which various words are used, the top 100 words / lexemes account for 50% of all of the words in the Oxford English Corpus, per Wikipedia’s Most Common Words in English article. The top 7,000 words account for 90% of the corpus based on usage, and the long tail includes lemmas that are rarely used, according to Oxford English Dictionary’s Facts About the Language.
## Warning in readLines(twitterFile): line 167155 appears to contain an
## embedded nul
## Warning in readLines(twitterFile): line 268547 appears to contain an
## embedded nul
## Warning in readLines(twitterFile): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(twitterFile): line 1759032 appears to contain an
## embedded nul
## [1] "read_lines() took 20.2914791107178 secs"
## [1] "After sampling, allData contains 1280903 documents."
Each of the three text files contains a number of documents, where each row in the input text file corresponds to a document in the corpus. To manage the processing time for the analysis we will draw a 30% sample, for a total of approximately 1280903 documents. Once we’ve loaded the data into a corpus, we can see that the average number of sentences is very low, per the following histogram.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:readr':
##
## tokenize
## The following object is masked from 'package:stats':
##
## df
## The following object is masked from 'package:base':
##
## sample
## Loading required package: boot
## [1] "quanteda::corpus() took 7.33486199378967 secs"
## [1] "quanteda::summary() took 4.62915031512578 mins"
After loading the three text files provided for our research and combining them into a single corpus, the total number of documents in the corpus is 1280903.
There are 2449465 sentences in the corpus, with an average of 1.91 per document. The distribution of sentences in the corpus is as follows.
| nbr.val | 1280903.0000000 |
| nbr.null | 0.0000000 |
| nbr.na | 0.0000000 |
| min | 1.0000000 |
| max | 104.0000000 |
| range | 103.0000000 |
| sum | 2449465.0000000 |
| median | 1.0000000 |
| mean | 1.9122955 |
| SE.mean | 0.0013247 |
| CI.mean.0.95 | 0.0025963 |
| var | 2.2476163 |
| std.dev | 1.4992052 |
| coef.var | 0.7839820 |
The data is positively skewed as fully 50% of the observations include a single sentence. Another way to visualize the descriptive statistics is with a histogram, that clearly demonstrates the positive skew of the data. This is understandable, given that the data sources included a large volume of tweets that are limited to 140 characters, and blog posts.
There are 36167741 tokens in the corpus, with an average of 28.24 per document.The distribution of tokens in the corpus is as follows.
| nbr.val | 1280903.0000000 |
| nbr.null | 0.0000000 |
| nbr.na | 0.0000000 |
| min | 1.0000000 |
| max | 2570.0000000 |
| range | 2569.0000000 |
| sum | 36167741.0000000 |
| median | 19.0000000 |
| mean | 28.2361280 |
| SE.mean | 0.0275323 |
| CI.mean.0.95 | 0.0539624 |
| var | 970.9619990 |
| std.dev | 31.1602631 |
| coef.var | 1.1035601 |
The number of tokens per document varies more widely than the number of sentences, as illustrated by the following histogram.
Since the input data includes content from Twitter, we use the appropriate options on the tokenizer to remove hashtags and twitter handles. The following table lists the top 20 1-grams, along with their percentage frequency.
## Starting tokenization...
## ...preserving Twitter characters (#, @)...total elapsed: 0.64 seconds.
## ...tokenizing texts...total elapsed: 27.59 seconds.
## ...replacing Twitter characters (#, @)...total elapsed: 13.75 seconds.
## ...replacing names...total elapsed: 0.08 seconds.
## Finished tokenizing and cleaning 1,280,903 texts.
## [1] "quanteda::tokenize() took 44.3640871047974 secs"
## [1] "Building word frequencies took 48.5653610229492 secs"
| ngram1 | Freq | pct | |
|---|---|---|---|
| 368211 | the | 1430217 | 0.0475714 |
| 373119 | to | 827045 | 0.0275089 |
| 56766 | and | 725535 | 0.0241325 |
| 45936 | a | 715601 | 0.0238021 |
| 273567 | of | 602554 | 0.0200419 |
| 195020 | i | 497623 | 0.0165518 |
| 198057 | in | 495646 | 0.0164860 |
| 159440 | for | 330303 | 0.0109864 |
| 203166 | is | 322820 | 0.0107375 |
| 368047 | that | 313175 | 0.0104167 |
| 412554 | you | 284410 | 0.0094599 |
| 203791 | it | 275413 | 0.0091607 |
| 275839 | on | 246382 | 0.0081951 |
| 402775 | with | 214875 | 0.0071471 |
| 395683 | was | 187372 | 0.0062323 |
| 258602 | my | 181456 | 0.0060355 |
| 63905 | at | 171815 | 0.0057149 |
| 72222 | be | 164377 | 0.0054675 |
| 369894 | this | 163293 | 0.0054314 |
| 183675 | have | 160476 | 0.0053377 |
There are a total of 415742 1-grams in the corpus. The top 20 1-grams represent 27.6417217 percent of the total frequency in the corpus. The top 100 1-grams represent 45.9049687 percent, and the top 7,000 represent 88.9554625 percent. These results are very similar to the statistics reported by Merriam Webster that were cited earlier in this report.
We expect that the 2-grams and 3-grams will have lower frequency than the 1-grams. The following table illustrates frequencies for the top 20 2-grams.
## [1] "Building word frequencies for 2-grams took 2.87706534862518 mins"
| ngram2 | Freq | pct | |
|---|---|---|---|
| 3696703 | of_the | 128716 | 0.0044717 |
| 2642310 | in_the | 123801 | 0.0043010 |
| 5415179 | to_the | 64250 | 0.0022321 |
| 2032649 | for_the | 60454 | 0.0021002 |
| 3755613 | on_the | 59042 | 0.0020512 |
| 5389375 | to_be | 48555 | 0.0016869 |
| 551272 | at_the | 43061 | 0.0014960 |
| 397945 | and_the | 37822 | 0.0013140 |
| 2619378 | in_a | 36254 | 0.0012595 |
| 5918482 | with_the | 32021 | 0.0011124 |
| 2744231 | is_a | 30226 | 0.0010501 |
| 2785372 | it_was | 28665 | 0.0009959 |
| 2011621 | for_a | 28328 | 0.0009841 |
| 2572988 | i_have | 26232 | 0.0009113 |
| 2106538 | from_the | 26199 | 0.0009102 |
| 2577063 | i_was | 25887 | 0.0008993 |
| 372912 | and_i | 24731 | 0.0008592 |
| 2780384 | it_is | 24634 | 0.0008558 |
| 5897926 | with_a | 24420 | 0.0008484 |
| 5874292 | will_be | 24035 | 0.0008350 |
There are a total of 6057842 2-grams in the corpus. The top 20 2-grams represent 3.1174432 percent of the total frequency in the corpus. The top 100 2-grams represent 7.0993589 percent, and the top 7,000 represent 33.0870228 percent.
The next table illustrates frequencies for the top 20 3-grams.
## [1] "Building word frequencies for 3-grams took 9.06613223155339 mins"
| ngram3 | Freq | pct | |
|---|---|---|---|
| 9857464 | one_of_the | 10447 | 0.0003797 |
| 273470 | a_lot_of | 9057 | 0.0003292 |
| 13000680 | thanks_for_the | 7200 | 0.0002617 |
| 14249722 | to_be_a | 5504 | 0.0002001 |
| 5554493 | going_to_be | 5247 | 0.0001907 |
| 13332134 | the_end_of | 4473 | 0.0001626 |
| 6612296 | i_want_to | 4461 | 0.0001621 |
| 10116911 | out_of_the | 4455 | 0.0001619 |
| 7309386 | it_was_a | 4280 | 0.0001556 |
| 12261378 | some_of_the | 4110 | 0.0001494 |
| 1623228 | as_well_as | 4081 | 0.0001483 |
| 1946735 | be_able_to | 4000 | 0.0001454 |
| 10270979 | part_of_the | 3637 | 0.0001322 |
| 6564472 | i_have_a | 3516 | 0.0001278 |
| 13618756 | the_rest_of | 3394 | 0.0001234 |
| 6566533 | i_have_to | 3377 | 0.0001227 |
| 8062054 | looking_forward_to | 3343 | 0.0001215 |
| 6550904 | i_don’t_know | 3255 | 0.0001183 |
| 13361442 | the_first_time | 3140 | 0.0001141 |
| 7124802 | is_going_to | 3096 | 0.0001125 |
There are a total of 16296115 3-grams in the corpus. The top 20 3-grams represent 0.3419354 percent of the total frequency in the corpus. The top 100 3-grams represent 0.9608744 percent, and the top 7,000 represent 8.153288 percent.
As expected, as the number of tokens increases in the n-gram, the percentage of total frequency accounted by the top N n-grams declines.
The dictionary capabilities within the quanteda package can be used to evaluate words from foreign languages. One can keep specific words in document feature matrix by specifying a foreign language dictionary and the keptFeatures= option within the quanteda::dfm() function. Since quanteda supports the Wordstat format, dictionaries for a variety of languages may be downloaded from the Provalis Research Dictionary Download web page.
Coverage of the word combinations can be increased in the following ways:
As we learn more about natural language processing I will refine my strategy for developing a prediction algorithm. Based on what I know at this point, I expect to develop a document feature matrix based on n-grams, and based on a set of n-grams, use the words from 1 to n-1 to predict the n-th word. The Shiny application will store a predictive model, allow the user to enter some text, and then display a selection of words from which the user can select to add the n-th word.
In the Appendix section we include a listing of the code used in the analysis. We refrained from echoing the code during the report to make the report easier to read.
programStart <- Sys.time()
setwd("C:/Users/Leonard/gitrepos/datascience")
# set seed to make results reproducible
set.seed(300755442)
# load libraries
library(readr)
library(stringr)
# assign names to source files
blogFile <- "./capstone/data/en_us.blogs.txt"
newsFile <- "./capstone/data/en_us.news.txt"
twitterFile <- "./capstone/data/en_us.twitter.txt"
intervalStart <- Sys.time()
blogData <- read_lines(blogFile)
newsData <- read_lines(newsFile)
twitterData <- readLines(twitterFile) # had to use readLines because it handled embedded nulls in 4 lines (167155, 268547, 1274086, and 1759032)
allData <- c(blogData,newsData,twitterData)
intervalEnd <- Sys.time()
paste("read_lines() took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
allData <- str_replace_all(allData, "â","")
allData <- str_replace_all(allData, "\"","")
allData <- str_replace_all(allData, " "," ")
sample_pct <- .3
sample_size <- round(length(allData) * sample_pct,0)
allData <- sample(allData,sample_size)
paste("After sampling, allData contains",length(allData),"documents.")
library(quanteda)
library(knitr)
library(pastecs)
intervalStart <- Sys.time()
theText <- corpus(allData)
intervalEnd <- Sys.time()
paste("quanteda::corpus() took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
# calculate number of words and sentences per blog
intervalStart <- Sys.time()
summaryStats <- summary(theText,verbose=FALSE,n=ndoc(theText))
intervalEnd <- Sys.time()
paste("quanteda::summary() took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
sentences <- stat.desc(summaryStats$Sentences,desc=TRUE,basic=TRUE)
tokens <- stat.desc(summaryStats$Tokens,desc=TRUE,basic=TRUE)
hist(summaryStats$Sentences,
main="Sentences per Blog / News / Twitter post")
hist(summaryStats$Tokens,
main="Tokens per Blog / News / Twitter post")
theText <- toLower(theText)
intervalStart <- Sys.time()
words <- quanteda::tokenize(toLower(theText),
removePunct=TRUE,
removeNumbers=TRUE,
removeSeparators=TRUE,
verbose=TRUE)
intervalEnd <- Sys.time()
paste("quanteda::tokenize() took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
ngram1 <- unlist(ngrams(words,n=1))
ngram2 <- unlist(ngrams(words,n=2))
ngram3 <- unlist(ngrams(words,n=3))
# create data table of frequencies
intervalStart <- Sys.time()
wordFreq <- as.data.frame(table(ngram1))
wordFreq$pct <- wordFreq$Freq / sum(wordFreq$Freq)
intervalEnd <- Sys.time()
paste("Building word frequencies took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
wordFreq <- wordFreq[order(-wordFreq$Freq),]
kable(wordFreq[1:20,])
hist(wordFreq$Freq,
main="Frequency Distribution of 1-grams")
intervalStart <- Sys.time()
ngram2Freq <- as.data.frame(table(ngram2))
ngram2Freq$pct <- ngram2Freq$Freq / sum(ngram2Freq$Freq)
intervalEnd <- Sys.time()
paste("Building word frequencies for 2-grams took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
ngram2Freq <- ngram2Freq[order(-ngram2Freq$Freq),]
kable(ngram2Freq[1:20,])
hist(ngram2Freq$Freq,
main="Frequency Distribution of 2-grams")
intervalStart <- Sys.time()
ngram3Freq <- as.data.frame(table(ngram3))
ngram3Freq$pct <- ngram3Freq$Freq / sum(ngram3Freq$Freq)
intervalEnd <- Sys.time()
paste("Building word frequencies for 3-grams took",intervalEnd - intervalStart,attr(intervalEnd - intervalStart,"units"))
ngram3Freq <- ngram3Freq[order(-ngram3Freq$Freq),]
kable(ngram3Freq[1:20,])
hist(ngram3Freq$Freq,
main="Frequency Distribution of 3-grams")
Finally, we include the session information for the analysis.
sessionInfo()
## R version 3.2.4 (2016-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pastecs_1.3-18 boot_1.3-18 knitr_1.12.3 quanteda_0.9.4
## [5] stringr_1.0.0 readr_0.2.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 lattice_0.20-33 digest_0.6.9 chron_2.3-47
## [5] grid_3.2.4 formatR_1.3 magrittr_1.5 evaluate_0.8.3
## [9] highr_0.5.1 stringi_1.0-1 data.table_1.9.6 ca_0.64
## [13] Matrix_1.2-4 rmarkdown_0.9.5 tools_3.2.4 parallel_3.2.4
## [17] yaml_2.1.13 htmltools_0.3
intervalEnd <- Sys.time()
paste("Milestone Report took",intervalEnd - programStart,attr(intervalEnd - programStart,"units"))
## [1] "Milestone Report took 27.6026578823725 mins"