This report presents basic approach and results of the exploratory data analysis of the text corpus from english speaking auditorium. The final goal of the research is building of next word prediction algorithm and engine. Such an engine may be used at internet sites and in mobile devices.
The report is generated with R/knitr, the most of the shown data are calculated at the execution time.
On the other hand, the report may be used for marketing purposes, so it should be understandable to a non-data scientist manager. Consequently, the main part of source code is hidden in the text.
The goals of the milestone project are to apply and demonstrate the following abilities:
The data of an English text corpus called HC Corpora were downloaded and analysed as a primary training data set. As a reference english word lists were used the data from Wordnet and SCOWL projects.
Downloading, processing and exploration analysis of the data were performed using tools of R, a free software environment for statistical computing and graphics.
Aiming a non-data scientist manager with this report, all used processing and configuration routines stay under the hood, being still available for technical experts as source code at GitHub.
Data loading and unpacking is performed in the background using R function utils::download.file and utils:unzip .
Here only the proofs of the work done:
## Status before downloading
dir(datadir)
character(0)
dir(file.path(datadir,"final","en_US"))
character(0)
dir(file.path(scowldir))
character(0)
dir(file.path(datadir,"en_dicts"))
character(0)
## Status after downloading and unpacking the text corpus
dir(datadir)
[1] "Coursera-SwiftKey.zip" "en_dicts" "final"
[4] "SCOWL"
dir(file.path(datadir,"final","en_US"))
[1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
dir(file.path(scowldir))
character(0)
dir(file.path(datadir,"en_dicts"))
character(0)
Some technical details are unavoidable at this place, as we must reach some decisions, influencing feasibility and credibility of the following statistical evaluations and thus predictions.
Evaluation of the text corpus at the command line shows, that the text files has each from 899288 to 2360148 lines and respectively from 30373559 to 34365936 “words” (any sequence of non-blank characters).
From the technical point of view, the total volume of the raw text data reaches 556 megabytes, or about one gigabyte when loaded in the memory for analysis. Having the memory and CPU intensive explorative procedures in mind, the corpus in toto is definitely too big. We use randomized selection preparing the work data set (effectively 6% of all lines).
Loaded in the memory, data set has 54048, 60673, 141149 lines from blogs, news and twitter. The total size of the data in memory is 48 megabyte, so the data set stays manageable.
We load and read in the reference english word lists from Wordnet and SCOWL projects as well.
## Status after downloading the dictionaries
dir(datadir)
[1] "Coursera-SwiftKey.zip" "en_dicts" "final"
[4] "SCOWL"
dir(file.path(datadir,"final","en_US"))
[1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
dir(file.path(scowldir))
[1] "scowl-2016.06.26" "scowl-2016.06.26.tar.gz"
dir(file.path(datadir,"en_dicts"))
[1] "dict_scowl.txt" "dict_wordnet.txt"
The next performance relevant steps to be made are removing of articles, numbers (low predictive strength) and of longer non-verbal character sequences and trimming the text with respect to white spaces and punctuation marks (we leave them in). Than the words are to be filtered away, written with non-latin characters and after it those, not found in the reference list of total of 715094 words and variants from british, american, canadian and other flavors of English). We reject the stemming of the data set and reference word list.
This is a way tidy data being prepared, suitable for the regular exploration.
5412 or (0.262%) words were identified and removed as containing the non-latin characters. Further, 0 words could not be identified as english.
#Creating Term-Document Matrices
dtmTr <- DocumentTermMatrix(x = tidyTrainCorpus)
#Normalizing the frequencies in the matrix
cntIncid <- sum(sum(dtmTr))
dtmTr <- dtmTr/cntIncid
#Single word frequency, descending
wordFreqDescNorm <- sort(colSums(as.matrix(dtmTr)), decreasing = T)
The first step in the data analysis is calculating of the frequency of unique english words. The built Document-Text Matrix reveals the total of 4229287 word incidences of 74184 unique words.
The Top-20 single words by the frequency are (in descending order, frequencies in percent):
round(wordFreqDescNorm[1:20]*100,4)
and for that you with was this have are but
3.4013 1.5495 1.4644 1.3215 1.0037 0.8855 0.7530 0.7470 0.6925 0.6800
not from all will they said your his just about
0.5792 0.5425 0.4658 0.4503 0.4440 0.4323 0.4287 0.4278 0.4252 0.4171
Here is the estimation, how many top words build ten, twenty, fifty and ninety percent quantile of all word occurrences in the text corpus.
## Twenty percent of the words/stems are only the first twenty seven in the list.
## Less than 400 words constitute the half of the text corpus
sum(cumsum(wordFreqDescNorm) <= .1)
[1] 6
sum(cumsum(wordFreqDescNorm) <= .2)
[1] 27
sum(cumsum(wordFreqDescNorm) <= .5)
[1] 356
sum(cumsum(wordFreqDescNorm) <= .9)
[1] 8082
The pool, building the half of the text body, is only few hundred words.
Plotting the frequencies of the Top-300 words reveals the fast decrease up to rank about 50, at lower level the decline becomes much slower.
Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters.
To keep the plot manageable, we take only the Top-50 words by frequency in the cluster.
The word “and” keeps distance from the other popular words, showing less binding to the rest of the pool. The words from the Top-20 build own clusters.
K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. The plot let us take a look at the same word pool, showing one more time the detachment of the word “and” and else essentially the same grouping.
To complete the mandatory program and have sum fun (sorry, some - I probably have had too much statistics ;-) , I’ve plotted the wordcloud of for Top-500 words.
It is nothing essentially new here, but the plot of the word universe is really pleasant!
Having seen the illustrations of the existent word associations, we should now come back to the more prosaic figures. The frequencies of all the word pairs occurring in the text corpus have been calculated and ranked. I took the punctuation marks for a “natural boundaries” building pairs, so that the last word before period or comma came as a second into the pair with the last but one, but not as a first in the next pair with the word after the punctuation mark. After the mark the count began from scratch, The same principle were applied to tri- and four-grams. Logically, this principle should weigh for the future prediction.
The total number of bigrams is 1402169. The number is really big. Fortunately for living speakers and the future prediction engine, it is still much less, than the number of possible permutations of all detected words (5503191672)! Statistically proven, English is not a simply word mess.
The top-20 bigrams are (absolute numbers here):
for (i in 1:20) print(paste(names(bigrams[i]),bigrams[i],sep = " - "))
[1] "to be - 9550"
[1] "it was - 5637"
[1] "i have - 5095"
[1] "will be - 4857"
[1] "and i - 4851"
[1] "i was - 4811"
[1] "going to - 4802"
[1] "it is - 4736"
[1] "i am - 4492"
[1] "one of - 4327"
[1] "to get - 4257"
[1] "if you - 4175"
[1] "want to - 3842"
[1] "have to - 3739"
[1] "to do - 3432"
[1] "out of - 3358"
[1] "i think - 3338"
[1] "to see - 3288"
[1] "this is - 3220"
[1] "that i - 3006"
The list looks (and much longer downwards!) like a page from the beginner’s textbook in English and it suggests like no surprises are here. But we pack it carefully in the separate R data.table as the first building block for the future prediction model.
Analysis has revealed 2833571 unique word triplets.
One more time, no real surprises under the top-20 and long downwards:
for (i in 1:20) print(paste(names(trigrams[i]),trigrams[i],sep = " - "))
[1] "going to be - 1072"
[1] "i want to - 921"
[1] "as well as - 843"
[1] "be able to - 827"
[1] "i have to - 703"
[1] "looking forward to - 687"
[1] "is going to - 634"
[1] "thank you for - 634"
[1] "i need to - 583"
[1] "i love you - 556"
[1] "you want to - 554"
[1] "you have to - 548"
[1] "it would be - 511"
[1] "is one of - 502"
[1] "one of my - 488"
[1] "to go to - 481"
[1] "there is no - 444"
[1] "in front of - 440"
[1] "i think i - 413"
[1] "at end of - 406"
And again, we’ll pack the long list in a data table for future utilization in the model.
Total number of unique four-grams 2998516
Top-20 four-grams:
for (i in 1:20) print(paste(names(four_grams[i]),four_grams[i],sep = " - "))
[1] "is going to be - 289"
[1] "when it comes to - 252"
[1] "to be able to - 206"
[1] "if you want to - 181"
[1] "thank you so much - 159"
[1] "i am going to - 143"
[1] "what do you think - 123"
[1] "will be able to - 123"
[1] "one of my favorite - 121"
[1] "i would like to - 120"
[1] "i would love to - 112"
[1] "to be part of - 112"
[1] "i wish i could - 107"
[1] "i was going to - 106"
[1] "was going to be - 101"
[1] "i just want to - 100"
[1] "as much as i - 99"
[1] "hope to see you - 99"
[1] "thanks so much for - 97"
[1] "i have no idea - 95"
At this place we have at last a surprise of some kind. The number of the unique word combinations changes from three to four words no more (from 2833571 to 2998516) as compared to the possible permutations for both cases. It reflects probably the real spectrum of the senses, the people express with the phrases or expressions. At that point we may assume the triplets of words being reasonable predictors for the future model.
Taking the full n-gram for the “outcome”, we can define the n-gram without a last word as “predictor” (e.g. “go to” is a predictor for “go to bed”). Using this approach we try to explore a potential statistical performance of predictors of the length 1, 2 and 3 (bi-, tri- and four-grams used as source for predictors and outcome respectively).
Having produced the ranked frequencies of n-grams, we could try to build the prediction model at this basis. Loving extremes, we take the worst case - the most versatile predictors (by the number of related outcomes).
Here are some examples of such predictors, they are on the top by the overall incidence as well. A real hardship case for practical prediction!
to be it was will be it is i have i was has been have been i am out of
The following plot is built averaging data for fifty most “multipotent” predictors (i.e., related to the more n-grams than the others). For each of the predictors the ranked array of frequencies of the “parents” has been selected and then mixed. At this basis some kind of “saturation curve” was plotted, reflecting how many tries with the top predictors we would need to reach the given coverage level for summarized frequency of related outcomes.
Comparing the plots for bi-, tri- and four-grams as outcomes/parents (predictors of length 1, 2 or 3) we can see, that 50% coverage:
- with a one-word-predictor needs on average 100 ranked variants to be probed
- with predictor of two words only 50 tries needed
- the same with three words is finished in twenty 20 steps
The difference for 90% quantile is much more strong.
So, the longer predictors have better predictive performance (at least for one, two and three words).
The most simple prediction model at this point is using of the calculated and raked n-grams frequencies for generation of the suggestions for the next word. There are some practical questions to be answered building the feasible practical solution.
The questions posed above are to intriguing to wait with the answers. Experiences gathered in the former stage oblige.
Here is a demonstration of the prediction engine prototype as built in a short time.
## Warning in data.table(freq = c(used_words, dummies), words_EN =
## c(names(used_words), : Item 2 is of size 74184 but maximum size is 715094
## (recycled leaving remainder of 47438 items)
## Warning in rm(dummies): object 'dummies' not found
## Short engine tests
```r
## We are probably not perfect yet ;-)
#Normal case
system.time(rslt <- predictEngine("red"))
user system elapsed
0.004 0.000 0.004
head(rslt)
freq predictor outcome
1: 70 red sox
2: 59 red and
3: 36 red cross
4: 35 red wine
5: 31 red carpet
6: 23 red light
#Normal case
system.time(rslt <- predictEngine("it was"))
user system elapsed
0.004 0.000 0.002
head(rslt)
freq predictor outcome
1: 143 it was great
2: 123 it was so
3: 121 it was just
4: 109 it was not
5: 98 it was good
6: 82 it was very
# Predictor not found in DB - shortening the second word
system.time(rslt <- predictEngine("it readd"))
user system elapsed
0.004 0.000 0.004
head(rslt)
freq predictor outcome
1: 1 it read yet
# Control for the second part
system.time(rslt <- predictEngine("readdd"))
user system elapsed
0.18 0.00 0.18
head(rslt)
freq predictor outcome
1: 173 read it
2: 84 read this
3: 75 read my
4: 74 read and
5: 71 read that
6: 59 read about
# Predictor ends with punctuation mark - nothing to be returned
system.time(rslt <- predictEngine("red river ,"))
user system elapsed
0 0 0
head(rslt)
data frame with 0 columns and 0 rows
# After foreign words no suggestions delivered
system.time(rslt <- predictEngine("red уменьшению"))
user system elapsed
0.004 0.000 0.001
head(rslt)
data frame with 0 columns and 0 rows
```
----------------------------------------------------------------------------------
# Conclusion
- We have successfully downloaded and performed an exploration data analysis of the english recommended text corpus.
- The results of the analysis let us build a plain, but well performing prediction algorithm.
- The insights, algorithm and prototype of prediction engine are a very good basis for further technically efficient and
parsimonious development product.