Introduction

This report presents basic approach and results of the exploratory data analysis of the text corpus from english speaking auditorium. The final goal of the research is building of next word prediction algorithm and engine. Such an engine may be used at internet sites and in mobile devices.

The report is generated with R/knitr, the most of the shown data are calculated at the execution time.

On the other hand, the report may be used for marketing purposes, so it should be understandable to a non-data scientist manager. Consequently, the main part of source code is hidden in the text.

Definition of goals

The goals of the milestone project are to apply and demonstrate the following abilities:

Obtaining and preprocessing the data.
Applying of the tools in the area of natural language processing and text mining.
Exploratory analysis of the data.
Formulating of the technical terms and prototyping the first and plain prediction algorithm.
Using the elaborated insights and iterating the think-try-think cycle we can subsequently develop more sophisticated concepts and models.

Language data sources

The data of an English text corpus called HC Corpora were downloaded and analysed as a primary training data set. As a reference english word lists were used the data from Wordnet and SCOWL projects.

Initial expectations and assumptions

The word set of the text corpus is expected to mainly correspond to the standard English.
Trivial english words and expressions are expected to be on the top by the frequency.
Typos and conscious mutilations are usual in Internet language.
Written expressive language of the news might differ from that of blogs or Twitter, the possible differences will be not studied or considered at this moment.
Internet jargon and dialects like smileys and emoticons should become the subject of a studies, other than this one.
Non-english words will be ignored, having the final goal of prediction in mind.

Technical backgrounds

Downloading, processing and exploration analysis of the data were performed using tools of R, a free software environment for statistical computing and graphics.
Aiming a non-data scientist manager with this report, all used processing and configuration routines stay under the hood, being still available for technical experts as source code at GitHub.

Data loading

Data loading and unpacking is performed in the background using R function utils::download.file and utils:unzip .

Here only the proofs of the work done:

## Status before downloading
dir(datadir)
character(0)
dir(file.path(datadir,"final","en_US"))
character(0)
dir(file.path(scowldir))
character(0)
dir(file.path(datadir,"en_dicts"))
character(0)

## Status after downloading and unpacking the text corpus
dir(datadir)
[1] "Coursera-SwiftKey.zip" "en_dicts"              "final"                
[4] "SCOWL"                
dir(file.path(datadir,"final","en_US"))
[1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
dir(file.path(scowldir))
character(0)
dir(file.path(datadir,"en_dicts"))
character(0)

Exploratory data analysis

Preliminary text corpus analysis and preprocessing the data

Some technical details are unavoidable at this place, as we must reach some decisions, influencing feasibility and credibility of the following statistical evaluations and thus predictions.

Evaluation of the text corpus at the command line shows, that the text files has each from 899288 to 2360148 lines and respectively from 30373559 to 34365936 “words” (any sequence of non-blank characters).

From the technical point of view, the total volume of the raw text data reaches 556 megabytes, or about one gigabyte when loaded in the memory for analysis. Having the memory and CPU intensive explorative procedures in mind, the corpus in toto is definitely too big. We use randomized selection preparing the work data set (effectively 6% of all lines).

Loaded in the memory, data set has 54048, 60673, 141149 lines from blogs, news and twitter. The total size of the data in memory is 48 megabyte, so the data set stays manageable.

We load and read in the reference english word lists from Wordnet and SCOWL projects as well.

## Status after downloading the dictionaries
dir(datadir)
[1] "Coursera-SwiftKey.zip" "en_dicts"              "final"                
[4] "SCOWL"                
dir(file.path(datadir,"final","en_US"))
[1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
dir(file.path(scowldir))
[1] "scowl-2016.06.26"        "scowl-2016.06.26.tar.gz"
dir(file.path(datadir,"en_dicts"))
[1] "dict_scowl.txt"   "dict_wordnet.txt"

The next performance relevant steps to be made are removing of articles, numbers (low predictive strength) and of longer non-verbal character sequences and trimming the text with respect to white spaces and punctuation marks (we leave them in). Than the words are to be filtered away, written with non-latin characters and after it those, not found in the reference list of total of 715094 words and variants from british, american, canadian and other flavors of English). We reject the stemming of the data set and reference word list.

This is a way tidy data being prepared, suitable for the regular exploration.

5412 or (0.262%) words were identified and removed as containing the non-latin characters. Further, 0 words could not be identified as english.

Exploratory data analysis

#Creating Term-Document Matrices

dtmTr <- DocumentTermMatrix(x = tidyTrainCorpus)
#Normalizing the frequencies in the matrix
cntIncid <- sum(sum(dtmTr))
dtmTr <- dtmTr/cntIncid

#Single word frequency, descending
wordFreqDescNorm <- sort(colSums(as.matrix(dtmTr)), decreasing = T)

The first step in the data analysis is calculating of the frequency of unique english words. The built Document-Text Matrix reveals the total of 4229287 word incidences of 74184 unique words.

The Top-20 single words by the frequency are (in descending order, frequencies in percent):

round(wordFreqDescNorm[1:20]*100,4)
   and    for   that    you   with    was   this   have    are    but 
3.4013 1.5495 1.4644 1.3215 1.0037 0.8855 0.7530 0.7470 0.6925 0.6800 
   not   from    all   will   they   said   your    his   just  about 
0.5792 0.5425 0.4658 0.4503 0.4440 0.4323 0.4287 0.4278 0.4252 0.4171

Here is the estimation, how many top words build ten, twenty, fifty and ninety percent quantile of all word occurrences in the text corpus.

## Twenty percent of the words/stems are only the first twenty seven in the list.
## Less than 400 words constitute the half of the text corpus
sum(cumsum(wordFreqDescNorm) <= .1)
[1] 6
sum(cumsum(wordFreqDescNorm) <= .2)
[1] 27
sum(cumsum(wordFreqDescNorm) <= .5)
[1] 356
sum(cumsum(wordFreqDescNorm) <= .9)
[1] 8082

The pool, building the half of the text body, is only few hundred words.

Plotting the frequencies of the Top-300 words reveals the fast decrease up to rank about 50, at lower level the decline becomes much slower.

Clustering by Term Similarity

Hierarchal Clustering

Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters.

To keep the plot manageable, we take only the Top-50 words by frequency in the cluster.

The word “and” keeps distance from the other popular words, showing less binding to the rest of the pool. The words from the Top-20 build own clusters.

K-means clustering

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. The plot let us take a look at the same word pool, showing one more time the detachment of the word “and” and else essentially the same grouping.

Word clouds

To complete the mandatory program and have sum fun (sorry, some - I probably have had too much statistics ;-) , I’ve plotted the wordcloud of for Top-500 words.

It is nothing essentially new here, but the plot of the word universe is really pleasant!

Frequencies of the word pairs (bigrams)

Having seen the illustrations of the existent word associations, we should now come back to the more prosaic figures. The frequencies of all the word pairs occurring in the text corpus have been calculated and ranked. I took the punctuation marks for a “natural boundaries” building pairs, so that the last word before period or comma came as a second into the pair with the last but one, but not as a first in the next pair with the word after the punctuation mark. After the mark the count began from scratch, The same principle were applied to tri- and four-grams. Logically, this principle should weigh for the future prediction.

The total number of bigrams is 1402169. The number is really big. Fortunately for living speakers and the future prediction engine, it is still much less, than the number of possible permutations of all detected words (5503191672)! Statistically proven, English is not a simply word mess.
The top-20 bigrams are (absolute numbers here):

for (i in 1:20) print(paste(names(bigrams[i]),bigrams[i],sep = " - "))
[1] "to be - 9550"
[1] "it was - 5637"
[1] "i have - 5095"
[1] "will be - 4857"
[1] "and i - 4851"
[1] "i was - 4811"
[1] "going to - 4802"
[1] "it is - 4736"
[1] "i am - 4492"
[1] "one of - 4327"
[1] "to get - 4257"
[1] "if you - 4175"
[1] "want to - 3842"
[1] "have to - 3739"
[1] "to do - 3432"
[1] "out of - 3358"
[1] "i think - 3338"
[1] "to see - 3288"
[1] "this is - 3220"
[1] "that i - 3006"

The list looks (and much longer downwards!) like a page from the beginner’s textbook in English and it suggests like no surprises are here. But we pack it carefully in the separate R data.table as the first building block for the future prediction model.

Frequencies of the word triplets (trigrams)

Analysis has revealed 2833571 unique word triplets.
One more time, no real surprises under the top-20 and long downwards:

for (i in 1:20) print(paste(names(trigrams[i]),trigrams[i],sep = " - "))
[1] "going to be - 1072"
[1] "i want to - 921"
[1] "as well as - 843"
[1] "be able to - 827"
[1] "i have to - 703"
[1] "looking forward to - 687"
[1] "is going to - 634"
[1] "thank you for - 634"
[1] "i need to - 583"
[1] "i love you - 556"
[1] "you want to - 554"
[1] "you have to - 548"
[1] "it would be - 511"
[1] "is one of - 502"
[1] "one of my - 488"
[1] "to go to - 481"
[1] "there is no - 444"
[1] "in front of - 440"
[1] "i think i - 413"
[1] "at end of - 406"

And again, we’ll pack the long list in a data table for future utilization in the model.

Four-grams

Total number of unique four-grams 2998516

Top-20 four-grams:

for (i in 1:20) print(paste(names(four_grams[i]),four_grams[i],sep = " - "))
[1] "is going to be - 289"
[1] "when it comes to - 252"
[1] "to be able to - 206"
[1] "if you want to - 181"
[1] "thank you so much - 159"
[1] "i am going to - 143"
[1] "what do you think - 123"
[1] "will be able to - 123"
[1] "one of my favorite - 121"
[1] "i would like to - 120"
[1] "i would love to - 112"
[1] "to be part of - 112"
[1] "i wish i could - 107"
[1] "i was going to - 106"
[1] "was going to be - 101"
[1] "i just want to - 100"
[1] "as much as i - 99"
[1] "hope to see you - 99"
[1] "thanks so much for - 97"
[1] "i have no idea - 95"

At this place we have at last a surprise of some kind. The number of the unique word combinations changes from three to four words no more (from 2833571 to 2998516) as compared to the possible permutations for both cases. It reflects probably the real spectrum of the senses, the people express with the phrases or expressions. At that point we may assume the triplets of words being reasonable predictors for the future model.

Comparative statistical perfrormance of n-grams

Taking the full n-gram for the “outcome”, we can define the n-gram without a last word as “predictor” (e.g. “go to” is a predictor for “go to bed”). Using this approach we try to explore a potential statistical performance of predictors of the length 1, 2 and 3 (bi-, tri- and four-grams used as source for predictors and outcome respectively).

Having produced the ranked frequencies of n-grams, we could try to build the prediction model at this basis. Loving extremes, we take the worst case - the most versatile predictors (by the number of related outcomes).

Here are some examples of such predictors, they are on the top by the overall incidence as well. A real hardship case for practical prediction!

to be       it was      will be        it is       i have        i was     has been    have been         i am       out of

The following plot is built averaging data for fifty most “multipotent” predictors (i.e., related to the more n-grams than the others). For each of the predictors the ranked array of frequencies of the “parents” has been selected and then mixed. At this basis some kind of “saturation curve” was plotted, reflecting how many tries with the top predictors we would need to reach the given coverage level for summarized frequency of related outcomes.

Comparing the plots for bi-, tri- and four-grams as outcomes/parents (predictors of length 1, 2 or 3) we can see, that 50% coverage:
- with a one-word-predictor needs on average 100 ranked variants to be probed
- with predictor of two words only 50 tries needed
- the same with three words is finished in twenty 20 steps

The difference for 90% quantile is much more strong.

So, the longer predictors have better predictive performance (at least for one, two and three words).

Concetps and algorithms for prediction model.

The most simple prediction model at this point is using of the calculated and raked n-grams frequencies for generation of the suggestions for the next word. There are some practical questions to be answered building the feasible practical solution.

We will give the ranked by frequency lists of bi-, tri- and four-grams a trial as prediction mechanism.
All the technical routines for selection of the lists are as described in the source code (hidden) above.
We will probably need not only the in-line-prediction (word by word), but the in-word-prediction as well (predicting the ending of the word being just typed).
We need some fallback mechanisms for missing predictors in database, typos and so on.
The controversy between bias and variance of statistical model is ubiquitous. In our case we have data with a maximal variance. It is a vivid mapping of the test data set, but probably some to vivid. That should be kept in mind for advanced modeling.
The next omnipresent controversy is that between the statistical performance of the model and hardware performance of implementation.
For the sake of technical performance (execution times meant) I have packed the word lists into indexed data tables, each for a kind of n-grams.
The other facet of technical performance is memory consumption of the databases. It is definitely the matter for further research.
It is not only the capacity of hardware, limiting the depth of the prediction, but the willingness of the end user to scroll through the suggestion lists. Already at this stage it makes no sense to plan delivering all 9550 variants for the predicthor “to be” to the end user.
The limitations put by hardware and user interfaces of the gadgets like smartphone reinforce the considerations above.
We need for development much enough resources, to produce the parsimonious and efficient end product.

Quick and plain prototype of prediction engine

The questions posed above are to intriguing to wait with the answers. Experiences gathered in the former stage oblige.

Here is a demonstration of the prediction engine prototype as built in a short time.

The order of evaluation of predictor against database is sequence of 3, then 2 and 1 words.
The lower limit for in-word completion set at 3 characters.
The upper limit for the number of suggestions set at 30 words.
As a fallback for predictor missed in the database is used a derivate, shortened by one word at front (“nice moonlight evening” -> “moonlight evening”). The procedure is iterated repeated when needed.
The in-word prediction works with the same list, allowing the completion of the short chunks.
If the char sequence can not be found in the word list, it is iterative shortened and compared against the database.
The last resort is suggesting the top-30 words form the common list.
Right after the punctuation marks come no suggestions, the play begins from scratch with the first word.
The incredible performance of the indexed data tables and dplyr package are extensively used.
Actual output of the engine includes the data like frequency, the predictor and suggestion (last column).

## Warning in data.table(freq = c(used_words, dummies), words_EN =
## c(names(used_words), : Item 2 is of size 74184 but maximum size is 715094
## (recycled leaving remainder of 47438 items)

## Warning in rm(dummies): object 'dummies' not found

    ## Short engine tests
    
    
    ```r
    ## We are probably not perfect yet ;-)
    
    #Normal case
    system.time(rslt <- predictEngine("red"))
       user  system elapsed 
      0.004   0.000   0.004 
    head(rslt)
       freq predictor outcome
    1:   70       red     sox
    2:   59       red     and
    3:   36       red   cross
    4:   35       red    wine
    5:   31       red  carpet
    6:   23       red   light
    #Normal case
    system.time(rslt <- predictEngine("it was"))
       user  system elapsed 
      0.004   0.000   0.002 
    head(rslt)
       freq predictor outcome
    1:  143    it was   great
    2:  123    it was      so
    3:  121    it was    just
    4:  109    it was     not
    5:   98    it was    good
    6:   82    it was    very
    # Predictor not found in DB - shortening the second word
    system.time(rslt <- predictEngine("it readd"))
       user  system elapsed 
      0.004   0.000   0.004 
    head(rslt)
       freq predictor outcome
    1:    1   it read     yet
    # Control for the second part
    system.time(rslt <- predictEngine("readdd"))
       user  system elapsed 
       0.18    0.00    0.18 
    head(rslt)
       freq predictor outcome
    1:  173      read      it
    2:   84      read    this
    3:   75      read      my
    4:   74      read     and
    5:   71      read    that
    6:   59      read   about
    # Predictor ends with punctuation mark - nothing to be returned
    system.time(rslt <- predictEngine("red river ,"))
       user  system elapsed 
          0       0       0 
    head(rslt)
    data frame with 0 columns and 0 rows
    
    # After foreign words no suggestions delivered
    system.time(rslt <- predictEngine("red уменьшению"))
       user  system elapsed 
      0.004   0.000   0.001 
    head(rslt)
    data frame with 0 columns and 0 rows
    ```
    
    ----------------------------------------------------------------------------------  
    
    # Conclusion
    
    - We have successfully downloaded and performed an exploration data analysis of the english recommended text corpus.  
    - The results of the analysis let us build a plain, but well performing prediction algorithm.  
    - The insights, algorithm and prototype of prediction engine are a very good basis for further technically efficient and 
    parsimonious development product.

Milestone Project: Explorative Data Analysis

Data Science Specializaton Capstone Project

D.Dakhno

September 6, 2016

Introduction

Definition of goals

Language data sources

Initial expectations and assumptions

Technical backgrounds

Data loading

Exploratory data analysis

Preliminary text corpus analysis and preprocessing the data

Exploratory data analysis

Clustering by Term Similarity

Hierarchal Clustering

K-means clustering

Word clouds

Frequencies of the word pairs (bigrams)

Frequencies of the word triplets (trigrams)

Four-grams

Comparative statistical perfrormance of n-grams

Concetps and algorithms for prediction model.

Quick and plain prototype of prediction engine