Capstone Project - Exploratory Data Analysis

The objective of the project is to develop an application that can predict text entered at the keyboard. We are provided with test corpora that we can use to build our algorithm. The first step is to explore those data. Let’s have a look at the data: the English corpora.

path = "Z:/Professionnel/Cours/repos/Capstone/Capstone data/en_US/"
fichiers <- list.files(path= path )

for (fichier in fichiers)
{
    connect <- file(description = paste(path, fichier, sep=""), open = "rb")
    texte<-readLines(con=connect, skipNul = TRUE)
    close(connect)
    cat("file:", fichier, "# lines:",length(texte), "Longest line:",max(nchar(texte)), "\n")
}
## file: en_US.blogs.txt # lines: 899288 Longest line: 40835 
## file: en_US.news.txt # lines: 1010242 Longest line: 11384 
## file: en_US.twitter.txt # lines: 2360148 Longest line: 213

At this point, we do not estimate thenumber of words. We’ll do this during the tokenization phase.

Cleaning Up

We explore the Twitter data. The raw data need some cleanup: there are some “funny” characters (like emoticons, …). As the texts are in English, we can remove the non-Ascii characters. We time the operation.

ununi <-function(x)
{
    rawline <- charToRaw(x)
    rawline[rawline > as.raw(127)] <- as.raw(32)
    return(rawToChar(rawline))
}

system.time(texte <- unlist(lapply(texte,function(x) {unlist(lapply(x, ununi))})))

We define random samples of increasing size, tokenize them (using functions provided by the “tm” package) and record the processing time, the number of tokens and different words detected.

library(tm)
## Loading required package: NLP
Tlength <- length(texte)

WordCount <-
    data.frame(
        size = integer(),
        words = integer(),
        tokens = integer(),
        TimeTokens = numeric(),
        TimeWords = numeric()
    )

LoopLength = 25

Lengths <- as.integer(seq(from = 1,
                  to = Tlength,
                  length.out = LoopLength))

for (Length in Lengths) {
    
    echant <-
        paste(texte[sample(length(texte), Length)], collapse = ' ')
    
    TTok <- system.time(tokens <- scan_tokenizer(echant))
    
    TWords <-
        system.time(WordFreq <- sort(table(tokens), decreasing = TRUE))
    
    ligne <-
        list(
            size = Length,
            words = length(WordFreq),
            tokens = length(tokens),
            TimeTokens = TTok[3],
            TimeWords = TWords[3]
        )
    WordCount <- rbind(WordCount, ligne)
}

## 
## Call:
## lm(formula = WordCount$words ~ WordCount$size)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -162336  -27396   14793   35966   47150 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.624e+05  1.959e+04   8.287 2.33e-08 ***
## WordCount$size 4.896e-01  1.423e-02  34.408  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50460 on 23 degrees of freedom
## Multiple R-squared:  0.9809, Adjusted R-squared:  0.9801 
## F-statistic:  1184 on 1 and 23 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = log(WordCount$words) ~ log(WordCount$size))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.08619 -0.04642 -0.01913  0.02993  0.17215 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.976562   0.062292   47.78   <2e-16 ***
## log(WordCount$size) 0.758461   0.004606  164.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06483 on 23 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9991 
## F-statistic: 2.711e+04 on 1 and 23 DF,  p-value: < 2.2e-16

We see that the number of words increases linearly with the sample size. The number of different tokens does not increase linearly. The “log-log” graph seems to be linear, which is confirmed by the regression analysis. This is the signature of a “power law”. In this case, the exponent is about 0,5.

We also see that it is possible to tokenize the whole sample in a reasonable time: about 12s.

Comparing tm and tokenizers.

We now try another tokenizer, from the “tokenizers” package.

## 
## Attaching package: 'tokenizers'
## The following object is masked from 'package:tm':
## 
##     stopwords

## 
## Call:
## lm(formula = WordCount2$words ~ WordCount2$size)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67830  -8680   5803  14359  17354 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.785e+04  7.613e+03   8.913  6.4e-09 ***
## WordCount2$size 1.365e-01  5.530e-03  24.687  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19610 on 23 degrees of freedom
## Multiple R-squared:  0.9636, Adjusted R-squared:  0.9621 
## F-statistic: 609.5 on 1 and 23 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = log(WordCount2$words) ~ log(WordCount2$size))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.08697 -0.04405 -0.01788  0.02132  0.18650 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.222465   0.062951   51.19   <2e-16 ***
## log(WordCount2$size) 0.657708   0.004655  141.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06552 on 23 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9988 
## F-statistic: 1.996e+04 on 1 and 23 DF,  p-value: < 2.2e-16

We see that the “tokenizers” library generates smaller tokens sets in a comparable amount of time.We also find the same kind of “power law”, but with a larger exponent (about 0.7) At this stage, we decide to work with the “tokenizers” library, because it generates smaller tokens sets, which can have a positive impact on the memory consumption of the future application.

Getting the n-grams and exploring them

We are now going to have a look at the n-grams. We detect 2-grams and 3-grams

##    user  system elapsed 
##  234.22    1.86  241.83
##    user  system elapsed 
##  572.99    1.97  591.12

Let’s now summarise the results: the “1-grams”, “2-grams” and “3-grams”

## 1-Grams: number: 367527 memory size: 23960248
## tokens4
##    the     to      i      a    you    and    for     in     of     is 
## 937405 788647 724437 611505 549009 438540 385349 380382 359636 358777 
##     it     my     on   that     me 
## 296213 291907 278033 234898 202568
## tokens4
##                       zzif                  zziiinnng 
##                          1                          1 
##                       zzle                      zzolo 
##                          1                          1 
##               zzooooommmmm                 zzoooooone 
##                          1                          1 
##                        zzt            zzziiiiiiiiippp 
##                          1                          1 
##                       zzzs          zzzzoooooommmmmmm 
##                          1                          1 
## zzzzz___eeee__zzzzz_eheheh                     zzzzzn 
##                          1                          1 
##                     zzzzzz                    zzzzzzs 
##                          1                          1 
##                   zzzzzzzx 
##                          1

## 2-Grams: number: 5305561 memory size: 369651984
## 
##     in the    for the     of the     on the      to be     to the 
##      78250      73921      56923      48432      47094      43385 
## thanks for     at the     i love   going to     have a  thank you 
##      42995      37162      35918      34273      33750      33404 
##     if you     i have       i am 
##      33391      31574      29838
## 
##                    zz z               zzhe gets              zzif stage 
##                       1                       1                       1 
##     zziiinnng rehearsal              zzle after         zzolo hopefully 
##                       1                       1                       1 
## zzoooooone throneflight               zzt album            zztop belief 
##                       1                       1                       1 
##           zztop surname                  zzup i   zzziiiiiiiiippp there 
##                       1                       1                       1 
##                  zzzs d        zzzzzs goodnight               zzzzzs in 
##                       1                       1                       1

## 2-Grams: number: 13906356 memory size: 1054281656
## 
##     thanks for the looking forward to      thank you for 
##              23619               8832               8678 
##         i love you     for the follow        going to be 
##               8419               7929               7415 
##      can't wait to          i want to           a lot of 
##               7344               7113               6250 
##            to be a          i need to           i have a 
##               5992               5921               5763 
##         one of the       have a great         to see you 
##               5587               5450               5303
## 
##                                       zz z ill 
##                                              1 
##                               zzhe gets really 
##                                              1 
##                              zzif stage lights 
##                                              1 
##                        zziiinnng rehearsal was 
##                                              1 
##                                zzle after this 
##                                              1 
##                            zzolo hopefully joe 
##                                              1 
## zzoooooone throneflight bestflightbacktolaever 
##                                              1 
##                                   zzt album is 
##                                              1 
##                                 zztop belief i 
##                                              1 
##                             zztop surname last 
##                                              1 
##                                    zzup guys i 
##                                              1 
##                                    zzup i miss 
##                                              1 
##                     zzziiiiiiiiippp there goes 
##                                              1 
##                      zzzzzs goodnight tweeters 
##                                              1 
##                               zzzzzs in before 
##                                              1

Future Work

We see that we’ll have to clean up the lists of words. We’ll do this by validating the words against lists of English words, like http://www-01.sil.org/linguistics/wordlists/english/. This will also be the opportunity to remove profanity.

When this is done, we’ll implement the Katz’s back-off model to generate the proposed words.