The objective of the project is to develop an application that can predict text entered at the keyboard. We are provided with test corpora that we can use to build our algorithm. The first step is to explore those data. Let’s have a look at the data: the English corpora.
path = "Z:/Professionnel/Cours/repos/Capstone/Capstone data/en_US/"
fichiers <- list.files(path= path )
for (fichier in fichiers)
{
connect <- file(description = paste(path, fichier, sep=""), open = "rb")
texte<-readLines(con=connect, skipNul = TRUE)
close(connect)
cat("file:", fichier, "# lines:",length(texte), "Longest line:",max(nchar(texte)), "\n")
}
## file: en_US.blogs.txt # lines: 899288 Longest line: 40835
## file: en_US.news.txt # lines: 1010242 Longest line: 11384
## file: en_US.twitter.txt # lines: 2360148 Longest line: 213
At this point, we do not estimate thenumber of words. We’ll do this during the tokenization phase.
We explore the Twitter data. The raw data need some cleanup: there are some “funny” characters (like emoticons, …). As the texts are in English, we can remove the non-Ascii characters. We time the operation.
ununi <-function(x)
{
rawline <- charToRaw(x)
rawline[rawline > as.raw(127)] <- as.raw(32)
return(rawToChar(rawline))
}
system.time(texte <- unlist(lapply(texte,function(x) {unlist(lapply(x, ununi))})))
We define random samples of increasing size, tokenize them (using functions provided by the “tm” package) and record the processing time, the number of tokens and different words detected.
library(tm)
## Loading required package: NLP
Tlength <- length(texte)
WordCount <-
data.frame(
size = integer(),
words = integer(),
tokens = integer(),
TimeTokens = numeric(),
TimeWords = numeric()
)
LoopLength = 25
Lengths <- as.integer(seq(from = 1,
to = Tlength,
length.out = LoopLength))
for (Length in Lengths) {
echant <-
paste(texte[sample(length(texte), Length)], collapse = ' ')
TTok <- system.time(tokens <- scan_tokenizer(echant))
TWords <-
system.time(WordFreq <- sort(table(tokens), decreasing = TRUE))
ligne <-
list(
size = Length,
words = length(WordFreq),
tokens = length(tokens),
TimeTokens = TTok[3],
TimeWords = TWords[3]
)
WordCount <- rbind(WordCount, ligne)
}
##
## Call:
## lm(formula = WordCount$words ~ WordCount$size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -162336 -27396 14793 35966 47150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.624e+05 1.959e+04 8.287 2.33e-08 ***
## WordCount$size 4.896e-01 1.423e-02 34.408 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50460 on 23 degrees of freedom
## Multiple R-squared: 0.9809, Adjusted R-squared: 0.9801
## F-statistic: 1184 on 1 and 23 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = log(WordCount$words) ~ log(WordCount$size))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08619 -0.04642 -0.01913 0.02993 0.17215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.976562 0.062292 47.78 <2e-16 ***
## log(WordCount$size) 0.758461 0.004606 164.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06483 on 23 degrees of freedom
## Multiple R-squared: 0.9992, Adjusted R-squared: 0.9991
## F-statistic: 2.711e+04 on 1 and 23 DF, p-value: < 2.2e-16
We see that the number of words increases linearly with the sample size. The number of different tokens does not increase linearly. The “log-log” graph seems to be linear, which is confirmed by the regression analysis. This is the signature of a “power law”. In this case, the exponent is about 0,5.
We also see that it is possible to tokenize the whole sample in a reasonable time: about 12s.
We now try another tokenizer, from the “tokenizers” package.
##
## Attaching package: 'tokenizers'
## The following object is masked from 'package:tm':
##
## stopwords
##
## Call:
## lm(formula = WordCount2$words ~ WordCount2$size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67830 -8680 5803 14359 17354
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.785e+04 7.613e+03 8.913 6.4e-09 ***
## WordCount2$size 1.365e-01 5.530e-03 24.687 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19610 on 23 degrees of freedom
## Multiple R-squared: 0.9636, Adjusted R-squared: 0.9621
## F-statistic: 609.5 on 1 and 23 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = log(WordCount2$words) ~ log(WordCount2$size))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08697 -0.04405 -0.01788 0.02132 0.18650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.222465 0.062951 51.19 <2e-16 ***
## log(WordCount2$size) 0.657708 0.004655 141.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06552 on 23 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9988
## F-statistic: 1.996e+04 on 1 and 23 DF, p-value: < 2.2e-16
We see that the “tokenizers” library generates smaller tokens sets in a comparable amount of time.We also find the same kind of “power law”, but with a larger exponent (about 0.7) At this stage, we decide to work with the “tokenizers” library, because it generates smaller tokens sets, which can have a positive impact on the memory consumption of the future application.
We are now going to have a look at the n-grams. We detect 2-grams and 3-grams
## user system elapsed
## 234.22 1.86 241.83
## user system elapsed
## 572.99 1.97 591.12
Let’s now summarise the results: the “1-grams”, “2-grams” and “3-grams”
## 1-Grams: number: 367527 memory size: 23960248
## tokens4
## the to i a you and for in of is
## 937405 788647 724437 611505 549009 438540 385349 380382 359636 358777
## it my on that me
## 296213 291907 278033 234898 202568
## tokens4
## zzif zziiinnng
## 1 1
## zzle zzolo
## 1 1
## zzooooommmmm zzoooooone
## 1 1
## zzt zzziiiiiiiiippp
## 1 1
## zzzs zzzzoooooommmmmmm
## 1 1
## zzzzz___eeee__zzzzz_eheheh zzzzzn
## 1 1
## zzzzzz zzzzzzs
## 1 1
## zzzzzzzx
## 1
## 2-Grams: number: 5305561 memory size: 369651984
##
## in the for the of the on the to be to the
## 78250 73921 56923 48432 47094 43385
## thanks for at the i love going to have a thank you
## 42995 37162 35918 34273 33750 33404
## if you i have i am
## 33391 31574 29838
##
## zz z zzhe gets zzif stage
## 1 1 1
## zziiinnng rehearsal zzle after zzolo hopefully
## 1 1 1
## zzoooooone throneflight zzt album zztop belief
## 1 1 1
## zztop surname zzup i zzziiiiiiiiippp there
## 1 1 1
## zzzs d zzzzzs goodnight zzzzzs in
## 1 1 1
## 2-Grams: number: 13906356 memory size: 1054281656
##
## thanks for the looking forward to thank you for
## 23619 8832 8678
## i love you for the follow going to be
## 8419 7929 7415
## can't wait to i want to a lot of
## 7344 7113 6250
## to be a i need to i have a
## 5992 5921 5763
## one of the have a great to see you
## 5587 5450 5303
##
## zz z ill
## 1
## zzhe gets really
## 1
## zzif stage lights
## 1
## zziiinnng rehearsal was
## 1
## zzle after this
## 1
## zzolo hopefully joe
## 1
## zzoooooone throneflight bestflightbacktolaever
## 1
## zzt album is
## 1
## zztop belief i
## 1
## zztop surname last
## 1
## zzup guys i
## 1
## zzup i miss
## 1
## zzziiiiiiiiippp there goes
## 1
## zzzzzs goodnight tweeters
## 1
## zzzzzs in before
## 1
We see that we’ll have to clean up the lists of words. We’ll do this by validating the words against lists of English words, like http://www-01.sil.org/linguistics/wordlists/english/. This will also be the opportunity to remove profanity.
When this is done, we’ll implement the Katz’s back-off model to generate the proposed words.