Introduction:

The goal of this project is just to display that I’ve gotten used to working with the data and that I’m on track to create my own prediction algorithm.

The data sets:

There are three data sets: the one from twitter, the one from blogs and the one from news. All they are text and each one it’s so big that generates a big work load and accopies a lot of RAM. It’s because of this that we’ll t take jus samples from them.

Importing and sampling data sets:

  1. Loading libraries: First of all, load needed libraries and enabling multicore work. Really we’ll need it.
## Loading packages
require(quanteda)
library(doMC)
library(text2vec)
## Enabling multicore
registerDoMC(cores = 4)
  1. Loading data: Will load the data sets, but because of their size, keep just a random sample of 0.10 out of the original.
twitter <- readLines("/home/oscar/Documents/DSSCapstone/en_US/en_US.twitter.txt")
news <- readLines("/home/oscar/Documents/DSSCapstone/en_US/en_US.news.txt")
blogs <- readLines("/home/oscar/Documents/DSSCapstone/en_US/en_US.blogs.txt")

twitter <- twitter[sample(seq(1, length(twitter)), size = length(twitter) * 0.01)]
news <- news[sample(seq(1, length(news)), size = length(news) * 0.01)]
blogs <- blogs[sample(seq(1, length(blogs)), size = length(blogs) * 0.01)]

head(twitter, 2)
## [1] "memphis soul 55 years." "Excited for"
head(news, 2)
## [1] "In the past six months, the firm recommended that both Gloucester and Camden counties seek application to enter the state plan, a recommendation that substantially reduced their compensation."                                                               
## [2] "There was no one place or time to intervene, or one step, that clearly would have prevented Brennan's suicide. And almost all of the scores of soldier suicides seem like that, a blur of events, symptoms and behaviors culminating in one final violent act."
head(blogs, 2)
## [1] "The human ear is designed to receive natural sounds lying in the range 5 kHz to 15 kHz. Our hearing threshold is maximum near 4-5 kHz. That is why the sound of a Buzzer, Cricket (the noisy insect) etc are piercing and irritating to us. These sound vibrations create stress on the ear drum because the eardrum will not vibrate exactly similar to the sound waves. In order to avoid such frequencies, the middle ear act as a low pass filter and prevent them passing into the inner ear. Natural sounds including human voice lies between 10-15 kHz which is friendly to human ear. Sound of unusual frequencies cause noise pollution and alters our psychomotor performances."
## [2] "Back in December I ordered a new bed for our new master bedroom. I purchased the bed shown below from Joss & Main during one of their flash sales."
  1. For the aim of avoid memory issues we’ll parallelize the work. So, first, split the samples in chunks:
twitter <- split_into(twitter, 100)
news <- split_into(news, 100)
blogs <- split_into(blogs, 100)
  1. Will end to enable the multicore for run the quanteda’s functions on parallel for each (with foreach) of the chunks

4.1. First for onegrams: I know that this work could be done recursively with just one function which takes the ngram and the data set as arguments, but now it’s not straightforward for me.

library(doMC)
registerDoMC(cores = 4)
library(foreach)
library(doParallel)

twitterdfm <- foreach(i=1:length(twitter), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(twitter[[i]])
dfm <- dfm(as.character(splitsn), verbose = FALSE, what = c("word"), removeNumbers = T,removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T)
}

##

newsdfm <- foreach(i=1:length(news), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(news[[i]])
dfm <- dfm(as.character(splitsn), verbose = FALSE, what = c("word"), removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T)
}

##

blogsdfm <- foreach(i=1:length(blogs), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(blogs[[i]])
dfm <- dfm(as.character(splitsn), verbose = FALSE, what = c("word"), removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T)
}

##

head(twitterdfm)
## Document-feature matrix of: 23,601 documents, 24,900 features.
## (showing first 6 documents and first 6 features)
##        features
## docs    for you like our this can
##   text1   0   0    0   0    0   0
##   text2   1   0    0   0    0   0
##   text3   0   2    1   1    1   1
##   text4   0   2    1   0    0   0
##   text5   1   0    0   0    0   0
##   text6   0   1    0   0    2   0
head(newsdfm)
## Document-feature matrix of: 10,102 documents, 29,311 features.
## (showing first 6 documents and first 6 features)
##        features
## docs    in the that and to a
##   text1  1   3    2   1  1 1
##   text2  1   1    2   2  1 1
##   text3  0   0    0   0  0 0
##   text4  0   3    0   1  3 0
##   text5  0   5    0   1  2 1
##   text6  2   2    0   0  1 0
head(blogsdfm)
## Document-feature matrix of: 8,992 documents, 28,567 features.
## (showing first 6 documents and first 6 features)
##        features
## docs    the is to in our that
##   text1   9  4  6  2   2    1
##   text2   1  0  0  1   1    0
##   text3  16  5  3  3   1    5
##   text4   0  0  2  1   0    0
##   text5   0  0  0  0   0    0
##   text6   2  0  0  1   0    0
alldfm <- rbind(twitterdfm, newsdfm, blogsdfm)

##

head(alldfm)
## Document-feature matrix of: 42,695 documents, 53,195 features.
## (showing first 6 documents and first 6 features)
##        features
## docs    for you like our this can
##   text1   0   0    0   0    0   0
##   text2   1   0    0   0    0   0
##   text3   0   2    1   1    1   1
##   text4   0   2    1   0    0   0
##   text5   1   0    0   0    0   0
##   text6   0   1    0   0    2   0

4.2. Now the bigrams:

library(doMC)
registerDoMC(cores = 4)
library(foreach)
library(doParallel)

twitterdfmbi <- foreach(i=1:length(twitter), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(twitter[[i]])
ng <- dfm(tokenize(splitsn, ngrams = 2 , verbose = FALSE, removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T, simplify = F))
}

##

newsdfmbi <- foreach(i=1:length(news), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(news[[i]])
ng <- dfm(tokenize(splitsn, ngrams = 2 , verbose = FALSE, removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T, simplify = F))
}

##

blogsdfmbi <- foreach(i=1:length(blogs), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(blogs[[i]])
ng <- dfm(tokenize(splitsn, ngrams = 2 , verbose = FALSE, removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T, simplify = F))
}

##

head(twitterdfmbi)
## Document-feature matrix of: 23,601 documents, 159,676 features.
## (showing first 6 documents and first 6 features)
##        features
## docs    for_the in_the to_be of_the on_the Thanks_for
##   text1       0      0     0      0      0          0
##   text2       0      0     0      0      0          0
##   text3       0      0     0      0      0          0
##   text4       0      0     0      0      0          0
##   text5       1      0     0      0      0          0
##   text6       0      0     0      0      0          0
head(newsdfmbi)
## Document-feature matrix of: 10,102 documents, 195,788 features.
## (showing first 6 documents and first 6 features)
##        features
## docs    of_the for_the at_the in_the of_a in_a
##   text1      0       0      0      0    0    0
##   text2      1       0      0      0    0    0
##   text3      0       0      0      0    0    0
##   text4      0       1      0      0    0    0
##   text5      0       0      0      0    0    0
##   text6      1       0      0      0    0    0
head(blogsdfmbi)
## Document-feature matrix of: 8,992 documents, 197,725 features.
## (showing first 6 documents and first 6 features)
##        features
## docs    in_the on_the to_the and_the of_the to_be
##   text1      1      1      1       0      0     0
##   text2      0      0      0       0      0     0
##   text3      0      0      0       1      2     0
##   text4      0      0      0       0      0     1
##   text5      0      0      0       0      0     0
##   text6      0      0      0       0      1     0
alldfmbi <- rbind(twitterdfmbi, newsdfmbi, blogsdfmbi)

##

head(alldfmbi)
## Document-feature matrix of: 42,695 documents, 477,556 features.
## (showing first 6 documents and first 6 features)
##        features
## docs    for_the in_the to_be of_the on_the Thanks_for
##   text1       0      0     0      0      0          0
##   text2       0      0     0      0      0          0
##   text3       0      0     0      0      0          0
##   text4       0      0     0      0      0          0
##   text5       1      0     0      0      0          0
##   text6       0      0     0      0      0          0

4.3. And trigrams:

library(doMC)
registerDoMC(cores = 4)
library(foreach)
library(doParallel)

twitterdfmtri <- foreach(i=1:length(twitter), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(twitter[[i]])
ng <- dfm(tokenize(splitsn, ngrams = 3 , verbose = FALSE, removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T, simplify = F))
}

##

newsdfmtri <- foreach(i=1:length(news), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(news[[i]])
ng <- dfm(tokenize(splitsn, ngrams = 3 , verbose = FALSE, removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T, simplify = F))
}

##

blogsdfmtri <- foreach(i=1:length(blogs), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(blogs[[i]])
ng <- dfm(tokenize(splitsn, ngrams = 3 , verbose = FALSE, removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T, simplify = F))
}

##

head(twitterdfmtri)
## Document-feature matrix of: 23,601 documents, 225,045 features.
## (showing first 6 documents and first 6 features)
##        features
## docs    Thanks_for_the Thank_you_for know_how_to are_looking_for
##   text1              0             0           0               0
##   text2              0             0           0               0
##   text3              0             0           0               0
##   text4              0             0           0               0
##   text5              0             0           0               0
##   text6              0             0           0               0
##        features
## docs    wish_I_could would_love_to
##   text1            0             0
##   text2            0             0
##   text3            0             0
##   text4            0             0
##   text5            0             0
##   text6            0             0
head(newsdfmtri)
## Document-feature matrix of: 10,102 documents, 289,910 features.
## (showing first 6 documents and first 6 features)
##        features
## docs    part_of_the one_of_the to_be_the as_much_as I_had_a in_the_house
##   text1           0          0         0          0       0            0
##   text2           0          0         0          0       0            0
##   text3           0          0         0          0       0            0
##   text4           0          0         0          0       0            0
##   text5           0          0         0          0       0            0
##   text6           0          0         0          0       0            0
head(blogsdfmtri)
## Document-feature matrix of: 8,992 documents, 312,527 features.
## (showing first 6 documents and first 6 features)
##        features
## docs    a_lot_of I_have_a some_of_the of_the_same I_had_no I_did_not
##   text1        0        0           0           0        0         0
##   text2        0        0           0           0        0         0
##   text3        0        0           0           0        1         0
##   text4        0        0           0           0        0         0
##   text5        0        0           0           0        0         0
##   text6        0        0           0           0        0         0
alldfmtri <- rbind(twitterdfmtri, newsdfmtri, blogsdfmtri)

##

head(alldfmtri)
## Document-feature matrix of: 42,695 documents, 790,796 features.
## (showing first 6 documents and first 6 features)
##        features
## docs    Thanks_for_the know_how_to are_looking_for wish_I_could
##   text1              0           0               0            0
##   text2              0           0               0            0
##   text3              0           0               0            0
##   text4              0           0               0            0
##   text5              0           0               0            0
##   text6              0           0               0            0
##        features
## docs    would_love_to a_bunch_of
##   text1             0          0
##   text2             0          0
##   text3             0          0
##   text4             0          0
##   text5             0          0
##   text6             0          0

Exploratory data analysis and reduction of the sample:

Of all of those Document-Feature Matrix (dfm), we are interested just in the frecuencies with which each of the ngrams (onegram, bigram or trigram) appears. So lets keep just this feature:

  1. Transforming the data sets: Let’s go give them the structure of a data frame:
## Getting the feature
onegram <- colSums(alldfm)
bigram <- colSums(alldfmbi)
trigram <- colSums(alldfmtri)
## converting in to data frame
onegram <- as.data.frame(onegram)
bigram <- as.data.frame(bigram)
trigram <- as.data.frame(trigram)
## working around with row names
onegram$token <- rownames(onegram)
bigram$token <- rownames(bigram)
trigram$token <- rownames(trigram)
## Fixing col names
names(onegram) <- c("n", "token")
names(bigram) <- c("n", "token")
names(trigram) <- c("n", "token")
## Erasing rownames
rownames(onegram) <- NULL
rownames(bigram) <- NULL
rownames(trigram) <- NULL
## Showing
head(onegram)
##       n token
## 1 10915   for
## 2  9382   you
## 3  2774  like
## 4  1881   our
## 5  5468  this
## 6  2542   can
head(bigram)
##      n      token
## 1 1891    for_the
## 2 3832     in_the
## 3 1577      to_be
## 4 4191     of_the
## 5 1837     on_the
## 6  299 Thanks_for
head(trigram)
##     n           token
## 1 179  Thanks_for_the
## 2  42     know_how_to
## 3  21 are_looking_for
## 4  21    wish_I_could
## 5  44   would_love_to
## 6  41      a_bunch_of
  1. Featuring creation: We need some features as criteria to select the part of the sample that we’ll be keeped.
## Creatin percentage
onegram$perc <- onegram$n/sum(onegram$n)
bigram$perc <- bigram$n/sum(bigram$n)
trigram$perc <- trigram$n/sum(trigram$n)
## Sorting
onegram <- onegram[order(-onegram$perc),]
bigram <- bigram[order(-bigram$perc),]
trigram <- trigram[order(-trigram$perc),]
## Getting cumulative percentages
onegram$cumperc <- cumsum(onegram$perc)
bigram$cumperc <- cumsum(bigram$perc)
trigram$cumperc <- cumsum(trigram$perc)
## Agregating a ID for plots
onegram$ID <- c(1:nrow(onegram))
bigram$ID <- c(1:nrow(bigram))
trigram$ID <- c(1:nrow(trigram))
head(onegram)
##        n token       perc    cumperc ID
## 19 47559   the 0.04730525 0.04730525  1
## 20 27313    to 0.02716727 0.07447253  2
## 8  23990   and 0.02386200 0.09833453  3
## 17 23893     a 0.02376552 0.12210006  4
## 31 20358    of 0.02024938 0.14234944  5
## 37 16695     i 0.01660593 0.15895536  6
head(bigram)
##      n   token        perc     cumperc ID
## 4 4191  of_the 0.004353458 0.004353458  1
## 2 3832  in_the 0.003980542 0.008334000  2
## 8 2087  to_the 0.002167900 0.010501899  3
## 1 1891 for_the 0.001964302 0.012466201  4
## 5 1837  on_the 0.001908209 0.014374410  5
## 3 1577   to_be 0.001638130 0.016012540  6
head(trigram)
##        n          token         perc      cumperc ID
## 250  305       a_lot_of 0.0003314317 0.0003314317  1
## 248  300     one_of_the 0.0003259984 0.0006574300  2
## 1    179 Thanks_for_the 0.0001945124 0.0008519424  3
## 563  174        to_be_a 0.0001890791 0.0010410215  4
## 39   170    going_to_be 0.0001847324 0.0012257539  5
## 1216 151     the_end_of 0.0001640858 0.0013898397  6
  1. Exploratory data: Let’s explore the data composition by plotting the cumulative percentage and the individual percentages of each of the tokens:
library(ggplot2)
g <- ggplot(data = onegram, aes(x=ID, y=cumperc))
g <- g + geom_point()
g <- g + geom_point(data = onegram, aes(x=ID, y=perc), colour="blue")
g

library(ggplot2)
g <- ggplot(data = bigram, aes(x=ID, y=cumperc))
g <- g + geom_point()
g <- g + geom_point(data = bigram, aes(x=ID, y=perc), colour="blue")
g

library(ggplot2)
g <- ggplot(data = trigram, aes(x=ID, y=cumperc))
g <- g + geom_point()
g <- g + geom_point(data = trigram, aes(x=ID, y=perc), colour="blue")
g

  1. Reducing the sample: We can see that if you sort the tokens by percentage covered of the total of tokens, it’s not necessary all of the tokens to cover a cumulative percentage of 0.75 out of the words that can be reached. So, with the aiming in mind of to speed up the final model, we can keep just those tokens that cover the 0.75 of the cumulative percentage of the words used by humans (according to the sample)
lo1 <- nrow(onegram)
onegram <- onegram[onegram$cumperc <= 0.75, ]
lo2 <- nrow(onegram)

lb1 <- nrow(bigram)
bigram <- bigram[bigram$cumperc <= 0.75, ]
lb2 <- nrow(bigram)

lt1 <- nrow(trigram)
trigram <- trigram[trigram$cumperc <= 0.75, ]
lt2 <- nrow(trigram)

data.frame(ini = c(lo1, lb1, lt1), fin = c(lo2, lb2, lt2), 
           dif = c(lo2 - lo1, lb2 - lb1, lt2 - lt1), 
           redperc = c((lo2 - lo1) / lo1, (lb2 - lb1) / lb1, (lt2 - lt1) / lt1))
##      ini    fin     dif    redperc
## 1  53195   1499  -51696 -0.9718207
## 2 477556 236885 -240671 -0.5039639
## 3 790796 560733 -230063 -0.2909259

This could be a reasonable collection of tokens to build up a model.

The model

About the model, may be I’m over simplificating, but taking in account the need of eficiency, the size of the files and the enviroment in which the app will run (Shiny), so far, my idea is this:

  1. All the dfms will be rbinded (bind by rows with “rbind” function) with their respective percentage.
allgram <- rbind(onegram, bigram, trigram)
head(allgram)
##        n token       perc    cumperc ID
## 19 47559   the 0.04730525 0.04730525  1
## 20 27313    to 0.02716727 0.07447253  2
## 8  23990   and 0.02386200 0.09833453  3
## 17 23893     a 0.02376552 0.12210006  4
## 31 20358    of 0.02024938 0.14234944  5
## 37 16695     i 0.01660593 0.15895536  6
tail(allgram)
##        n                token         perc   cumperc     ID
## 559701 1         down_side_is 1.086661e-06 0.7499940 560728
## 559702 1   is_that_grapefruit 1.086661e-06 0.7499951 560729
## 559703 1 that_grapefruit_seed 1.086661e-06 0.7499962 560730
## 559704 1     seed_extract_can 1.086661e-06 0.7499973 560731
## 559705 1       extract_can_be 1.086661e-06 0.7499984 560732
## 559706 1         can_be_toxic 1.086661e-06 0.7499995 560733
  1. The new dfm will be sorted by percentage (not cumulative percentage but percentage).
allgram <- allgram[sort(-allgram$perc), ]
  1. The inputed text over which will be make the prediction or completion will be take as a string which have to be passed out to a function. Supouse inputations like “my”, “I’m”, “can be”

  2. Then, the all dfm data frame have to be filtered with the inputation as argument and return the first three matchs ordered by percentage:

newallgram <- allgram[grep("^my", allgram$token), ]
head(newallgram, 3)
##            n   token         perc   cumperc  ID
## 49      5962      my 0.0059301905 0.2532506  16
## 440      291  myself 0.0002894474 0.5961691 364
## 1051000  126 my_life 0.0001308842 0.1090764 344
newallgram$token[1:3]
## [1] "my"      "myself"  "my_life"
newallgram <- allgram[grep("^I'm", allgram$token), ]
head(newallgram, 3)
##        n     token         perc    cumperc  ID
## 2330 168   I'm_not 1.745123e-04 0.09165218 227
## 953   73     I'm_a 7.582974e-05 0.14755428 738
## 1532  71 I'm_going 7.375221e-05 0.15024156 774
newallgram$token[1:3]
## [1] "I'm_not"   "I'm_a"     "I'm_going"
## Spaces have to be files in regex with "_"
newallgram <- allgram[grep("^can_be", allgram$token), ]
head(newallgram, 3)
##           n      token         perc    cumperc    ID
## 2621000 296     can_be 3.074740e-04 0.06174514    91
## 4899     12 can_become 1.246516e-05 0.30763917  6941
## 51560     5   can_beat 5.193818e-06 0.42076468 22112
newallgram$token[1:3]
## [1] "can_be"     "can_become" "can_beat"

I know still have a lot of work with regexs, pasting and replacing chunks of strings, but this is the line of work I’ll be.