The goal of this project is just to display that I’ve gotten used to working with the data and that I’m on track to create my own prediction algorithm.
There are three data sets: the one from twitter, the one from blogs and the one from news. All they are text and each one it’s so big that generates a big work load and accopies a lot of RAM. It’s because of this that we’ll t take jus samples from them.
## Loading packages
require(quanteda)
library(doMC)
library(text2vec)
## Enabling multicore
registerDoMC(cores = 4)
twitter <- readLines("/home/oscar/Documents/DSSCapstone/en_US/en_US.twitter.txt")
news <- readLines("/home/oscar/Documents/DSSCapstone/en_US/en_US.news.txt")
blogs <- readLines("/home/oscar/Documents/DSSCapstone/en_US/en_US.blogs.txt")
twitter <- twitter[sample(seq(1, length(twitter)), size = length(twitter) * 0.01)]
news <- news[sample(seq(1, length(news)), size = length(news) * 0.01)]
blogs <- blogs[sample(seq(1, length(blogs)), size = length(blogs) * 0.01)]
head(twitter, 2)
## [1] "memphis soul 55 years." "Excited for"
head(news, 2)
## [1] "In the past six months, the firm recommended that both Gloucester and Camden counties seek application to enter the state plan, a recommendation that substantially reduced their compensation."
## [2] "There was no one place or time to intervene, or one step, that clearly would have prevented Brennan's suicide. And almost all of the scores of soldier suicides seem like that, a blur of events, symptoms and behaviors culminating in one final violent act."
head(blogs, 2)
## [1] "The human ear is designed to receive natural sounds lying in the range 5 kHz to 15 kHz. Our hearing threshold is maximum near 4-5 kHz. That is why the sound of a Buzzer, Cricket (the noisy insect) etc are piercing and irritating to us. These sound vibrations create stress on the ear drum because the eardrum will not vibrate exactly similar to the sound waves. In order to avoid such frequencies, the middle ear act as a low pass filter and prevent them passing into the inner ear. Natural sounds including human voice lies between 10-15 kHz which is friendly to human ear. Sound of unusual frequencies cause noise pollution and alters our psychomotor performances."
## [2] "Back in December I ordered a new bed for our new master bedroom. I purchased the bed shown below from Joss & Main during one of their flash sales."
twitter <- split_into(twitter, 100)
news <- split_into(news, 100)
blogs <- split_into(blogs, 100)
4.1. First for onegrams: I know that this work could be done recursively with just one function which takes the ngram and the data set as arguments, but now it’s not straightforward for me.
library(doMC)
registerDoMC(cores = 4)
library(foreach)
library(doParallel)
twitterdfm <- foreach(i=1:length(twitter), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(twitter[[i]])
dfm <- dfm(as.character(splitsn), verbose = FALSE, what = c("word"), removeNumbers = T,removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T)
}
##
newsdfm <- foreach(i=1:length(news), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(news[[i]])
dfm <- dfm(as.character(splitsn), verbose = FALSE, what = c("word"), removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T)
}
##
blogsdfm <- foreach(i=1:length(blogs), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(blogs[[i]])
dfm <- dfm(as.character(splitsn), verbose = FALSE, what = c("word"), removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T)
}
##
head(twitterdfm)
## Document-feature matrix of: 23,601 documents, 24,900 features.
## (showing first 6 documents and first 6 features)
## features
## docs for you like our this can
## text1 0 0 0 0 0 0
## text2 1 0 0 0 0 0
## text3 0 2 1 1 1 1
## text4 0 2 1 0 0 0
## text5 1 0 0 0 0 0
## text6 0 1 0 0 2 0
head(newsdfm)
## Document-feature matrix of: 10,102 documents, 29,311 features.
## (showing first 6 documents and first 6 features)
## features
## docs in the that and to a
## text1 1 3 2 1 1 1
## text2 1 1 2 2 1 1
## text3 0 0 0 0 0 0
## text4 0 3 0 1 3 0
## text5 0 5 0 1 2 1
## text6 2 2 0 0 1 0
head(blogsdfm)
## Document-feature matrix of: 8,992 documents, 28,567 features.
## (showing first 6 documents and first 6 features)
## features
## docs the is to in our that
## text1 9 4 6 2 2 1
## text2 1 0 0 1 1 0
## text3 16 5 3 3 1 5
## text4 0 0 2 1 0 0
## text5 0 0 0 0 0 0
## text6 2 0 0 1 0 0
alldfm <- rbind(twitterdfm, newsdfm, blogsdfm)
##
head(alldfm)
## Document-feature matrix of: 42,695 documents, 53,195 features.
## (showing first 6 documents and first 6 features)
## features
## docs for you like our this can
## text1 0 0 0 0 0 0
## text2 1 0 0 0 0 0
## text3 0 2 1 1 1 1
## text4 0 2 1 0 0 0
## text5 1 0 0 0 0 0
## text6 0 1 0 0 2 0
4.2. Now the bigrams:
library(doMC)
registerDoMC(cores = 4)
library(foreach)
library(doParallel)
twitterdfmbi <- foreach(i=1:length(twitter), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(twitter[[i]])
ng <- dfm(tokenize(splitsn, ngrams = 2 , verbose = FALSE, removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T, simplify = F))
}
##
newsdfmbi <- foreach(i=1:length(news), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(news[[i]])
ng <- dfm(tokenize(splitsn, ngrams = 2 , verbose = FALSE, removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T, simplify = F))
}
##
blogsdfmbi <- foreach(i=1:length(blogs), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(blogs[[i]])
ng <- dfm(tokenize(splitsn, ngrams = 2 , verbose = FALSE, removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T, simplify = F))
}
##
head(twitterdfmbi)
## Document-feature matrix of: 23,601 documents, 159,676 features.
## (showing first 6 documents and first 6 features)
## features
## docs for_the in_the to_be of_the on_the Thanks_for
## text1 0 0 0 0 0 0
## text2 0 0 0 0 0 0
## text3 0 0 0 0 0 0
## text4 0 0 0 0 0 0
## text5 1 0 0 0 0 0
## text6 0 0 0 0 0 0
head(newsdfmbi)
## Document-feature matrix of: 10,102 documents, 195,788 features.
## (showing first 6 documents and first 6 features)
## features
## docs of_the for_the at_the in_the of_a in_a
## text1 0 0 0 0 0 0
## text2 1 0 0 0 0 0
## text3 0 0 0 0 0 0
## text4 0 1 0 0 0 0
## text5 0 0 0 0 0 0
## text6 1 0 0 0 0 0
head(blogsdfmbi)
## Document-feature matrix of: 8,992 documents, 197,725 features.
## (showing first 6 documents and first 6 features)
## features
## docs in_the on_the to_the and_the of_the to_be
## text1 1 1 1 0 0 0
## text2 0 0 0 0 0 0
## text3 0 0 0 1 2 0
## text4 0 0 0 0 0 1
## text5 0 0 0 0 0 0
## text6 0 0 0 0 1 0
alldfmbi <- rbind(twitterdfmbi, newsdfmbi, blogsdfmbi)
##
head(alldfmbi)
## Document-feature matrix of: 42,695 documents, 477,556 features.
## (showing first 6 documents and first 6 features)
## features
## docs for_the in_the to_be of_the on_the Thanks_for
## text1 0 0 0 0 0 0
## text2 0 0 0 0 0 0
## text3 0 0 0 0 0 0
## text4 0 0 0 0 0 0
## text5 1 0 0 0 0 0
## text6 0 0 0 0 0 0
4.3. And trigrams:
library(doMC)
registerDoMC(cores = 4)
library(foreach)
library(doParallel)
twitterdfmtri <- foreach(i=1:length(twitter), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(twitter[[i]])
ng <- dfm(tokenize(splitsn, ngrams = 3 , verbose = FALSE, removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T, simplify = F))
}
##
newsdfmtri <- foreach(i=1:length(news), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(news[[i]])
ng <- dfm(tokenize(splitsn, ngrams = 3 , verbose = FALSE, removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T, simplify = F))
}
##
blogsdfmtri <- foreach(i=1:length(blogs), .combine = rbind, .packages = c("quanteda")) %dopar% {
splitsn <- as.vector(blogs[[i]])
ng <- dfm(tokenize(splitsn, ngrams = 3 , verbose = FALSE, removeNumbers = T, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T, removeHyphens = T, removeURL = T, simplify = F))
}
##
head(twitterdfmtri)
## Document-feature matrix of: 23,601 documents, 225,045 features.
## (showing first 6 documents and first 6 features)
## features
## docs Thanks_for_the Thank_you_for know_how_to are_looking_for
## text1 0 0 0 0
## text2 0 0 0 0
## text3 0 0 0 0
## text4 0 0 0 0
## text5 0 0 0 0
## text6 0 0 0 0
## features
## docs wish_I_could would_love_to
## text1 0 0
## text2 0 0
## text3 0 0
## text4 0 0
## text5 0 0
## text6 0 0
head(newsdfmtri)
## Document-feature matrix of: 10,102 documents, 289,910 features.
## (showing first 6 documents and first 6 features)
## features
## docs part_of_the one_of_the to_be_the as_much_as I_had_a in_the_house
## text1 0 0 0 0 0 0
## text2 0 0 0 0 0 0
## text3 0 0 0 0 0 0
## text4 0 0 0 0 0 0
## text5 0 0 0 0 0 0
## text6 0 0 0 0 0 0
head(blogsdfmtri)
## Document-feature matrix of: 8,992 documents, 312,527 features.
## (showing first 6 documents and first 6 features)
## features
## docs a_lot_of I_have_a some_of_the of_the_same I_had_no I_did_not
## text1 0 0 0 0 0 0
## text2 0 0 0 0 0 0
## text3 0 0 0 0 1 0
## text4 0 0 0 0 0 0
## text5 0 0 0 0 0 0
## text6 0 0 0 0 0 0
alldfmtri <- rbind(twitterdfmtri, newsdfmtri, blogsdfmtri)
##
head(alldfmtri)
## Document-feature matrix of: 42,695 documents, 790,796 features.
## (showing first 6 documents and first 6 features)
## features
## docs Thanks_for_the know_how_to are_looking_for wish_I_could
## text1 0 0 0 0
## text2 0 0 0 0
## text3 0 0 0 0
## text4 0 0 0 0
## text5 0 0 0 0
## text6 0 0 0 0
## features
## docs would_love_to a_bunch_of
## text1 0 0
## text2 0 0
## text3 0 0
## text4 0 0
## text5 0 0
## text6 0 0
Of all of those Document-Feature Matrix (dfm), we are interested just in the frecuencies with which each of the ngrams (onegram, bigram or trigram) appears. So lets keep just this feature:
## Getting the feature
onegram <- colSums(alldfm)
bigram <- colSums(alldfmbi)
trigram <- colSums(alldfmtri)
## converting in to data frame
onegram <- as.data.frame(onegram)
bigram <- as.data.frame(bigram)
trigram <- as.data.frame(trigram)
## working around with row names
onegram$token <- rownames(onegram)
bigram$token <- rownames(bigram)
trigram$token <- rownames(trigram)
## Fixing col names
names(onegram) <- c("n", "token")
names(bigram) <- c("n", "token")
names(trigram) <- c("n", "token")
## Erasing rownames
rownames(onegram) <- NULL
rownames(bigram) <- NULL
rownames(trigram) <- NULL
## Showing
head(onegram)
## n token
## 1 10915 for
## 2 9382 you
## 3 2774 like
## 4 1881 our
## 5 5468 this
## 6 2542 can
head(bigram)
## n token
## 1 1891 for_the
## 2 3832 in_the
## 3 1577 to_be
## 4 4191 of_the
## 5 1837 on_the
## 6 299 Thanks_for
head(trigram)
## n token
## 1 179 Thanks_for_the
## 2 42 know_how_to
## 3 21 are_looking_for
## 4 21 wish_I_could
## 5 44 would_love_to
## 6 41 a_bunch_of
## Creatin percentage
onegram$perc <- onegram$n/sum(onegram$n)
bigram$perc <- bigram$n/sum(bigram$n)
trigram$perc <- trigram$n/sum(trigram$n)
## Sorting
onegram <- onegram[order(-onegram$perc),]
bigram <- bigram[order(-bigram$perc),]
trigram <- trigram[order(-trigram$perc),]
## Getting cumulative percentages
onegram$cumperc <- cumsum(onegram$perc)
bigram$cumperc <- cumsum(bigram$perc)
trigram$cumperc <- cumsum(trigram$perc)
## Agregating a ID for plots
onegram$ID <- c(1:nrow(onegram))
bigram$ID <- c(1:nrow(bigram))
trigram$ID <- c(1:nrow(trigram))
head(onegram)
## n token perc cumperc ID
## 19 47559 the 0.04730525 0.04730525 1
## 20 27313 to 0.02716727 0.07447253 2
## 8 23990 and 0.02386200 0.09833453 3
## 17 23893 a 0.02376552 0.12210006 4
## 31 20358 of 0.02024938 0.14234944 5
## 37 16695 i 0.01660593 0.15895536 6
head(bigram)
## n token perc cumperc ID
## 4 4191 of_the 0.004353458 0.004353458 1
## 2 3832 in_the 0.003980542 0.008334000 2
## 8 2087 to_the 0.002167900 0.010501899 3
## 1 1891 for_the 0.001964302 0.012466201 4
## 5 1837 on_the 0.001908209 0.014374410 5
## 3 1577 to_be 0.001638130 0.016012540 6
head(trigram)
## n token perc cumperc ID
## 250 305 a_lot_of 0.0003314317 0.0003314317 1
## 248 300 one_of_the 0.0003259984 0.0006574300 2
## 1 179 Thanks_for_the 0.0001945124 0.0008519424 3
## 563 174 to_be_a 0.0001890791 0.0010410215 4
## 39 170 going_to_be 0.0001847324 0.0012257539 5
## 1216 151 the_end_of 0.0001640858 0.0013898397 6
library(ggplot2)
g <- ggplot(data = onegram, aes(x=ID, y=cumperc))
g <- g + geom_point()
g <- g + geom_point(data = onegram, aes(x=ID, y=perc), colour="blue")
g
library(ggplot2)
g <- ggplot(data = bigram, aes(x=ID, y=cumperc))
g <- g + geom_point()
g <- g + geom_point(data = bigram, aes(x=ID, y=perc), colour="blue")
g
library(ggplot2)
g <- ggplot(data = trigram, aes(x=ID, y=cumperc))
g <- g + geom_point()
g <- g + geom_point(data = trigram, aes(x=ID, y=perc), colour="blue")
g
lo1 <- nrow(onegram)
onegram <- onegram[onegram$cumperc <= 0.75, ]
lo2 <- nrow(onegram)
lb1 <- nrow(bigram)
bigram <- bigram[bigram$cumperc <= 0.75, ]
lb2 <- nrow(bigram)
lt1 <- nrow(trigram)
trigram <- trigram[trigram$cumperc <= 0.75, ]
lt2 <- nrow(trigram)
data.frame(ini = c(lo1, lb1, lt1), fin = c(lo2, lb2, lt2),
dif = c(lo2 - lo1, lb2 - lb1, lt2 - lt1),
redperc = c((lo2 - lo1) / lo1, (lb2 - lb1) / lb1, (lt2 - lt1) / lt1))
## ini fin dif redperc
## 1 53195 1499 -51696 -0.9718207
## 2 477556 236885 -240671 -0.5039639
## 3 790796 560733 -230063 -0.2909259
This could be a reasonable collection of tokens to build up a model.
About the model, may be I’m over simplificating, but taking in account the need of eficiency, the size of the files and the enviroment in which the app will run (Shiny), so far, my idea is this:
allgram <- rbind(onegram, bigram, trigram)
head(allgram)
## n token perc cumperc ID
## 19 47559 the 0.04730525 0.04730525 1
## 20 27313 to 0.02716727 0.07447253 2
## 8 23990 and 0.02386200 0.09833453 3
## 17 23893 a 0.02376552 0.12210006 4
## 31 20358 of 0.02024938 0.14234944 5
## 37 16695 i 0.01660593 0.15895536 6
tail(allgram)
## n token perc cumperc ID
## 559701 1 down_side_is 1.086661e-06 0.7499940 560728
## 559702 1 is_that_grapefruit 1.086661e-06 0.7499951 560729
## 559703 1 that_grapefruit_seed 1.086661e-06 0.7499962 560730
## 559704 1 seed_extract_can 1.086661e-06 0.7499973 560731
## 559705 1 extract_can_be 1.086661e-06 0.7499984 560732
## 559706 1 can_be_toxic 1.086661e-06 0.7499995 560733
allgram <- allgram[sort(-allgram$perc), ]
The inputed text over which will be make the prediction or completion will be take as a string which have to be passed out to a function. Supouse inputations like “my”, “I’m”, “can be”
Then, the all dfm data frame have to be filtered with the inputation as argument and return the first three matchs ordered by percentage:
newallgram <- allgram[grep("^my", allgram$token), ]
head(newallgram, 3)
## n token perc cumperc ID
## 49 5962 my 0.0059301905 0.2532506 16
## 440 291 myself 0.0002894474 0.5961691 364
## 1051000 126 my_life 0.0001308842 0.1090764 344
newallgram$token[1:3]
## [1] "my" "myself" "my_life"
newallgram <- allgram[grep("^I'm", allgram$token), ]
head(newallgram, 3)
## n token perc cumperc ID
## 2330 168 I'm_not 1.745123e-04 0.09165218 227
## 953 73 I'm_a 7.582974e-05 0.14755428 738
## 1532 71 I'm_going 7.375221e-05 0.15024156 774
newallgram$token[1:3]
## [1] "I'm_not" "I'm_a" "I'm_going"
## Spaces have to be files in regex with "_"
newallgram <- allgram[grep("^can_be", allgram$token), ]
head(newallgram, 3)
## n token perc cumperc ID
## 2621000 296 can_be 3.074740e-04 0.06174514 91
## 4899 12 can_become 1.246516e-05 0.30763917 6941
## 51560 5 can_beat 5.193818e-06 0.42076468 22112
newallgram$token[1:3]
## [1] "can_be" "can_become" "can_beat"
I know still have a lot of work with regexs, pasting and replacing chunks of strings, but this is the line of work I’ll be.