The goal of this milestone report is to display what have been gotten by working with the sample data from the blogs, news and twitter texts provided by Swiftkey and demonstrate the project is on track to create the prediction algorithm.
The motivation for this report is to:
Computer: Toshiba Satellite E55-A laptop
R libraries:
packages <- c("parallel", "quanteda", "readtext", "data.table", "gridExtra", "ggplot2", "dplyr")
noquote(c(R=paste(R.Version()[6:7], collapse="."), sapply(packages, function(x) {library(x, character.only=T, logical.return=T); as.character(packageVersion(x))})))
R parallel quanteda readtext data.table gridExtra ggplot2 dplyr
3.4.4 3.4.4 1.1.1 0.50 1.10.4.3 2.3 2.2.1 0.7.4
Downloaded file:
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dataDir <- "~/dsscapstone"
zipFile <- "Coursera-SwiftKey.zip"
zipfilePath <- file.path(dataDir, zipFile)
if(!file.exists(zipfilePath)) download.file(fileURL, destfile=zipfilePath, cacheOK = FALSE)
file.info(zipfilePath)[c(1,5)]
size ctime
~/dsscapstone/Coursera-SwiftKey.zip 574661177 2018-03-20 18:13:52
Compressed contents:
(zipContents <- unzip(zipfilePath, list=TRUE))
Name Length Date
1 final/ 0 2014-07-22 10:10:00
2 final/de_DE/ 0 2014-07-22 10:10:00
3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
6 final/ru_RU/ 0 2014-07-22 10:10:00
7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
10 final/en_US/ 0 2014-07-22 10:10:00
11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
14 final/fi_FI/ 0 2014-07-22 10:10:00
15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
Languages found:
langFilePattern <- "^final/(.._..)/.._...+\\.txt$"
unique(sub(langFilePattern, "\\1", grep(langFilePattern, zipContents$Name, value = TRUE)))
[1] "de_DE" "ru_RU" "en_US" "fi_FI"
The scope of this project is the English files. After uncompressing the files and prior to build a corpus it is necessary to fix unexpected end of file caused by the \032 character, only when running on Windows platforms, which is the case.
unzipFiles <- sort(grep("^final/en_US/.._...+\\.txt$", zipContents$Name, value=T))
files <- file.path(dataDir, basename(unzipFiles))
docs <- sub(file.path(dataDir, "en_US\\.(.*)\\.txt$"), "\\1", files)
if(!all(file.exists(files))) unzip(zipfilePath, files=unzipFiles, setTimes=T, exdir=dataDir, junkpath=T)
fix.files <- function(f, basenamer = function(x) basename(x)) {
# binary open avoids unexpected end of file on readLines() on Windows platforms
con <- file(f, open = "rb")
# removing unexpected end of file character "\032" on Windows platforms
buffer <- gsub("\032", "", readLines(con, skipNul=TRUE))
close(con)
outFile <- file.path(dirname(f), basenamer(f))
con <- file(outFile, "wb")
writeLines(buffer, con)
close(con)
outFile
}
if(!file.exists(paste0(dataDir, "/fixed"))) {
cluster <- makeCluster(detectCores())
out <- parSapply(cluster, files, fix.files) #35s
stopCluster(cluster)
writeLines("", paste0(dataDir, "/fixed"))
}
Summarize files via GNU coreutils wc:
wc <- function(paths) {
require(parallel)
cluster <- makeCluster(detectCores())
ret <- parSapplyLB(cluster, paths, function(x)
as.integer(unlist(strsplit(system(paste("wc -l -w -c -L", x), TRUE),"\\s+"))[2:5]))
stopCluster(cluster)
rownames(ret) <- c("lines", "words", "bytes", "longest.line")
data.frame(t(ret))
}
(wordcounts <- wc(files))
lines words bytes longest.line
~/dsscapstone/en_US.blogs.txt 899288 37272578 209260726 40832
~/dsscapstone/en_US.news.txt 1010242 34309642 204801643 11384
~/dsscapstone/en_US.twitter.txt 2360148 30341028 164745183 140
c(colSums(wordcounts[1:3]), longest.line=max(wordcounts[4]))
lines words bytes longest.line
4269678 101923248 578807552 40832
As defined in Wikipedia, an n-gram is a contiguous sequence of n items from a given sample of text or speech.
It’s assumed that the optimal unit of a sample text to create n-grams is the text sentence, otherwise n-grams could be created crossing sentences, e.g., a bigram composed by the last word of a given sentence and the first word of the very next sentence.
Creates one corpus per file, each corpus document containing a single line.
for(i in seq_along(files)) {
corpus <- paste0(docs[i], ".corpus1")
corpusFile <- file.path(dataDir, paste0(corpus, ".rda"))
if(!file.exists(corpusFile)) { #62s
buffer <- readLines(files[i], skipNul=TRUE, encoding = "UTF-8")
tmp <- corpus(buffer)
docnames(tmp) <- NULL
assign(corpus, tmp)
rm(buffer, tmp)
save(list=corpus, file=corpusFile, compress = FALSE)
} else load(corpusFile) #43s
}
Summary
summary.corpora <- function(corpus.names) {
bind_rows(lapply(corpus.names, function(corpus.name) {
corpus <- get(corpus.name)
data.frame(Corpus = corpus.name,
Documents = ndoc(corpus),
Sentences = sum(nsentence(corpus)),
Tokens = sum(ntoken(corpus, remove_punct=TRUE)),
Megabytes = round(as.numeric(object.size(corpus))/1024/1024),
stringsAsFactors = FALSE)
}))
}
summariesFile <- file.path(dataDir, "corpora1.summary.rda")
if(!file.exists(summariesFile)) {
corpora1.summary <- summary.corpora(paste0(docs,".corpus1")) #42min
save(corpora1.summary, file=summariesFile, compress=FALSE)
} else load(summariesFile)
corpora1.summary
Corpus Documents Sentences Tokens Megabytes
1 blogs.corpus1 899288 2367316 37296232 247.1
2 news.corpus1 1010242 1993117 34258969 248.9
3 twitter.corpus1 2360148 3770706 29959374 301.3
colSums(corpora1.summary[,-1])
Documents Sentences Tokens Megabytes
4269678.0 8131139.0 101514575.0 797.3
for(i in seq_along(files)) {
corpus1 <- paste0(docs[i], ".corpus1")
corpus2 <- paste0(docs[i], ".corpus2")
corpus2File <- file.path(dataDir, paste0(corpus2, ".rda"))
if(!file.exists(corpusFile)) { #33min
tmp <- corpus_reshape(get(corpus1), to="sentences", use_docvars=FALSE)
docnames(tmp) <- NULL
assign(corpus2, tmp)
rm(tmp)
save(list=corpus2, file=corpus2File, compress = FALSE)
} else load(corpus2File) #43s
rm(list=corpus1)
}
Summary
summariesFile <- file.path(dataDir, "corpora2.summary.rda")
if(!file.exists(summariesFile)) {
corpora2.summary <- summary.corpora(paste0(docs,".corpus2")) #74min
save(corpora2.summary, file=summariesFile, compress=FALSE)
} else load(summariesFile)
corpora2.summary
Corpus Documents Sentences Tokens Megabytes
1 blogs.corpus2 2367316 2367331 37300823 355.8
2 news.corpus2 1993117 1993119 34260121 325.7
3 twitter.corpus2 3770706 3770710 29968955 380.5
colSums(corpora2.summary[,-1])
Documents Sentences Tokens Megabytes
8131139 8131160 101529899 1062
Generating word tokens (unigrams) with the following transformations:
for (doc in docs) { # 11min
corpus <- paste0(doc, ".corpus2")
corpusFile <- file.path(dataDir, paste0(corpus, ".rda"))
tokens <- paste0(doc, ".tokens")
tokensFile <- file.path(dataDir, paste0(tokens, ".rda"))
if(!file.exists(tokensFile)) {
if(!exists(corpus)) load(corpusFile)
assign(tokens, tokens(get(corpus), remove_numbers=T, remove_punct=T, remove_symbols=T,
remove_separators=T, remove_twitter=T, remove_hyphens=T, remove_url=T))
assign(tokens, tokens_remove(get(tokens), max_nchar = 20L))
assign(tokens, tokens_tolower(get(tokens)))
save(list=tokens, file=tokensFile, compress = FALSE)
} else load(tokensFile)
rm(list=corpus)
}
Summary
(tokens.summary <- data.frame(t(sapply(docs, function(doc) {
tokens <- paste0(doc, ".tokens")
c(Tokens = sum(ntoken(get(tokens))),
Types = length(types(get(tokens))),
size.MB = round(object.size(get(tokens))/1024/1024)) #11s
}))))
Tokens Types size.MB
blogs 37130798 290887 444
news 33862902 243376 387
twitter 29638369 338942 566
colSums(tokens.summary[,c(1,3)])
Tokens size.MB
100632069 1397
# generating document feature matrices
for (doc in docs) { #1h34m
tokens <- paste0(doc, ".tokens")
for (ngram in 1:5) {
dfm <- paste0(doc, ".dfm.", ngram, "gram")
dfmFile <- file.path(dataDir, paste0(dfm, ".rda"))
if(!file.exists(dfmFile)) {
assign(dfm, dfm(get(tokens), tolower=FALSE, ngrams=ngram, concatenator=" "))
save(list=dfm, file=dfmFile, compress = FALSE)
rm(list=dfm)
}
}
rm(list=tokens)
}
# generating a list of top 10 features
top=10
topfeatures.file <- file.path(dataDir, "topfeatures.bydoc.rda")
if(!file.exists(topfeatures.file)) {
topfeatures.bydoc <-
lapply(1:5, function(ngram) { #12min
sapply(docs, function(doc) {
dfm <- paste0(doc, ".dfm.", ngram, "gram")
load(file.path(dataDir, paste0(dfm, ".rda")))
tf <- topfeatures(get(dfm), top)
rm(list=dfm)
tf
}, simplify = FALSE)
})
save(topfeatures.bydoc, file=topfeatures.file, compress=FALSE)
} else load(topfeatures.file)
# plotting the top 10 features
for(ngram in seq_along(topfeatures.bydoc)) {
grobs <- lapply(seq_along(topfeatures.bydoc[[ngram]]), function(doc) {
frequency <- topfeatures.bydoc[[ngram]][[doc]]
feature <- names(frequency)
ggplot(mapping=aes(feature, frequency)) + geom_col() + coord_flip() +
labs(title=docs[doc], x=paste0(ngram, "-gram"))
})
grid.arrange(grobs = grobs, ncol=length(docs),
top=paste0("Top ",top," \"",ngram,"-gram\""," features"))
}
It’s been found more than hundred millions tokens from the given texts.
There are more tokens in the blogs, followed by news and twitter.
There are more unique words (types), which means a more diverse vocabulary in twitter, followed by blogs and news. This is expected hence twitter is supposed to have a more informal language than blogs, while news is supposed to have the less informal language than the other two.
The top 1-grams show the vocabularies are somewhat similar although one can see high frequencies of pronoun “you” in twitter messages and the pronoun “I” in both blogs and twitter which agree with the fact that blogs are meant to be a kind of biography written in first person, while twitter is meant to be both a kind of biography but also frequently replying to someone else’s twitter, and finally, news are more likely to be written in third person.
Comparing the top 2-grams with the top 3-grams and so on, one can see how the English language model start to build up and how some kind of jargons appear, e.g.:
Once there are substantial samples from blogs, news and twitters, one can believe that a prediction model based on n-grams should suffice to predict a next word from from previous ones with reasonable accuracy, therefore, the proposal is to develop a prediction algorithm that does the following:
takes the last n words from a given text, limited to a maximum of 4 words, and look for the next word in a n+1 n-gram table.
To improve the chances to find a next word, if the n words are not found in the n+1 n-gram table, the algorithm shall successively fall back to look for the last n-1 words in the n-gram table while n-1 words are not found and n is greater than 0.