1 Introduction

The goal of this milestone report is to display what have been gotten by working with the sample data from the blogs, news and twitter texts provided by Swiftkey and demonstrate the project is on track to create the prediction algorithm.

The motivation for this report is to:

  1. Demonstrate that the data have been successfully downloaded and loaded into a corpus.
  2. Create summary statistics about the data sets.
  3. Report any interesting findings.
  4. Get feedback on the plans for creating a prediction algorithm and a Shiny application.

2 Resources used

Computer: Toshiba Satellite E55-A laptop

R libraries:

packages <- c("parallel", "quanteda", "readtext", "data.table", "gridExtra", "ggplot2", "dplyr")
noquote(c(R=paste(R.Version()[6:7], collapse="."), sapply(packages, function(x) {library(x, character.only=T, logical.return=T); as.character(packageVersion(x))})))
         R   parallel   quanteda   readtext data.table  gridExtra    ggplot2      dplyr 
     3.4.4      3.4.4      1.1.1       0.50   1.10.4.3        2.3      2.2.1      0.7.4 

3 Getting Data

Downloaded file:

fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dataDir <- "~/dsscapstone"
zipFile <- "Coursera-SwiftKey.zip"
zipfilePath <- file.path(dataDir, zipFile)
if(!file.exists(zipfilePath)) download.file(fileURL, destfile=zipfilePath, cacheOK = FALSE)
file.info(zipfilePath)[c(1,5)]
                                         size               ctime
~/dsscapstone/Coursera-SwiftKey.zip 574661177 2018-03-20 18:13:52

Compressed contents:

(zipContents <- unzip(zipfilePath, list=TRUE))
                            Name    Length                Date
1                         final/         0 2014-07-22 10:10:00
2                   final/de_DE/         0 2014-07-22 10:10:00
3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
6                   final/ru_RU/         0 2014-07-22 10:10:00
7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
10                  final/en_US/         0 2014-07-22 10:10:00
11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
14                  final/fi_FI/         0 2014-07-22 10:10:00
15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00

Languages found:

langFilePattern <- "^final/(.._..)/.._...+\\.txt$"
unique(sub(langFilePattern, "\\1", grep(langFilePattern, zipContents$Name, value = TRUE)))
[1] "de_DE" "ru_RU" "en_US" "fi_FI"

The scope of this project is the English files. After uncompressing the files and prior to build a corpus it is necessary to fix unexpected end of file caused by the \032 character, only when running on Windows platforms, which is the case.

unzipFiles <- sort(grep("^final/en_US/.._...+\\.txt$", zipContents$Name, value=T))
files <- file.path(dataDir, basename(unzipFiles))
docs <- sub(file.path(dataDir, "en_US\\.(.*)\\.txt$"), "\\1", files)
if(!all(file.exists(files))) unzip(zipfilePath, files=unzipFiles, setTimes=T, exdir=dataDir, junkpath=T)
fix.files <- function(f, basenamer = function(x) basename(x)) {
    # binary open avoids unexpected end of file on readLines() on Windows platforms
    con <- file(f, open = "rb")
    # removing unexpected end of file character "\032" on Windows platforms
    buffer <- gsub("\032", "", readLines(con, skipNul=TRUE))
    close(con)
    outFile <- file.path(dirname(f), basenamer(f))
    con <- file(outFile, "wb")
    writeLines(buffer, con)
    close(con)
    outFile
}
if(!file.exists(paste0(dataDir, "/fixed"))) {
    cluster <- makeCluster(detectCores())
    out <- parSapply(cluster, files, fix.files) #35s
    stopCluster(cluster)
    writeLines("", paste0(dataDir, "/fixed"))
}

Summarize files via GNU coreutils wc:

wc <- function(paths) {
    require(parallel)
    cluster <- makeCluster(detectCores())
    ret <- parSapplyLB(cluster, paths, function(x)
        as.integer(unlist(strsplit(system(paste("wc -l -w -c -L", x), TRUE),"\\s+"))[2:5]))
    stopCluster(cluster)
    rownames(ret) <- c("lines", "words", "bytes", "longest.line")
    data.frame(t(ret))
}
(wordcounts <- wc(files))
                                  lines    words     bytes longest.line
~/dsscapstone/en_US.blogs.txt    899288 37272578 209260726        40832
~/dsscapstone/en_US.news.txt    1010242 34309642 204801643        11384
~/dsscapstone/en_US.twitter.txt 2360148 30341028 164745183          140
c(colSums(wordcounts[1:3]), longest.line=max(wordcounts[4]))
       lines        words        bytes longest.line 
     4269678    101923248    578807552        40832 

4 Exploratory Data Analysis

As defined in Wikipedia, an n-gram is a contiguous sequence of n items from a given sample of text or speech.

It’s assumed that the optimal unit of a sample text to create n-grams is the text sentence, otherwise n-grams could be created crossing sentences, e.g., a bigram composed by the last word of a given sentence and the first word of the very next sentence.

4.1 Creating the corpus

Creates one corpus per file, each corpus document containing a single line.

for(i in seq_along(files)) {
    corpus <- paste0(docs[i], ".corpus1")
    corpusFile <- file.path(dataDir, paste0(corpus, ".rda"))
    if(!file.exists(corpusFile)) { #62s
        buffer <- readLines(files[i], skipNul=TRUE, encoding = "UTF-8")
        tmp <- corpus(buffer)
        docnames(tmp) <- NULL
        assign(corpus, tmp)
        rm(buffer, tmp)
        save(list=corpus, file=corpusFile, compress = FALSE)
    } else load(corpusFile) #43s
}

Summary

summary.corpora <- function(corpus.names) {
    bind_rows(lapply(corpus.names, function(corpus.name) {
        corpus <- get(corpus.name)
        data.frame(Corpus = corpus.name,
                   Documents = ndoc(corpus),
                   Sentences = sum(nsentence(corpus)),
                   Tokens = sum(ntoken(corpus, remove_punct=TRUE)),
                   Megabytes = round(as.numeric(object.size(corpus))/1024/1024),
                   stringsAsFactors = FALSE)
    }))
}
summariesFile <- file.path(dataDir, "corpora1.summary.rda")
if(!file.exists(summariesFile)) {
    corpora1.summary <- summary.corpora(paste0(docs,".corpus1")) #42min
    save(corpora1.summary, file=summariesFile, compress=FALSE)
} else load(summariesFile)
corpora1.summary
           Corpus Documents Sentences   Tokens Megabytes
1   blogs.corpus1    899288   2367316 37296232     247.1
2    news.corpus1   1010242   1993117 34258969     248.9
3 twitter.corpus1   2360148   3770706 29959374     301.3
colSums(corpora1.summary[,-1])
  Documents   Sentences      Tokens   Megabytes 
  4269678.0   8131139.0 101514575.0       797.3 

4.2 Reshaping the corpus from document to sentences

for(i in seq_along(files)) {
    corpus1 <- paste0(docs[i], ".corpus1")
    corpus2 <- paste0(docs[i], ".corpus2")
    corpus2File <- file.path(dataDir, paste0(corpus2, ".rda"))
    if(!file.exists(corpusFile)) { #33min
        tmp <- corpus_reshape(get(corpus1), to="sentences", use_docvars=FALSE)
        docnames(tmp) <- NULL
        assign(corpus2, tmp)
        rm(tmp)
        save(list=corpus2, file=corpus2File, compress = FALSE)
    } else load(corpus2File) #43s
    rm(list=corpus1)
}

Summary

summariesFile <- file.path(dataDir, "corpora2.summary.rda")
if(!file.exists(summariesFile)) {
    corpora2.summary <- summary.corpora(paste0(docs,".corpus2")) #74min
    save(corpora2.summary, file=summariesFile, compress=FALSE)
} else load(summariesFile)
corpora2.summary
           Corpus Documents Sentences   Tokens Megabytes
1   blogs.corpus2   2367316   2367331 37300823     355.8
2    news.corpus2   1993117   1993119 34260121     325.7
3 twitter.corpus2   3770706   3770710 29968955     380.5
colSums(corpora2.summary[,-1])
Documents Sentences    Tokens Megabytes 
  8131139   8131160 101529899      1062 

4.3 Generating word tokens

Generating word tokens (unigrams) with the following transformations:

  • Remove numbers
  • Remove punctuation
  • Remove symbols
  • Remove separators
  • Remove twitter tags
  • Remove hyphens
  • Remove URLs
  • Lowercase
  • Remove unlikely words longer than 20 characters, based on Wikipedia’s citation
for (doc in docs) { # 11min
    corpus <- paste0(doc, ".corpus2")
    corpusFile <- file.path(dataDir, paste0(corpus, ".rda"))
    tokens <- paste0(doc, ".tokens")
    tokensFile <- file.path(dataDir, paste0(tokens, ".rda"))
    if(!file.exists(tokensFile)) {
        if(!exists(corpus)) load(corpusFile)
        assign(tokens, tokens(get(corpus), remove_numbers=T, remove_punct=T, remove_symbols=T,
                              remove_separators=T, remove_twitter=T, remove_hyphens=T, remove_url=T))
        assign(tokens, tokens_remove(get(tokens), max_nchar = 20L))
        assign(tokens, tokens_tolower(get(tokens)))
        save(list=tokens, file=tokensFile, compress = FALSE)
    } else load(tokensFile)
    rm(list=corpus)
}

Summary

(tokens.summary <- data.frame(t(sapply(docs, function(doc) {
    tokens <- paste0(doc, ".tokens")
    c(Tokens = sum(ntoken(get(tokens))),
      Types = length(types(get(tokens))),
      size.MB = round(object.size(get(tokens))/1024/1024)) #11s
}))))
          Tokens  Types size.MB
blogs   37130798 290887     444
news    33862902 243376     387
twitter 29638369 338942     566
colSums(tokens.summary[,c(1,3)])
   Tokens   size.MB 
100632069      1397 

4.4 Histograms of the top 10 n-gram features, n from 1 to 5

# generating document feature matrices
for (doc in docs) { #1h34m
    tokens <- paste0(doc, ".tokens")
    for (ngram in 1:5) {
        dfm <- paste0(doc, ".dfm.", ngram, "gram")
        dfmFile <- file.path(dataDir, paste0(dfm, ".rda"))
        if(!file.exists(dfmFile)) {
            assign(dfm, dfm(get(tokens), tolower=FALSE, ngrams=ngram, concatenator=" "))
            save(list=dfm, file=dfmFile, compress = FALSE)
            rm(list=dfm)
        }
    }
    rm(list=tokens)
}
# generating a list of top 10 features
top=10
topfeatures.file <- file.path(dataDir, "topfeatures.bydoc.rda")
if(!file.exists(topfeatures.file)) {
    topfeatures.bydoc <-
        lapply(1:5, function(ngram) { #12min
            sapply(docs, function(doc) {
                dfm <- paste0(doc, ".dfm.", ngram, "gram")
                load(file.path(dataDir, paste0(dfm, ".rda")))
                tf <- topfeatures(get(dfm), top)
                rm(list=dfm)
                tf
            }, simplify = FALSE)
        })
    save(topfeatures.bydoc, file=topfeatures.file, compress=FALSE)
} else load(topfeatures.file)
# plotting the top 10 features
for(ngram in seq_along(topfeatures.bydoc)) {
    grobs <- lapply(seq_along(topfeatures.bydoc[[ngram]]), function(doc) {
        frequency <- topfeatures.bydoc[[ngram]][[doc]]
        feature <- names(frequency)
        ggplot(mapping=aes(feature, frequency)) + geom_col() + coord_flip() +
                labs(title=docs[doc], x=paste0(ngram, "-gram"))
    })
    grid.arrange(grobs = grobs, ncol=length(docs),
                 top=paste0("Top ",top," \"",ngram,"-gram\""," features"))
}

4.5 Findings

It’s been found more than hundred millions tokens from the given texts.

There are more tokens in the blogs, followed by news and twitter.

There are more unique words (types), which means a more diverse vocabulary in twitter, followed by blogs and news. This is expected hence twitter is supposed to have a more informal language than blogs, while news is supposed to have the less informal language than the other two.

The top 1-grams show the vocabularies are somewhat similar although one can see high frequencies of pronoun “you” in twitter messages and the pronoun “I” in both blogs and twitter which agree with the fact that blogs are meant to be a kind of biography written in first person, while twitter is meant to be both a kind of biography but also frequently replying to someone else’s twitter, and finally, news are more likely to be written in third person.

Comparing the top 2-grams with the top 3-grams and so on, one can see how the English language model start to build up and how some kind of jargons appear, e.g.:

  • Blogs and News: “at the end of”
  • Twitter: “thanks for the shout out”

5 Prediction algorithm proposal

Once there are substantial samples from blogs, news and twitters, one can believe that a prediction model based on n-grams should suffice to predict a next word from from previous ones with reasonable accuracy, therefore, the proposal is to develop a prediction algorithm that does the following: