Synposis

This document summarizes the basic properties of a dataset comprised of tests obtained from news feeds, twitter posts and blogs andoutlines an approach to exploit these texts to train an algorithm to predict the following word in some text. In order to do so, the document is split into four parts:

An explanation of the data acquisition process
A summary of the dataset
The data preparation process
An overview of the path towards a text prediction app.

Data acquisition

First, data are loaded into a directory (that is created, if it does not yet exist). It is either extracted from a downloaded zip file, or directly extracted.

# Check if directory exists
if(!file.exists("./projectData")){
  dir.create("./projectData")
}
Url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# Check if zip has already been downloaded in projectData directory?
if(!file.exists("./projectData/Coursera-SwiftKey.zip")){
  download.file(Url,destfile="./projectData/Coursera-SwiftKey.zip",mode = "wb")
}
# Check if zip has already been unzipped?
if(!file.exists("./projectData/final")){
  unzip(zipfile="./projectData/Coursera-SwiftKey.zip",exdir="./projectData")
}

The English data are then read from the files. All other languages are ignored for the time being. Note: The encoding is set to UTF-8.

twitter <- readLines(con <- file ("./projectData/final/en_US/en_US.twitter.txt", "r"), encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines(con <- file ("./projectData/final/en_US/en_US.blogs.txt", "r"), encoding = "UTF-8", skipNul = TRUE)
news <- readLines(con <- file ("./projectData/final/en_US/en_US.news.txt", "r"), encoding = "UTF-8", skipNul = TRUE)

## Warning in readLines(con <- file("./projectData/final/en_US/
## en_US.news.txt", : unvollständige letzte Zeile in './projectData/final/
## en_US/en_US.news.txt' gefunden

close(con)

Report of summary statistics

This section lists some summary statistics (i.e., basic properties) of the data. It presents the file size, the number of words in each file and the mean number of words.

library(stringi)
# Get file sizes
blogs.size <- file.info("./projectData/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("./projectData/final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("./projectData/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

# Get words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)

# Summary of the data sets
data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blogs.size, news.size, twitter.size),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
           mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))

##    source file.size.MB num.lines num.words mean.num.words
## 1   blogs     200.4242    899288  37546246       41.75108
## 2    news     196.2775     77259   2674536       34.61779
## 3 twitter     159.3641   2360148  30093410       12.75065

Data preparation

In order to prepare the data for modeling, it needs to be processed further. Exemplary, for space considerations, only 5 per cent of the data are taken into account here. The following steps are taken:

Remove punctuation and numbers
Convert all letters to lower case
Remove stop words (e.g., “a”, “and”, “the”, …)
Remove extra white spaces.

After this, the text is tokenized, that is, split into groups of one, two, three and four words (n-grams). These groups are sorted and counted to determine word frequencies in each document. This measure is stored in the term-document matrix.

Since this matrix mostly contains zeros (it is sparse), it needs to be processed to a less sparse level. This saves computational time and memory.

Finally, n-gram frequencies are determined.

raw_data <- c(sample(blogs, length(blogs) * 0.05),
                 sample(news, length(news) * 0.05),
                 sample(twitter, length(twitter) * 0.05))
rm(twitter,blogs,news)
gc()

##           used (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  718590 38.4    4547696  242.9   4141480  221.2
## Vcells 7387280 56.4  133657291 1019.8 134110273 1023.2

sampled_data <- sapply(raw_data, function(x) iconv(x, "UTF-8", "ASCII", sub=""))
rm(raw_data)
gc()

##           used (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  736295 39.4    3638156 194.3   4547696  242.9
## Vcells 8189113 62.5  106925832 815.8 134110273 1023.2

docs <- Corpus(VectorSource(sampled_data))
docs <- tm_map(docs, removePunctuation) # Remove punctuation
docs <- tm_map(docs, removeNumbers) # Remove numbers
docs <- tm_map(docs, tolower)  # Convert everything to lower case
docs <- tm_map(docs, removeWords, stopwords("en")) # Remove stopwords, i.e., "a", "and", ...
docs <- tm_map(docs, stripWhitespace) # Remove whitespaces
docs <- tm_map(docs, PlainTextDocument)  # Finish preparation

uniGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
biGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
triGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
quadGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
  
uniDocMatrix <- TermDocumentMatrix(docs, control = list(tokenize = uniGramTokenizer))
biDocMatrix  <- TermDocumentMatrix(docs, control = list(tokenize = biGramTokenizer))
triDocMatrix <- TermDocumentMatrix(docs, control = list(tokenize = triGramTokenizer))
quadDocMatrix <- TermDocumentMatrix(docs, control = list(tokenize = quadGramTokenizer))

uniGram <- removeSparseTerms(uniDocMatrix, sparse = 0.99)
biGram  <- removeSparseTerms(biDocMatrix,  sparse = 0.999)
triGram <- removeSparseTerms(triDocMatrix,  sparse = 0.999)
quadGram <- removeSparseTerms(quadDocMatrix,  sparse = 0.999)

getFreq <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}

freq1 <- getFreq(uniGram)
freq2 <- getFreq(biGram)
freq3 <- getFreq(triGram)
freq4 <- getFreq(quadGram)

The following plots show the frequency of occurrence of the most frequent n-grams in the 5 per cent sampled dataset.

ggplot(freq1[1:30,], aes(reorder(word, -freq), freq)) +
         labs(x = "Most frequent 1-grams", y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("blue"))

ggplot(freq2[1:30,], aes(reorder(word, -freq), freq)) +
         labs(x = "Most frequent 2-grams", y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("blue"))

ggplot(freq3[1:30,], aes(reorder(word, -freq), freq)) +
         labs(x = "Most frequent 3-grams", y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("blue"))

## Warning: Removed 28 rows containing missing values (position_stack).

# No 4-grams in this sample
# ggplot(freq4[1:30,], aes(reorder(word, -freq), freq)) +
#         labs(x = "Most frequent 4-grams", y = "Frequency") +
#         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
#         geom_bar(stat = "identity", fill = I("blue"))

Approach to further modeling

Since it is simple to determine the occurrence of an n-gram, it can be used as a measure of probability of occurrence without further information assumed. It is, so to speak, the ‘a priori’ probability of occurrence. Using this notation, the approach to modeling the following word will be based on a Bayesian scheme. The law of Bayes is a statement about conditional probability that provides the probability of occurrence of B provided A as the probability of A provided by multiplied by the fraction of the probabilities of A and B.

The probability of the last word of the n-gram occurring is therefore given by the probability of the first n-1 words occurring and the probability of the n-th word occurring. The algorithm then goes through all n-grams and determines the probabilities of occurrence of the last word of the n-gram provided the entered words match the previous n-1 words.

Test, if that actually is implemented!!!!!!

Milestone Report Coursera Data Science Capstone

C. Euler

13 Februar 2017

Synposis

Data acquisition

Report of summary statistics

Data preparation

Approach to further modeling