This document summarizes the basic properties of a dataset comprised of tests obtained from news feeds, twitter posts and blogs andoutlines an approach to exploit these texts to train an algorithm to predict the following word in some text. In order to do so, the document is split into four parts:
An explanation of the data acquisition process
A summary of the dataset
The data preparation process
An overview of the path towards a text prediction app.
First, data are loaded into a directory (that is created, if it does not yet exist). It is either extracted from a downloaded zip file, or directly extracted.
# Check if directory exists
if(!file.exists("./projectData")){
dir.create("./projectData")
}
Url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# Check if zip has already been downloaded in projectData directory?
if(!file.exists("./projectData/Coursera-SwiftKey.zip")){
download.file(Url,destfile="./projectData/Coursera-SwiftKey.zip",mode = "wb")
}
# Check if zip has already been unzipped?
if(!file.exists("./projectData/final")){
unzip(zipfile="./projectData/Coursera-SwiftKey.zip",exdir="./projectData")
}
The English data are then read from the files. All other languages are ignored for the time being. Note: The encoding is set to UTF-8.
twitter <- readLines(con <- file ("./projectData/final/en_US/en_US.twitter.txt", "r"), encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines(con <- file ("./projectData/final/en_US/en_US.blogs.txt", "r"), encoding = "UTF-8", skipNul = TRUE)
news <- readLines(con <- file ("./projectData/final/en_US/en_US.news.txt", "r"), encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(con <- file("./projectData/final/en_US/
## en_US.news.txt", : unvollständige letzte Zeile in './projectData/final/
## en_US/en_US.news.txt' gefunden
close(con)
This section lists some summary statistics (i.e., basic properties) of the data. It presents the file size, the number of words in each file and the mean number of words.
library(stringi)
# Get file sizes
blogs.size <- file.info("./projectData/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("./projectData/final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("./projectData/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
# Get words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
# Summary of the data sets
data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(blogs.size, news.size, twitter.size),
num.lines = c(length(blogs), length(news), length(twitter)),
num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
## source file.size.MB num.lines num.words mean.num.words
## 1 blogs 200.4242 899288 37546246 41.75108
## 2 news 196.2775 77259 2674536 34.61779
## 3 twitter 159.3641 2360148 30093410 12.75065
In order to prepare the data for modeling, it needs to be processed further. Exemplary, for space considerations, only 5 per cent of the data are taken into account here. The following steps are taken:
Remove punctuation and numbers
Convert all letters to lower case
Remove stop words (e.g., “a”, “and”, “the”, …)
Remove extra white spaces.
After this, the text is tokenized, that is, split into groups of one, two, three and four words (n-grams). These groups are sorted and counted to determine word frequencies in each document. This measure is stored in the term-document matrix.
Since this matrix mostly contains zeros (it is sparse), it needs to be processed to a less sparse level. This saves computational time and memory.
Finally, n-gram frequencies are determined.
raw_data <- c(sample(blogs, length(blogs) * 0.05),
sample(news, length(news) * 0.05),
sample(twitter, length(twitter) * 0.05))
rm(twitter,blogs,news)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 718590 38.4 4547696 242.9 4141480 221.2
## Vcells 7387280 56.4 133657291 1019.8 134110273 1023.2
sampled_data <- sapply(raw_data, function(x) iconv(x, "UTF-8", "ASCII", sub=""))
rm(raw_data)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 736295 39.4 3638156 194.3 4547696 242.9
## Vcells 8189113 62.5 106925832 815.8 134110273 1023.2
docs <- Corpus(VectorSource(sampled_data))
docs <- tm_map(docs, removePunctuation) # Remove punctuation
docs <- tm_map(docs, removeNumbers) # Remove numbers
docs <- tm_map(docs, tolower) # Convert everything to lower case
docs <- tm_map(docs, removeWords, stopwords("en")) # Remove stopwords, i.e., "a", "and", ...
docs <- tm_map(docs, stripWhitespace) # Remove whitespaces
docs <- tm_map(docs, PlainTextDocument) # Finish preparation
uniGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
biGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
triGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
quadGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
uniDocMatrix <- TermDocumentMatrix(docs, control = list(tokenize = uniGramTokenizer))
biDocMatrix <- TermDocumentMatrix(docs, control = list(tokenize = biGramTokenizer))
triDocMatrix <- TermDocumentMatrix(docs, control = list(tokenize = triGramTokenizer))
quadDocMatrix <- TermDocumentMatrix(docs, control = list(tokenize = quadGramTokenizer))
uniGram <- removeSparseTerms(uniDocMatrix, sparse = 0.99)
biGram <- removeSparseTerms(biDocMatrix, sparse = 0.999)
triGram <- removeSparseTerms(triDocMatrix, sparse = 0.999)
quadGram <- removeSparseTerms(quadDocMatrix, sparse = 0.999)
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
freq1 <- getFreq(uniGram)
freq2 <- getFreq(biGram)
freq3 <- getFreq(triGram)
freq4 <- getFreq(quadGram)
The following plots show the frequency of occurrence of the most frequent n-grams in the 5 per cent sampled dataset.
ggplot(freq1[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = "Most frequent 1-grams", y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("blue"))
ggplot(freq2[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = "Most frequent 2-grams", y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("blue"))
ggplot(freq3[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = "Most frequent 3-grams", y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("blue"))
## Warning: Removed 28 rows containing missing values (position_stack).
# No 4-grams in this sample
# ggplot(freq4[1:30,], aes(reorder(word, -freq), freq)) +
# labs(x = "Most frequent 4-grams", y = "Frequency") +
# theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
# geom_bar(stat = "identity", fill = I("blue"))
Since it is simple to determine the occurrence of an n-gram, it can be used as a measure of probability of occurrence without further information assumed. It is, so to speak, the ‘a priori’ probability of occurrence. Using this notation, the approach to modeling the following word will be based on a Bayesian scheme. The law of Bayes is a statement about conditional probability that provides the probability of occurrence of B provided A as the probability of A provided by multiplied by the fraction of the probabilities of A and B.
The probability of the last word of the n-gram occurring is therefore given by the probability of the first n-1 words occurring and the probability of the n-th word occurring. The algorithm then goes through all n-grams and determines the probabilities of occurrence of the last word of the n-gram provided the entered words match the previous n-1 words.
Test, if that actually is implemented!!!!!!