The subfolder en_US contains three text files (blogs, news and twitter). The number of lines range between 900K and 2Mio and of words between 30 and 50 Millions. From these three files a fraction was sampled and finally mixed together, that was further processed. Along the cleaning process redundancies like punctuations, numbers, symbols, composite expressions were removed or resolved - aren’t to are not, etc. The whole data set was transformed to lower case. Finally 1gram, 2gram and 3gram were constructed, each comparing the set with and without stop words.
Stop words are by far more frequent than non stop words. By removing stop words like ‘and’, ‘the’ and so forth the distribution changes and differences become visible between the three files, that are significant to my eyes. There are some non Englsih words in the text, but they do not seem to cause any troubles so far along this exploration process, though an in-depth examination was not performed.
A model needs to be created such that a memory savy and runtime efficient prediction model can be created for the capstone project.
The data is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip and its documentation can be found at http://www.corpora.heliohost.org/. The downloaded file will be found in a separate folder named “data” where it will be unzipped. The data will be found in “data/final”, consisting of three subfolders. For this milestone report we will only explore the “en_US” subfolder.
if(!file.exists("data")){ dir.create("data") }
if(!file.exists("data/Coursera-SwiftKey.zip")){
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile = "data/Coursera-SwiftKey.zip", mode="wb")
}
if(!file.exists("data/final")){
unzip("data/Coursera-SwiftKey.zip", exdir="data")
}
library(ngram) # for wordcount
dir()
## [1] "data" "Report_Exploratory.html"
## [3] "Report_Exploratory.Rmd" "Report_Exploratory_cache"
## [5] "Report_Exploratory_files"
dir("data")
## [1] "Coursera-SwiftKey.zip" "final" "trigramModel.RData"
dir("data/final") ## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
dir("data/final/en_US") ## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
con <- file("data/final/en_US/en_US.blogs.txt","r")
blog <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
# news cannot be completely read in Windows 10 ,
# therefore "rb" instead of "r" for read in binary mode
con <- file("data/final/en_US/en_US.news.txt","rb")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
con <- file("data/final/en_US/en_US.twitter.txt","r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
print(data.frame(File = c("blogs", "news", "twitter"),
MBytes = c(object.size(blog)/2^20, object.size(news)/2^20, object.size(twitter)/2^20),
Lines = c(length(blog), length(news), length(twitter)),
Words = c(wordcount(blog), wordcount(news), wordcount(twitter))))
## File MBytes Lines Words
## 1 blogs 255.3545 899288 37334131
## 2 news 257.3404 1010242 34372530
## 3 twitter 318.9897 2360148 30373583
Languages include redundancies like punctuations, whitespaces as well as lower and upper cases in order to make it easier to use it for us humans. For our exploratory data analysis we will remove these redundancies, numbers as well as symbols like @, #, etc. Moreover, we will transform usual composite expressions to its original compositions, except for has, is and genitiv s which will be replaced by a white space to keep things simple. As the files are quite large, we only take a small sample of each of these three files in order to reduce the time to perform the following operations.
library(tm)
set.seed(1)
# can only work with small samples due to memory constraints
dataset <- c(sample(blog, length(blog)*0.0005),
sample(news, length(news)*0.0005),
sample(twitter, length(twitter)*0.0005))
# define functions
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
have <- content_transformer(function(x, pattern) gsub(pattern, " have", x))
haveNot <- content_transformer(function(x, pattern) gsub(pattern, "have not", x))
hasNot <- content_transformer(function(x, pattern) gsub(pattern, "has not", x))
hadNot <- content_transformer(function(x, pattern) gsub(pattern, "had not", x))
are <- content_transformer(function(x, pattern) gsub(pattern, " are", x))
goingTo <- content_transformer(function(x, pattern) gsub(pattern, "going to", x))
wasNot <- content_transformer(function(x, pattern) gsub(pattern, "was not", x))
wereNot <- content_transformer(function(x, pattern) gsub(pattern, "were not", x))
isNot <- content_transformer(function(x, pattern) gsub(pattern, "is not", x))
areNot <- content_transformer(function(x, pattern) gsub(pattern, "are not", x))
wantTo <- content_transformer(function(x, pattern) gsub(pattern, "want to", x))
willNot <- content_transformer(function(x, pattern) gsub(pattern, "will not", x))
didNot <- content_transformer(function(x, pattern) gsub(pattern, "did not", x))
doesNot <- content_transformer(function(x, pattern) gsub(pattern, "does not", x))
doNot <- content_transformer(function(x, pattern) gsub(pattern, "do not", x))
canNot <- content_transformer(function(x, pattern) gsub(pattern, "can not", x))
# clean data set
corpus <- VCorpus(VectorSource(dataset))
corpus <- tm_map(corpus, have, "'ve")
corpus <- tm_map(corpus, hasNot, "hasn't")
corpus <- tm_map(corpus, haveNot, "haven't")
corpus <- tm_map(corpus, hadNot, "hadn't")
corpus <- tm_map(corpus, are, "[^ a-zA-Z]+'re")
corpus <- tm_map(corpus, goingTo, "gonna")
corpus <- tm_map(corpus, wasNot, "wasn't")
corpus <- tm_map(corpus, wereNot, "weren't")
corpus <- tm_map(corpus, isNot, "isn't")
corpus <- tm_map(corpus, areNot, "aren't")
corpus <- tm_map(corpus, wantTo, "wanna")
corpus <- tm_map(corpus, willNot, "won't")
corpus <- tm_map(corpus, didNot, "didn't")
corpus <- tm_map(corpus, doesNot, "doesn't")
corpus <- tm_map(corpus, doNot, "don't")
corpus <- tm_map(corpus, canNot, "can't")
corpus <- tm_map(corpus, canNot, "cannot")
corpus <- tm_map(corpus, canNot, "cant")
corpus <- tm_map(corpus, toSpace, "'s")
corpus <- tm_map(corpus, toSpace, "`s")
corpus <- tm_map(corpus, toSpace, "´s")
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, toSpace, " [-]+ ")
corpus <- tm_map(corpus, toSpace, " [\"]+ ")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
# splitting data set into one with and another without stop words
corpusWithStopWords <- corpus
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpusWithStopWords <- tm_map(corpusWithStopWords, PlainTextDocument)
corpus <- tm_map(corpus, PlainTextDocument)
We will look at the frequencies of words, as well as 2 and 3 grams for the top 20 entries.
library(tm) # reincluded as this library likes to disappear somehow
library(RWeka)
# define functions to process data
(bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)))
(trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)))
getFrequency <- function(data) {
frequency <- sort(rowSums(as.matrix(data)), decreasing = TRUE)
return(data.frame(word = names(frequency), freq = frequency))
}
# create appropriate ngrams
uni <- getFrequency(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
uniStop <- getFrequency(removeSparseTerms(TermDocumentMatrix(corpusWithStopWords), 0.9999))
bi <- getFrequency(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
biStop <- getFrequency(removeSparseTerms(TermDocumentMatrix(corpusWithStopWords, control = list(tokenize = bigram)), 0.9999))
tri <- getFrequency(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))
triStop <- getFrequency(removeSparseTerms(TermDocumentMatrix(corpusWithStopWords, control = list(tokenize = trigram)), 0.9999))
Now we will plot the top 20 unigrams, bigrams and trigrams and compare their differences and frequencies.
library(ggplot2)
plotNgram <- function(data, label) {
ggplot(data[1:20,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(panel.grid.major = element_blank(), panel.background = element_blank(), axis.text.x = element_text(angle = 75, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("blue"))
}
plotNgram(uniStop, "unigrams with stop words")
plotNgram(uni, "unigrams without stop words")
plotNgram(biStop, "bigrams with stop words")
plotNgram(bi, "bigrams without stop words")
plotNgram(triStop, "trigrams with stop words")
plotNgram(tri, "trigrams without stop words")
It is fascinating to see how the distribution differs with and without stop words among the ngrams. Along the cleaning process we also saw that there are some snippets of text we thought we would have removed, like ’re, ’s, and so forth, which shows, that either there are different apostrophes we have not thought of, or our regular expressions are not completely correct. Furthermore, the frequencies among uni-, bi- and trigram differs between the set with and without stop words, decreasing with the size of the ngrams. We can see, that stop words play an important role for prediciting the next word, as they are much more frequent than non stop words. We need to figure out, how big our sample needs to be in order to cover a reasonable amount of words, which we like to predict. Therefore, we need to find the optimal size in order to keep memory size and runtime to a reasonable optimum. Consequently, it would be interesting to see, if we can attain similar results by decreasing the sampling size.
In the following days, we will get acquainted with the ngram and backoff models, in order to build a solid prediction model.