This exercise is part of the Data Science Capstone: Swiftkey Project of Coursera’s Data Science specialization. The goal of this project is build an algorithm to predict the next word in a sequence of words and to present the results of this work in a Shiny app. This exercise uses data from HC Corpora. The zipfile that is provided with this exercise contains English, German, Finnish and Russion data from Blogs, News and Twitter.
The focus of this interim Milestone Report is on Exploratory Data Analysis on the English corpus of the data from Blogs, News and Twitter. Since the goal is to build a predictive model for English text, only data from the English sourcefiles is used. For the NLP (Natural Language Processing) parts of this exercise the R-packages tm and RWeka are used.
This milestone report will describe the exploratory analysis of the data and will present directions for the algorithm and the shiny app. To count the words in the different sources a function from the stringi-package is used.
# load R libraries
library(readr)
library(dplyr)
library(tidyr)
library(tm)
library(stringi)
library(RWeka)
library(ggplot2)
In the code below the zipfile will be downloaded (if it does not yet exists in the current directory).
# Url and filename
zipUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
zipFile <- "Coursera-SwiftKey.zip"
# download zip file (wb, binary mode) if it is not already been downloaded
if (!file.exists(zipFile)) {
download.file(zipUrl, zipFile, mode = "wb")
}
The next step extracts the content of the zipfile to a specified subdirectory (only if this subdirectory does not yet exists in the current directory).
# unzip the zip file containing data if data directory does not exist
dataPath <- "Swiftkey-data"
if (!file.exists(dataPath)) {
unzip(zipFile, junkpaths = TRUE, exdir = dataPath)
}
In the code below, en_US blogs, en_US news and en_US twitter text is read in separately with the readLines()-function. The R function iconv converts between character string encodings, in this case from UTF-8 to ASCII to make sure only ASCII-charachter will be used.
# read in the three different files (only EN)
blogs <- readLines(paste0(dataPath, "//", "en_US.blogs.txt"), warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(paste0(dataPath, "//", "en_US.news.txt"), warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(paste0(dataPath, "//", "en_US.twitter.txt"), warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
# cleaning
blogs <- iconv(blogs, from = "UTF-8", to = "ASCII", sub = "")
news <- iconv(news, from = "UTF-8", to = "ASCII", sub = "")
twitter <- iconv(twitter, from = "UTF-8", to = "ASCII", sub = "")
In the part below some preliminary exploration of the data is performed:
* file size
* nr of records * max nr of character per record per souce
* total nr of characters per source
* total nr of words per source
c0 <- c("Blogs", "News", "Twitter")
# file size/source
c1 <- NULL
c1[1] <- file.info(paste0(dataPath, "//", "en_US.blogs.txt"))$size / 1024^2
c1[2] <- file.info(paste0(dataPath, "//", "en_US.news.txt"))$size / 1024^2
c1[3] <- file.info(paste0(dataPath, "//", "en_US.twitter.txt"))$size / 1024^2
# nr of lines/source
c2 <- NULL
c2[1] <- length(blogs)
c2[2] <- length(news)
c2[3] <- length(twitter)
# total nr of characters/source
c3 <- NULL
c3[1] <- sum(nchar(blogs))
c3[2] <- sum(nchar(news))
c3[3] <- sum(nchar(twitter))
# max nr of characters per record/source
c4 <- NULL
c4[1] <- max(nchar(blogs))
c4[2] <- max(nchar(news))
c4[3] <- max(nchar(twitter))
# total number of words/source
c5 <- NULL
c5[1] <- sum(stri_count_words(blogs))
c5[2] <- sum(stri_count_words(news))
c5[3] <- sum(stri_count_words(twitter))
data.frame(source = c0, fsize = c1, lines = c2, chars = c3, maxch = c4, words = c5)
## source fsize lines chars maxch words
## 1 Blogs 200.4242 899288 206043906 40832 37510168
## 2 News 196.2775 77259 15615538 5760 2673480
## 3 Twitter 159.3641 2360148 161961555 140 30088605
To limit the use of memory and cpu-time, it I decided to work with samples. The dataset was constructed as a stratified composition of the three sources (1% sample).
# proportional sampling: .01
set.seed(123)
s_blogs <- sample(blogs, size = 0.01 * length(blogs))
s_news <- sample(news, size = 0.01 * length(news))
s_twitter <- sample(twitter, size = 0.01 * length(twitter))
sampleText <- c(s_blogs, s_news, s_twitter)
To free memory the three full datasets and the 1%-samples where removed and garbage-collection was called.
# ruimte maken, garbage collection
rm(blogs, news, twitter)
rm(s_blogs, s_news, s_twitter)
rm(c0, c1,c2,c3,c4,c5)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 784282 41.9 5767448 308.1 4501122 240.4
## Vcells 3940790 30.1 74330638 567.1 77361069 590.3
In this part we convert our sampleText to up a corpus with the tm-package. Here all text is changed to lowercase text, all punctuation is removed and all numbers, url’s and whitespace in the text is removed. Profanity words are removed as well. I decided not to remove the stopwords and typo’s as these may play a crucial role in the prediction of the next word(s). In the last step the corpus is converted to a regular dataframe.
# inlezen profanity words
profWords <- tolower(readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"))
# tm
corpus <- Corpus(VectorSource(sampleText))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removePunctuation))
corpus <- tm_map(corpus, content_transformer(removeNumbers))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
corpus <- tm_map(corpus, removeWords, profWords)
# leave the stopwords in the corpus
# corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
# conversion to dataframe
corpus <- data.frame(text = get("content", corpus), stringsAsFactors = FALSE)
# check
glimpse(corpus)
## Observations: 33,365
## Variables: 1
## $ text <chr> "to sum this up when we are unjustly wounded by men let u...
In this part so-called ngrams are produced. In our case an n-gram is a contiguous sequence of n-words from a given sample of text (lines). A unigram is a sequence of length one, a bigram is a sequence of two words, a trigram of three words, and so on. Breaking-up the text into tokens (inour case: words) is called tokenization in NLP. We use the RWeka-package to tokenize our text.
In the code below the NGramTokenizer returns all unigrams in the text. In the next line (table(unigram)) this collection in deduplicated and a frequency is added. Next, the unigrams are ordered by frequency (decreasing). Then the cumulative frequency, the frequency percentage and the cumulative frequency percentage are added to the dataframe.
# delimiters
delim <- " \\r\\n\\t.,;:\"()?!"
unigram <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1, delimiters = delim))
unigram <- data.frame(table(unigram))
unigram <- unigram[order(unigram$Freq, decreasing = TRUE), ]
names(unigram) <- c("word", "freq")
unigram$word <- as.character(unigram$word)
# add frequency, cumulative frequency, percentage and cumulative percentage
unigram$cumfreq <- cumsum(unigram$freq)
unigram$freq_pct <- 100 * unigram$freq / sum(unigram$freq)
unigram$cumfreq_pct <- 100 * unigram$cumfreq / sum(unigram$freq)
# create a barplot
ggplot(data = unigram[1:15, ], aes(x = reorder(word, freq), y = freq)) +
xlab("tot-15 unigrams") +
ylab("unigram frequency") +
geom_bar(stat = "identity") +
coord_flip()
bigram <- NGramTokenizer(corpus, Weka_control(min = 2, max = 2, delimiters = delim))
bigram <- data.frame(table(bigram))
bigram <- bigram[order(bigram$Freq, decreasing = TRUE), ]
names(bigram) <- c("words", "freq")
bigram$words <- as.character(bigram$words)
# add frequency, cumulative frequency, percentage and cumulative percentage
bigram$cumfreq <- cumsum(bigram$freq)
bigram$freq_pct <- 100 * bigram$freq / sum(bigram$freq)
bigram$cumfreq_pct <- 100 * bigram$cumfreq / sum(bigram$freq)
# create a barplot
ggplot(data = bigram[1:15, ], aes(x = reorder(words, freq), y = freq)) +
xlab("tot-15 bigrams") +
ylab("bigram frequency") +
geom_bar(stat = "identity") +
coord_flip()
trigram <- NGramTokenizer(corpus, Weka_control(min = 3, max = 3, delimiters = delim))
trigram <- data.frame(table(trigram))
trigram <- trigram[order(trigram$Freq, decreasing = TRUE), ]
names(trigram) <- c("words", "freq")
trigram$words <- as.character(trigram$words)
# add frequency, cumulative frequency, percentage and cumulative percentage
trigram$cumfreq <- cumsum(trigram$freq)
trigram$freq_pct <- 100 * trigram$freq / sum(trigram$freq)
trigram$cumfreq_pct <- 100 * trigram$cumfreq / sum(trigram$freq)
# create a barplot
ggplot(data = trigram[1:15, ], aes(x = reorder(words, freq), y = freq)) +
xlab("tot-15 trigrams") +
ylab("trigram frequency") +
geom_bar(stat = "identity") +
coord_flip()
quadgram <- NGramTokenizer(corpus, Weka_control(min = 4, max = 4, delimiters = delim))
quadgram <- data.frame(table(quadgram))
quadgram <- quadgram[order(quadgram$Freq, decreasing = TRUE), ]
names(quadgram) <- c("words", "freq")
quadgram$words <- as.character(quadgram$words)
# add frequency, cumulative frequency, percentage and cumulative percentage
quadgram$cumfreq <- cumsum(quadgram$freq)
quadgram$freq_pct <- 100 * quadgram$freq / sum(quadgram$freq)
quadgram$cumfreq_pct <- 100 * quadgram$cumfreq / sum(quadgram$freq)
# create a barplot
ggplot(data = quadgram[1:15, ], aes(x = reorder(words, freq), y = freq)) +
xlab("tot-15 quadgrams") +
ylab("quadgram frequency") +
geom_bar(stat = "identity") +
coord_flip()
# separating the words in the bi-, tri- and quadgrams
bigram <- separate(bigram, words, c("word1", "word2"), sep = " ", remove = TRUE)
trigram <- separate(trigram, words, c("word1", "word2", "word3"), sep = " ", remove = TRUE)
quadgram <- separate(quadgram, words, c("word1", "word2", "word3", "word4"), sep = " ", remove = TRUE)
In the code belowe nrow(unigram) gives us the total numer of different words after tokenization. The expression which.min(abs(unigram$cumfreq_pct-90)) results in the index (row number) that nearest to the 90th percentile. Dividing this index by the total number of unigrams gives us the coverage. So, 0.14 here means that we only need 14% of the total number of different unigrams to account for 90% of all word-occurences.
glimpse(unigram)
## Observations: 44,254
## Variables: 5
## $ word <chr> "the", "to", "and", "a", "i", "of", "in", "you", "...
## $ freq <int> 29803, 19388, 15728, 15541, 15105, 12978, 10062, 8...
## $ cumfreq <int> 29803, 49191, 64919, 80460, 95565, 108543, 118605,...
## $ freq_pct <dbl> 4.3705373, 2.8432030, 2.3064729, 2.2790498, 2.2151...
## $ cumfreq_pct <dbl> 4.370537, 7.213740, 9.520213, 11.799263, 14.014374...
which.min(abs(unigram$cumfreq_pct-90)) / nrow(unigram)
## [1] 0.1423826
which.min(abs(unigram$cumfreq_pct-50)) / nrow(unigram)
## [1] 0.002598635
which.min(abs(unigram$cumfreq_pct-25)) / nrow(unigram)
## [1] 0.0003163556
glimpse(bigram)
## Observations: 329,385
## Variables: 6
## $ word1 <chr> "of", "in", "to", "for", "on", "to", "at", "i", "a...
## $ word2 <chr> "the", "the", "the", "the", "the", "be", "the", "h...
## $ freq <int> 2618, 2456, 1421, 1410, 1345, 1168, 902, 811, 769,...
## $ cumfreq <int> 2618, 5074, 6495, 7905, 9250, 10418, 11320, 12131,...
## $ freq_pct <dbl> 0.38392388, 0.36016694, 0.20838649, 0.20677337, 0....
## $ cumfreq_pct <dbl> 0.3839239, 0.7440908, 0.9524773, 1.1592507, 1.3564...
which.min(abs(bigram$cumfreq_pct-90)) / nrow(bigram)
## [1] 0.7929748
which.min(abs(bigram$cumfreq_pct-50)) / nrow(bigram)
## [1] 0.08475492
which.min(abs(bigram$cumfreq_pct-25)) / nrow(bigram)
## [1] 0.00669733
glimpse(trigram)
## Observations: 584,677
## Variables: 7
## $ word1 <chr> "thanks", "one", "a", "going", "i", "to", "looking...
## $ word2 <chr> "for", "of", "lot", "to", "want", "be", "forward",...
## $ word3 <chr> "the", "the", "of", "be", "to", "a", "to", "a", "t...
## $ freq <int> 219, 201, 194, 151, 129, 128, 121, 118, 110, 110, ...
## $ cumfreq <int> 219, 420, 614, 765, 894, 1022, 1143, 1261, 1371, 1...
## $ freq_pct <dbl> 0.03211591, 0.02947625, 0.02844971, 0.02214385, 0....
## $ cumfreq_pct <dbl> 0.03211591, 0.06159216, 0.09004187, 0.11218572, 0....
which.min(abs(trigram$cumfreq_pct-90)) / nrow(trigram)
## [1] 0.8833698
which.min(abs(trigram$cumfreq_pct-50)) / nrow(trigram)
## [1] 0.4168524
which.min(abs(trigram$cumfreq_pct-25)) / nrow(trigram)
## [1] 0.1252794
glimpse(quadgram)
## Observations: 664,757
## Variables: 8
## $ word1 <chr> "thanks", "is", "the", "at", "the", "i", "for", "a...
## $ word2 <chr> "for", "going", "rest", "the", "end", "cant", "the...
## $ word3 <chr> "the", "to", "of", "same", "of", "wait", "first", ...
## $ word4 <chr> "follow", "be", "the", "time", "the", "to", "time"...
## $ freq <int> 58, 46, 45, 41, 40, 39, 36, 34, 34, 32, 31, 31, 30...
## $ cumfreq <int> 58, 104, 149, 190, 230, 269, 305, 339, 373, 405, 4...
## $ freq_pct <dbl> 0.008505596, 0.006745818, 0.006599169, 0.006012577...
## $ cumfreq_pct <dbl> 0.008505596, 0.015251414, 0.021850583, 0.027863160...
which.min(abs(quadgram$cumfreq_pct-90)) / nrow(quadgram)
## [1] 0.8974212
which.min(abs(quadgram$cumfreq_pct-50)) / nrow(quadgram)
## [1] 0.4871028
which.min(abs(quadgram$cumfreq_pct-25)) / nrow(quadgram)
## [1] 0.2306542
The predictive algorithm to be built will rely on the N-gram-distributions. Because the goal is to predict the next word, we will probably only use the ngrams with n > 1.
Next steps to take:
* minimize the dataset to be used for the algorithm as it is meant to run on miobile devices with limited resources
* build a predictive algorithm
* build a shiny-app to implement the algorithm
* create a slide deck to present/publish the app