The objective of the Capstone Project is to build a SwiftKey-like tool. In the first assignment (this report) we download and explore the text corpus on which the tool will be built. This analysis will be followed by model building and putting together a Shiny app.
First we download the dataset from a URL link provided in the course assignment. Then we unzip it and check the contents and its sizes.
# download data from website and extract from zip archive in work directory
zip.file <- 'Coursera-SwiftKey.zip'
web.url <- 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
download.file(url = web.url, destfile = zip.file)
unzip(zipfile = zip.file)
# these are the extracted files
list.files('final')
[1] "de_DE" "en_US" "fi_FI" "ru_RU"
# we will focus on the English corpus
files <- list.files('final/en_US/')
# how big are our data files in MB?
sizes <- 'final/en_US/' %>% paste0(files) %>% file.size / 10^6
data.frame(files = files, size = round(sizes, digits = 2))
files size
1 en_US.blogs.txt 210.16
2 en_US.news.txt 205.81
3 en_US.twitter.txt 167.11
# load the text files into R
txt.blogs <- read_lines(file = 'final/en_US/en_US.blogs.txt')
txt.news <- read_lines(file = 'final/en_US/en_US.news.txt')
txt.twitter <- read_lines(file = 'final/en_US/en_US.twitter.txt')
Let’s look at how many characters and lines there are in each text file. Also, we’ll use this space to answer some questions from Quiz 1.
# calculate how many lines there are in each dataset
lines <- list(txt.blogs, txt.news, txt.twitter) %>% sapply(length)
# calculate how many characters are there in each line
# for practical reasons, we break the DRY principle and break down code
nchar.blogs <- txt.blogs %>% lapply(nchar) %>% unlist %>% sum
nchar.news <- txt.news %>% lapply(nchar) %>% unlist %>% sum
nchar.twitter <- txt.twitter %>% lapply(nchar) %>% unlist %>% sum
# format a data frame to display amounts of lines and characters
data.frame(files = files, chars = c(nchar.blogs, nchar.news, nchar.twitter),
lines = lines)
files chars lines
1 en_US.blogs.txt 206824505 899288
2 en_US.news.txt 203223159 1010242
3 en_US.twitter.txt 162096031 2360148
### Quiz 1
# us twitter love vs hate
love <- str_detect(txt.twitter, 'love') %>% sum()
hate <- str_detect(txt.twitter, 'hate') %>% sum()
# it is good to see more love than hate
love/hate
[1] 4.108592
# quiz question 5: biostats
grep(txt.twitter, pattern = 'biostats', value = TRUE)
[1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
# quiz question 6
sum(str_detect(txt.twitter, 'A computer once beat me at chess, but it was no match for me at kickboxing'))
[1] 3
This section was very problematic. In order to prepare the corpus for bigram tokenization, we have to use the VCorpus function instead of Corpus. Also, tolower in tm_map would convert the data type to character which the RWeka tokenizer would not accept anymore. Using content_transformer solved this issue. Also, RWeka requires Java SDK version 11.0.1 which had to be downgraded from the current version 13. This troubleshooting cost a significant amount of time.
# subset the datasets for efficient processing
txt.cut <- c(sample(txt.blogs, 1000), sample(txt.news, 1000), sample(txt.twitter, 1000))
corpus <- txt.cut %>% VectorSource() %>% VCorpus() %>%
# clean corpus
tm_map(content_transformer(tolower)) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
# tm_map(removeWords, stopwords('english')) %>%
tm_map(stripWhitespace)
In this section let’s explore the distribution frequency of single words, bigrams and trigrams.
# define a function that wraps tokenization, data transformation and plotting
dngram <- function(n, corpus) {
# define a function that tokenizes bigrams
options(mc.cores = 1)
corpus.ngram <- TermDocumentMatrix(corpus,
control = list(tokenize = function(x) NGramTokenizer(x, Weka_control(min = n, max = n))))
# process the TDM into data frame
ngram.mx <- as.matrix(rollup(corpus.ngram, MARGIN = 2,
na.rm = TRUE, FUN = sum))
ngram.mx <- data.frame(word = rownames(ngram.mx),
freq = ngram.mx[,1]) %>% arrange(desc(freq))
# return(ngram.mx))
# create horizontal bar plot for top N grams
ngram.mx %>% head(10) %>% ggplot(aes(reorder(word, freq), freq)) +
geom_bar(stat = "identity") + coord_flip() +
xlab(paste0(n, '-gram')) + ylab("Frequency")
}
# inspecting the top 1-grams is actually looking at the most frequent terms in the corpus
dngram(1, corpus)
# bigrams
dngram(2, corpus)
# trigrams
dngram(3, corpus)
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
nwords <- dngram(1, corpus) %>% nrow
# 50% of all words
nwords / 100 * 50
[1] 7029.5
# 90% of all words
nwords / 100 * 90
[1] 12653.1
How do you evaluate how many of the words come from foreign languages? Since we’re using a random sample of the dataset, and evaluating the language of entries instead of invididual words, we can use the resulting ratio to roughly approximate the proportion of words from foreign languages.
# go over corpus and detect languages
lang <- textcat(corpus)
# store the number of English entries
lang.foreign <- (lang %in% 'english') %>% sum
# store the number of entries in foreign languages
lang.english <- lang %>% length
# compute ratio of foreign vs English entries
1 - lang.foreign / lang.english
[1] 0.211
Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases? By using a library/dictionary, by using a set of unique words and reducing the amount of low-frequency words or by clustering based on semantics.