In this report, we develop an initial exploratory analysis of data which is perfomed on a subset of the provided large corpus of text documents. Data is sampled and frequency analysis is perfomed over single words, bi- and tri-grams. We uncover in this way some basic structure of the data which is useful to guide the subsequent modeling stages.
The data is from a corpus called the HC corpora which collects text in several languages from different sources, namely blogs, news and twitter. The data can be downloaded from the Coursera web site.
For the purpose of this initial exploration, we focus on the American English section of the corpora, which provides data for the different sources in separate files. We first read in those files in character vectors:
# data_dir is the data directory: it's hidden
file_paths <- data.frame(source = c("blogs", "news", "twitter"),
orig = c(file.path(data_dir, "final", "en_US", "en_US.blogs.txt"),
file.path(data_dir, "final", "en_US", "en_US.news.txt"),
file.path(data_dir, "final", "en_US", "en_US.twitter.txt")),
sample = c(file.path(data_dir, "samples", "en_US.blogs.sample.txt"),
file.path(data_dir, "samples", "en_US.news.sample.txt"),
file.path(data_dir, "samples", "en_US.twitter.sample.txt")))
blogs <- readLines(file_paths$orig[1], skipNul = TRUE)
news <- readLines(file_paths$orig[2], skipNul = TRUE)
twitters <- readLines(file_paths$orig[3], skipNul = TRUE)
Here we describe basic data summaries in terms of word and line counts for each of the three files of each source:
wordcount <- function(textlines) { sum(str_count(textlines, "\\S+")) }
data_summary <- data.frame(source = c("blogs", "news", "twitter"),
`word count` = c(wordcount(blogs),
wordcount(news),
wordcount(twitters)),
`line count` = c(length(blogs),
length(news),
length(twitters)))
data_summary
## source word.count line.count
## 1 blogs 37334131 899288
## 2 news 34372530 1010242
## 3 twitter 30373583 2360148
The three sources bring about the same number of words. The blogs and news sources have approximately the same number of lines while twitters have doubled their number of lines. This is not surprising as twitter messages are limited to up to 140 characters.
This exploratory analysis is performed on a random sample which amounts at 1% of the original data. The code used for sampling can be found in the Appendix.
seed <- 2791
for(i in 1:length(file_paths$sample)) {
orig <- as.character(file_paths$orig[i])
sample <- as.character(file_paths$sample[i])
if(!file.exists(sample)) {
sampleFile(orig, sample, perc = 1, append = TRUE, seed = seed)
}
}
samples <- bind_rows(data.frame(text = readLines(as.character(file_paths$sample[1]),
skipNul = TRUE), source = "blogs"),
data.frame(text = readLines(as.character(file_paths$sample[2]),
skipNul = TRUE), source = "news"),
data.frame(text = readLines(as.character(file_paths$sample[3]),
skipNul = TRUE), source = "twitters"))
Finally, we load the collected list of profane words for the English language:
badwords <- readLines("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en")
badwords <- data.frame(word = badwords[-length(badwords)]) # last element read is empty string
Text is converted to lowercase, and then numbers, punctuaction, high-frequency (i.e. stop words), profane words and twitter specific elements are removed:
data(stop_words)
corpus <- samples %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
filter(!str_detect(word, "^\\d+\\.*\\d*$")) %>% # remove numbers, or maybe "mutate(word = str_extract(word, "[a-z]+")) %>%"
anti_join(badwords) %>% # profanity filtering
filter(!str_detect(word, "^rt|lol|[@#]\\S+$")) # filter out re-tweets, hash tags and twitter handles and also common jargon
## Joining, by = "word"
## Joining, by = "word"
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
The most common words in the whole sample:
corpus %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word,n)) %>%
top_n(20) %>%
ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip()
## Selecting by n
This is the distribution of word frequencies:
corpus %>%
count(word, sort = TRUE) %>%
mutate(total = sum(n)) %>%
ggplot(aes(n/total)) + geom_histogram(show.legend = FALSE) + xlim(NA,0.0006)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 152 rows containing non-finite values (stat_bin).
These type of long-tailed distributions are expected in any given corpus of natural language, a classic example of the Zip’s Law, with many words that occur rarely and fewer words that occur frequently. This is occurring even after removing the most common (i.e. stop) words used in the language.
Here we plot the cumulative frequency of the frequency sorted dictionary extracted from the filtered corpus:
corpus %>%
count(word, sort = TRUE) %>%
mutate(rank = row_number(), total = sum(n), frequency = n / total) %>%
ggplot(aes(rank, cumsum(frequency))) + geom_line() +
scale_x_continuous(breaks=seq(0,50000,2500)) +
scale_y_continuous(breaks=seq(0,1,.1))
Interestingly, and in line with the previous findings, we can see how we just need a small fraction of the dictionary in order to cover significant portions of all word occurrences: ~4% (~2000) of unique words to cover 50%, and ~37% (~19000) of unique words to cover 90% of all word instances in the corpus.
Let’s find out if there are significant differences among the three sources of text considered here:
source_words <- corpus %>% count(source, word, sort = TRUE) %>% ungroup()
total_source_words <- source_words %>% group_by(source) %>% summarize(total = sum(n))
source_words <- left_join(source_words, total_source_words)
## Joining, by = "source"
ggplot(source_words, aes(n/total), fill = source) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0009) +
facet_wrap(~source, ncol = 3, scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 224 rows containing non-finite values (stat_bin).
These plots exhibit similar distributions for all the sources.
Firstly, we tokenize by bi-grams:
bigrams <- samples %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 489,481 x 2
## bigram n
## <chr> <int>
## 1 of the 4295
## 2 in the 4087
## 3 to the 2128
## 4 on the 2004
## 5 for the 1992
## 6 to be 1629
## 7 at the 1448
## 8 and the 1214
## 9 in a 1148
## 10 with the 1078
## # ... with 489,471 more rows
As one might expect, a lot of the most common bigrams are pairs of common (i.e. stop) words. Let’s remove the cases where either is a stop word but also profane and non-words, in order to inspect the most common but also interesting ones:
bigrams_filtered <-
bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(str_detect(word1, "^[a-z]+$") & str_detect(word2, "^[a-z]+$")) %>%
filter(!word2 %in% stop_words$word & !word1 %in% stop_words$word) %>%
filter(!word1 %in% badwords$word & !word2 %in% badwords$word)
dim(bigrams_filtered)
## [1] 146402 3
After filtering, we end up with a total of 146402 bigrams. Let’s examine their frequency:
bigrams_filtered %>%
unite("bigram", c("word1","word2"), sep=" ") %>%
count(bigram, sort = TRUE) %>%
mutate(total = sum(n)) %>%
ggplot(aes(n/total)) + geom_histogram(show.legend = FALSE) + xlim(NA,0.0002)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite values (stat_bin).
Not surprisingly, we get another right long-tailed distribution which is even more skewed than the one obtained with single words. When developing the prediction algorithm based on n-grams, we must consider this fact. Let’s then plot the histogram of the most common bi-grams:
bigrams_filtered %>%
unite("bigram", c("word1","word2"), sep=" ") %>%
count(bigram, sort = TRUE) %>%
mutate(bigram = reorder(bigram,n)) %>%
top_n(25) %>%
ggplot(aes(bigram, n)) + geom_col() + xlab(NULL) + coord_flip()
## Selecting by n
Interestengly, there are 6 US cities among the top frequent 25 bi-grams, the rest is dominated by word couples which represent a single entity, e.g. health care, ice cream, vice president.
Let’s tokenize by tri-grams and inspect the most frequent ones. This time we don’t filter out those containing stop-words as this would leave us with triplets with only very low frequencies.
trigrams <- samples %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3)
trigrams %>%
count(trigram, sort = TRUE)
## # A tibble: 876,506 x 2
## trigram n
## <chr> <int>
## 1 one of the 350
## 2 a lot of 286
## 3 thanks for the 219
## 4 to be a 192
## 5 as well as 161
## 6 going to be 158
## 7 i want to 155
## 8 out of the 153
## 9 the end of 145
## 10 i have a 136
## # ... with 876,496 more rows
There are nearly a million tri-grams, which together with bi-grams, shows approximately a quadratic increase in the number of parameters with respect to the dimension of the n-grams (n). Again, the set of the most frequent tri-grams is dominated by those containing common words.
trigrams %>%
count(trigram, sort = TRUE) %>%
mutate(trigram = reorder(trigram,n)) %>%
top_n(25) %>%
ggplot(aes(trigram, n)) + geom_col() + xlab(NULL) + coord_flip()
## Selecting by n
# get the number of lines in a file
numberOfLines <- function(fname) {
if(!file.exists(fname)) {
stop(paste("Cannot read file ", fname, sep = " "))
}
as.integer(strsplit(try(system(paste("wc", fname, sep = " "), intern = TRUE)), " +")[[1]][2])
}
# sample a certain fraction of the lines of a file, and write them to an output file
#
# adapted from https://stat.ethz.ch/pipermail/r-help/2007-February/124812.html
#
# The basic premise with this approach below, is that you are in effect
# creating a sequential file cache in an R object. Reading large chunks of
# the source file into the cache. Then randomly selecting rows within the
# cache and then writing out the selected rows.
# Thus, if you can read 100,000 rows at once, you would have 9 reads of
# the source file, and 9 writes of the target file.
# The key thing here is to ensure that the offsets within the cache and
# the corresponding random row values are properly set.
#
# WARN
# works with files with no or a very few empty lines
sampleFile <- function(ifname, ofname, perc = 1, append = TRUE, seed = 1234) {
nlines <- numberOfLines(ifname)
# generate the random row values
set.seed(seed)
sel <- sample(1:nlines, nlines * perc / 100)
# set up a sequence for the cache chunks,
# chunk size is 9th of number of lines
chunk_size <- floor(nlines/9)
cuts <- seq(0, nlines, chunk_size)
# loop over the length of cuts, less 1
for ( i in seq(along = cuts[-1]) ) {
# get a chunk_size row chunk, skipping rows
# as appropriate for each subsequent chunk
# might get less then chunk_size lines, if there are empty lines
chunk <- scan(ifname, what = character(), sep = "\n", skip = cuts[i], nlines = chunk_size)
# set up a row sequence for the current chunk
rows <- (cuts[i]+1):(cuts[i+1])
# are any of the the random values in the current chunk?
# if so, get them and write them out
chunk.sel <- sel[which(sel %in% rows)]
if(length(chunk.sel) > 0) {
chunk_index <- sel - cuts[i]
# take into account chunk might have less than chunk_size lines
write.rows <- chunk[chunk_index[chunk_index>0 & chunk_index <= chunk_size]]
# write.rows <- chunk[!is.na(chunk[sel-cuts[i]])]
write(write.rows, ofname, append = append, sep = "\n")
}
}
}