Synopsis

In this report, we develop an initial exploratory analysis of data which is perfomed on a subset of the provided large corpus of text documents. Data is sampled and frequency analysis is perfomed over single words, bi- and tri-grams. We uncover in this way some basic structure of the data which is useful to guide the subsequent modeling stages.

Getting the Data

The data is from a corpus called the HC corpora which collects text in several languages from different sources, namely blogs, news and twitter. The data can be downloaded from the Coursera web site.

For the purpose of this initial exploration, we focus on the American English section of the corpora, which provides data for the different sources in separate files. We first read in those files in character vectors:

# data_dir is the data directory: it's hidden
file_paths <- data.frame(source = c("blogs", "news", "twitter"),
                         orig = c(file.path(data_dir, "final", "en_US", "en_US.blogs.txt"), 
                                  file.path(data_dir, "final", "en_US", "en_US.news.txt"), 
                                  file.path(data_dir, "final", "en_US", "en_US.twitter.txt")), 
                         sample = c(file.path(data_dir, "samples", "en_US.blogs.sample.txt"), 
                                    file.path(data_dir, "samples", "en_US.news.sample.txt"), 
                                    file.path(data_dir, "samples", "en_US.twitter.sample.txt")))
blogs <- readLines(file_paths$orig[1], skipNul = TRUE)
news <- readLines(file_paths$orig[2], skipNul = TRUE)
twitters <- readLines(file_paths$orig[3], skipNul = TRUE)

Data Summary

Here we describe basic data summaries in terms of word and line counts for each of the three files of each source:

wordcount <- function(textlines) { sum(str_count(textlines, "\\S+")) }
data_summary <- data.frame(source = c("blogs", "news", "twitter"), 
                           `word count` = c(wordcount(blogs), 
                                            wordcount(news), 
                                            wordcount(twitters)), 
                           `line count` = c(length(blogs), 
                                            length(news), 
                                            length(twitters)))

data_summary
##    source word.count line.count
## 1   blogs   37334131     899288
## 2    news   34372530    1010242
## 3 twitter   30373583    2360148

The three sources bring about the same number of words. The blogs and news sources have approximately the same number of lines while twitters have doubled their number of lines. This is not surprising as twitter messages are limited to up to 140 characters.

Preprocessing and Loading the Data

This exploratory analysis is performed on a random sample which amounts at 1% of the original data. The code used for sampling can be found in the Appendix.

seed <- 2791

for(i in 1:length(file_paths$sample)) {
  orig <- as.character(file_paths$orig[i])
  sample <- as.character(file_paths$sample[i])
  if(!file.exists(sample)) {
    sampleFile(orig, sample, perc = 1, append = TRUE, seed = seed)
  }
}

samples <- bind_rows(data.frame(text = readLines(as.character(file_paths$sample[1]), 
                                                 skipNul = TRUE), source = "blogs"),
                     data.frame(text = readLines(as.character(file_paths$sample[2]), 
                                                 skipNul = TRUE), source = "news"),
                     data.frame(text = readLines(as.character(file_paths$sample[3]), 
                                                 skipNul = TRUE), source = "twitters"))

Finally, we load the collected list of profane words for the English language:

badwords <- readLines("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en")
badwords <- data.frame(word = badwords[-length(badwords)])  # last element read is empty string

Word Frequency Analysis

Tokenizing and Filtering

Text is converted to lowercase, and then numbers, punctuaction, high-frequency (i.e. stop words), profane words and twitter specific elements are removed:

data(stop_words)

corpus <- samples %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>% # remove stop words
  filter(!str_detect(word, "^\\d+\\.*\\d*$")) %>% # remove numbers, or maybe "mutate(word = str_extract(word, "[a-z]+")) %>%"
  anti_join(badwords) %>% # profanity filtering
  filter(!str_detect(word, "^rt|lol|[@#]\\S+$")) # filter out re-tweets, hash tags and twitter handles and also common jargon
## Joining, by = "word"
## Joining, by = "word"
## Warning: Column `word` joining character vector and factor, coercing into
## character vector

Word Frequency Distribution

The most common words in the whole sample:

corpus %>%
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word,n)) %>%
  top_n(20) %>%
  ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip()
## Selecting by n

This is the distribution of word frequencies:

corpus %>%
  count(word, sort = TRUE) %>%
  mutate(total = sum(n)) %>%
  ggplot(aes(n/total)) + geom_histogram(show.legend = FALSE) + xlim(NA,0.0006)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 152 rows containing non-finite values (stat_bin).

These type of long-tailed distributions are expected in any given corpus of natural language, a classic example of the Zip’s Law, with many words that occur rarely and fewer words that occur frequently. This is occurring even after removing the most common (i.e. stop) words used in the language.

Here we plot the cumulative frequency of the frequency sorted dictionary extracted from the filtered corpus:

corpus %>%
  count(word, sort = TRUE) %>%
  mutate(rank = row_number(), total = sum(n), frequency = n / total) %>%
  ggplot(aes(rank, cumsum(frequency))) + geom_line() + 
         scale_x_continuous(breaks=seq(0,50000,2500)) + 
         scale_y_continuous(breaks=seq(0,1,.1))

Interestingly, and in line with the previous findings, we can see how we just need a small fraction of the dictionary in order to cover significant portions of all word occurrences: ~4% (~2000) of unique words to cover 50%, and ~37% (~19000) of unique words to cover 90% of all word instances in the corpus.

Let’s find out if there are significant differences among the three sources of text considered here:

source_words <- corpus %>% count(source, word, sort = TRUE) %>% ungroup()
total_source_words <- source_words %>% group_by(source) %>% summarize(total = sum(n))
source_words <- left_join(source_words, total_source_words)
## Joining, by = "source"
ggplot(source_words, aes(n/total), fill = source) + 
  geom_histogram(show.legend = FALSE) + 
  xlim(NA, 0.0009) + 
  facet_wrap(~source, ncol = 3, scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 224 rows containing non-finite values (stat_bin).

These plots exhibit similar distributions for all the sources.

NGrams Frequency Analysis

Bigrams

Firstly, we tokenize by bi-grams:

bigrams <- samples %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)
bigrams %>%
  count(bigram, sort = TRUE)
## # A tibble: 489,481 x 2
##      bigram     n
##       <chr> <int>
##  1   of the  4295
##  2   in the  4087
##  3   to the  2128
##  4   on the  2004
##  5  for the  1992
##  6    to be  1629
##  7   at the  1448
##  8  and the  1214
##  9     in a  1148
## 10 with the  1078
## # ... with 489,471 more rows

As one might expect, a lot of the most common bigrams are pairs of common (i.e. stop) words. Let’s remove the cases where either is a stop word but also profane and non-words, in order to inspect the most common but also interesting ones:

bigrams_filtered <- 
  bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(str_detect(word1, "^[a-z]+$") & str_detect(word2, "^[a-z]+$")) %>% 
  filter(!word2 %in% stop_words$word & !word1 %in% stop_words$word) %>%
  filter(!word1 %in% badwords$word & !word2 %in% badwords$word)
dim(bigrams_filtered)
## [1] 146402      3

After filtering, we end up with a total of 146402 bigrams. Let’s examine their frequency:

bigrams_filtered %>%
  unite("bigram", c("word1","word2"), sep=" ") %>%
  count(bigram, sort = TRUE) %>%
  mutate(total = sum(n)) %>%
  ggplot(aes(n/total)) + geom_histogram(show.legend = FALSE) + xlim(NA,0.0002)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite values (stat_bin).

Not surprisingly, we get another right long-tailed distribution which is even more skewed than the one obtained with single words. When developing the prediction algorithm based on n-grams, we must consider this fact. Let’s then plot the histogram of the most common bi-grams:

bigrams_filtered %>%
  unite("bigram", c("word1","word2"), sep=" ") %>%
  count(bigram, sort = TRUE) %>%
  mutate(bigram = reorder(bigram,n)) %>%
  top_n(25) %>%
  ggplot(aes(bigram, n)) + geom_col() + xlab(NULL) + coord_flip()
## Selecting by n

Interestengly, there are 6 US cities among the top frequent 25 bi-grams, the rest is dominated by word couples which represent a single entity, e.g. health care, ice cream, vice president.

Tri-grams

Let’s tokenize by tri-grams and inspect the most frequent ones. This time we don’t filter out those containing stop-words as this would leave us with triplets with only very low frequencies.

trigrams <- samples %>% 
  unnest_tokens(trigram, text, token = "ngrams", n = 3)
trigrams %>%
  count(trigram, sort = TRUE)
## # A tibble: 876,506 x 2
##           trigram     n
##             <chr> <int>
##  1     one of the   350
##  2       a lot of   286
##  3 thanks for the   219
##  4        to be a   192
##  5     as well as   161
##  6    going to be   158
##  7      i want to   155
##  8     out of the   153
##  9     the end of   145
## 10       i have a   136
## # ... with 876,496 more rows

There are nearly a million tri-grams, which together with bi-grams, shows approximately a quadratic increase in the number of parameters with respect to the dimension of the n-grams (n). Again, the set of the most frequent tri-grams is dominated by those containing common words.

trigrams %>%
  count(trigram, sort = TRUE) %>%
  mutate(trigram = reorder(trigram,n)) %>%
  top_n(25) %>%
  ggplot(aes(trigram, n)) + geom_col() + xlab(NULL) + coord_flip()
## Selecting by n

Predictive Text Modeling - Plan

Appendix

Number of Lines in a File

# get the number of lines in a file
numberOfLines <- function(fname) {
  if(!file.exists(fname)) {
    stop(paste("Cannot read file ", fname, sep = " "))
  }
  as.integer(strsplit(try(system(paste("wc", fname, sep = " "), intern = TRUE)), " +")[[1]][2])
}

Sample a Percentage of Lines in a File

# sample a certain fraction of the lines of a file, and write them to an output file
#
# adapted from https://stat.ethz.ch/pipermail/r-help/2007-February/124812.html
#
# The basic premise with this approach below, is that you are in effect
# creating a sequential file cache in an R object. Reading large chunks of
# the source file into the cache. Then randomly selecting rows within the
# cache and then writing out the selected rows.
# Thus, if you can read 100,000 rows at once, you would have 9 reads of
# the source file, and 9 writes of the target file.
# The key thing here is to ensure that the offsets within the cache and
# the corresponding random row values are properly set.
#
# WARN
#   works with files with no or a very few empty lines 
sampleFile <- function(ifname, ofname, perc = 1, append = TRUE, seed = 1234) {
  nlines <- numberOfLines(ifname)
  
  # generate the random row values
  set.seed(seed)
  sel <- sample(1:nlines, nlines * perc / 100)
  
  # set up a sequence for the cache chunks,
  # chunk size is 9th of number of lines
  chunk_size <- floor(nlines/9)
  cuts <- seq(0, nlines, chunk_size)
  
  # loop over the length of cuts, less 1
  for ( i in seq(along = cuts[-1]) ) {
    # get a chunk_size row chunk, skipping rows
    # as appropriate for each subsequent chunk
    # might get less then chunk_size lines, if there are empty lines
    chunk <- scan(ifname, what = character(), sep = "\n", skip = cuts[i], nlines = chunk_size)
    
    # set up a row sequence for the current chunk
    rows <- (cuts[i]+1):(cuts[i+1])
    
    # are any of the the random values in the current chunk?
    # if so, get them and write them out
    chunk.sel <- sel[which(sel %in% rows)]
    if(length(chunk.sel) > 0) {
      chunk_index <- sel - cuts[i]
      # take into account chunk might have less than chunk_size lines
      write.rows <- chunk[chunk_index[chunk_index>0 & chunk_index <= chunk_size]]
      # write.rows <- chunk[!is.na(chunk[sel-cuts[i]])]
      write(write.rows, ofname, append = append, sep = "\n")
    }
  }
}