Corpora exploration analysis

Summary

In this report I tried to understand how the data looks like, how could it be preprocessed and used in building prediction models. There’s some key outtakes:

Context matters. Sets of n-grams obtained from different sources are very different as well, so it is reasonable to use only one source most similar to the expected user’s environment.
Filtering procedures used for the analysis should (and apparently can) be improved, since there’s much ‘false positives’ in the filtered data set that can lead to sufficient distortions to the prediction.
Although the size of the sample used in this analysis was large enough for exploratory purposes, it is worth considering to increase the sample size for final release.

Analysis

Inintial text corpora are quite large: files contain about million or even more lines (first column) and more 30 millon words each.

## [1] "  899288 37334690 en_US.blogs.txt"

## [1] " 1010242 34372720 en_US.news.txt"

## [1] " 2360148 30374206 en_US.twitter.txt"

For exploratory analysis and modelling we don’t need such large amount of data, although it may be useful for final release. So we took just a small portion of the data of only 10% lines from each file. It is sufficient for our current purposes and can be processed in a reasonable time even on an ordinary PC. Data not included in the sample was stored separately.

There’s characteristocs of the files with the sample data:

## [1] "   89929 3735421 en_US.blogs_sample.txt"

## [1] "  101024 3432002 en_US.news_sample.txt"

## [1] "  236015 3033155 en_US.twitter_sample.txt"

All of three corpora was processed separately to explore the difference in language and patterns for every source, but the strategy in all cases was the same. Every document in each corpus was broken down to sentences (since we don’t need n-grams spanning two sentences) and filtered.

Profane words was removed, contractions like can’t or I’ll was substituted by special terms, just like dates, punctuation was stripped out etc (see the code in Appendix).

Every corpus was broken down to uni-, bi- and trigrams, obtained data was stored to disk in two variants: as hash object and as data.table.

Most frequent trigrams from the twitter corpus:

twitter.trigrams <- readRDS("twitter_3grams_dt.rds")
setkey(twitter.trigrams, freq)
most.frequent.ngrams(twitter.trigrams)

##                       keys freq
##  1:             i know iIM   87
##  2:        iIM pretty sure   86
##  3:              i guess i   86
##  4:            right now i   85
##  5:            i thought i   83
##  6: looking forward seeing   81
##  7:     have great weekend   77
##  8:         have great day   76
##  9:            feel like i   73
## 10:           last night i   70
## 11:           i know right   70
## 12:           i just wanna   68
## 13:              i can see   66
## 14:          i really like   65
## 15:         i look forward   65
## 16:             i wanna go   64
## 17:             i need get   64
## 18:               i love u   63
## 19:               i hope i   62
## 20:                  r i p   60

Most frequent trigrams from the blogs corpus:

blogs.trigrams <- readRDS("blogs_3grams_dt.rds")
setkey(blogs.trigrams, freq)
most.frequent.ngrams(blogs.trigrams)

##                       keys freq
##  1:             i can tell   75
##  2:            of course i   73
##  3:         new york times   70
##  4: i thought iAPOSTROPHED   70
##  5:                i say i   69
##  6:            i just want   67
##  7:            last year i   66
##  8:            feel like i   66
##  9:           i will never   65
## 10:          i really like   63
## 11:               i told i   62
## 12:            i just love   62
## 13:            last week i   61
## 14:            last time i   61
## 15:              i can say   58
## 16:          i really want   57
## 17:          i came across   57
## 18:        iIM pretty sure   56
## 19:           you can also   55
## 20:              in fact i   55

As could be seen from above, each corpus provides very different set of most frequent phrases. So it makes sence to use in the final product the corpus most similar to the expected user environment.

Most frequent bigrams from the blogs corpus:

blogs.bigrams <- readRDS("blogs_2grams_dt.rds")
setkey(blogs.bigrams, freq)
most.frequent.ngrams(blogs.bigrams)

##          keys freq
##  1:    know i  657
##  2:   i found  654
##  3:  i wanted  649
##  4:    i hope  644
##  5:     now i  641
##  6:    i made  631
##  7:   you can  628
##  8:   i never  616
##  9:   i still  569
## 10:    i need  561
## 11:    when i  552
## 12:    i went  546
## 13: i decided  524
## 14:  iIM sure  517
## 15: right now  516
## 16:     i say  506
## 17: years ago  505
## 18:  new york  496
## 19:   think i  488
## 20: i started  487

It is clearly seen from the figures above that increase in the length of n-gram leads to reducing even most frequent combination counts at an order of magnitude, and the number of terms found only once is very large even for bigrams (for example, below are the percentage of such terms from the blogs corpus). Such terms may be occasional and should not be used for prediction.

nrow(subset(blogs.bigrams, freq > 1))/nrow(blogs.bigrams)

## [1] 0.1402275

Appendix

Below are functions used for preprocessing: initial reading data, sampling, filtering, tokenizing and saving processed data.

Splitting

Initial corpora are splitted to sample and complement sets and saved to disk.

split.sources <- function(filenames, sample.fraction){
    for (filename in filenames){
        if (file.exists(filename)){
            txt <- readLines(filename, warn = FALSE)
            inSample <- sample(1:length(txt), 
                               size = round(sample.fraction * length(txt)),
                               replace = FALSE)
            txtSample <- txt[inSample]
            txtFilename <- paste0(sub("\\.txt$", "", filename), "_sample.txt")
            writeLines(txtSample, txtFilename)
            txtComplement <- txt[-inSample]
            txtFilename <- paste0(sub("\\.txt$", "", filename), 
                                  "_complement.txt")
            writeLines(txtComplement, txtFilename)
        }
    }
}

Filtering

There are several filtering procedures, incliding substitution profanity words to special terms, splitting blocks of text into sentences, masking contractions, dates, and time, removing stopwords (articles, pronouns etc), punctuation and extra spaces.

profanity.filter <- function(x){
    if (!file.exists("profane_words.txt"))
        profaneWords <- c("anus", "bitch", "cunt", "butt", "damn", "fuck", 
                          "bastard", "cock", "crap", "dick", "fag", "fart",
                          "whore", "nigger", "nigga", "queer", 'pussy', "boobs",
                          "wank", "arse", "shit",  "tits", "slut")
    else
        profaneWords <- readLines("profane_words.txt")
    
    for (word in profaneWords){
        pattern <- paste0("\\b[[:alnum:]]*?", word, "[[:alnum:]]*?\\b")
        x <- gsub(pattern, SUBST_PROFANE, x)
    }
    x
}

split.to.sentences <- function(text){
    sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")
    
    inIncrement <- 50
    inStart <- 1
    inEnd <- inStart + inIncrement
    textLen <- length(text)
    if (inEnd > textLen)
        inEnd <- textLen
    sentences <- character()
    
    while (inStart <= textLen){
        chunk <- text[inStart:inEnd]
        chunk <- as.String(chunk)
        sentence.boundaries <- annotate(chunk, sentence_token_annotator)
        sentences <- c(sentences, chunk[sentence.boundaries])
        inStart <- inStart + inIncrement + 1
        inEnd <- inStart + inIncrement
        if (inEnd > textLen)
            inEnd <- textLen
    }
    sentences
}

remove.urls <- function(text)
    gsub("https?://.*?([[:space:]]|>|$)", "<URL> ", text, ignore.case = TRUE)

remove.time <- function(text)
    gsub("\\b([0-2]?[0-9]:[0-9]{2}([[:space:]]am|[[:space:]]pm)?|[0-9]{1,2}[[:space:]]?(am|pm))\\b", "<TIME>", text, 
         ignore.case = TRUE)

remove.date <- function(text){
    date.formats <- c(
        "\\b[0-3]?[0-9]/[0-3]?[0-9]/[0-9]{2,4}\\b",
        "(january|february|march|april|may|june|july|august|september|october|november|december)[[:space:]]+[0-9]{1,2}(,[[:space:]]+[0-9]{4})?",
        "(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[[:space:]]+[0-9]{1,2}(,[[:space:]]+[0-9]{4})?"
    )
    gsub(paste(date.formats, collapse = "|"), "<DATE>", text, 
         ignore.case = TRUE)
}

substitute.contractions <- function(text){
    text <- gsub("o['’]clock", "<OCLOCK>", text, ignore.case = TRUE)
    text <- gsub("['’]m", "IM", text, ignore.case = TRUE)
    text <- gsub("['’]ll", "APOSTROPHELL", text, ignore.case = TRUE)
    text <- gsub("['’]d", "APOSTROPHED", text, ignore.case = TRUE)
    text <- gsub("['’]s", "APOSTROPHES", text, ignore.case = TRUE)
    text <- gsub("['’]t", "APOSTROPHET", text, ignore.case = TRUE)
    text <- gsub("['’]ve", "APOSTROPHEVE", text, ignore.case = TRUE)
    text <- gsub("['’]re", "APOSTROPHERE", text, ignore.case = TRUE)
    text <- gsub("['’]all", "<APOSTROPHEALL>", text, ignore.case = TRUE)
    text
}

remove.punctuation <- function(text)
    gsub("[[:punct:]]", " ", text)

remove.spaces <- function(text)
    gsub("[[:space:]]+", " ", text)

remove.stopwords <- function(text){
    pattern <- paste(stopwords(), collapse = "|")
    pattern <- paste0("\\b(", pattern, ")\\b")
    gsub(pattern, "", text)
}

remove.numbers <- function(text){
    gsub("[0-9]", "", text)
}
    
filter.text <- function(text){
    text <- split.to.sentences(text)
    text <- remove.stopwords(text)
    text <- tolower(text)
    text <- substitute.contractions(text)
    text <- remove.punctuation(text)
    text <- profanity.filter(text) 
    text <- remove.urls(text)
    text <- remove.date(text)
    text <- remove.time(text)
    text <- remove.numbers(text)
    text <- remove.spaces(text)
    text
}

Tokenizing

In this case tokens are defined as n-grams. For purpose of the analysis I used bi- ans trigrams, but, of course, it is possible to get tokens of any length. Tokens are stored in .RDS format on disk in two variants: as hash objects (using the hash package) and in data.tables (using eponymous package).

add.to.hash <- function(h, ngram){
    if (has.key(ngram, h))
        h[[ngram]] <- h[[ngram]] + 1
    else
        h[ngram] <- 1
    h
}

get.ngrams <- function(text, n = 2){
    freq.tokens <- hash()
    for (line in text){
        tokens <- ngrams(strsplit(line, " ", fixed = TRUE)[[1]], n)
        if (length(tokens) > 0)
            for (i in 1:length(tokens)){
                    add.to.hash(freq.tokens, token)
            }
    }
    freq.tokens
}

ngrams.to.dt <- function(h){
    k <- keys(h)
    v <- values(h)
    data.table(keys = k, freq = v)
}

most.frequent.ngrams <- function(dt, n = 20){
    setkey(dt, freq)
    nr <- nrow(dt)
    dt <- dt[nr-n:nr,]
    dt[1:n,]
}