In this report I tried to understand how the data looks like, how could it be preprocessed and used in building prediction models. There’s some key outtakes:
Context matters. Sets of n-grams obtained from different sources are very different as well, so it is reasonable to use only one source most similar to the expected user’s environment.
Filtering procedures used for the analysis should (and apparently can) be improved, since there’s much ‘false positives’ in the filtered data set that can lead to sufficient distortions to the prediction.
Although the size of the sample used in this analysis was large enough for exploratory purposes, it is worth considering to increase the sample size for final release.
Inintial text corpora are quite large: files contain about million or even more lines (first column) and more 30 millon words each.
## [1] " 899288 37334690 en_US.blogs.txt"
## [1] " 1010242 34372720 en_US.news.txt"
## [1] " 2360148 30374206 en_US.twitter.txt"
For exploratory analysis and modelling we don’t need such large amount of data, although it may be useful for final release. So we took just a small portion of the data of only 10% lines from each file. It is sufficient for our current purposes and can be processed in a reasonable time even on an ordinary PC. Data not included in the sample was stored separately.
There’s characteristocs of the files with the sample data:
## [1] " 89929 3735421 en_US.blogs_sample.txt"
## [1] " 101024 3432002 en_US.news_sample.txt"
## [1] " 236015 3033155 en_US.twitter_sample.txt"
All of three corpora was processed separately to explore the difference in language and patterns for every source, but the strategy in all cases was the same. Every document in each corpus was broken down to sentences (since we don’t need n-grams spanning two sentences) and filtered.
Profane words was removed, contractions like can’t or I’ll was substituted by special terms, just like dates, punctuation was stripped out etc (see the code in Appendix).
Every corpus was broken down to uni-, bi- and trigrams, obtained data was stored to disk in two variants: as hash object and as data.table.
twitter.trigrams <- readRDS("twitter_3grams_dt.rds")
setkey(twitter.trigrams, freq)
most.frequent.ngrams(twitter.trigrams)
## keys freq
## 1: i know iIM 87
## 2: iIM pretty sure 86
## 3: i guess i 86
## 4: right now i 85
## 5: i thought i 83
## 6: looking forward seeing 81
## 7: have great weekend 77
## 8: have great day 76
## 9: feel like i 73
## 10: last night i 70
## 11: i know right 70
## 12: i just wanna 68
## 13: i can see 66
## 14: i really like 65
## 15: i look forward 65
## 16: i wanna go 64
## 17: i need get 64
## 18: i love u 63
## 19: i hope i 62
## 20: r i p 60
blogs.trigrams <- readRDS("blogs_3grams_dt.rds")
setkey(blogs.trigrams, freq)
most.frequent.ngrams(blogs.trigrams)
## keys freq
## 1: i can tell 75
## 2: of course i 73
## 3: new york times 70
## 4: i thought iAPOSTROPHED 70
## 5: i say i 69
## 6: i just want 67
## 7: last year i 66
## 8: feel like i 66
## 9: i will never 65
## 10: i really like 63
## 11: i told i 62
## 12: i just love 62
## 13: last week i 61
## 14: last time i 61
## 15: i can say 58
## 16: i really want 57
## 17: i came across 57
## 18: iIM pretty sure 56
## 19: you can also 55
## 20: in fact i 55
As could be seen from above, each corpus provides very different set of most frequent phrases. So it makes sence to use in the final product the corpus most similar to the expected user environment.
blogs.bigrams <- readRDS("blogs_2grams_dt.rds")
setkey(blogs.bigrams, freq)
most.frequent.ngrams(blogs.bigrams)
## keys freq
## 1: know i 657
## 2: i found 654
## 3: i wanted 649
## 4: i hope 644
## 5: now i 641
## 6: i made 631
## 7: you can 628
## 8: i never 616
## 9: i still 569
## 10: i need 561
## 11: when i 552
## 12: i went 546
## 13: i decided 524
## 14: iIM sure 517
## 15: right now 516
## 16: i say 506
## 17: years ago 505
## 18: new york 496
## 19: think i 488
## 20: i started 487
It is clearly seen from the figures above that increase in the length of n-gram leads to reducing even most frequent combination counts at an order of magnitude, and the number of terms found only once is very large even for bigrams (for example, below are the percentage of such terms from the blogs corpus). Such terms may be occasional and should not be used for prediction.
nrow(subset(blogs.bigrams, freq > 1))/nrow(blogs.bigrams)
## [1] 0.1402275
Below are functions used for preprocessing: initial reading data, sampling, filtering, tokenizing and saving processed data.
Initial corpora are splitted to sample and complement sets and saved to disk.
split.sources <- function(filenames, sample.fraction){
for (filename in filenames){
if (file.exists(filename)){
txt <- readLines(filename, warn = FALSE)
inSample <- sample(1:length(txt),
size = round(sample.fraction * length(txt)),
replace = FALSE)
txtSample <- txt[inSample]
txtFilename <- paste0(sub("\\.txt$", "", filename), "_sample.txt")
writeLines(txtSample, txtFilename)
txtComplement <- txt[-inSample]
txtFilename <- paste0(sub("\\.txt$", "", filename),
"_complement.txt")
writeLines(txtComplement, txtFilename)
}
}
}
There are several filtering procedures, incliding substitution profanity words to special terms, splitting blocks of text into sentences, masking contractions, dates, and time, removing stopwords (articles, pronouns etc), punctuation and extra spaces.
profanity.filter <- function(x){
if (!file.exists("profane_words.txt"))
profaneWords <- c("anus", "bitch", "cunt", "butt", "damn", "fuck",
"bastard", "cock", "crap", "dick", "fag", "fart",
"whore", "nigger", "nigga", "queer", 'pussy', "boobs",
"wank", "arse", "shit", "tits", "slut")
else
profaneWords <- readLines("profane_words.txt")
for (word in profaneWords){
pattern <- paste0("\\b[[:alnum:]]*?", word, "[[:alnum:]]*?\\b")
x <- gsub(pattern, SUBST_PROFANE, x)
}
x
}
split.to.sentences <- function(text){
sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")
inIncrement <- 50
inStart <- 1
inEnd <- inStart + inIncrement
textLen <- length(text)
if (inEnd > textLen)
inEnd <- textLen
sentences <- character()
while (inStart <= textLen){
chunk <- text[inStart:inEnd]
chunk <- as.String(chunk)
sentence.boundaries <- annotate(chunk, sentence_token_annotator)
sentences <- c(sentences, chunk[sentence.boundaries])
inStart <- inStart + inIncrement + 1
inEnd <- inStart + inIncrement
if (inEnd > textLen)
inEnd <- textLen
}
sentences
}
remove.urls <- function(text)
gsub("https?://.*?([[:space:]]|>|$)", "<URL> ", text, ignore.case = TRUE)
remove.time <- function(text)
gsub("\\b([0-2]?[0-9]:[0-9]{2}([[:space:]]am|[[:space:]]pm)?|[0-9]{1,2}[[:space:]]?(am|pm))\\b", "<TIME>", text,
ignore.case = TRUE)
remove.date <- function(text){
date.formats <- c(
"\\b[0-3]?[0-9]/[0-3]?[0-9]/[0-9]{2,4}\\b",
"(january|february|march|april|may|june|july|august|september|october|november|december)[[:space:]]+[0-9]{1,2}(,[[:space:]]+[0-9]{4})?",
"(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[[:space:]]+[0-9]{1,2}(,[[:space:]]+[0-9]{4})?"
)
gsub(paste(date.formats, collapse = "|"), "<DATE>", text,
ignore.case = TRUE)
}
substitute.contractions <- function(text){
text <- gsub("o['’]clock", "<OCLOCK>", text, ignore.case = TRUE)
text <- gsub("['’]m", "IM", text, ignore.case = TRUE)
text <- gsub("['’]ll", "APOSTROPHELL", text, ignore.case = TRUE)
text <- gsub("['’]d", "APOSTROPHED", text, ignore.case = TRUE)
text <- gsub("['’]s", "APOSTROPHES", text, ignore.case = TRUE)
text <- gsub("['’]t", "APOSTROPHET", text, ignore.case = TRUE)
text <- gsub("['’]ve", "APOSTROPHEVE", text, ignore.case = TRUE)
text <- gsub("['’]re", "APOSTROPHERE", text, ignore.case = TRUE)
text <- gsub("['’]all", "<APOSTROPHEALL>", text, ignore.case = TRUE)
text
}
remove.punctuation <- function(text)
gsub("[[:punct:]]", " ", text)
remove.spaces <- function(text)
gsub("[[:space:]]+", " ", text)
remove.stopwords <- function(text){
pattern <- paste(stopwords(), collapse = "|")
pattern <- paste0("\\b(", pattern, ")\\b")
gsub(pattern, "", text)
}
remove.numbers <- function(text){
gsub("[0-9]", "", text)
}
filter.text <- function(text){
text <- split.to.sentences(text)
text <- remove.stopwords(text)
text <- tolower(text)
text <- substitute.contractions(text)
text <- remove.punctuation(text)
text <- profanity.filter(text)
text <- remove.urls(text)
text <- remove.date(text)
text <- remove.time(text)
text <- remove.numbers(text)
text <- remove.spaces(text)
text
}
In this case tokens are defined as n-grams. For purpose of the analysis I used bi- ans trigrams, but, of course, it is possible to get tokens of any length. Tokens are stored in .RDS format on disk in two variants: as hash objects (using the hash package) and in data.tables (using eponymous package).
add.to.hash <- function(h, ngram){
if (has.key(ngram, h))
h[[ngram]] <- h[[ngram]] + 1
else
h[ngram] <- 1
h
}
get.ngrams <- function(text, n = 2){
freq.tokens <- hash()
for (line in text){
tokens <- ngrams(strsplit(line, " ", fixed = TRUE)[[1]], n)
if (length(tokens) > 0)
for (i in 1:length(tokens)){
add.to.hash(freq.tokens, token)
}
}
freq.tokens
}
ngrams.to.dt <- function(h){
k <- keys(h)
v <- values(h)
data.table(keys = k, freq = v)
}
most.frequent.ngrams <- function(dt, n = 20){
setkey(dt, freq)
nr <- nrow(dt)
dt <- dt[nr-n:nr,]
dt[1:n,]
}