CONTENT
The report is organized in the following sections:
The report is organized in the following sections:
In this report I briefly illustrate the exploratory analysis performed on a three datasets, comprising text from blogs, news and tweets.
The ultimate goal is to produce a light application able to predict text (words) given some preceding text, mimicking the predictive typing feature of modern software keyboard of portable devices.
As a playground a fairly substantial dataset was made available, comprising text from various heterogenous sources (blogs, news, twitter). These datasets are the foundation for developing an understanding of language processing and in turn devise a strategy for achieving the goal, and perhaps more importantly (in practice) they constitute our training and testing datasets.
I decided to invest a significant amount of time to explore the data, and delved (too) deeply into data cleaning, assuming that this effort will pay off by making any algorithm more robust.
At this stage in the project I will mostly review my exploratory analysis of the data, and outline my current thought about the strategy for developing the algorithm for the text-predicting application.
Performance issues: it is worth mentioning that one of the main challenges has been dealing smartly with the computational load, that turned out to be a serious limiting factor, even on a powerful workstation.
I did not use the suggested tm suite and relied instead heavily on perl and in R mainly dplyr, NLP and RWeka.
My current thoughts, very much in flux, about the strategy are that a n-grams based approach would be the most effective.
In particular, I am leaning towards a weighted combination of 2- 3- 4- 5-grams (linear interpolation), perhaps assisted by some additional information drawn from an analysis of the association of words in sentences or their distance within it.
An important issue that I have not yet had a chance to ponder sufficiently include the handling of “zeros”, i.e. words not included in the dictionary of the training set or, more importantly with a n-grams approach words that are not seen following a given (n-1) gram. In practice, based on my readings, this problem is tackled with some form of smoothing, that is assigning a probability to the “zeros” (and in turn re-allocating some mass probability away from the observed n-grams).
I have not yet had a chance to explore the feasibility and effectiveness of methods like Good-Turing or Stupid Backoff.
Libraries needed for data processing and plotting:
#-----------------------------
# NLP
library("tm")
library("SnowballC")
library("openNLP")
library("NLP")
# To help java fail less :-(
options( java.parameters = "-Xmx6g")
library("RWeka") # [NGramTokenizer], [Weka_control]
#-----------------------------
# general
library("dplyr")
library("magrittr")
library("devtools")
library("ggplot2")
library("gridExtra")
# library("RColorBrewer")
#-----------------------------
# my functions
source("./scripts/my_functions.R")
#-----------------------------
After a quick review of the data with various R functions and packages, I decided to perform some cleaning of the text with standard Linux command line tools.
The main task was to analyze the mix of invidual characters present in the three datasets with the goal of doing some homogeneization and tidying up of non-alphanumeric characters, such as quotes that can come in different forms.
The used method is not elegant, but effective enough, relying on a simple perl command substituting a series of non-odd characters with spaces, thus leaving a stream of odd characters subsequently parsed and cleaned to produce a list of odd characters sorted by their count.
perl -pe 's|[\d\w\$\,\.\!\?\(\);:\/\\\-=&%#_\~<>]||g; s|\s||g; s|[\^@"\+\*\[\]]||g;' | \
perl -pe "s/\'//g;" | \
egrep -v '^$' | \
split_to_singles.awk | \
sort -k 1 | uniq -c | sort -k 1nr
# split_to_singles.awk is a short awk script not worth including here (it's on GitHub)
The number of unique odd characters found in each dataset are 2159 for blogs, 310 for news, 2087 for twitter.
The following is the census of odd characters appearing more than 500 times in each of the datasets (the full sorted lists are available on the GitHub repo in the data directory).
blogs news twitter
----------- ---------- ------------------------
387317 [’] 102911 [’] 27440 [“] 726 [»]
109154 [”] 48115 [—] 26895 [”] 718 [«]
108769 [“] 47090 [“] 11419 [’] 715 [😔]
50176 [–] 43992 [”] 5746 [♥] 686 [😉]
41129 […] 8650 [–] 5241 […] 680 [😳]
23836 [‘] 6991 [ø] 3838 [|] 639 [{]
18757 [—] 6723 [] 2353 [❤] 617 [•]
3963 [é] 6544 [] 2314 [–] 593 [‘]
2668 [£] 6267 [] 1799 [—] 578 [�]
1301 [′] 4898 [‘] 1333 [😊] 561 [💜]
914 [´] 3641 [] 1211 [👍] 560 [😃]
755 [″] 3319 [é] 1149 [😂] 544 [😏]
643 [€] 3062 […] 977 [é] 506 [☀]
624 [ā] 2056 [] 963 [😁] 503 [😜]
605 [½] 1408 [] 955 [☺]
598 [á] 1152 [�] 926 [😒]
582 [ö] 971 [•] 802 [`]
555 [è] 837 [½] 758 [😍]
518 [°] 711 [`] 751 [😘]
537 [ñ] 741 [}]
For this preliminary stage I decided to not worry about accented letters, and characters from non-latin alphabet (e.g. asian, emoticons), but I thought it would be helpful to standardize a small set of very frequent characters, whose “meaning” is substantially equivalent
blogs news twitter TOTAL
quotes [‘] 23836 4898 593 = 29327
[’] 387317 102911 11419 = 501647
[“] 108769 47090 27440 = 183299
[”] 109154 43992 26895 = 180041
[«] 0 0 718 = 718
[»] 0 0 726 = 726
dashes [–] 50176 8650 2314 = 61140
[—] 48115 18757 1799 = 68671
ellipsis […] 41129 5241 3062 = 49432
The dataset where cleaned with this perl commands and saved.
perl -pe "s|’|\'|g; s|…|...|g; s|–|\-|g; s|—|\-|g; s|‘|\'|g;" $1 | \
perl -pe 's|«|"|g; s|»|"|g; s|”|"|g; s|“|"|g;'
The datasets are read-in separately into character vectors, using a user-defined compact function (readByLine()) (see Appendix for the short source).
in.blogs.CL <- readByLine("./data/en_US.blogs.CLEANED1.txt.gz", check_nl = FALSE, skipNul = TRUE)
in.news.CL <- readByLine("./data/en_US.news.CLEANED1.txt.gz", check_nl = FALSE, skipNul = TRUE)
in.twitter.CL <- readByLine("./data/en_US.twitter.CLEANED1.txt.gz", check_nl = FALSE, skipNul = TRUE)
Some basic statistics of the three datasets:
stats.blogs <- as.numeric(system("gzip -dc ./data/en_US.blogs.CLEANED1.txt.gz | wc | awk '{print $1; print $2; print $3}'", intern = TRUE))
stats.news <- as.numeric(system("gzip -dc ./data/en_US.news.CLEANED1.txt.gz | wc | awk '{print $1; print $2; print $3}'", intern = TRUE))
stats.twitter <- as.numeric(system("gzip -dc ./data/en_US.twitter.CLEANED1.txt.gz | wc | awk '{print $1; print $2; print $3}'", intern = TRUE))
stats.df <- data.frame( blogs = stats.blogs, news = stats.news, twitter = stats.twitter,
row.names = c("lines", "words", "characters"), stringsAsFactors = FALSE)
stats.df
## blogs news twitter
## lines 899288 1010242 2360148
## words 37334114 34365936 30359804
## characters 208763813 205300313 166962974
There are some common, customary, operations performed on a text dataset before proceeding to analyze it.
Given that the goal is to predict words in a typing context I think that removing stopwords does not make much sense.
Working with a text without stopwords may be useful if one wanted to use in the prediction algorithm some information about words’ association in sentences, which may help improve meaningful discrimination between different next word possibilities “proposed” by an algorithm based on n-grams.
Because of the context, I also do not think that removing punctuation would be wise, nor make sense.
I have applied to the data the other three transformations, as follows (btw, a big obligatory acknowledgement and thank you to Hadley Wickham and Stefan Bache for bringing us %>%!).
in.blogs.CL.cleaned <- tolower(in.blogs.CL) %>% removeNumbers() %>% stripWhitespace()
in.news.CL.cleaned <- tolower(in.news.CL) %>% removeNumbers() %>% stripWhitespace()
in.twitter.CL.cleaned <- tolower(in.twitter.CL) %>% removeNumbers() %>% stripWhitespace()
During my initial attempts it immediately emerged the problem of excessively short rows of text. In particular, because I decided to perform tokenization on individual sentences, not directly on individual rows, the tokenizer tripped and failed on empty “sentences” resulting from short rows.
I have then decided to set a cutoff to the minimum acceptable length of rows. After some empirical testing and row-length analysis with command line tools (e.g. something like awk '{if(length <= 8){printf "%6d - %-s\n",NR,$0}}') I have set a threshold at 6 characters.
nchar.min <- 6
nchar.blogs.CL <- nchar(in.blogs.CL.cleaned)
in.blogs.CL.cleaned <- in.blogs.CL.cleaned[nchar.blogs.CL > nchar.min]
nchar.news.CL <- nchar(in.news.CL.cleaned)
in.news.CL.cleaned <- in.news.CL.cleaned[nchar.news.CL > nchar.min]
nchar.twitter.CL <- nchar(in.twitter.CL.cleaned)
in.twitter.CL.cleaned <- in.twitter.CL.cleaned[nchar.twitter.CL > nchar.min]
It immediately became clear that analyzing the entire dataset requires fairly powerful computing resources and time, even on a very high-end laptop and workstation.
Therefore, for exploration and prototyping I have been working with a subset of 20% of the data of each type.
fraction <- 0.2
# for reproducibility set seed!
set.seed(6420)
idx.blogs <- sample(1:length(in.blogs), ceiling(fraction*length(in.blogs)))
idx.news <- sample(1:length(in.news), ceiling(fraction*length(in.news)))
idx.twitter <- sample(1:length(in.twitter), ceiling(fraction*length(in.twitter)))
sel.blogs <- in.blogs[idx.blogs]
sel.news <- in.news[idx.news]
sel.twitter <- in.twitter[idx.twitter]
As noted, after some tests, I settled on an approach whereby n-grams tokenization is performed on separate individual sentences, instead of directly on individual rows as loaded from the dataset.
This is motivated by the fact that the tokenizer I have adopted because I found its performance to be more satisfactory, the NGramTokenizer of the RWeka package, does not seem to interrupt its construction of n-grams at what are very likely sentence boundaries.
With next word prediction in mind, it makes a lot of sense to restrict n-grams to sequences of words within the boundaries of a sentence.
Therefore, after cleaning, transforming and filtering the data, the first real operation I perform is the annotation of sentences, for which I have been using the openNLP sentence annotator Maxent_Sent_Token_Annotator(), with its default settings.
sent_token_annotator <- Maxent_Sent_Token_Annotator()
sent_token_annotator
## An annotator inheriting from classes
## Simple_Sent_Token_Annotator Annotator
## with description
## Computes sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'.
I want the data in the form of a vector with individual sentences, and so I opted for sapply() combined with a function wrapping the operations necessary to prepare a row of data for annotation, the annotation itself and finally return a vector of sentences (the short function is shown in the Appendix).
sel.blogs.sentences <- sapply(sel.blogs, FUN = find_sentences, USE.NAMES = FALSE) %>% unlist
sel.news.sentences <- sapply(sel.news, FUN = find_sentences, USE.NAMES = FALSE) %>% unlist
sel.twitter.sentences <- sapply(sel.twitter, FUN = find_sentences, USE.NAMES = FALSE) %>% unlist
N.sentences <- c(length(sel.blogs.sentences), length(sel.news.sentences), length(sel.twitter.sentences))
stats.df[4, ] <- as.numeric(5*N.sentences)
row.names(stats.df)[4] <- "sentences"
The stats table, with the added estimated number of sentences (because the analysis is on just 20% of the data, the number tabulated is 5*N.sentences) is as follows:
stats.df
## blogs news twitter
## lines 899288 1010242 2360148
## words 37334114 34365936 30359804
## characters 208763813 205300313 166962974
## sentences 2209980 1956430 3682545
round(stats.df[4, ]/stats.df[1, ], 3)
## blogs news twitter
## sentences 2.457 1.937 1.56
For the n-grams tokenization I have been using the RWeka Tokenizer NGramTokenizer, passing to it a list of token delimiters.
I have been extracting n-grams for \(n = 1, 2, 3, 4, 5\). It turns out that the 1-grams seem to represent a better definition of words than what is produced by the NWordTokenizer. For instance, this latter breaks don’t in 2, while the NGramTokenizer picks it up as a 1-gram.
I have not been able to run NGramTokenizer on the full vector of sentences for each data set. It fails on some variation of memory-allocation related error (that honestly does not make much sense to me considering that I am running it on machines with 12GB of RAM).
So, I am processing data in chunks of 100,000 sentences, as exemplified by this block of code (the n-grams data for the following section are loaded from saved previous analysis).
token_delim <- " \\t\\r\\n.!?,;\"()"
nl.chunk <- 100000
N <- ceiling(length(sel.blogs.sentences)/nl.chunk)
#----- BLOGS ------------------------------------
end.blogs <- length(sel.blogs.sentences)
#----- 2-grams -----
cat(" *** Tokenizing : blogs : 2-grams ------------------------------------------------------------\n")
n2grams.blogs.1 <- NGramTokenizer(sel.blogs.sentences[1:100000],
Weka_control(min = 2, max = 2, delimiters = token_delim))
n2grams.blogs.2 <- NGramTokenizer(sel.blogs.sentences[100001:200000],
Weka_control(min = 2, max = 2, delimiters = token_delim))
n2grams.blogs.3 <- NGramTokenizer(sel.blogs.sentences[200001:300000],
Weka_control(min = 2, max = 2, delimiters = token_delim))
n2grams.blogs.4 <- NGramTokenizer(sel.blogs.sentences[300001:400000],
Weka_control(min = 2, max = 2, delimiters = token_delim))
n2grams.blogs.5 <- NGramTokenizer(sel.blogs.sentences[400001:end.blogs],
Weka_control(min = 2, max = 2, delimiters = token_delim))
#----- 3-grams -----
cat(" *** Tokenizing : blogs : 3-grams ------------------------------------------------------------\n")
n3grams.blogs.1 <- NGramTokenizer(sel.blogs.sentences[1:100000],
Weka_control(min = 3, max = 3, delimiters = token_delim))
n3grams.blogs.2 <- NGramTokenizer(sel.blogs.sentences[100001:200000],
Weka_control(min = 3, max = 3, delimiters = token_delim))
n3grams.blogs.3 <- NGramTokenizer(sel.blogs.sentences[200001:300000],
Weka_control(min = 3, max = 3, delimiters = token_delim))
n3grams.blogs.4 <- NGramTokenizer(sel.blogs.sentences[300001:400000],
Weka_control(min = 3, max = 3, delimiters = token_delim))
n3grams.blogs.5 <- NGramTokenizer(sel.blogs.sentences[400001:end.blogs],
Weka_control(min = 3, max = 3, delimiters = token_delim))
# Combining split N-grams vector
source("./scripts/combine_nXgrams_blogs.R")
From the n-grams vectors we can compute frequencies, which will be an important basis for the prediction algorithms.
For now we can take a peek at what are the most frequent 3-grams and 4-grams in the three datasets.
print(cbind(head(n3g.blogs.freq, 20), head(n3g.news.freq, 20), head(n3g.twitter.freq, 20)),
print.gap = 1, right = FALSE)
## ngram count ngram count ngram count
## 1 one of the 4416 the united states 1324 thanks for the 7135
## 2 a lot of 3613 the first time 1249 thank you for 2590
## 3 to be a 2078 for the first 1021 i love you 2474
## 4 it was a 2076 more than <DOLLARAMOUNT> 1000 for the follow 2334
## 5 as well as 2067 the end the 896 for the rt 1311
## 6 some of the 1988 it would be 751 let me know 1301
## 7 the end of 1974 it was the 722 i do_not know 1265
## 8 out of the 1954 the fact that 690 i feel like 1179
## 9 be able to 1927 <DOLLARAMOUNT> - <DOLLARAMOUNT> 680 i wish i 1154
## 10 i want to 1882 this is the 679 thanks for following 1048
## 11 a couple of 1828 the rest the 676 you for the 1013
## 12 the fact that 1596 said he was 667 i can_not wait 968
## 13 this is a 1592 he said he 655 <HASHTAG> <HASHTAG> <HASHTAG> 963
## 14 the rest of 1539 i do_not think 652 how are you 960
## 15 going to be 1521 the new york 651 for the <HASHTAG> 958
## 16 part of the 1478 he said the 628 can_not wait for 919
## 17 i_am going to 1448 i do_not know 626 rt : i 915
## 18 i do_not know 1425 for more than 622 i think i 895
## 19 one of my 1408 the same time 578 if you want 867
## 20 i had to 1373 when he was 565 what do you 858
print(head(n4g.blogs.freq, 20), print.gap = 3, right = FALSE)
## ngram count
## 1 the end of the 1011
## 2 the rest of the 913
## 3 at the end of 872
## 4 at the same time 700
## 5 when it comes to 611
## 6 one of the most 610
## 7 to be able to 578
## 8 for the first time 565
## 9 in the middle of 519
## 10 if you want to 469
## 11 is one of the 462
## 12 i do_not want to 461
## 13 a bit of a 403
## 14 i was going to 395
## 15 on the other hand 393
## 16 i would like to 375
## 17 one of my favorite 350
## 18 as well as the 325
## 19 i was able to 304
## 20 is going to be 302
print(head(n4g.news.freq, 20), print.gap = 3, right = FALSE)
## ngram count
## 1 for the first time 791
## 2 more than <DOLLARAMOUNT> million 398
## 3 the first time since 195
## 4 more than <DOLLARAMOUNT> billion 150
## 5 for more than years 138
## 6 feet <DATE> for <DOLLARAMOUNT> 137
## 7 square feet <DATE> for 137
## 8 <DOLLARAMOUNT> million <DOLLARAMOUNT> million 136
## 9 for the most part 133
## 10 the past two years 132
## 11 told the associated press 132
## 12 the united states and 131
## 13 i do_not know if 126
## 14 the end the year 126
## 15 the end the day 124
## 16 g fat g saturated 118
## 17 the new york times 118
## 18 dow jones industrial average 114
## 19 be reached for comment 112
## 20 i do_not know what 112
print(head(n4g.twitter.freq, 20), print.gap = 3, right = FALSE)
## ngram count
## 1 thanks for the follow 1882
## 2 thanks for the rt 1031
## 3 thank you for the 916
## 4 for the first time 513
## 5 i wish i could 410
## 6 thanks for the <HASHTAG> 375
## 7 rt : rt : 358
## 8 thanks for the mention 358
## 9 let me know if 330
## 10 <HASHTAG> <HASHTAG> <HASHTAG> <HASHTAG> 322
## 11 that awkward moment when 299
## 12 what do you think 292
## 13 thank you much for 276
## 14 hope all is well 266
## 15 can_not wait for the 262
## 16 thanks for the shout 257
## 17 for the shout out 254
## 18 i thought it was 243
## 19 thank you for following 240
## 20 thank you for your 232
It is apparent that there some work will be necessary on the validation of the n-grams, or better still further text transformations, in particular of the twitter data set that “suffers” from the tendency of using shorthand slang (e.g. “rt” for “re-tweet”) that adds a lot of “noise” to the data.
These are two handy functions used in the analysis.
sapply() to annotate sentences, allowing to work by row instead of converting the whole dataset into one document.#-----------------------------------------------------------------------------------------
# modified readLines
readByLine <- function(fname, check_nl = TRUE, skipNul = TRUE) {
if( check_nl ) {
cmd.nl <- paste("gzip -dc", fname, "| wc -l | awk '{print $1}'", sep = " ")
nl <- system(cmd.nl, intern = TRUE)
} else {
nl <- -1L
}
con <- gzfile(fname, open = "r")
on.exit(close(con))
readLines(con, n = nl, skipNul = skipNul)
}
#-----------------------------------------------------------------------------------------
# to use w/ sapply for finer sentence splitting.
find_sentences <- function(x) {
s <- paste(x, collapse = " ") %>% as.String()
a <- NLP::annotate(s , sent_token_annotator)
as.vector(s[a])
}
#-----------------------------------------------------------------------------------------
NGramTokenizerBecause the NGramTokenizer would fail with a java memory error if fed the full vector of sentences, but run when fed chunks of 100,000 sentences, I thought that turning this into a basic loop handling the splitting in chunks, collecting the output and finally return just one vector of n-grams would work, be compact and smarter.
It turns out that it fails… and this puzzles me deeply.
Is R somehow handling the “stuff” in the loop in the same way it would if I run the tokenizer with the full vector?
Any clue?
nl.chunk <- 100000
N <- ceiling(length(sel.blogs.sentences)/nl.chunk)
alt.n3grams.blogs <- vector("list", N)
system.time({
for( i in 1:N ) {
i <- i+1
n1 <- (i-1)*nl.chunk + 1
n2 <- min(i*nl.chunk, end.blogs)
cat(" ", i, n1, n2, "\n")
alt.n3grams.blogs[[i]] <- NGramTokenizer(sel.blogs.sentences[n1:n2],
Weka_control(min = 3, max = 3,
delimiters = token_delim))
}
})