Building a Text Prediction Algorithm

CONTENT

The report is organized in the following sections:

EXECUTIVE SUMMARY
PRELIMINARIES
PREPROCESSING (before loading into R)
- Homogeneization of Characters
- Contractions, Profanities, Emoticons, Hashtags, etc…
LOADING THE DATA INTO R
FURTHER DATA CLEANING IN R
ANALYSIS - STEP 1 : SENTENCE ANNOTATION
ANALYSIS - STEP 2 : N-GRAMS TOKENIZATION
- A look at the n-grams
- Some Summary Plots for 4-grams
  - Top-30 all mixed
  - Top-20 by data source
APPENDIX
- User Defined Functions
- Mysterious issue with NGramTokenizer

EXECUTIVE SUMMARY

In this report I briefly illustrate the exploratory analysis performed on a three datasets, comprising text from blogs, news and tweets.

The ultimate goal is to produce a light application able to predict text (words) given some preceding text, mimicking the predictive typing feature of modern software keyboard of portable devices.

As a playground a fairly substantial dataset was made available, comprising text from various heterogenous sources (blogs, news, twitter). These datasets are the foundation for developing an understanding of language processing and in turn devise a strategy for achieving the goal, and perhaps more importantly (in practice) they constitute our training and testing datasets.

I decided to invest a significant amount of time to explore the data, and delved (too) deeply into data cleaning, assuming that this effort will pay off by making any algorithm more robust.

At this stage in the project I will mostly review my exploratory analysis of the data, and outline my current thought about the strategy for developing the algorithm for the text-predicting application.

Performance issues: it is worth mentioning that one of the main challenges has been dealing smartly with the computational load, that turned out to be a serious limiting factor, even on a powerful workstation.
I did not use the suggested tm suite and relied instead heavily on perl and in R mainly dplyr, NLP and RWeka.

Current Thoughts About Predictive Algorithm Strategy

My current thoughts, very much in flux, about the strategy are that a n-grams based approach would be the most effective.
In particular, I am leaning towards a weighted combination of 2- 3- 4- 5-grams (linear interpolation), perhaps assisted by some additional information drawn from an analysis of the association of words in sentences or their distance within it.

An important issue that I have not yet had a chance to ponder sufficiently include the handling of “zeros”, i.e. words not included in the dictionary of the training set or, more importantly with a n-grams approach words that are not seen following a given (n-1) gram. In practice, based on my readings, this problem is tackled with some form of smoothing, that is assigning a probability to the “zeros” (and in turn re-allocating some mass probability away from the observed n-grams).
I have not yet had a chance to explore the feasibility and effectiveness of methods like Good-Turing or Stupid Backoff.

PRELIMINARIES

Back to the Top

Libraries needed for data processing and plotting:

#-----------------------------
# NLP
library("tm")
library("SnowballC")
library("openNLP")
library("NLP")

# To help java fail less :-(
options( java.parameters = "-Xmx6g")
library("RWeka")   # [NGramTokenizer], [Weka_control]

#-----------------------------
# general
library("dplyr")
library("magrittr")
library("devtools")

library("ggplot2")
library("gridExtra")
# library("RColorBrewer")

#-----------------------------
# my functions
source("./scripts/my_functions.R")
#-----------------------------

PREPROCESSING (before loading into R)

Back to the Top

After a quick review of the data with various R functions and packages, I decided to perform some cleaning of the text with standard Linux command line tools.

The main task was to analyze the mix of invidual characters present in the three datasets with the goal of doing some homogeneization and tidying up of non-alphanumeric characters, such as quotes that can come in different forms.

The used method is not elegant, but effective enough, relying on a simple perl command substituting a series of non-odd characters with spaces, thus leaving a stream of odd characters subsequently parsed and cleaned to produce a list of odd characters sorted by their count.

perl -pe 's|[\d\w\$\,\.\!\?\(\);:\/\\\-=&%#_\~<>]||g; s|\s||g; s|[\^@"\+\*\[\]]||g;' | \
          perl -pe "s/\'//g;" | \
          egrep -v '^$' | \
          split_to_singles.awk | \
          sort -k 1 | uniq -c | sort -k 1nr

# split_to_singles.awk is a short awk script not worth including here (it's on GitHub)

The number of unique odd characters found in each dataset are 2159 for blogs, 310 for news, 2087 for twitter.

The following is the census of odd characters appearing more than 500 times in each of the datasets (the full sorted lists are available on the GitHub repo in the data directory).

   blogs           news              twitter
-----------      ----------         ------------------------
 387317 [’]      102911 [’]         27440 [“]        726 [»]
 109154 [”]       48115 [—]         26895 [”]        718 [«]
 108769 [“]       47090 [“]         11419 [’]        715 [😔]
  50176 [–]       43992 [”]          5746 [♥]        686 [😉]
  41129 […]        8650 [–]          5241 […]        680 [😳]
  23836 [‘]        6991 [ø]          3838 [|]        639 [{]
  18757 [—]        6723 []          2353 [❤]        617 [•]
   3963 [é]        6544 []          2314 [–]        593 [‘]
   2668 [£]        6267 []          1799 [—]        578 [�]
   1301 [′]        4898 [‘]          1333 [😊]        561 [💜]
    914 [´]        3641 []          1211 [👍]        560 [😃]
    755 [″]        3319 [é]          1149 [😂]        544 [😏]
    643 [€]        3062 […]           977 [é]        506 [☀]
    624 [ā]        2056 []           963 [😁]        503 [😜]
    605 [½]        1408 []           955 [☺]
    598 [á]        1152 [�]           926 [😒]
    582 [ö]         971 [•]           802 [`]
    555 [è]         837 [½]           758 [😍]
    518 [°]         711 [`]           751 [😘]
                    537 [ñ]           741 [}]

Homogeneization of Characters

Back to the Top

For this preliminary stage I decided to not worry about accented letters, and characters from non-latin alphabet (e.g. asian, emoticons), but I thought it would be helpful to standardize a small set of very frequent characters, whose “meaning” is substantially equivalent

                blogs    news  twitter      TOTAL
        
quotes    [‘]   23836    4898     593   =   29327
          [’]  387317  102911   11419   =  501647
          [“]  108769   47090   27440   =  183299
          [”]  109154   43992   26895   =  180041
          [«]       0       0     718   =     718
          [»]       0       0     726   =     726
                                
dashes    [–]   50176    8650    2314   =   61140
          [—]   48115   18757    1799   =   68671

ellipsis  […]   41129    5241    3062   =   49432

The dataset where cleaned with this perl commands and saved.

  perl -pe "s|’|\'|g; s|…|...|g; s|–|\-|g; s|—|\-|g; s|‘|\'|g;" $1 |  \
        perl -pe 's|«|"|g; s|»|"|g; s|”|"|g; s|“|"|g;'

Contractions, Profanities, Emoticons, Hashtags, etc…

Back to the Top

I have put a major effort into understanding the idiosyncrasies of the textual data, with the expectation that a deep cleaning would make a difference in the prediction context.

One example of what I have in mind is that transforming to categorical generic “tag” frequent “items” with a lot of variations but broadly similar meaning (e.g. dates, money, possessive pronouns), could strengthen the predictive ability of any algorithm.

Most of the work was done with perl “offline” (can’t beat it for regex work).
To match the application input with the data on which the application is built, all operations were ported to R either directly or by relying on an external perl script. Among the main transformations applied to the text:

Contractions (e.g. don’t, isn’t, I’ll): this seem to be more commonly regarded as stopword, hence removed. My take has been that they can provide meaning and it was worth preserving them, as well as they non-contracted counterparts. I homogeneized all of them in forms like “I_will”, “do_not”, with an underscore gluing them together.
Profanity filtering: I based my cleaning on the “7 dirt words”, and some words rooted on them.
- To preserve their potential predictive value, I replace them with a tag <PROFANITY>.
- User input is also filtered, but the information carried by a possible profanity can be used.
Emoticons: Recognized them with regex. Marked with a tag, <EMOJ>.

Other transformations done on the text before loading the data into R:

Regularization/ Homogeneization of Characters
- Mostly cleaning (not necessarily removing) odd characters e.g. apostrophes, quotes, etc.
- Sequences of characters: inline and End-Of-Line ellipsis, and other “non-sense”.
- Substitution on “|” that seem to be equivalent to end of sentences (i.e. a period).
- Substitution of <==/<-- and ==>/--> with ;.
- Cleaning sequences of ! and ?.
Hashtags: Recognized and replaced with a generic tag HASHTAG
Acronyms: limited to variations of U.S., also replaced with a tag, <USA>.
Number-related:
- (likely) dollar amounts by the presence of $: marked with <MONEY> tag.
- dates (e.g. 12/34/5678): marked with <DATE> tag.
- hours (e.g. 1:30 p.m.): marked with <HOUR> tag.
- percentages: marked with <PERCENTAGE> tag.
Repeated Consecutive Characters: handled by type.
- $ signs, assumed to stand for a money: replaced with tag <MONEY>.
- *, within words usually are disguised profanities: replaced with <PROFANITY> tag.
- -: context/surroundings dependent replacement with regular punctuation.
- Some character sequences were entirely deleted: multiple <, >, =, #.

The rest of the analysis presented here is based on these cleaned datasets.

LOADING THE DATA INTO R

Back to the Top

The datasets are read-in separately into character vectors, using a user-defined compact function (readByLine()) (see Appendix for the short source).

in.blogs.CL <- readByLine("./data/en_US.blogs.CLEANED1.txt.gz", check_nl = FALSE, skipNul = TRUE)
in.news.CL <- readByLine("./data/en_US.news.CLEANED1.txt.gz", check_nl = FALSE, skipNul = TRUE)
in.twitter.CL <- readByLine("./data/en_US.twitter.CLEANED1.txt.gz", check_nl = FALSE, skipNul = TRUE)

Some basic statistics of the three datasets:

stats.blogs   <- as.numeric(system("gzip -dc ./data/en_US.blogs.CLEANED1.txt.gz | wc | awk '{print $1; print $2; print $3}'", intern = TRUE))
stats.news    <- as.numeric(system("gzip -dc ./data/en_US.news.CLEANED1.txt.gz | wc | awk '{print $1; print $2; print $3}'", intern = TRUE))
stats.twitter <- as.numeric(system("gzip -dc ./data/en_US.twitter.CLEANED1.txt.gz | wc | awk '{print $1; print $2; print $3}'", intern = TRUE))

stats.df <- data.frame( blogs = stats.blogs, news = stats.news, twitter = stats.twitter, 
                        row.names = c("lines", "words", "characters"), stringsAsFactors = FALSE)

stats.df
##                blogs      news   twitter
## lines         899288   1010242   2360148
## words       37334114  34365936  30359804
## characters 208763813 205300313 166962974

FURTHER DATA CLEANING IN R

Back to the Top

There are some common, customary, operations performed on a text dataset before proceeding to analyze it.

Make text lowercase.
Strip extra white spaces.
Remove numbers.
Remove punctuation.
Remove stopwords.

Given that the goal is to predict words in a typing context I think that removing stopwords does not make much sense.
Working with a text without stopwords may be useful if one wanted to use in the prediction algorithm some information about words’ association in sentences, which may help improve meaningful discrimination between different next word possibilities “proposed” by an algorithm based on n-grams.

Because of the context, I also do not think that removing punctuation would be wise, nor make sense.

Text transformations

Back to the Top

I have applied to the data the other three transformations, as follows (btw, a big obligatory acknowledgement and thank you to Hadley Wickham and Stefan Bache for bringing us %>%!).

in.blogs.CL.cleaned <- tolower(in.blogs.CL) %>% removeNumbers() %>% stripWhitespace()
in.news.CL.cleaned <- tolower(in.news.CL) %>% removeNumbers() %>% stripWhitespace()
in.twitter.CL.cleaned <- tolower(in.twitter.CL) %>% removeNumbers() %>% stripWhitespace()

Excluding rows with too few characters

Back to the Top

During my initial attempts it immediately emerged the problem of excessively short rows of text. In particular, because I decided to perform tokenization on individual sentences, not directly on individual rows, the tokenizer tripped and failed on empty “sentences” resulting from short rows.

I have then decided to set a cutoff to the minimum acceptable length of rows. After some empirical testing and row-length analysis with command line tools (e.g. something like awk '{if(length <= 8){printf "%6d - %-s\n",NR,$0}}') I have set a threshold at 6 characters.

nchar.min <- 6

nchar.blogs.CL <- nchar(in.blogs.CL.cleaned)
in.blogs.CL.cleaned <- in.blogs.CL.cleaned[nchar.blogs.CL > nchar.min]

nchar.news.CL <- nchar(in.news.CL.cleaned)
in.news.CL.cleaned <- in.news.CL.cleaned[nchar.news.CL > nchar.min]

nchar.twitter.CL <- nchar(in.twitter.CL.cleaned)
in.twitter.CL.cleaned <- in.twitter.CL.cleaned[nchar.twitter.CL > nchar.min]

Subsetting of the data

Back to the Top

It immediately became clear that analyzing the entire dataset requires fairly powerful computing resources and time, even on a very high-end laptop and workstation.

Therefore, for exploration and prototyping I have been working with a subset of 20% of the data of each type.

fraction <- 0.2

# for reproducibility set seed!
set.seed(6420)
idx.blogs   <- sample(1:length(in.blogs), ceiling(fraction*length(in.blogs)))
idx.news    <- sample(1:length(in.news), ceiling(fraction*length(in.news)))
idx.twitter <- sample(1:length(in.twitter), ceiling(fraction*length(in.twitter)))

sel.blogs   <- in.blogs[idx.blogs]
sel.news    <- in.news[idx.news]
sel.twitter <- in.twitter[idx.twitter]

ANALYSIS - STEP 1 : SENTENCE ANNOTATION

Back to the Top

As noted, after some tests, I settled on an approach whereby n-grams tokenization is performed on separate individual sentences, instead of directly on individual rows as loaded from the dataset.

This is motivated by the fact that the tokenizer I have adopted because I found its performance to be more satisfactory, the NGramTokenizer of the RWeka package, does not seem to interrupt its construction of n-grams at what are very likely sentence boundaries.

With next word prediction in mind, it makes a lot of sense to restrict n-grams to sequences of words within the boundaries of a sentence.

Therefore, after cleaning, transforming and filtering the data, the first real operation I perform is the annotation of sentences, for which I have been using the openNLP sentence annotator Maxent_Sent_Token_Annotator(), with its default settings.

sent_token_annotator <- Maxent_Sent_Token_Annotator()
sent_token_annotator
## An annotator inheriting from classes
##   Simple_Sent_Token_Annotator Annotator
## with description
##   Computes sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'.

I want the data in the form of a vector with individual sentences, and so I opted for sapply() combined with a function wrapping the operations necessary to prepare a row of data for annotation, the annotation itself and finally return a vector of sentences (the short function is shown in the Appendix).

sel.blogs.sentences <- sapply(sel.blogs, FUN = find_sentences, USE.NAMES = FALSE) %>% unlist 
sel.news.sentences <- sapply(sel.news, FUN = find_sentences, USE.NAMES = FALSE) %>% unlist 
sel.twitter.sentences <- sapply(sel.twitter, FUN = find_sentences, USE.NAMES = FALSE) %>% unlist

N.sentences <- c(length(sel.blogs.sentences), length(sel.news.sentences), length(sel.twitter.sentences))
stats.df[4, ] <- as.numeric(5*N.sentences)
row.names(stats.df)[4] <- "sentences"

The stats table, with the added estimated number of sentences (because the analysis is on just 20% of the data, the number tabulated is 5*N.sentences) is as follows:

stats.df
##                blogs      news   twitter
## lines         899288   1010242   2360148
## words       37334114  34365936  30359804
## characters 208763813 205300313 166962974
## sentences    2209980   1956430   3682545
round(stats.df[4, ]/stats.df[1, ], 3)
##           blogs  news twitter
## sentences 2.457 1.937    1.56

ANALYSIS - STEP 2 : N-GRAMS TOKENIZATION

Back to the Top

For the n-grams tokenization I have been using the RWeka Tokenizer NGramTokenizer, passing to it a list of token delimiters.

I have been extracting n-grams for $n = 1, 2, 3, 4, 5$. It turns out that the 1-grams seem to represent a better definition of words than what is produced by the NWordTokenizer. For instance, this latter breaks don’t in 2, while the NGramTokenizer picks it up as a 1-gram.

I have not been able to run NGramTokenizer on the full vector of sentences for each data set. It fails on some variation of memory-allocation related error (that honestly does not make much sense to me considering that I am running it on machines with 12GB of RAM).

So, I am processing data in chunks of 100,000 sentences, as exemplified by this block of code (the n-grams data for the following section are loaded from saved previous analysis).

token_delim <- " \\t\\r\\n.!?,;\"()"
nl.chunk <- 100000
N <- ceiling(length(sel.blogs.sentences)/nl.chunk)

#----- BLOGS ------------------------------------
end.blogs <- length(sel.blogs.sentences)

#----- 2-grams -----
cat(" *** Tokenizing : blogs : 2-grams ------------------------------------------------------------\n")
n2grams.blogs.1 <- NGramTokenizer(sel.blogs.sentences[1:100000], 
                                  Weka_control(min = 2, max = 2, delimiters = token_delim))
n2grams.blogs.2 <- NGramTokenizer(sel.blogs.sentences[100001:200000], 
                                  Weka_control(min = 2, max = 2, delimiters = token_delim))
n2grams.blogs.3 <- NGramTokenizer(sel.blogs.sentences[200001:300000], 
                                  Weka_control(min = 2, max = 2, delimiters = token_delim))
n2grams.blogs.4 <- NGramTokenizer(sel.blogs.sentences[300001:400000], 
                                  Weka_control(min = 2, max = 2, delimiters = token_delim))
n2grams.blogs.5 <- NGramTokenizer(sel.blogs.sentences[400001:end.blogs], 
                                  Weka_control(min = 2, max = 2, delimiters = token_delim))

#----- 3-grams -----
cat(" *** Tokenizing : blogs : 3-grams ------------------------------------------------------------\n")
n3grams.blogs.1 <- NGramTokenizer(sel.blogs.sentences[1:100000], 
                                  Weka_control(min = 3, max = 3, delimiters = token_delim))
n3grams.blogs.2 <- NGramTokenizer(sel.blogs.sentences[100001:200000], 
                                  Weka_control(min = 3, max = 3, delimiters = token_delim))
n3grams.blogs.3 <- NGramTokenizer(sel.blogs.sentences[200001:300000], 
                                  Weka_control(min = 3, max = 3, delimiters = token_delim))
n3grams.blogs.4 <- NGramTokenizer(sel.blogs.sentences[300001:400000], 
                                  Weka_control(min = 3, max = 3, delimiters = token_delim))
n3grams.blogs.5 <- NGramTokenizer(sel.blogs.sentences[400001:end.blogs], 
                                  Weka_control(min = 3, max = 3, delimiters = token_delim))

# Combining split N-grams vector 
source("./scripts/combine_nXgrams_blogs.R")

A look at the n-grams

Back to the Top

From the n-grams vectors we can compute frequencies, which will be an important basis for the prediction algorithms.

For now we can take a peek at what are the most frequent 3-grams and 4-grams in the three datasets.

3-grams

print(cbind(head(n3g.blogs.freq, 20), head(n3g.news.freq, 20), head(n3g.twitter.freq, 20)), 
      print.gap = 1, right = FALSE)
##    ngram         count ngram                           count ngram                         count
## 1  one of the    4416  the united states               1324  thanks for the                7135 
## 2  a lot of      3613  the first time                  1249  thank you for                 2590 
## 3  to be a       2078  for the first                   1021  i love you                    2474 
## 4  it was a      2076  more than <DOLLARAMOUNT>        1000  for the follow                2334 
## 5  as well as    2067  the end the                      896  for the rt                    1311 
## 6  some of the   1988  it would be                      751  let me know                   1301 
## 7  the end of    1974  it was the                       722  i do_not know                 1265 
## 8  out of the    1954  the fact that                    690  i feel like                   1179 
## 9  be able to    1927  <DOLLARAMOUNT> - <DOLLARAMOUNT>  680  i wish i                      1154 
## 10 i want to     1882  this is the                      679  thanks for following          1048 
## 11 a couple of   1828  the rest the                     676  you for the                   1013 
## 12 the fact that 1596  said he was                      667  i can_not wait                 968 
## 13 this is a     1592  he said he                       655  <HASHTAG> <HASHTAG> <HASHTAG>  963 
## 14 the rest of   1539  i do_not think                   652  how are you                    960 
## 15 going to be   1521  the new york                     651  for the <HASHTAG>              958 
## 16 part of the   1478  he said the                      628  can_not wait for               919 
## 17 i_am going to 1448  i do_not know                    626  rt : i                         915 
## 18 i do_not know 1425  for more than                    622  i think i                      895 
## 19 one of my     1408  the same time                    578  if you want                    867 
## 20 i had to      1373  when he was                      565  what do you                    858

4-grams

blogs

print(head(n4g.blogs.freq, 20), print.gap = 3, right = FALSE)
##      ngram                count
## 1    the end of the       1011 
## 2    the rest of the       913 
## 3    at the end of         872 
## 4    at the same time      700 
## 5    when it comes to      611 
## 6    one of the most       610 
## 7    to be able to         578 
## 8    for the first time    565 
## 9    in the middle of      519 
## 10   if you want to        469 
## 11   is one of the         462 
## 12   i do_not want to      461 
## 13   a bit of a            403 
## 14   i was going to        395 
## 15   on the other hand     393 
## 16   i would like to       375 
## 17   one of my favorite    350 
## 18   as well as the        325 
## 19   i was able to         304 
## 20   is going to be        302

news

print(head(n4g.news.freq, 20), print.gap = 3, right = FALSE)
##      ngram                                           count
## 1    for the first time                              791  
## 2    more than <DOLLARAMOUNT> million                398  
## 3    the first time since                            195  
## 4    more than <DOLLARAMOUNT> billion                150  
## 5    for more than years                             138  
## 6    feet <DATE> for <DOLLARAMOUNT>                  137  
## 7    square feet <DATE> for                          137  
## 8    <DOLLARAMOUNT> million <DOLLARAMOUNT> million   136  
## 9    for the most part                               133  
## 10   the past two years                              132  
## 11   told the associated press                       132  
## 12   the united states and                           131  
## 13   i do_not know if                                126  
## 14   the end the year                                126  
## 15   the end the day                                 124  
## 16   g fat g saturated                               118  
## 17   the new york times                              118  
## 18   dow jones industrial average                    114  
## 19   be reached for comment                          112  
## 20   i do_not know what                              112

twitter

print(head(n4g.twitter.freq, 20), print.gap = 3, right = FALSE)
##      ngram                                     count
## 1    thanks for the follow                     1882 
## 2    thanks for the rt                         1031 
## 3    thank you for the                          916 
## 4    for the first time                         513 
## 5    i wish i could                             410 
## 6    thanks for the <HASHTAG>                   375 
## 7    rt : rt :                                  358 
## 8    thanks for the mention                     358 
## 9    let me know if                             330 
## 10   <HASHTAG> <HASHTAG> <HASHTAG> <HASHTAG>    322 
## 11   that awkward moment when                   299 
## 12   what do you think                          292 
## 13   thank you much for                         276 
## 14   hope all is well                           266 
## 15   can_not wait for the                       262 
## 16   thanks for the shout                       257 
## 17   for the shout out                          254 
## 18   i thought it was                           243 
## 19   thank you for following                    240 
## 20   thank you for your                         232

It is apparent that there some work will be necessary on the validation of the n-grams, or better still further text transformations, in particular of the twitter data set that “suffers” from the tendency of using shorthand slang (e.g. “rt” for “re-tweet”) that adds a lot of “noise” to the data.

Some Summary Plots for 4-grams

Back to the Top

Top-30 all mixed

Top-20 by data source

APPENDIX

Back to the Top

User Defined Functions

These are two handy functions used in the analysis.

The first for reading the data.
The second is passed to sapply() to annotate sentences, allowing to work by row instead of converting the whole dataset into one document.

#-----------------------------------------------------------------------------------------
# modified readLines

readByLine <- function(fname, check_nl = TRUE, skipNul = TRUE) {
    if( check_nl ) {
        cmd.nl   <- paste("gzip -dc", fname, "| wc -l | awk '{print $1}'", sep = " ")
        nl   <- system(cmd.nl, intern = TRUE)
    } else {
        nl   <- -1L
    }
    con <- gzfile(fname, open = "r")
    on.exit(close(con))
    readLines(con, n = nl, skipNul = skipNul) 
}

#-----------------------------------------------------------------------------------------
# to use w/ sapply for finer sentence splitting.

find_sentences <- function(x) {
    s <- paste(x, collapse = " ") %>% as.String()
    a <- NLP::annotate(s , sent_token_annotator) 
    as.vector(s[a])
}
#-----------------------------------------------------------------------------------------

Mysterious issue with `NGramTokenizer`

Because the NGramTokenizer would fail with a java memory error if fed the full vector of sentences, but run when fed chunks of 100,000 sentences, I thought that turning this into a basic loop handling the splitting in chunks, collecting the output and finally return just one vector of n-grams would work, be compact and smarter.

It turns out that it fails… and this puzzles me deeply.
Is R somehow handling the “stuff” in the loop in the same way it would if I run the tokenizer with the full vector?

Any clue?

nl.chunk <- 100000
N <- ceiling(length(sel.blogs.sentences)/nl.chunk)
alt.n3grams.blogs <- vector("list", N)

system.time({
for( i in 1:N ) {
    i <- i+1
    n1 <- (i-1)*nl.chunk + 1
    n2 <- min(i*nl.chunk, end.blogs)
    cat(" ", i, n1, n2, "\n")
    alt.n3grams.blogs[[i]] <- NGramTokenizer(sel.blogs.sentences[n1:n2], 
                                             Weka_control(min = 3, max = 3, 
                                                          delimiters = token_delim)) 
}
})

Building a Text Prediction Algorithm

Exploratory Analysis and Thoughts about a Prediction Strategy

Giovanni Fossati

CONTENT

EXECUTIVE SUMMARY

Current Thoughts About Predictive Algorithm Strategy

PRELIMINARIES

PREPROCESSING (before loading into R)

Homogeneization of Characters

Contractions, Profanities, Emoticons, Hashtags, etc…

LOADING THE DATA INTO R

FURTHER DATA CLEANING IN R

Text transformations

Excluding rows with too few characters

Subsetting of the data

ANALYSIS - STEP 1 : SENTENCE ANNOTATION

ANALYSIS - STEP 2 : N-GRAMS TOKENIZATION

A look at the n-grams

3-grams

4-grams

Some Summary Plots for 4-grams

Top-30 all mixed

Top-20 by data source

APPENDIX

User Defined Functions

Mysterious issue with `NGramTokenizer`

Building a Text Prediction Algorithm

Exploratory Analysis and Thoughts about a Prediction Strategy

Giovanni Fossati

CONTENT

EXECUTIVE SUMMARY

Current Thoughts About Predictive Algorithm Strategy

PRELIMINARIES

PREPROCESSING (before loading into R)

Homogeneization of Characters

Contractions, Profanities, Emoticons, Hashtags, etc…

LOADING THE DATA INTO R

FURTHER DATA CLEANING IN R

Text transformations

Excluding rows with too few characters

Subsetting of the data

ANALYSIS - STEP 1 : SENTENCE ANNOTATION

ANALYSIS - STEP 2 : N-GRAMS TOKENIZATION

A look at the n-grams

3-grams

4-grams

Some Summary Plots for 4-grams

Top-30 all mixed

Top-20 by data source

APPENDIX

User Defined Functions

Mysterious issue with NGramTokenizer

Mysterious issue with `NGramTokenizer`