Comparing heuristics on wide, noisy data

Different heuristics are better-suited to different data environments. Here we compare heuristics like Take the Best, a unit-weight linear model, and multiple linear regression on wide, noisy data sets, such as those found in genetics.

Background

Consider a situation common in genetic data: only a few rows of data but many, many columns of noisy data. For example, in trying to find which genes cause Alzheimer’s Disease, we might have data from just 5 patients– but we have measured thousands or even millions of genes for each.

Here I simulate this situation by generating data with few row but many columns. Furthermore, I make only one cue (column) predictive, giving it a cue validity of 0.75 with the criterion, a score of severity of Alzheimer’s symptoms. Then I add additional cues, but all of them are randomly generated. As more random cues are added, it gets to be harder and harder for heuristics to pick out the one cue that has a “signal” in it and ignore the noise!

Replication

# Use this seed to exactly replicate my tables and graphs below.
set.seed(3)
# Remove it to see a new sampling-- and whether my overall conclusions still
# hold.

Helper functions

First let’s load the heuristica package to get the heuristics we will compare. It also includes functions to calculate accuracy. (The package is only on github for now.)

# Uncomment and execute install functions you may need.
# install.packages("devtools") 
# Uncomment and execute 
# install.packages("heuristica") 
#    -- OR --
# devtools::install_github("jeanimal/heuristica")
#    -- OR --
devtools::load_all()

## Loading heuristica

library("heuristica")

Now here is a function to generate data. The first cue has a 75% validity. Then we tell it to generate num_random_cues additional cues. Then it measures predictive accuracy on this data for the list of models we gave it in vec_of_models, which is a list of the model types (i.e. “constructors”);

# If you do not have reshape2 installed yet, uncomment the install line the first time.
# install.packages("reshape2") 
library("reshape2")

multiPredict75 <- function(vec_of_models, num_rand_columns) {
  num_rows <- 5  # adding 0.75 validity vector below hard-codes 5 rows
  m_train <- matrix(sample(0:1, num_rows*num_rand_columns, replace=TRUE), num_rows, num_rand_columns)
  m_train <- cbind(c(num_rows:1), c(1,1,1,0,1), m_train)
  m_test <- matrix(sample(0:1, num_rows*num_rand_columns, replace=TRUE), num_rows, num_rand_columns)
  m_test <- cbind(c(num_rows:1), c(1,1,1,0,1), m_test)
  criterion_col <- 1
  cols_to_fit <- c(2:(num_rand_columns+2))
  df <- data.frame()
  list_of_fitted_models <- list()
  for (model_constructor in vec_of_models) {
    fitted_model <- model_constructor(m_train, criterion_col, cols_to_fit)
    list_of_fitted_models[[length(list_of_fitted_models)+1]] <- fitted_model
  }
  pctCorrect <- pctCorrectOfPredictPair(list_of_fitted_models, m_test)
  pctCorrect <- cbind.data.frame(num_rand_col=num_rand_columns, pctCorrect)
  df <- melt(pctCorrect, id.vars=c("num_rand_col"))
  names(df) <- c("num_rand_col", "model", "accuracy")
  return(df)
}

Now we need an easy way to summarize a lot of results. I got this function from an R cookbook. It uses plyr.’

## Summarizes data.
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be summariezed
##   groupvars: a vector containing names of columns that contain grouping variables
##   na.rm: a boolean that indicates whether to ignore NA's
##   conf.interval: the percent range of the confidence interval (default is 95%)
## Source:
## http://www.cookbook-r.com/Manipulating_data/Summarizing_data/
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
                      conf.interval=.95, .drop=TRUE) {
    # New version of length which can handle NA's: if na.rm==T, don't count them
    length2 <- function (x, na.rm=FALSE) {
        if (na.rm) sum(!is.na(x))
        else       length(x)
    }
    # This does the summary. For each group's data frame, return a vector with
    # N, mean, and sd
    if (requireNamespace("plyr", quietly = TRUE)) {
     datac <- plyr::ddply(data, groupvars, .drop=.drop,
      .fun = function(xx, col) {
        c(N    = length2(xx[[col]], na.rm=na.rm),
          mean = mean   (xx[[col]], na.rm=na.rm),
          sd   = sd     (xx[[col]], na.rm=na.rm)   
        )
      },
      measurevar
     )
     # Rename the "mean" column    
     datac <- plyr::rename(datac, c("mean" = measurevar))
    }
    datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean
    # Confidence interval multiplier for standard error
    # Calculate t-statistic for confidence interval: 
    # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
    ciMult <- qt(conf.interval/2 + .5, datac$N-1)
    datac$ci <- datac$se * ciMult
    return(datac)
}

Also load the ggplot2 library so we will be able to plot results.

# If you do not have ggplot2 installed yet, uncomment the install line the first time.
# install.packages("ggplot2") 
library("ggplot2")

Generating data

For 0 random cues, the models will be using the cue with 75% validity. You might think that would lead to 75% accuracy, but the definition of cue validity ignores cases where the cues are the same– when they do not discriminate between objects. This is fairly common with binary data, when both objects can have a cue value of 1 or both have a cue value of 0. In those cases, the models have to guess, which has a 50% accuracy. So the overall accuracy will be somwehat less than 75%.

Below we see all models earn 60% accuracy. They are equally good at fitting the cue with a “signal” in it.

model_list <- list(dawesModel, franklinModel, ttbModel, logRegModel, regNoIModel, regModel)
multiPredict75(model_list, 0)

##   num_rand_col         model accuracy
## 1            0    dawesModel      0.6
## 2            0 franklinModel      0.6
## 3            0      ttbModel      0.6
## 4            0   logRegModel      0.6
## 5            0   regNoIModel      0.6
## 6            0      regModel      0.6

Now consider adding one random cue. To be sure we are fair, we should run this several times, say 200 times, and take the average.

out <- data.frame()
for (i in 1:200) { out <- rbind(out, multiPredict75(model_list, 1)) }
out_summary <- summarySE(out, "accuracy", groupvars=c("model", "num_rand_col"))
out_summary[c("model", "accuracy")]

##           model accuracy
## 1    dawesModel  0.57700
## 2 franklinModel  0.57150
## 3      ttbModel  0.57350
## 4   logRegModel  0.57225
## 5   regNoIModel  0.59950
## 6      regModel  0.56100

In general, the models have fallen in accuracy. The random cue definitely adds a challenge. regNoI model– a regression without fitting an intercept– is typically best, and regModel is typically much worse than the rest because it wastes a degree of freedom on fitting an intercept. An intercept fitted to each row is useless when comparing pairs of rows because the intercept of one row is the same as the intercept of the other row. (Idea: We could fit a regression on the pairs matrix.)

Below is a hacky but effective way to generate a lot of data for a variety of number of random cues. You need 2 repetitions for 0 random cues because otherwise the summarize function complains. Using 200 repetitions takes about a minute on my MacBook Air, while 2000 repetitions takes enough time to grab a coffee.

num_rep <- 400
out <- data.frame()
for (i in 1:2) { out <- rbind(out, multiPredict75(model_list, 0)) }
for (i in 1:num_rep) { out <- rbind(out, multiPredict75(model_list, 1)) }
for (i in 1:num_rep) { out <- rbind(out, multiPredict75(model_list, 2)) }
for (i in 1:num_rep) { out <- rbind(out, multiPredict75(model_list, 3)) }
for (i in 1:num_rep) { out <- rbind(out, multiPredict75(model_list, 4)) }
for (i in 1:num_rep) { out <- rbind(out, multiPredict75(model_list, 5)) }
for (i in 1:num_rep) { out <- rbind(out, multiPredict75(model_list, 6)) }
for (i in 1:num_rep) { out <- rbind(out, multiPredict75(model_list, 7)) }
for (i in 1:num_rep) { out <- rbind(out, multiPredict75(model_list, 8)) }

Now for the fun part, the analysis as we add more and more random columns! Which model is best at picking out the signal amidst the noise?

out_summary <- summarySE(out, "accuracy", groupvars=c("model", "num_rand_col"))

And a plot, which shows the trends of the mean accuracy with a 95% confidence interval around it (although the ci assumes a normal distribution, so take it with a grain of salt).

ggplot(out_summary, aes(x=num_rand_col, y=accuracy, colour=model)) + 
     geom_errorbar(aes(ymin=accuracy-ci, ymax=accuracy+ci), width=.1) +
     geom_line() +
     geom_point() + xlab("number of random cues")

It shows model accuracy decreasing as more random cues are added, which is not a surprise. In general, regNoIModel, dawesModel, and franklinModel are best. ttbModel does pretty badly.

More good news is that most heuristics continue to maintain above-chance performance (above 50%) even with 8 additional random cues. They are champions of picking out signals amidst the noise!

Discussion

I think ttbModel does badly because it puts so much emphasis on the first discriminating high-validity cue. But with random data, we will generate some cues that appear to have a validity well above 75%, and ttb will incorrectly rely on those, ignoring the rest.

In contrast, models like dawesModel– unit linear, so every cue’s weight is +1 or -1– and franklinModel– where cue weights are their cue validities except validities below 0.5 are reversed– let the noise of additional cues cancel each other out. For example, for dawesModel, if there are 100 random cues, 50 of them are likely to get +1 weights and 50 are likely to get -1 weights. In fact, the more random cues are added, the more likely they are to cancel out! Unfortunately, the weight of the one signal cue becomes a smaller percentage of the total weight, and its signal eventually gets lost anyway.

But if there were multiple signal cues– e.g. if 10% of the added cues were signal cues rather than noise cues– then dawesModel could maintain its good performance with 100’s of extra cues. In contrast, ttb would increasingly choose cues which seemed to have a validity of 1.0 but were actually pure noise cues.

Comparing heuristics on wide, noisy data

Jean Czerlinski Whitmore

2015-12-28

Comparing heuristics on wide, noisy data