Comparing the performance of simple heuristics using the Heuristica R package

Daniel Barkoczi

2016-06-06

This document provides a simple example of how to compare the out-of-sample performance of different models in the heuristica package.

Replication

# Use this seed to exactly replicate my tables and graphs below.
#set.seed(3)
# Remove it to see a new sampling-- and whether my overall conclusions still
# hold.

Helper functions

First let’s load the heuristica package to get the heuristics we will compare. It also includes functions to calculate accuracy.

# Uncomment and execute if you do not already have devtools.
#install.packages("devtools")
#devtools::install_github("jeanimal/heuristica")
devtools::load_all() # For development work.

## Loading heuristica

library("heuristica")

Now we can load the German cities dataset which is included in the heuristica package. The criterion column may change depending on your data set, so set it correctly!

data(city_population)
data_set <- city_population
criterion_col <- 3
cols_to_fit <- 4:ncol(data_set)

Let’s enter the models we want to test (note we fit linear regression with an intercept term because the original study by Czerlinski, Gigerenzer and Goldstein, 1999 also included an intercept. For regression with no intercept use regModel():

vec_of_models <-c(ttbModel, unitWeightModel, regInterceptModel,minModel)

Here’s a function that does cross-validation taking the vector of models, criterion column, columns to fit, the dataset, and the number of repetitions as input:

crossV <- function(vec_of_models, criterion_col, cols_to_fit, data, train_size, reps){
  fitting <- vector()
  prediction <- vector()
  for(i in 1:reps){
    
    #randomly sample training and test items
    train <- sample(1:nrow(data), nrow(data)*0.5)
    test <- setdiff(1:nrow(data), train)
   
    #create training and test set
    training_set <- data[train,]
    test_set <- data[test,]
    
    # If a regression is overdetermined (e.g. has too many columns(), it will
    # drop the right-most columns.  To instead make it drop random columns,
    # we shuffle the column order.
    shuffled_cols_to_fit <- sample(cols_to_fit)

    models<-list()
    y <- 0
    for (mod in vec_of_models) { #fit the models to the training_set
      y <- y+1
      models[[y]] <- mod(training_set, criterion_col, shuffled_cols_to_fit)
    }

    #calculate percentage of correct predictions
    fittingAccuracy <- percentCorrect(models, training_set)
    predictionAccuracy <- percentCorrect(models, test_set)
    fitting <- rbind(fitting,fittingAccuracy)
    prediction <- rbind(prediction,predictionAccuracy)
  }

  return (rbind(colMeans(fitting),colMeans(prediction)))
}

City population

Then we can just run this function to calculate predictive accuracy for different training and test set sizes:

reps <- 1000
results <- crossV(vec_of_models, criterion_col, cols_to_fit, data_set, train_size, reps)

Finally, let’s plot the results:

library(ggplot2)
library(reshape)
rownames(results) <- c("Fitting","Prediction")
p <- melt(results)
colnames(p) <- c("condition","model","value")
ggplot(p, aes(x=condition, y=value, colour=model,group=model)) +
  geom_line() + 
  geom_point() + 
  xlab("Condition") + ylab("Proportion correct")

High school drop-outs

Now do the same analysis for the school drop-out data set. It has real-valued cues rather than binary cues. Also, it has a whopping 23 cues for the 63 Chicago public high schools.

Note that this data set has na’s, so we use na.omit to clean them because not all heuristics can handle them properly.

data(highschool_dropout)
data_set <- na.omit(highschool_dropout)
criterion_col <- 4
cols_to_fit <- 6:ncol(data_set)

reps <- 1000
results <- crossV(vec_of_models, criterion_col, cols_to_fit, data_set, train_size, reps) 

rownames(results) <- c("Fitting","Prediction")
p <- melt(results)
colnames(p) <- c("condition","model","value")
ggplot(p, aes(x=condition, y=value, colour=model,group=model)) +
  geom_line() + 
  geom_point() + 
  xlab("Condition") + ylab("Proportion correct")

As expected, the performance of all models drops when they are predicting unseen data. Importantly, for some of the simple models (TTB and UnitWeightModel) the drop in accuracy is smaller than for linear regression.

How would other models compare to take-the-best? Try some of the existing models in the heuristica package (e.g., logRegModel for logistic regression) or create your own model (see vignette on ‘how to make a heuristic’).