This document provides a simple example of how to compare the predictive performance of different models in the heuristica package. It uses cross-validation on the German cities dataset and measures model performance by the percentage of correct predictions.
Replication
# Use this seed to exactly replicate my tables and graphs below.
set.seed(3)
# Remove it to see a new sampling-- and whether my overall conclusions still
# hold.
Helper functions
First let’s load the heuristica package to get the heuristics we will compare. It also includes functions to calculate accuracy. (The package is only on github for now.)
# Uncomment and execute if you do not already have devtools.
#install.packages("devtools")
devtools::install_github("jeanimal/heuristica")
## Downloading github repo jeanimal/heuristica@master
## Installing heuristica
## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \
## --no-environ --no-save --no-restore CMD INSTALL \
## '/private/var/folders/wg/3hpqt3dd5n155w_506frkg_w0000gn/T/RtmpGQGa9o/devtools1764702fb56f/jeanimal-heuristica-b9ab705' \
## --library='/Library/Frameworks/R.framework/Versions/3.2/Resources/library' \
## --install-tests
library("heuristica")
Now we can load the German cities dataset which is included in the heuristica package.
data(city_population)
We remove the first and second columns which contain data that we do not need for the models. Then we store the position of the criterion values and the predictor variables.
criterion_col <- 3
cols_to_fit <- 4:ncol(city_population)
Let’s enter the models we want to test:
vec_of_models <-c(ttbBinModel,dawesModel,franklinModel,regModel,logRegModel)
Next we define the size of the training and test sets.
trainVec <- seq(0.1,0.9,0.1)
trainS <- round(nrow(city_population)*trainVec)
testS <- nrow(city_population)-trainS
Here’s a function that does cross-validation taking the dataset, the vector of models and the number of repetitions as input:
crossV <- function(vec_of_models,x,testRowPairs,reps){
results <- vector()
for ( i in 1:reps){
test <- sample(1:nrow(x),testSize)
train <- setdiff(1:nrow(x),test)
training_set <- x[train,]
test_set <- x[test,]
models<-list()
y=0
for (mod in vec_of_models){ #fit the models to the training_set
y=y+1
models[[y]] <- mod(training_set,criterion_col,cols_to_fit)
}
#calculate percentage of correct predictions
df <- predictAlternativeWithCorrect(models,test_set,row_pairs=testRowPairs)
errors <- createErrorsFromPredicts(df)
pctCorr <- createPctCorrectsFromErrors(errors)
results <- rbind(results,pctCorr)
}
return (colMeans(results,na.rm=T))
}
Then we can just run this function to calculate predictive accuracy for different training and test set sizes:
accuracy<-vector()
for (s in 1:length(trainS)){
trainSize <- trainS[s]
testSize <- testS[s]
testRowPairs <- rowPairGenerator(testSize)
accuracy <- rbind(accuracy,crossV(vec_of_models,city_population,testRowPairs,500))
}
Finally, let’s plot the results:
library(reshape)
library(ggplot2)
p <- melt(accuracy)
p$X1 <- rep(seq(10,90,10),ncol(accuracy))
colnames(p)<-c("size","model","value")
ggplot(p, aes(x=size, y=value, colour=model)) +
geom_line() +
geom_point() + xlab("Size of training set") + ylab("Percent correct")+
scale_x_continuous(limits=c(10,90),breaks=seq(10,90,10),expand=c(0.01,0.01)) +
scale_y_continuous(limits=c(min(accuracy)-0.01, max(accuracy)+0.01),breaks=seq(0.6,0.8, 0.1),expand=c(0.01,0.01))