How predictive is the Predictive R-squared?

In a previous post I’ve summarized the relevant information on the Predictive R-squared.

You may found the post here: https://rpubs.com/RatherBit/102428

The predictive R² is a measure that helps you to determine how well the model predicts responses for new observations. The predictive R² is something of a in-house cooked measure, which is “calculated by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation”.

Reference: http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/goodness-of-fit-statistics/r-squared/

However, to have a hint on how well your model predicts responses for new observations, what you really need is an independent sample.

This is what I will test in the present post.

Some months ago, Tom Hopper, who maintains a great blog on #rstats (http://tomhopper.me/), published a post on the predictive R² offering support for its calculation.

You may find the post here: http://tomhopper.me/2014/05/16/can-we-do-better-than-r-squared/

The scripts are also here: https://gist.github.com/tomhopper/8c204d978c4a0cbcb8c0

I’ve played along with these scripts, and I’ve created a general script to be sourced for easy of calculation.

I added the calculation of the ratio of adjusted R² to R², which tells you the likely decrease in model fit when the model is applied to new data. As far as the ratio of adjusted R² to R² is concerned, the higher, the better. Ideally, the adjusted R² should be very close to the R² for a good fit

Let’s call the scripts created by Tom Hopper to calculate the predictive R².

### short-hand: source("predictiveR2.R")

cat("These functions are helpful for calculating the predictive R-square of a model \n
  The functions are used essentially to compare different models \n
    The package plyr is necessary for the functions to work \n
    For models mod1 and mod2 \n
    digit: ldply(list(mod1, mod2), model_fit_stats) \n
    
    If you have just one model and you want to know the predictive R-square of your model [e.g, named 'fit'] \n
    just digit: ldply(list(fit), model_fit_stats) \n
    
    ")

## These functions are helpful for calculating the predictive R-square of a model 
## 
##   The functions are used essentially to compare different models 
## 
##  The package plyr is necessary for the functions to work 
## 
##  For models mod1 and mod2 
## 
##  digit: ldply(list(mod1, mod2), model_fit_stats) 
## 
##  
##  If you have just one model and you want to know the predictive R-square of your model [e.g, named 'fit'] 
## 
##  just digit: ldply(list(fit), model_fit_stats) 
## 
##  
##

#####################################################################
###
### Functions to calculate Predictive R-squared
###
#####################################################################


### This calculate the PRESS (predictive residual sum of squares), the lower, the better

#' @title PRESS
#' @author Thomas Hopper
#' @description Returns the PRESS statistic (predictive residual sum of squares).
#'              Useful for evaluating predictive power of regression models.
#' @param linear.model A linear regression model (class 'lm'). Required.
#' 
PRESS <- function(linear.model) {
  #' calculate the predictive residuals
  pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
  #' calculate the PRESS
  PRESS <- sum(pr^2)
  
  return(PRESS)
}


### This calculate the Predictive r-squared

#' @title Predictive R-squared
#' @author Thomas Hopper
#' @description returns the predictive r-squared. Requires the function PRESS(), which returns
#'              the PRESS statistic.
#' @param linear.model A linear regression model (class 'lm'). Required.
#'
pred_r_squared <- function(linear.model) {
  #' Use anova() to get the sum of squares for the linear model
  lm.anova <- anova(linear.model)
  #' Calculate the total sum of squares
  tss <- sum(lm.anova$'Sum Sq')
  # Calculate the predictive R^2
  pred.r.squared <- 1-PRESS(linear.model)/(tss)
  
  return(pred.r.squared)
}



### This calculate the R-squared, the adj-R-squared and the Predictive R-squared (from the functions above)

#' @title Model Fit Statistics
#' @description Returns lm model fit statistics R-squared, adjucted R-squared,
#'      predicted R-squared and PRESS.
#'      Thanks to John Mount for his 6-June-2014 blog post, R style tip: prefer functions that return data frames" for
#'      the idea \link{http://www.win-vector.com/blog/2014/06/r-style-tip-prefer-functions-that-return-data-frames}
#' @return Returns a data frame with one row and a column for each statistic
#' @param linear.model A \code{lm()} model.
model_fit_stats <- function(linear.model) {
  r.sqr <- summary(linear.model)$r.squared
  adj.r.sqr <- summary(linear.model)$adj.r.squared
  ratio.adjr2.to.r2 <- (adj.r.sqr/r.sqr)
  pre.r.sqr <- pred_r_squared(linear.model)
  press <- PRESS(linear.model)
  return.df <- data.frame("R-squared" = r.sqr, "Adj R-squared" = adj.r.sqr, 
    "Ratio Adj.R2 to R2" = ratio.adjr2.to.r2, "Pred R-squared" = pre.r.sqr, PRESS = press)
  return(round(return.df,3))
}


### This will allow the testing (e.g., comparison of model1 and model2)

library(plyr)

## Warning: package 'plyr' was built under R version 3.1.2

# Now call model_fit_stats() for each lm model that
# we have, using ldply. This returns the results in
# a data frame that is easily used for further 
# calculations.

# ldply(list(model1, model2), model_fit_stats)

Now, we need a dataset. Let’s start with a simulation, a simple one: two variables with medium correlation.

###############################################################################
#### (n = 1000, rho = 0.30)
###############################################################################

set.seed(170815)  ### for reproducibility

require(MASS)

## Loading required package: MASS

dat1 <- mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.3,0.3,1), ncol = 2),
               empirical = FALSE)
    dat1 <- as.data.frame(dat1)
    names(dat1) <- c("X", "Y")

    ### correlazion test

    c <- cor.test(dat1$X,dat1$Y)

    c

## 
##  Pearson's product-moment correlation
## 
## data:  dat1$X and dat1$Y
## t = 11.59, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2885 0.3979
## sample estimates:
##    cor 
## 0.3444

We have 1000 observations. Let’s separate the data into a training set (70% of data) and a test set (the remaining 30%).

###############################################################################
#### Train and test set with caret
###############################################################################

library(caret)

## Warning: package 'caret' was built under R version 3.1.3

## Loading required package: lattice
## Loading required package: ggplot2

### define an 70%/30% train/test split of the dataset

set.seed(234)   ### for reproducibility

trainIndex <- createDataPartition(dat1$Y, p=0.70, list=FALSE)

train1 <- dat1[ trainIndex,]
test1 <- dat1[-trainIndex,]

Now, let’s train a linear regression model in the training and the test set

###############################################################################
#### train a linear regression model
###############################################################################

set.seed(825)

fit.tr1 <-lm(train1$Y ~ train1$X) 
su.tr <- summary(fit.tr1)
r2.tr <- round(su.tr$r.squared,3)


fit.te1 <-lm(test1$Y ~ test1$X) 
su.te <- summary(fit.te1)
r2.te <- round(su.te$r.squared,3)

###############################################################################
#### plot the models
###############################################################################

par(mfrow=c(1,2))

plot(train1$Y ~ train1$X, cex=1.5, col="blue", xlab="X", ylab="Y", main="Train set")
abline(fit.tr1, col = "red", lwd=3)
abline(fit.te1, col = "blue", lwd=3)

### Why so much space in the second "mtext" call? 
### Because I was unable to figure out how to plot the value next to an "expression" command...
### Any suggestion is greatly appreciated.
 
mtext(expression(paste("Train set R"^"2", " = "))); mtext(paste("                                ", r2.tr))

legend("topleft", # places a legend at the appropriate place 
    c("Train fit","Test fit"), # puts text in the legend
    text.col=c("blue", "red"), bty="n")

plot(test1$Y ~ test1$X, cex=1.5, col="red", xlab="X", ylab="Y", main="Test set")
abline(fit.tr1, col = "red", lwd=3)
abline(fit.te1, col = "blue", lwd=3)


mtext(expression(paste("Test set R"^"2", " = "))); mtext(paste("                                ", r2.te))

legend("topleft", # places a legend at the appropriate place 
    c("Train fit","Test fit"), # puts text in the legend
    text.col=c("blue", "red"), bty="n")

plot of chunk unnamed-chunk-5

Let’s see the predictive R² for the training set

ldply(list(fit.tr1), model_fit_stats)

##   R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1     0.105         0.104              0.988            0.1 617.3

As it seems, the predictive R² for the training set is not a good approximation of the R² in the test set (take a look at the figure)

Let’s try another dataset, with stronger links between the X and the Y (the target)

###############################################################################
#### (n = 1000, rho = 0.70)
###############################################################################

set.seed(170815)

require(MASS)



dat2 <- mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.7,0.7,1), ncol = 2),
               empirical = FALSE)
    dat2 <- as.data.frame(dat2)
    names(dat2) <- c("X", "Y")

    ### correlazion test

    c <- cor.test(dat2$X,dat2$Y)

    c

## 
##  Pearson's product-moment correlation
## 
## data:  dat2$X and dat2$Y
## t = 33.2, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6936 0.7526
## sample estimates:
##    cor 
## 0.7244

###############################################################################
#### Train and test set with caret
###############################################################################

library(caret)

### define an 70%/30% train/test split of the dataset

set.seed(234)

trainIndex <- createDataPartition(dat2$Y, p=0.70, list=FALSE)

train2 <- dat2[ trainIndex,]
test2 <- dat2[-trainIndex,]




###############################################################################
#### train a linear regression model
###############################################################################

set.seed(825)

fit.tr2 <-lm(train2$Y ~ train2$X) 
su.tr <- summary(fit.tr2)
r2.tr <- round(su.tr$r.squared,3)


fit.te2 <-lm(test2$Y ~ test2$X) 
su.te <- summary(fit.te2)
r2.te <- round(su.te$r.squared,3)

### plot the models

par(mfrow=c(1,2))

plot(train2$Y ~ train2$X, cex=1.5, col="blue", xlab="X", ylab="Y", main="Train set")
abline(fit.tr2, col = "red", lwd=3)
abline(fit.te2, col = "blue", lwd=3)

mtext(expression(paste("Train set R"^"2", " = "))); mtext(paste("                                ", r2.tr))

legend("topleft", # places a legend at the appropriate place 
    c("Train fit","Test fit"), # puts text in the legend
    text.col=c("blue", "red"), bty="n")

plot(test2$Y ~ test2$X, cex=1.5, col="red", xlab="X", ylab="Y", main="Test set")
abline(fit.tr2, col = "red", lwd=3)
abline(fit.te2, col = "blue", lwd=3)


mtext(expression(paste("Test set R"^"2", " = "))); mtext(paste("                                ", r2.te))

legend("topleft", # places a legend at the appropriate place 
    c("Train fit","Test fit"), # puts text in the legend
    text.col=c("blue", "red"), bty="n")

plot of chunk unnamed-chunk-7

Again, let’s see the predictive R² for the training set

ldply(list(fit.tr2), model_fit_stats)

##   R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1     0.522         0.521              0.999          0.519   367

In this example, the predictive R² for the training set is a good approximation of the R² in the test set (take a look at the figure)

Repetita iuvant. Let’s see all results. You must check the predictive R² for the training set and the R² for the corresponding test set

ldply(list(fit.tr1, fit.te1, fit.tr2, fit.te2), model_fit_stats)

##   R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1     0.105         0.104              0.988          0.100 617.3
## 2     0.152         0.149              0.981          0.141 298.9
## 3     0.522         0.521              0.999          0.519 367.0
## 4     0.534         0.532              0.997          0.528 134.7

Let’s try again with a dataset with a very poor link between the X and the Y (the target)

###############################################################################
#### (n = 1000, rho = 0.10)
###############################################################################

set.seed(170815)

require(MASS)



dat3 <- mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.1,0.1,1), ncol = 2),
               empirical = FALSE)
    dat3 <- as.data.frame(dat3)
    names(dat3) <- c("X", "Y")

    ### correlazion test

    c <- cor.test(dat3$X,dat3$Y)

    c

## 
##  Pearson's product-moment correlation
## 
## data:  dat3$X and dat3$Y
## t = 4.75, df = 998, p-value = 2.333e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0875 0.2088
## sample estimates:
##    cor 
## 0.1487

###############################################################################
#### Train and test set with caret
###############################################################################

library(caret)

### define an 70%/30% train/test split of the dataset

set.seed(234)

trainIndex <- createDataPartition(dat3$Y, p=0.70, list=FALSE)

train3 <- dat3[ trainIndex,]
test3 <- dat3[-trainIndex,]




###############################################################################
#### train a linear regression model
###############################################################################

set.seed(825)

fit.tr3 <-lm(train3$Y ~ train3$X) 
su.tr <- summary(fit.tr3)
r3.tr <- round(su.tr$r.squared,3)


fit.te3 <-lm(test3$Y ~ test3$X) 
su.te <- summary(fit.te3)
r3.te <- round(su.te$r.squared,3)

### plot the models

par(mfrow=c(1,2))

plot(train3$Y ~ train3$X, cex=1.5, col="blue", xlab="X", ylab="Y", main="Train set")
abline(fit.tr3, col = "red", lwd=3)
abline(fit.te3, col = "blue", lwd=3)

mtext(expression(paste("Train set R"^"2", " = "))); mtext(paste("                                ", r3.tr))

legend("topleft", # places a legend at the appropriate place 
    c("Train fit","Test fit"), # puts text in the legend
    text.col=c("blue", "red"), bty="n")

plot(test3$Y ~ test3$X, cex=1.5, col="red", xlab="X", ylab="Y", main="Test set")
abline(fit.tr3, col = "red", lwd=3)
abline(fit.te3, col = "blue", lwd=3)


mtext(expression(paste("Test set R"^"2", " = "))); mtext(paste("                                ", r3.te))

legend("topleft", # places a legend at the appropriate place 
    c("Train fit","Test fit"), # puts text in the legend
    text.col=c("blue", "red"), bty="n")

plot of chunk unnamed-chunk-10

### comparison between the models
### You must check the predictive R-squared  for the training set
### and the R-squared for the corresponding test set 

ldply(list(fit.tr3, fit.te3), model_fit_stats)

##   R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1     0.018         0.016              0.920          0.012 712.4
## 2     0.034         0.031              0.905          0.022 296.6

Now, a dataset with expected R² = 0.80

###############################################################################
#### (n = 1000, rho = 0.90)
###############################################################################

set.seed(170815)

require(MASS)



dat4 <- mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.9,0.9,1), ncol = 2),
               empirical = FALSE)
    dat4 <- as.data.frame(dat4)
    names(dat4) <- c("X", "Y")

    ### correlazion test

    c <- cor.test(dat4$X,dat4$Y)

    c

## 
##  Pearson's product-moment correlation
## 
## data:  dat4$X and dat4$Y
## t = 68.91, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8976 0.9192
## sample estimates:
##   cor 
## 0.909

###############################################################################
#### Train and test set with caret
###############################################################################

library(caret)

### define an 70%/30% train/test split of the dataset

set.seed(234)

trainIndex <- createDataPartition(dat4$Y, p=0.70, list=FALSE)

train4 <- dat4[ trainIndex,]
test4 <- dat4[-trainIndex,]




###############################################################################
#### train a linear regression model
###############################################################################

set.seed(825)

fit.tr4 <-lm(train4$Y ~ train4$X) 
su.tr <- summary(fit.tr4)
r4.tr <- round(su.tr$r.squared,3)


fit.te4 <-lm(test4$Y ~ test4$X) 
su.te <- summary(fit.te4)
r4.te <- round(su.te$r.squared,3)

### plot the models

par(mfrow=c(1,2))

plot(train4$Y ~ train4$X, cex=1.5, col="blue", xlab="X", ylab="Y", main="Train set")
abline(fit.tr4, col = "red", lwd=3)
abline(fit.te4, col = "blue", lwd=3)

mtext(expression(paste("Train set R"^"2", " = "))); mtext(paste("                                ", r4.tr))

legend("topleft", # places a legend at the appropriate place 
    c("Train fit","Test fit"), # puts text in the legend
    text.col=c("blue", "red"), bty="n")

plot(test4$Y ~ test4$X, cex=1.5, col="red", xlab="X", ylab="Y", main="Test set")
abline(fit.tr4, col = "red", lwd=3)
abline(fit.te4, col = "blue", lwd=3)


mtext(expression(paste("Test set R"^"2", " = "))); mtext(paste("                                ", r4.te))

legend("topleft", # places a legend at the appropriate place 
    c("Train fit","Test fit"), # puts text in the legend
    text.col=c("blue", "red"), bty="n")

plot of chunk unnamed-chunk-11

### comparison between the models
### You must check the predictive R-squared  for the training set
### and the R-squared for the corresponding test set 

ldply(list(fit.tr4, fit.te4), model_fit_stats)

##   R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared  PRESS
## 1     0.812         0.812              1.000          0.811 133.08
## 2     0.855         0.855              0.999          0.853  51.17

General summary. Again, you must check the predictive R² for the training set and the R² for the corresponding test set

ldply(list(fit.tr1, fit.te1, fit.tr2, fit.te2, fit.tr3, fit.te3, fit.tr4, fit.te4), model_fit_stats)

##   R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared  PRESS
## 1     0.105         0.104              0.988          0.100 617.32
## 2     0.152         0.149              0.981          0.141 298.88
## 3     0.522         0.521              0.999          0.519 366.98
## 4     0.534         0.532              0.997          0.528 134.74
## 5     0.018         0.016              0.920          0.012 712.36
## 6     0.034         0.031              0.905          0.022 296.57
## 7     0.812         0.812              1.000          0.811 133.08
## 8     0.855         0.855              0.999          0.853  51.17

Conclusion 1: When the dataset has a low R², the predictive R² is a poor approximation of the R² in an independent sample. However, when the R² is above 0.50, the predictive R² is rather good.

Anyway, these were simulated data. What about real data?

Let see the “abalone” dataset, from the UCI Machine Learning Archives. The abalone is a marine gastropod mollusc, and the “abalone” dataset is used for regression exercise. Essentially, the aim is to predict abalone’s age (through the number of rings on the shell) given various descriptive attributes of the abalone.

Reference: Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

I took the dataset from here: http://scg.sdsu.edu/linear-regression-in-r-abalone-dataset/

aburl = 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abnames = c('sex','length','diameter','height','weight.w','weight.s','weight.v','weight.sh','rings')
abalone = read.table(aburl, header = F , sep = ',', col.names = abnames)

Good, now we will take a look at the data

dim(abalone)

## [1] 4177    9

head(abalone)

##   sex length diameter height weight.w weight.s weight.v weight.sh rings
## 1   M  0.455    0.365  0.095   0.5140   0.2245   0.1010     0.150    15
## 2   M  0.350    0.265  0.090   0.2255   0.0995   0.0485     0.070     7
## 3   F  0.530    0.420  0.135   0.6770   0.2565   0.1415     0.210     9
## 4   M  0.440    0.365  0.125   0.5160   0.2155   0.1140     0.155    10
## 5   I  0.330    0.255  0.080   0.2050   0.0895   0.0395     0.055     7
## 6   I  0.425    0.300  0.095   0.3515   0.1410   0.0775     0.120     8

str(abalone)

## 'data.frame':    4177 obs. of  9 variables:
##  $ sex      : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
##  $ length   : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $ diameter : num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
##  $ height   : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $ weight.w : num  0.514 0.226 0.677 0.516 0.205 ...
##  $ weight.s : num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ weight.v : num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ weight.sh: num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $ rings    : int  15 7 9 10 7 8 20 16 9 19 ...

summary(abalone)

##  sex          length         diameter         height         weight.w    
##  F:1307   Min.   :0.075   Min.   :0.055   Min.   :0.000   Min.   :0.002  
##  I:1342   1st Qu.:0.450   1st Qu.:0.350   1st Qu.:0.115   1st Qu.:0.442  
##  M:1528   Median :0.545   Median :0.425   Median :0.140   Median :0.799  
##           Mean   :0.524   Mean   :0.408   Mean   :0.140   Mean   :0.829  
##           3rd Qu.:0.615   3rd Qu.:0.480   3rd Qu.:0.165   3rd Qu.:1.153  
##           Max.   :0.815   Max.   :0.650   Max.   :1.130   Max.   :2.825  
##     weight.s        weight.v        weight.sh          rings      
##  Min.   :0.001   Min.   :0.0005   Min.   :0.0015   Min.   : 1.00  
##  1st Qu.:0.186   1st Qu.:0.0935   1st Qu.:0.1300   1st Qu.: 8.00  
##  Median :0.336   Median :0.1710   Median :0.2340   Median : 9.00  
##  Mean   :0.359   Mean   :0.1806   Mean   :0.2388   Mean   : 9.93  
##  3rd Qu.:0.502   3rd Qu.:0.2530   3rd Qu.:0.3290   3rd Qu.:11.00  
##  Max.   :1.488   Max.   :0.7600   Max.   :1.0050   Max.   :29.00

### plot the dataset
### the first column (abalone[,1]) is for sex
### the abalone has a clear sexual dimorphism


plot(abalone[,2:9], col = as.numeric(abalone[,1]))

plot of chunk unnamed-chunk-14

Now, some analyses.

###############################################################################
#### Train and test set with caret
###############################################################################

library(caret)

### define an 70%/30% train/test split of the dataset

set.seed(234)

trainIndex <- createDataPartition(abalone$rings, p=0.70, list=FALSE)

train.ab <- abalone[ trainIndex,]
test.ab <- abalone[-trainIndex,]

### rapid check

summary(train.ab$rings)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00    9.00    9.95   11.00   29.00

summary(test.ab$rings)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    8.00    9.00    9.89   11.00   27.00

t.test(train.ab$rings, test.ab$rings)

## 
##  Welch Two Sample t-test
## 
## data:  train.ab$rings and test.ab$rings
## t = 0.6068, df = 2458, p-value = 0.544
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.145  0.275
## sample estimates:
## mean of x mean of y 
##     9.953     9.888

### OK, no differences in the target

###############################################################################
#### Regression model
###############################################################################


### In the plot, there is a good relationhsip between "rings" and "diameter"
### Let's see a regression model


set.seed(825)

fit.tr1 <-lm(train.ab$rings ~ train.ab$diameter) 
su.tr <- summary(fit.tr1)
r.ab.tr <- round(su.tr$r.squared,3)

summary(fit.tr1)

## 
## Call:
## lm(formula = train.ab$rings ~ train.ab$diameter)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.249 -1.724 -0.727  0.895 15.659 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.214      0.210    10.6   <2e-16 ***
## train.ab$diameter   19.020      0.501    38.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.67 on 2923 degrees of freedom
## Multiple R-squared:  0.33,   Adjusted R-squared:  0.33 
## F-statistic: 1.44e+03 on 1 and 2923 DF,  p-value: <2e-16

fit.te1 <-lm(test.ab$rings ~ test.ab$diameter) 
su.te <- summary(fit.te1)
r.ab.te <- round(su.te$r.squared,3)

summary(fit.te1)

## 
## Call:
## lm(formula = test.ab$rings ~ test.ab$diameter)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.661 -1.601 -0.690  0.926 16.131 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.539      0.304    8.37   <2e-16 ***
## test.ab$diameter   17.914      0.719   24.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.56 on 1250 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.332 
## F-statistic:  621 on 1 and 1250 DF,  p-value: <2e-16

So far, so good. Let’s plot the models and check the predictive R².

### plot the models

par(mfrow=c(1,2))

plot(train.ab$rings ~ train.ab$diameter, cex=1.5, col="blue", xlab="diameter", ylab="rings", main="Train set")
abline(fit.tr1, col = "red", lwd=3)
abline(fit.te1, col = "blue", lwd=3)

mtext(expression(paste("Train set R"^"2", " = "))); mtext(paste("                                ", r.ab.tr))

legend("topleft", # places a legend at the appropriate place 
    c("Train fit","Test fit"), # puts text in the legend
    text.col=c("blue", "red"), bty="n")

plot(test.ab$rings ~ test.ab$diameter, cex=1.5, col="red", xlab="diameter", ylab="rings", main="Test set")
abline(fit.tr1, col = "red", lwd=3)
abline(fit.te1, col = "blue", lwd=3)


mtext(expression(paste("Test set R"^"2", " = "))); mtext(paste("                                ", r.ab.te))

legend("topleft", # places a legend at the appropriate place 
    c("Train fit","Test fit"), # puts text in the legend
    text.col=c("blue", "red"), bty="n")

plot of chunk unnamed-chunk-16

### comparison between the models
### You must check the predictive R-squared  for the training set
### and the R-squared for the corresponding test set 

ldply(list(fit.tr1, fit.te1), model_fit_stats)

##   R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1     0.330         0.330              0.999          0.329 20887
## 2     0.332         0.332              0.998          0.330  8210

Well, in this case the predictive R² for the training set was a good approximation of the R² for the corresponding test set.

Let’s try another dataset. We will use the “airquality” dataset, available in the “datasets” package, included in R.

data(airquality)

dim(airquality)

## [1] 153   6

head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

str(airquality)

## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

summary(airquality)

##      Ozone          Solar.R         Wind            Temp     
##  Min.   :  1.0   Min.   :  7   Min.   : 1.70   Min.   :56.0  
##  1st Qu.: 18.0   1st Qu.:116   1st Qu.: 7.40   1st Qu.:72.0  
##  Median : 31.5   Median :205   Median : 9.70   Median :79.0  
##  Mean   : 42.1   Mean   :186   Mean   : 9.96   Mean   :77.9  
##  3rd Qu.: 63.2   3rd Qu.:259   3rd Qu.:11.50   3rd Qu.:85.0  
##  Max.   :168.0   Max.   :334   Max.   :20.70   Max.   :97.0  
##  NA's   :37      NA's   :7                                   
##      Month           Day      
##  Min.   :5.00   Min.   : 1.0  
##  1st Qu.:6.00   1st Qu.: 8.0  
##  Median :7.00   Median :16.0  
##  Mean   :6.99   Mean   :15.8  
##  3rd Qu.:8.00   3rd Qu.:23.0  
##  Max.   :9.00   Max.   :31.0  
##

plot(airquality)

plot of chunk unnamed-chunk-17

### Alarm missing data!

aq <- na.omit(airquality)

dim(aq)

## [1] 111   6

head(aq)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 7    23     299  8.6   65     5   7
## 8    19      99 13.8   59     5   8

###############################################################################
#### Train and test set with caret
###############################################################################

library(caret)

### define an 70%/30% train/test split of the dataset

set.seed(234)

trainIndex <- createDataPartition(aq$Ozone, p=0.70, list=FALSE)

train.aq <- aq[ trainIndex,]
test.aq <- aq[-trainIndex,]


### rapid check

summary(train.aq$Ozone)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    18.0    30.0    40.1    61.0   118.0

summary(test.aq$Ozone)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    16.0    31.5    47.1    62.0   168.0

t.test(train.aq$Ozone, test.aq$Ozone)

## 
##  Welch Two Sample t-test
## 
## data:  train.aq$Ozone and test.aq$Ozone
## t = -0.8592, df = 43.54, p-value = 0.3949
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -23.337   9.389
## sample estimates:
## mean of x mean of y 
##     40.09     47.06

### OK, no differences in the target



###############################################################################
#### train a linear regression model
###############################################################################

### Let's try Wind as predictor

### 
set.seed(825)

fit.tr5 <-lm(train.aq$Ozone ~ train.aq$Wind) 
su.tr <- summary(fit.tr5)
r5.tr <- round(su.tr$r.squared,3)

summary(fit.tr5)

## 
## Call:
## lm(formula = train.aq$Ozone ~ train.aq$Wind)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.12 -15.98  -3.84  13.63  53.10 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     90.089      8.243   10.93  < 2e-16 ***
## train.aq$Wind   -4.946      0.772   -6.41  1.1e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.6 on 77 degrees of freedom
## Multiple R-squared:  0.348,  Adjusted R-squared:  0.339 
## F-statistic: 41.1 on 1 and 77 DF,  p-value: 1.07e-08

fit.te5 <-lm(test.aq$Ozone ~ test.aq$Wind) 
su.te <- summary(fit.te5)
r5.te <- round(su.te$r.squared,3)

summary(fit.te5)

## 
## Call:
## lm(formula = test.aq$Ozone ~ test.aq$Wind)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.10 -26.88  -4.17  22.65  76.47 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    116.23      15.60    7.45  2.6e-08 ***
## test.aq$Wind    -7.27       1.53   -4.76  4.5e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.2 on 30 degrees of freedom
## Multiple R-squared:  0.431,  Adjusted R-squared:  0.412 
## F-statistic: 22.7 on 1 and 30 DF,  p-value: 4.55e-05

### plot the models

par(mfrow=c(1,2))

plot(train.aq$Ozone ~ train.aq$Wind, cex=1.5, col="blue", xlab="Wind", ylab="Ozone", main="Train set")
abline(fit.tr5, col = "red", lwd=3)
abline(fit.te5, col = "blue", lwd=3)

mtext(expression(paste("Train set R"^"2", " = "))); mtext(paste("                                ", r5.tr))

legend("topleft", # places a legend at the appropriate place 
    c("Train fit","Test fit"), # puts text in the legend
    text.col=c("blue", "red"), bty="n")

plot(test.aq$Ozone ~ test.aq$Wind, cex=1.5, col="red", xlab="Wind", ylab="Ozone", main="Test set")
abline(fit.tr5, col = "red", lwd=3)
abline(fit.te5, col = "blue", lwd=3)


mtext(expression(paste("Test set R"^"2", " = "))); mtext(paste("                                ", r5.te))

legend("topleft", # places a legend at the appropriate place 
    c("Train fit","Test fit"), # puts text in the legend
    text.col=c("blue", "red"), bty="n")

plot of chunk unnamed-chunk-17

### comparison between the models
### You must check the predictive R-squared  for the training set
### and the R-squared for the corresponding test set 

ldply(list(fit.tr5, fit.te5), model_fit_stats)

##   R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1     0.348         0.339              0.976          0.304 45877
## 2     0.431         0.412              0.956          0.325 36956

OK. It seems that in real dataset, the predictive R² for the training set is a good approximation of the R² for the test set, even when the R² is below 0.50. Essentially, the predictive R² is a good approximation of the fit of the model in an independent sample.

What next? To evaluate how the predictive R² works in multiple regression. We will see this in a next future.

I hope you’ve enjoyed this!

Have a nice day!

Alt text

Twitter: @AntoViral

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.

For more details on using R Markdown see http://rmarkdown.rstudio.com.

August 21, 2015

How predictive is the Predictive R-squared?

Twitter: @AntoViral

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.