Predictive R-squared according to Tom Hopper

Recently there has been some buzz about the predictive R²

The predictive R² is a measure that helps you to determine how well the model predicts responses for new observations.

Of course, to have a hint on how well your model predicts responses for new observations, what you really need is an independent sample.

The predictive R² is something of a in-house cooked measure, which is “calculated by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation”.

Reference: http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/goodness-of-fit-statistics/r-squared/

Minitab© has included this measure, which adds to the more known R-squared and R-squared adjusted.

A short recap may be helpful.

The R² is a mesure of fit of your model, and it is a measure of the proportion of the variance explained by the model (from 0 to 1 = 0% to 100%). It is also known as the coefficient of determination or multiple determination (in multiple linear regression).

The R² is calculated as follow:

R-squared = 1 - (sums of squares error / sums of squares total)

The Minitab© site has a great graphical explanation about error sums of squares and total sums of squares. You may find it here: http://blog.minitab.com/blog/statistics-and-quality-data-analysis/r-squared-sometimes-a-square-is-just-a-square

Essentially, the total sum of squares is the sum of the squares that may be built by using as the side of each square the vertical distance of the point from the mean of the target (the dependent variable, the “y” of your model):

Alt text

The error sum of squares is the sum of the squares that may be built by using as the side of each square the vertical distance of each data value from the regression line:

Alt text

I’m a bit lazy, so I added the figures from the Minitab© site.

The R² is a biased measure, inasmuch it increases the more terms are added into the model as predictors.

The adjusted R² compensates for this overfitting by penalizing for the number of terms included as predictors.

In a LinkedIn discussion (https://www.linkedin.com/grp/post/3696237-77951329), Marcel Rietjens states that the adjusted R² is related to R² as follows (Dillon and Goldstein. Multivariate analysis: Methods and Applications. 1984, p 222).

    adjusted R-squared = 1 - ((1-R2)*(n - 1)/(n - p))

            where n is the number of measurements and p the number of parameters or variables.

In the future, R will includes, in all likelihood, this measure in the summary of the lm and related functions.

To date, you have to calculate by yourself. As pointed out by Liviu Ibanescu on LinkedIn (https://www.linkedin.com/grp/post/3696237-77951329) the predicted R squared is calculated by using the predicted residual sums of squares (PRESS):

    predictive R-squared = [1 - (PRESS / sums of squares total)] * 100

So, you have to calculate the PRESS to derive the predictive R-squared.

Some months ago, Tom Hopper, who maintains a great blog on #rstats (http://tomhopper.me/) and also has an interesting GitHub gist (https://gist.github.com/tomhopper), published a post on the predictive R² offering support for its calculation. In particular, he posted the script for calculating the PRESS, the predictive R², and a summary script for comparison of R², adjusted R² and predictive R².

You may find it here: http://tomhopper.me/2014/05/16/can-we-do-better-than-r-squared/

The scripts are also here: https://gist.github.com/tomhopper/8c204d978c4a0cbcb8c0

You may apply the scripts to a single model or use it for comparing two or more models.

I’ve played along with these scripts, and I’ve created a general script to be sourced for easy of calculation.

I added the calculation of the ratio of adjusted R² to R². Indeed, as stated by Ronán Michael Conroy in the quoted LinkedIn conversation, “The ratio of Adjusted R-squared to R-Squared tells you the likely decrease in model fit when the model is applied to new data”.

As far as the ratio of adjusted R² to R² is concerned, the higher, the better. Ideally, the adjusted R² should be very close to the R² for a good fit.

Here The script:

### short-hand: source("predictiveR2.R")

cat("These functions are helpful for calculating the predictive R-square of a model \n
    The functions are used essentially to compare different models \n
    The package plyr is necessary for the functions to work \n
    For models mod1 and mod2 \n
    digit: ldply(list(mod1, mod2), model_fit_stats) \n
    
    If you have just one model and you want to know the predictive R-square of your model [e.g, named 'fit'] \n
    just digit: ldply(list(fit), model_fit_stats) \n
    
    ")

## These functions are helpful for calculating the predictive R-square of a model 
## 
##  The functions are used essentially to compare different models 
## 
##  The package plyr is necessary for the functions to work 
## 
##  For models mod1 and mod2 
## 
##  digit: ldply(list(mod1, mod2), model_fit_stats) 
## 
##  
##  If you have just one model and you want to know the predictive R-square of your model [e.g, named 'fit'] 
## 
##  just digit: ldply(list(fit), model_fit_stats) 
## 
##  
##

#####################################################################
###
### Functions to calculate Predictive R-squared
###
#####################################################################


### This calculate the PRESS (predictive residual sum of squares), the lower, the better

#' @title PRESS
#' @author Thomas Hopper
#' @description Returns the PRESS statistic (predictive residual sum of squares).
#'              Useful for evaluating predictive power of regression models.
#' @param linear.model A linear regression model (class 'lm'). Required.
#' 
PRESS <- function(linear.model) {
  #' calculate the predictive residuals
  pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
  #' calculate the PRESS
  PRESS <- sum(pr^2)
  
  return(PRESS)
}


### This calculate the Predictive r-squared

#' @title Predictive R-squared
#' @author Thomas Hopper
#' @description returns the predictive r-squared. Requires the function PRESS(), which returns
#'              the PRESS statistic.
#' @param linear.model A linear regression model (class 'lm'). Required.
#'
pred_r_squared <- function(linear.model) {
  #' Use anova() to get the sum of squares for the linear model
  lm.anova <- anova(linear.model)
  #' Calculate the total sum of squares
  tss <- sum(lm.anova$'Sum Sq')
  # Calculate the predictive R^2
  pred.r.squared <- 1-PRESS(linear.model)/(tss)
  
  return(pred.r.squared)
}



### This calculate the R-squared, the adj-R-squared and the Predictive R-squared (from the functions above)

#' @title Model Fit Statistics
#' @description Returns lm model fit statistics R-squared, adjucted R-squared,
#'      predicted R-squared and PRESS.
#'      Thanks to John Mount for his 6-June-2014 blog post, R style tip: prefer functions that return data frames" for
#'      the idea \link{http://www.win-vector.com/blog/2014/06/r-style-tip-prefer-functions-that-return-data-frames}
#' @return Returns a data frame with one row and a column for each statistic
#' @param linear.model A \code{lm()} model.
model_fit_stats <- function(linear.model) {
  r.sqr <- summary(linear.model)$r.squared
  adj.r.sqr <- summary(linear.model)$adj.r.squared
  ratio.adjr2.to.r2 <- (adj.r.sqr/r.sqr)
  pre.r.sqr <- pred_r_squared(linear.model)
  press <- PRESS(linear.model)
  return.df <- data.frame("R-squared" = r.sqr, "Adj R-squared" = adj.r.sqr, 
    "Ratio Adj.R2 to R2" = ratio.adjr2.to.r2, "Pred R-squared" = pre.r.sqr, PRESS = press)
  return(round(return.df,3))
}


### This will allow the testing (e.g., comparison of model1 and model2)

library(plyr)

## Warning: package 'plyr' was built under R version 3.1.2

# Now call model_fit_stats() for each lm model that
# we have, using ldply. This returns the results in
# a data frame that is easily used for further 
# calculations.

# ldply(list(model1, model2), model_fit_stats)

When you run the script, you will be ready to test your model.

Let’s see an example by Tom Hopper.

#####################################################################
###
### Example usage for model_fit_stats.R
###
#####################################################################


# Set up some data
x <- seq(from=0, to=10, by=0.5)
y1 <- 2*x + rnorm(length(x))




# We want to compare two different linear models:
my.lm1 <- lm(y1 ~ sin(x))
my.lm2 <- lm(y1 ~ x)

# plot the two models: in black the linear one, in red the one using the sin

plot(y1 ~ x, pch=19, col="black", cex=1.2)
points(y1 ~ sin(x), pch=19, col="red", cex=1.2)

plot of chunk unnamed-chunk-2

# We will use plyr for the dispalying of the results.
library(plyr)

# Now call model_fit_stats() for each lm model that
# we have, using ldply. This returns the results in
# a data frame that is easily used for further 
# calculations.
ldply(list(my.lm1, my.lm2), model_fit_stats)

##   R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared   PRESS
## 1     0.006        -0.046             -7.966         -0.173 1014.02
## 2     0.981         0.980              0.999          0.978   19.44

As you can see, the “real” linear model has a great fit and a great predictive R².

Now, another example with the classical “iris” dataset.

##########################################################
##
## Example with dataset iris
##
##########################################################

### the data

data(iris)

attach(iris)

### data inspection

dim(iris)

## [1] 150   5

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

### data visualization

plot(iris[,1:4], pch=19, cex=1.5, col=as.factor(iris[,5]))

plot of chunk unnamed-chunk-3

### We want to compare three different linear models:

mod1 <- lm(Sepal.Length ~ Sepal.Width, data=iris)

summary(mod1)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.556 -0.633 -0.112  0.558  2.223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.526      0.479   13.63   <2e-16 ***
## Sepal.Width   -0.223      0.155   -1.44     0.15    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.825 on 148 degrees of freedom
## Multiple R-squared:  0.0138, Adjusted R-squared:  0.00716 
## F-statistic: 2.07 on 1 and 148 DF,  p-value: 0.152

mod2 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris)

summary(mod2)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9616 -0.2349  0.0008  0.2145  0.7856 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.2491     0.2480    9.07  7.0e-16 ***
## Sepal.Width    0.5955     0.0693    8.59  1.2e-14 ***
## Petal.Length   0.4719     0.0171   27.57  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.333 on 147 degrees of freedom
## Multiple R-squared:  0.84,   Adjusted R-squared:  0.838 
## F-statistic:  386 on 2 and 147 DF,  p-value: <2e-16

mod3 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=iris)

summary(mod3)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
##     data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8282 -0.2199  0.0187  0.1971  0.8457 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.8560     0.2508    7.40  9.9e-12 ***
## Sepal.Width    0.6508     0.0666    9.77  < 2e-16 ***
## Petal.Length   0.7091     0.0567   12.50  < 2e-16 ***
## Petal.Width   -0.5565     0.1275   -4.36  2.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.315 on 146 degrees of freedom
## Multiple R-squared:  0.859,  Adjusted R-squared:  0.856 
## F-statistic:  296 on 3 and 146 DF,  p-value: <2e-16

# We will use plyr for best presentation of the results.

library(plyr)

# Now call model_fit_stats() for each lm model that
# we have, using ldply. This returns the results in
# a data frame that is easily used for further 
# calculations.

ldply(list(mod1, mod2, mod3), model_fit_stats)

##   R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared  PRESS
## 1     0.014         0.007              0.518         -0.011 103.28
## 2     0.840         0.838              0.997          0.834  16.96
## 3     0.859         0.856              0.997          0.851  15.27

### plotting the predictive r-square for visual inspection

prs <- ldply(list(mod1, mod2, mod3), model_fit_stats)

pr2 <- rbind(prs$Pred.R.squared[1], prs$Pred.R.squared[2], prs$Pred.R.squared[3])

max_pr2 <- max(pr2) ### this calculates the highest predictive R-squared, which is then plotted as a red line.

barplot(c("Model 1"=prs$Pred.R.squared[1], "Model 2"=prs$Pred.R.squared[2], "Model 3"=prs$Pred.R.squared[3]),
    col="light blue",
    ylab="Predictive R-squared")
abline(h=max_pr2,col="red", lwd=3)

plot of chunk unnamed-chunk-3

The model 3 is the best, but not very different from model 2.

 Tip for using superscript: e.g., for producing R2 with the '2' superscript, just type R<sup>2</sup>

I hope you’ve enjoyed this!

Have a nice day!

Alt text

Twitter: @AntoViral

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.

For more details on using R Markdown see http://rmarkdown.rstudio.com.

August 16, 2015

Predictive R-squared according to Tom Hopper

Twitter: @AntoViral

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.