Recently there has been some buzz about the predictive R2
The predictive R2 is a measure that helps you to determine how well the model predicts responses for new observations.
Of course, to have a hint on how well your model predicts responses for new observations, what you really need is an independent sample.
The predictive R2 is something of a in-house cooked measure, which is “calculated by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation”.
Minitab© has included this measure, which adds to the more known R-squared and R-squared adjusted.
A short recap may be helpful.
The R2 is a mesure of fit of your model, and it is a measure of the proportion of the variance explained by the model (from 0 to 1 = 0% to 100%). It is also known as the coefficient of determination or multiple determination (in multiple linear regression).
The R2 is calculated as follow:
R-squared = 1 - (sums of squares error / sums of squares total)
The Minitab© site has a great graphical explanation about error sums of squares and total sums of squares. You may find it here: http://blog.minitab.com/blog/statistics-and-quality-data-analysis/r-squared-sometimes-a-square-is-just-a-square
Essentially, the total sum of squares is the sum of the squares that may be built by using as the side of each square the vertical distance of the point from the mean of the target (the dependent variable, the “y” of your model):
The error sum of squares is the sum of the squares that may be built by using as the side of each square the vertical distance of each data value from the regression line:
I’m a bit lazy, so I added the figures from the Minitab© site.
The R2 is a biased measure, inasmuch it increases the more terms are added into the model as predictors.
The adjusted R2 compensates for this overfitting by penalizing for the number of terms included as predictors.
In a LinkedIn discussion (https://www.linkedin.com/grp/post/3696237-77951329), Marcel Rietjens states that the adjusted R2 is related to R2 as follows (Dillon and Goldstein. Multivariate analysis: Methods and Applications. 1984, p 222).
adjusted R-squared = 1 - ((1-R2)*(n - 1)/(n - p))
where n is the number of measurements and p the number of parameters or variables.
In the future, R will includes, in all likelihood, this measure in the summary of the lm and related functions.
To date, you have to calculate by yourself. As pointed out by Liviu Ibanescu on LinkedIn (https://www.linkedin.com/grp/post/3696237-77951329) the predicted R squared is calculated by using the predicted residual sums of squares (PRESS):
predictive R-squared = [1 - (PRESS / sums of squares total)] * 100
So, you have to calculate the PRESS to derive the predictive R-squared.
Some months ago, Tom Hopper, who maintains a great blog on #rstats (http://tomhopper.me/) and also has an interesting GitHub gist (https://gist.github.com/tomhopper), published a post on the predictive R2 offering support for its calculation. In particular, he posted the script for calculating the PRESS, the predictive R2, and a summary script for comparison of R2, adjusted R2 and predictive R2.
You may find it here: http://tomhopper.me/2014/05/16/can-we-do-better-than-r-squared/
The scripts are also here: https://gist.github.com/tomhopper/8c204d978c4a0cbcb8c0
You may apply the scripts to a single model or use it for comparing two or more models.
I’ve played along with these scripts, and I’ve created a general script to be sourced for easy of calculation.
I added the calculation of the ratio of adjusted R2 to R2. Indeed, as stated by Ronán Michael Conroy in the quoted LinkedIn conversation, “The ratio of Adjusted R-squared to R-Squared tells you the likely decrease in model fit when the model is applied to new data”.
As far as the ratio of adjusted R2 to R2 is concerned, the higher, the better. Ideally, the adjusted R2 should be very close to the R2 for a good fit.
Here The script:
### short-hand: source("predictiveR2.R")
cat("These functions are helpful for calculating the predictive R-square of a model \n
The functions are used essentially to compare different models \n
The package plyr is necessary for the functions to work \n
For models mod1 and mod2 \n
digit: ldply(list(mod1, mod2), model_fit_stats) \n
If you have just one model and you want to know the predictive R-square of your model [e.g, named 'fit'] \n
just digit: ldply(list(fit), model_fit_stats) \n
")
## These functions are helpful for calculating the predictive R-square of a model
##
## The functions are used essentially to compare different models
##
## The package plyr is necessary for the functions to work
##
## For models mod1 and mod2
##
## digit: ldply(list(mod1, mod2), model_fit_stats)
##
##
## If you have just one model and you want to know the predictive R-square of your model [e.g, named 'fit']
##
## just digit: ldply(list(fit), model_fit_stats)
##
##
##
#####################################################################
###
### Functions to calculate Predictive R-squared
###
#####################################################################
### This calculate the PRESS (predictive residual sum of squares), the lower, the better
#' @title PRESS
#' @author Thomas Hopper
#' @description Returns the PRESS statistic (predictive residual sum of squares).
#' Useful for evaluating predictive power of regression models.
#' @param linear.model A linear regression model (class 'lm'). Required.
#'
PRESS <- function(linear.model) {
#' calculate the predictive residuals
pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
#' calculate the PRESS
PRESS <- sum(pr^2)
return(PRESS)
}
### This calculate the Predictive r-squared
#' @title Predictive R-squared
#' @author Thomas Hopper
#' @description returns the predictive r-squared. Requires the function PRESS(), which returns
#' the PRESS statistic.
#' @param linear.model A linear regression model (class 'lm'). Required.
#'
pred_r_squared <- function(linear.model) {
#' Use anova() to get the sum of squares for the linear model
lm.anova <- anova(linear.model)
#' Calculate the total sum of squares
tss <- sum(lm.anova$'Sum Sq')
# Calculate the predictive R^2
pred.r.squared <- 1-PRESS(linear.model)/(tss)
return(pred.r.squared)
}
### This calculate the R-squared, the adj-R-squared and the Predictive R-squared (from the functions above)
#' @title Model Fit Statistics
#' @description Returns lm model fit statistics R-squared, adjucted R-squared,
#' predicted R-squared and PRESS.
#' Thanks to John Mount for his 6-June-2014 blog post, R style tip: prefer functions that return data frames" for
#' the idea \link{http://www.win-vector.com/blog/2014/06/r-style-tip-prefer-functions-that-return-data-frames}
#' @return Returns a data frame with one row and a column for each statistic
#' @param linear.model A \code{lm()} model.
model_fit_stats <- function(linear.model) {
r.sqr <- summary(linear.model)$r.squared
adj.r.sqr <- summary(linear.model)$adj.r.squared
ratio.adjr2.to.r2 <- (adj.r.sqr/r.sqr)
pre.r.sqr <- pred_r_squared(linear.model)
press <- PRESS(linear.model)
return.df <- data.frame("R-squared" = r.sqr, "Adj R-squared" = adj.r.sqr,
"Ratio Adj.R2 to R2" = ratio.adjr2.to.r2, "Pred R-squared" = pre.r.sqr, PRESS = press)
return(round(return.df,3))
}
### This will allow the testing (e.g., comparison of model1 and model2)
library(plyr)
## Warning: package 'plyr' was built under R version 3.1.2
# Now call model_fit_stats() for each lm model that
# we have, using ldply. This returns the results in
# a data frame that is easily used for further
# calculations.
# ldply(list(model1, model2), model_fit_stats)
When you run the script, you will be ready to test your model.
Let’s see an example by Tom Hopper.
#####################################################################
###
### Example usage for model_fit_stats.R
###
#####################################################################
# Set up some data
x <- seq(from=0, to=10, by=0.5)
y1 <- 2*x + rnorm(length(x))
# We want to compare two different linear models:
my.lm1 <- lm(y1 ~ sin(x))
my.lm2 <- lm(y1 ~ x)
# plot the two models: in black the linear one, in red the one using the sin
plot(y1 ~ x, pch=19, col="black", cex=1.2)
points(y1 ~ sin(x), pch=19, col="red", cex=1.2)
# We will use plyr for the dispalying of the results.
library(plyr)
# Now call model_fit_stats() for each lm model that
# we have, using ldply. This returns the results in
# a data frame that is easily used for further
# calculations.
ldply(list(my.lm1, my.lm2), model_fit_stats)
## R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1 0.006 -0.046 -7.966 -0.173 1014.02
## 2 0.981 0.980 0.999 0.978 19.44
As you can see, the “real” linear model has a great fit and a great predictive R2.
Now, another example with the classical “iris” dataset.
##########################################################
##
## Example with dataset iris
##
##########################################################
### the data
data(iris)
attach(iris)
### data inspection
dim(iris)
## [1] 150 5
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
### data visualization
plot(iris[,1:4], pch=19, cex=1.5, col=as.factor(iris[,5]))
### We want to compare three different linear models:
mod1 <- lm(Sepal.Length ~ Sepal.Width, data=iris)
summary(mod1)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.556 -0.633 -0.112 0.558 2.223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.526 0.479 13.63 <2e-16 ***
## Sepal.Width -0.223 0.155 -1.44 0.15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.825 on 148 degrees of freedom
## Multiple R-squared: 0.0138, Adjusted R-squared: 0.00716
## F-statistic: 2.07 on 1 and 148 DF, p-value: 0.152
mod2 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris)
summary(mod2)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9616 -0.2349 0.0008 0.2145 0.7856
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2491 0.2480 9.07 7.0e-16 ***
## Sepal.Width 0.5955 0.0693 8.59 1.2e-14 ***
## Petal.Length 0.4719 0.0171 27.57 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.333 on 147 degrees of freedom
## Multiple R-squared: 0.84, Adjusted R-squared: 0.838
## F-statistic: 386 on 2 and 147 DF, p-value: <2e-16
mod3 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=iris)
summary(mod3)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8282 -0.2199 0.0187 0.1971 0.8457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.8560 0.2508 7.40 9.9e-12 ***
## Sepal.Width 0.6508 0.0666 9.77 < 2e-16 ***
## Petal.Length 0.7091 0.0567 12.50 < 2e-16 ***
## Petal.Width -0.5565 0.1275 -4.36 2.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.315 on 146 degrees of freedom
## Multiple R-squared: 0.859, Adjusted R-squared: 0.856
## F-statistic: 296 on 3 and 146 DF, p-value: <2e-16
# We will use plyr for best presentation of the results.
library(plyr)
# Now call model_fit_stats() for each lm model that
# we have, using ldply. This returns the results in
# a data frame that is easily used for further
# calculations.
ldply(list(mod1, mod2, mod3), model_fit_stats)
## R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1 0.014 0.007 0.518 -0.011 103.28
## 2 0.840 0.838 0.997 0.834 16.96
## 3 0.859 0.856 0.997 0.851 15.27
### plotting the predictive r-square for visual inspection
prs <- ldply(list(mod1, mod2, mod3), model_fit_stats)
pr2 <- rbind(prs$Pred.R.squared[1], prs$Pred.R.squared[2], prs$Pred.R.squared[3])
max_pr2 <- max(pr2) ### this calculates the highest predictive R-squared, which is then plotted as a red line.
barplot(c("Model 1"=prs$Pred.R.squared[1], "Model 2"=prs$Pred.R.squared[2], "Model 3"=prs$Pred.R.squared[3]),
col="light blue",
ylab="Predictive R-squared")
abline(h=max_pr2,col="red", lwd=3)
The model 3 is the best, but not very different from model 2.
Tip for using superscript: e.g., for producing R2 with the '2' superscript, just type R<sup>2</sup>
I hope you’ve enjoyed this!
Have a nice day!