In a previous post I’ve summarized the relevant information on the Predictive R-squared.
You may found the post here: https://rpubs.com/RatherBit/102428
The predictive R2 is a measure that helps you to determine how well the model predicts responses for new observations. The predictive R2 is something of a in-house cooked measure, which is “calculated by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation”.
However, to have a hint on how well your model predicts responses for new observations, what you really need is an independent sample.
This is what I will test in the present post.
Some months ago, Tom Hopper, who maintains a great blog on #rstats (http://tomhopper.me/), published a post on the predictive R2 offering support for its calculation.
You may find the post here: http://tomhopper.me/2014/05/16/can-we-do-better-than-r-squared/
The scripts are also here: https://gist.github.com/tomhopper/8c204d978c4a0cbcb8c0
I’ve played along with these scripts, and I’ve created a general script to be sourced for easy of calculation.
I added the calculation of the ratio of adjusted R2 to R2, which tells you the likely decrease in model fit when the model is applied to new data. As far as the ratio of adjusted R2 to R2 is concerned, the higher, the better. Ideally, the adjusted R2 should be very close to the R2 for a good fit
Let’s call the scripts created by Tom Hopper to calculate the predictive R2.
### short-hand: source("predictiveR2.R")
cat("These functions are helpful for calculating the predictive R-square of a model \n
The functions are used essentially to compare different models \n
The package plyr is necessary for the functions to work \n
For models mod1 and mod2 \n
digit: ldply(list(mod1, mod2), model_fit_stats) \n
If you have just one model and you want to know the predictive R-square of your model [e.g, named 'fit'] \n
just digit: ldply(list(fit), model_fit_stats) \n
")
## These functions are helpful for calculating the predictive R-square of a model
##
## The functions are used essentially to compare different models
##
## The package plyr is necessary for the functions to work
##
## For models mod1 and mod2
##
## digit: ldply(list(mod1, mod2), model_fit_stats)
##
##
## If you have just one model and you want to know the predictive R-square of your model [e.g, named 'fit']
##
## just digit: ldply(list(fit), model_fit_stats)
##
##
##
#####################################################################
###
### Functions to calculate Predictive R-squared
###
#####################################################################
### This calculate the PRESS (predictive residual sum of squares), the lower, the better
#' @title PRESS
#' @author Thomas Hopper
#' @description Returns the PRESS statistic (predictive residual sum of squares).
#' Useful for evaluating predictive power of regression models.
#' @param linear.model A linear regression model (class 'lm'). Required.
#'
PRESS <- function(linear.model) {
#' calculate the predictive residuals
pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
#' calculate the PRESS
PRESS <- sum(pr^2)
return(PRESS)
}
### This calculate the Predictive r-squared
#' @title Predictive R-squared
#' @author Thomas Hopper
#' @description returns the predictive r-squared. Requires the function PRESS(), which returns
#' the PRESS statistic.
#' @param linear.model A linear regression model (class 'lm'). Required.
#'
pred_r_squared <- function(linear.model) {
#' Use anova() to get the sum of squares for the linear model
lm.anova <- anova(linear.model)
#' Calculate the total sum of squares
tss <- sum(lm.anova$'Sum Sq')
# Calculate the predictive R^2
pred.r.squared <- 1-PRESS(linear.model)/(tss)
return(pred.r.squared)
}
### This calculate the R-squared, the adj-R-squared and the Predictive R-squared (from the functions above)
#' @title Model Fit Statistics
#' @description Returns lm model fit statistics R-squared, adjucted R-squared,
#' predicted R-squared and PRESS.
#' Thanks to John Mount for his 6-June-2014 blog post, R style tip: prefer functions that return data frames" for
#' the idea \link{http://www.win-vector.com/blog/2014/06/r-style-tip-prefer-functions-that-return-data-frames}
#' @return Returns a data frame with one row and a column for each statistic
#' @param linear.model A \code{lm()} model.
model_fit_stats <- function(linear.model) {
r.sqr <- summary(linear.model)$r.squared
adj.r.sqr <- summary(linear.model)$adj.r.squared
ratio.adjr2.to.r2 <- (adj.r.sqr/r.sqr)
pre.r.sqr <- pred_r_squared(linear.model)
press <- PRESS(linear.model)
return.df <- data.frame("R-squared" = r.sqr, "Adj R-squared" = adj.r.sqr,
"Ratio Adj.R2 to R2" = ratio.adjr2.to.r2, "Pred R-squared" = pre.r.sqr, PRESS = press)
return(round(return.df,3))
}
### This will allow the testing (e.g., comparison of model1 and model2)
library(plyr)
## Warning: package 'plyr' was built under R version 3.1.2
# Now call model_fit_stats() for each lm model that
# we have, using ldply. This returns the results in
# a data frame that is easily used for further
# calculations.
# ldply(list(model1, model2), model_fit_stats)
Now, we need a dataset. Let’s start with a simulation, a simple one: two variables with medium correlation.
###############################################################################
#### (n = 1000, rho = 0.30)
###############################################################################
set.seed(170815) ### for reproducibility
require(MASS)
## Loading required package: MASS
dat1 <- mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.3,0.3,1), ncol = 2),
empirical = FALSE)
dat1 <- as.data.frame(dat1)
names(dat1) <- c("X", "Y")
### correlazion test
c <- cor.test(dat1$X,dat1$Y)
c
##
## Pearson's product-moment correlation
##
## data: dat1$X and dat1$Y
## t = 11.59, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2885 0.3979
## sample estimates:
## cor
## 0.3444
We have 1000 observations. Let’s separate the data into a training set (70% of data) and a test set (the remaining 30%).
###############################################################################
#### Train and test set with caret
###############################################################################
library(caret)
## Warning: package 'caret' was built under R version 3.1.3
## Loading required package: lattice
## Loading required package: ggplot2
### define an 70%/30% train/test split of the dataset
set.seed(234) ### for reproducibility
trainIndex <- createDataPartition(dat1$Y, p=0.70, list=FALSE)
train1 <- dat1[ trainIndex,]
test1 <- dat1[-trainIndex,]
Now, let’s train a linear regression model in the training and the test set
###############################################################################
#### train a linear regression model
###############################################################################
set.seed(825)
fit.tr1 <-lm(train1$Y ~ train1$X)
su.tr <- summary(fit.tr1)
r2.tr <- round(su.tr$r.squared,3)
fit.te1 <-lm(test1$Y ~ test1$X)
su.te <- summary(fit.te1)
r2.te <- round(su.te$r.squared,3)
###############################################################################
#### plot the models
###############################################################################
par(mfrow=c(1,2))
plot(train1$Y ~ train1$X, cex=1.5, col="blue", xlab="X", ylab="Y", main="Train set")
abline(fit.tr1, col = "red", lwd=3)
abline(fit.te1, col = "blue", lwd=3)
### Why so much space in the second "mtext" call?
### Because I was unable to figure out how to plot the value next to an "expression" command...
### Any suggestion is greatly appreciated.
mtext(expression(paste("Train set R"^"2", " = "))); mtext(paste(" ", r2.tr))
legend("topleft", # places a legend at the appropriate place
c("Train fit","Test fit"), # puts text in the legend
text.col=c("blue", "red"), bty="n")
plot(test1$Y ~ test1$X, cex=1.5, col="red", xlab="X", ylab="Y", main="Test set")
abline(fit.tr1, col = "red", lwd=3)
abline(fit.te1, col = "blue", lwd=3)
mtext(expression(paste("Test set R"^"2", " = "))); mtext(paste(" ", r2.te))
legend("topleft", # places a legend at the appropriate place
c("Train fit","Test fit"), # puts text in the legend
text.col=c("blue", "red"), bty="n")
Let’s see the predictive R2 for the training set
ldply(list(fit.tr1), model_fit_stats)
## R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1 0.105 0.104 0.988 0.1 617.3
As it seems, the predictive R2 for the training set is not a good approximation of the R2 in the test set (take a look at the figure)
Let’s try another dataset, with stronger links between the X and the Y (the target)
###############################################################################
#### (n = 1000, rho = 0.70)
###############################################################################
set.seed(170815)
require(MASS)
dat2 <- mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.7,0.7,1), ncol = 2),
empirical = FALSE)
dat2 <- as.data.frame(dat2)
names(dat2) <- c("X", "Y")
### correlazion test
c <- cor.test(dat2$X,dat2$Y)
c
##
## Pearson's product-moment correlation
##
## data: dat2$X and dat2$Y
## t = 33.2, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6936 0.7526
## sample estimates:
## cor
## 0.7244
###############################################################################
#### Train and test set with caret
###############################################################################
library(caret)
### define an 70%/30% train/test split of the dataset
set.seed(234)
trainIndex <- createDataPartition(dat2$Y, p=0.70, list=FALSE)
train2 <- dat2[ trainIndex,]
test2 <- dat2[-trainIndex,]
###############################################################################
#### train a linear regression model
###############################################################################
set.seed(825)
fit.tr2 <-lm(train2$Y ~ train2$X)
su.tr <- summary(fit.tr2)
r2.tr <- round(su.tr$r.squared,3)
fit.te2 <-lm(test2$Y ~ test2$X)
su.te <- summary(fit.te2)
r2.te <- round(su.te$r.squared,3)
### plot the models
par(mfrow=c(1,2))
plot(train2$Y ~ train2$X, cex=1.5, col="blue", xlab="X", ylab="Y", main="Train set")
abline(fit.tr2, col = "red", lwd=3)
abline(fit.te2, col = "blue", lwd=3)
mtext(expression(paste("Train set R"^"2", " = "))); mtext(paste(" ", r2.tr))
legend("topleft", # places a legend at the appropriate place
c("Train fit","Test fit"), # puts text in the legend
text.col=c("blue", "red"), bty="n")
plot(test2$Y ~ test2$X, cex=1.5, col="red", xlab="X", ylab="Y", main="Test set")
abline(fit.tr2, col = "red", lwd=3)
abline(fit.te2, col = "blue", lwd=3)
mtext(expression(paste("Test set R"^"2", " = "))); mtext(paste(" ", r2.te))
legend("topleft", # places a legend at the appropriate place
c("Train fit","Test fit"), # puts text in the legend
text.col=c("blue", "red"), bty="n")
Again, let’s see the predictive R2 for the training set
ldply(list(fit.tr2), model_fit_stats)
## R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1 0.522 0.521 0.999 0.519 367
In this example, the predictive R2 for the training set is a good approximation of the R2 in the test set (take a look at the figure)
Repetita iuvant. Let’s see all results. You must check the predictive R2 for the training set and the R2 for the corresponding test set
ldply(list(fit.tr1, fit.te1, fit.tr2, fit.te2), model_fit_stats)
## R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1 0.105 0.104 0.988 0.100 617.3
## 2 0.152 0.149 0.981 0.141 298.9
## 3 0.522 0.521 0.999 0.519 367.0
## 4 0.534 0.532 0.997 0.528 134.7
Let’s try again with a dataset with a very poor link between the X and the Y (the target)
###############################################################################
#### (n = 1000, rho = 0.10)
###############################################################################
set.seed(170815)
require(MASS)
dat3 <- mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.1,0.1,1), ncol = 2),
empirical = FALSE)
dat3 <- as.data.frame(dat3)
names(dat3) <- c("X", "Y")
### correlazion test
c <- cor.test(dat3$X,dat3$Y)
c
##
## Pearson's product-moment correlation
##
## data: dat3$X and dat3$Y
## t = 4.75, df = 998, p-value = 2.333e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0875 0.2088
## sample estimates:
## cor
## 0.1487
###############################################################################
#### Train and test set with caret
###############################################################################
library(caret)
### define an 70%/30% train/test split of the dataset
set.seed(234)
trainIndex <- createDataPartition(dat3$Y, p=0.70, list=FALSE)
train3 <- dat3[ trainIndex,]
test3 <- dat3[-trainIndex,]
###############################################################################
#### train a linear regression model
###############################################################################
set.seed(825)
fit.tr3 <-lm(train3$Y ~ train3$X)
su.tr <- summary(fit.tr3)
r3.tr <- round(su.tr$r.squared,3)
fit.te3 <-lm(test3$Y ~ test3$X)
su.te <- summary(fit.te3)
r3.te <- round(su.te$r.squared,3)
### plot the models
par(mfrow=c(1,2))
plot(train3$Y ~ train3$X, cex=1.5, col="blue", xlab="X", ylab="Y", main="Train set")
abline(fit.tr3, col = "red", lwd=3)
abline(fit.te3, col = "blue", lwd=3)
mtext(expression(paste("Train set R"^"2", " = "))); mtext(paste(" ", r3.tr))
legend("topleft", # places a legend at the appropriate place
c("Train fit","Test fit"), # puts text in the legend
text.col=c("blue", "red"), bty="n")
plot(test3$Y ~ test3$X, cex=1.5, col="red", xlab="X", ylab="Y", main="Test set")
abline(fit.tr3, col = "red", lwd=3)
abline(fit.te3, col = "blue", lwd=3)
mtext(expression(paste("Test set R"^"2", " = "))); mtext(paste(" ", r3.te))
legend("topleft", # places a legend at the appropriate place
c("Train fit","Test fit"), # puts text in the legend
text.col=c("blue", "red"), bty="n")
### comparison between the models
### You must check the predictive R-squared for the training set
### and the R-squared for the corresponding test set
ldply(list(fit.tr3, fit.te3), model_fit_stats)
## R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1 0.018 0.016 0.920 0.012 712.4
## 2 0.034 0.031 0.905 0.022 296.6
Now, a dataset with expected R2 = 0.80
###############################################################################
#### (n = 1000, rho = 0.90)
###############################################################################
set.seed(170815)
require(MASS)
dat4 <- mvrnorm(1000, mu = c(0,0), Sigma = matrix(c(1,0.9,0.9,1), ncol = 2),
empirical = FALSE)
dat4 <- as.data.frame(dat4)
names(dat4) <- c("X", "Y")
### correlazion test
c <- cor.test(dat4$X,dat4$Y)
c
##
## Pearson's product-moment correlation
##
## data: dat4$X and dat4$Y
## t = 68.91, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8976 0.9192
## sample estimates:
## cor
## 0.909
###############################################################################
#### Train and test set with caret
###############################################################################
library(caret)
### define an 70%/30% train/test split of the dataset
set.seed(234)
trainIndex <- createDataPartition(dat4$Y, p=0.70, list=FALSE)
train4 <- dat4[ trainIndex,]
test4 <- dat4[-trainIndex,]
###############################################################################
#### train a linear regression model
###############################################################################
set.seed(825)
fit.tr4 <-lm(train4$Y ~ train4$X)
su.tr <- summary(fit.tr4)
r4.tr <- round(su.tr$r.squared,3)
fit.te4 <-lm(test4$Y ~ test4$X)
su.te <- summary(fit.te4)
r4.te <- round(su.te$r.squared,3)
### plot the models
par(mfrow=c(1,2))
plot(train4$Y ~ train4$X, cex=1.5, col="blue", xlab="X", ylab="Y", main="Train set")
abline(fit.tr4, col = "red", lwd=3)
abline(fit.te4, col = "blue", lwd=3)
mtext(expression(paste("Train set R"^"2", " = "))); mtext(paste(" ", r4.tr))
legend("topleft", # places a legend at the appropriate place
c("Train fit","Test fit"), # puts text in the legend
text.col=c("blue", "red"), bty="n")
plot(test4$Y ~ test4$X, cex=1.5, col="red", xlab="X", ylab="Y", main="Test set")
abline(fit.tr4, col = "red", lwd=3)
abline(fit.te4, col = "blue", lwd=3)
mtext(expression(paste("Test set R"^"2", " = "))); mtext(paste(" ", r4.te))
legend("topleft", # places a legend at the appropriate place
c("Train fit","Test fit"), # puts text in the legend
text.col=c("blue", "red"), bty="n")
### comparison between the models
### You must check the predictive R-squared for the training set
### and the R-squared for the corresponding test set
ldply(list(fit.tr4, fit.te4), model_fit_stats)
## R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1 0.812 0.812 1.000 0.811 133.08
## 2 0.855 0.855 0.999 0.853 51.17
General summary. Again, you must check the predictive R2 for the training set and the R2 for the corresponding test set
ldply(list(fit.tr1, fit.te1, fit.tr2, fit.te2, fit.tr3, fit.te3, fit.tr4, fit.te4), model_fit_stats)
## R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1 0.105 0.104 0.988 0.100 617.32
## 2 0.152 0.149 0.981 0.141 298.88
## 3 0.522 0.521 0.999 0.519 366.98
## 4 0.534 0.532 0.997 0.528 134.74
## 5 0.018 0.016 0.920 0.012 712.36
## 6 0.034 0.031 0.905 0.022 296.57
## 7 0.812 0.812 1.000 0.811 133.08
## 8 0.855 0.855 0.999 0.853 51.17
Conclusion 1: When the dataset has a low R2, the predictive R2 is a poor approximation of the R2 in an independent sample. However, when the R2 is above 0.50, the predictive R2 is rather good.
Anyway, these were simulated data. What about real data?
Let see the “abalone” dataset, from the UCI Machine Learning Archives. The abalone is a marine gastropod mollusc, and the “abalone” dataset is used for regression exercise. Essentially, the aim is to predict abalone’s age (through the number of rings on the shell) given various descriptive attributes of the abalone.
Reference: Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
I took the dataset from here: http://scg.sdsu.edu/linear-regression-in-r-abalone-dataset/
aburl = 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abnames = c('sex','length','diameter','height','weight.w','weight.s','weight.v','weight.sh','rings')
abalone = read.table(aburl, header = F , sep = ',', col.names = abnames)
Good, now we will take a look at the data
dim(abalone)
## [1] 4177 9
head(abalone)
## sex length diameter height weight.w weight.s weight.v weight.sh rings
## 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
## 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
## 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
## 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
## 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
## 6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120 8
str(abalone)
## 'data.frame': 4177 obs. of 9 variables:
## $ sex : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
## $ length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
## $ diameter : num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
## $ height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
## $ weight.w : num 0.514 0.226 0.677 0.516 0.205 ...
## $ weight.s : num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
## $ weight.v : num 0.101 0.0485 0.1415 0.114 0.0395 ...
## $ weight.sh: num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
## $ rings : int 15 7 9 10 7 8 20 16 9 19 ...
summary(abalone)
## sex length diameter height weight.w
## F:1307 Min. :0.075 Min. :0.055 Min. :0.000 Min. :0.002
## I:1342 1st Qu.:0.450 1st Qu.:0.350 1st Qu.:0.115 1st Qu.:0.442
## M:1528 Median :0.545 Median :0.425 Median :0.140 Median :0.799
## Mean :0.524 Mean :0.408 Mean :0.140 Mean :0.829
## 3rd Qu.:0.615 3rd Qu.:0.480 3rd Qu.:0.165 3rd Qu.:1.153
## Max. :0.815 Max. :0.650 Max. :1.130 Max. :2.825
## weight.s weight.v weight.sh rings
## Min. :0.001 Min. :0.0005 Min. :0.0015 Min. : 1.00
## 1st Qu.:0.186 1st Qu.:0.0935 1st Qu.:0.1300 1st Qu.: 8.00
## Median :0.336 Median :0.1710 Median :0.2340 Median : 9.00
## Mean :0.359 Mean :0.1806 Mean :0.2388 Mean : 9.93
## 3rd Qu.:0.502 3rd Qu.:0.2530 3rd Qu.:0.3290 3rd Qu.:11.00
## Max. :1.488 Max. :0.7600 Max. :1.0050 Max. :29.00
### plot the dataset
### the first column (abalone[,1]) is for sex
### the abalone has a clear sexual dimorphism
plot(abalone[,2:9], col = as.numeric(abalone[,1]))
Now, some analyses.
###############################################################################
#### Train and test set with caret
###############################################################################
library(caret)
### define an 70%/30% train/test split of the dataset
set.seed(234)
trainIndex <- createDataPartition(abalone$rings, p=0.70, list=FALSE)
train.ab <- abalone[ trainIndex,]
test.ab <- abalone[-trainIndex,]
### rapid check
summary(train.ab$rings)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 9.00 9.95 11.00 29.00
summary(test.ab$rings)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 8.00 9.00 9.89 11.00 27.00
t.test(train.ab$rings, test.ab$rings)
##
## Welch Two Sample t-test
##
## data: train.ab$rings and test.ab$rings
## t = 0.6068, df = 2458, p-value = 0.544
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.145 0.275
## sample estimates:
## mean of x mean of y
## 9.953 9.888
### OK, no differences in the target
###############################################################################
#### Regression model
###############################################################################
### In the plot, there is a good relationhsip between "rings" and "diameter"
### Let's see a regression model
set.seed(825)
fit.tr1 <-lm(train.ab$rings ~ train.ab$diameter)
su.tr <- summary(fit.tr1)
r.ab.tr <- round(su.tr$r.squared,3)
summary(fit.tr1)
##
## Call:
## lm(formula = train.ab$rings ~ train.ab$diameter)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.249 -1.724 -0.727 0.895 15.659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.214 0.210 10.6 <2e-16 ***
## train.ab$diameter 19.020 0.501 38.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.67 on 2923 degrees of freedom
## Multiple R-squared: 0.33, Adjusted R-squared: 0.33
## F-statistic: 1.44e+03 on 1 and 2923 DF, p-value: <2e-16
fit.te1 <-lm(test.ab$rings ~ test.ab$diameter)
su.te <- summary(fit.te1)
r.ab.te <- round(su.te$r.squared,3)
summary(fit.te1)
##
## Call:
## lm(formula = test.ab$rings ~ test.ab$diameter)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.661 -1.601 -0.690 0.926 16.131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.539 0.304 8.37 <2e-16 ***
## test.ab$diameter 17.914 0.719 24.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.56 on 1250 degrees of freedom
## Multiple R-squared: 0.332, Adjusted R-squared: 0.332
## F-statistic: 621 on 1 and 1250 DF, p-value: <2e-16
So far, so good. Let’s plot the models and check the predictive R2.
### plot the models
par(mfrow=c(1,2))
plot(train.ab$rings ~ train.ab$diameter, cex=1.5, col="blue", xlab="diameter", ylab="rings", main="Train set")
abline(fit.tr1, col = "red", lwd=3)
abline(fit.te1, col = "blue", lwd=3)
mtext(expression(paste("Train set R"^"2", " = "))); mtext(paste(" ", r.ab.tr))
legend("topleft", # places a legend at the appropriate place
c("Train fit","Test fit"), # puts text in the legend
text.col=c("blue", "red"), bty="n")
plot(test.ab$rings ~ test.ab$diameter, cex=1.5, col="red", xlab="diameter", ylab="rings", main="Test set")
abline(fit.tr1, col = "red", lwd=3)
abline(fit.te1, col = "blue", lwd=3)
mtext(expression(paste("Test set R"^"2", " = "))); mtext(paste(" ", r.ab.te))
legend("topleft", # places a legend at the appropriate place
c("Train fit","Test fit"), # puts text in the legend
text.col=c("blue", "red"), bty="n")
### comparison between the models
### You must check the predictive R-squared for the training set
### and the R-squared for the corresponding test set
ldply(list(fit.tr1, fit.te1), model_fit_stats)
## R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1 0.330 0.330 0.999 0.329 20887
## 2 0.332 0.332 0.998 0.330 8210
Well, in this case the predictive R2 for the training set was a good approximation of the R2 for the corresponding test set.
Let’s try another dataset. We will use the “airquality” dataset, available in the “datasets” package, included in R.
data(airquality)
dim(airquality)
## [1] 153 6
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.0 Min. : 7 Min. : 1.70 Min. :56.0
## 1st Qu.: 18.0 1st Qu.:116 1st Qu.: 7.40 1st Qu.:72.0
## Median : 31.5 Median :205 Median : 9.70 Median :79.0
## Mean : 42.1 Mean :186 Mean : 9.96 Mean :77.9
## 3rd Qu.: 63.2 3rd Qu.:259 3rd Qu.:11.50 3rd Qu.:85.0
## Max. :168.0 Max. :334 Max. :20.70 Max. :97.0
## NA's :37 NA's :7
## Month Day
## Min. :5.00 Min. : 1.0
## 1st Qu.:6.00 1st Qu.: 8.0
## Median :7.00 Median :16.0
## Mean :6.99 Mean :15.8
## 3rd Qu.:8.00 3rd Qu.:23.0
## Max. :9.00 Max. :31.0
##
plot(airquality)
### Alarm missing data!
aq <- na.omit(airquality)
dim(aq)
## [1] 111 6
head(aq)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
###############################################################################
#### Train and test set with caret
###############################################################################
library(caret)
### define an 70%/30% train/test split of the dataset
set.seed(234)
trainIndex <- createDataPartition(aq$Ozone, p=0.70, list=FALSE)
train.aq <- aq[ trainIndex,]
test.aq <- aq[-trainIndex,]
### rapid check
summary(train.aq$Ozone)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 18.0 30.0 40.1 61.0 118.0
summary(test.aq$Ozone)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 16.0 31.5 47.1 62.0 168.0
t.test(train.aq$Ozone, test.aq$Ozone)
##
## Welch Two Sample t-test
##
## data: train.aq$Ozone and test.aq$Ozone
## t = -0.8592, df = 43.54, p-value = 0.3949
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -23.337 9.389
## sample estimates:
## mean of x mean of y
## 40.09 47.06
### OK, no differences in the target
###############################################################################
#### train a linear regression model
###############################################################################
### Let's try Wind as predictor
###
set.seed(825)
fit.tr5 <-lm(train.aq$Ozone ~ train.aq$Wind)
su.tr <- summary(fit.tr5)
r5.tr <- round(su.tr$r.squared,3)
summary(fit.tr5)
##
## Call:
## lm(formula = train.aq$Ozone ~ train.aq$Wind)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.12 -15.98 -3.84 13.63 53.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.089 8.243 10.93 < 2e-16 ***
## train.aq$Wind -4.946 0.772 -6.41 1.1e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.6 on 77 degrees of freedom
## Multiple R-squared: 0.348, Adjusted R-squared: 0.339
## F-statistic: 41.1 on 1 and 77 DF, p-value: 1.07e-08
fit.te5 <-lm(test.aq$Ozone ~ test.aq$Wind)
su.te <- summary(fit.te5)
r5.te <- round(su.te$r.squared,3)
summary(fit.te5)
##
## Call:
## lm(formula = test.aq$Ozone ~ test.aq$Wind)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.10 -26.88 -4.17 22.65 76.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.23 15.60 7.45 2.6e-08 ***
## test.aq$Wind -7.27 1.53 -4.76 4.5e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.2 on 30 degrees of freedom
## Multiple R-squared: 0.431, Adjusted R-squared: 0.412
## F-statistic: 22.7 on 1 and 30 DF, p-value: 4.55e-05
### plot the models
par(mfrow=c(1,2))
plot(train.aq$Ozone ~ train.aq$Wind, cex=1.5, col="blue", xlab="Wind", ylab="Ozone", main="Train set")
abline(fit.tr5, col = "red", lwd=3)
abline(fit.te5, col = "blue", lwd=3)
mtext(expression(paste("Train set R"^"2", " = "))); mtext(paste(" ", r5.tr))
legend("topleft", # places a legend at the appropriate place
c("Train fit","Test fit"), # puts text in the legend
text.col=c("blue", "red"), bty="n")
plot(test.aq$Ozone ~ test.aq$Wind, cex=1.5, col="red", xlab="Wind", ylab="Ozone", main="Test set")
abline(fit.tr5, col = "red", lwd=3)
abline(fit.te5, col = "blue", lwd=3)
mtext(expression(paste("Test set R"^"2", " = "))); mtext(paste(" ", r5.te))
legend("topleft", # places a legend at the appropriate place
c("Train fit","Test fit"), # puts text in the legend
text.col=c("blue", "red"), bty="n")
### comparison between the models
### You must check the predictive R-squared for the training set
### and the R-squared for the corresponding test set
ldply(list(fit.tr5, fit.te5), model_fit_stats)
## R.squared Adj.R.squared Ratio.Adj.R2.to.R2 Pred.R.squared PRESS
## 1 0.348 0.339 0.976 0.304 45877
## 2 0.431 0.412 0.956 0.325 36956
OK. It seems that in real dataset, the predictive R2 for the training set is a good approximation of the R2 for the test set, even when the R2 is below 0.50. Essentially, the predictive R2 is a good approximation of the fit of the model in an independent sample.
What next? To evaluate how the predictive R2 works in multiple regression. We will see this in a next future.
I hope you’ve enjoyed this!
Have a nice day!