#######################################################################################
When I started my activity as a statistician in the medical field, regression was a technique requiring to pull together one or more independent variables supposed to be related to a dependent one. All I needed was to get some p < .05 and, possibly, a R-squared > 10%.
I was vaguely aware that some requirements had to be met for a linear regression be valid:
- linearity and additivity of the relationship between dependent and independent variables
- statistical independence of the errors
- homoscedasticity (i.e., constant variance) of the errors
- normality of the error distribution.
However, the procedures to test these requirements with available commercial software were complicated, so I crossed the fingers
Then I have discovered R, and the wide range of opportunities that available packages offer to execute appropriate diagnostics for regression.
Over time, I have prepared a set of scripts to test some measures helpful to evaluate a regression model.
Now I’ ve reunited these scripts into two functions: one function calculates the root-mean-square error (RMSE), the coefficient of determination (denoted R2 or r2 and pronounced R squared), the Cohens 2 effect size measure for multiple regression, accuracy and risk of overfitting. More recently, I’ ve added to this function the predictive R2 (for details, see https://rpubs.com/RatherBit/102428 and https://rpubs.com/RatherBit/104537) and the 95% confidence of interval for both the R2 and the adjusted R2.
The other function is a diagnostics, which assesses violation of homoscedasticity, Durbin-Watson test for non-independence of errors, and normality of residuals, also plotting the qq plot for studentized residuals and the plot of residuals against fitted values, with Tuckey test (to confirm normality of residuals). More recently, I’ ve added to this second function a simple method to calculate the presence of outliers or influential points and the Ramsey Regression Equation Specification Error Test (RESET), which tests for misspecification of the model. If the RESET’ s p < 0.05, the model is misspecified.
Here these functions, that you can copy and paste into a script and save it in your main working directory as “GoF.Regression.R”.
You can call the function with source(“GoF.Regression.R”).
The first function can be called with ‘GoF(fit)’, where ‘GoF’ is the name of the function and ‘fit’ is the fit of the model. See the example.
The second function can be called with ‘GoF2(fit)’, where ‘GoF2’ is the name of the function and ‘fit’ is the fit of the model. See the example.
########################################################################################
########################################################################################
###
### Function for calculating RMSE, f2, risk of overfitting and accuracy in regression
###
########################################################################################
########################################################################################
cat("Utility for calculating RMSE, f2, risk of overfitting and accuracy in regression \n
and for other regression diagnostics \n
Apply to the fit of the model: fit <- lm(y~x1+x2+x3,data = mydata) \n
Usage --> GoF(fit) \n
--> GoF2(fit)
")
## Utility for calculating RMSE, f2, risk of overfitting and accuracy in regression
##
## and for other regression diagnostics
##
## Apply to the fit of the model: fit <- lm(y~x1+x2+x3,data = mydata)
##
## Usage --> GoF(fit)
##
## --> GoF2(fit)
##
### Contact about this funtion: apreti@tin.it
### Twitter: @AntoViral
### Best use: put the script into your main directory
### When you need it, call it with this command: source("GoF.Regression.R")
### The function will be loaded in your workspace --> check with ls()
### Information about the indexes were retrieved from Wikipedia or other publicly available web sites
### The root-mean-square error (RMSE) is a frequently used measure of the differences
### between value predicted by a model or an estimator and the values actually observed. B
### The RMSE represents the sample standard deviation of the differences between predicted values and observed values.
### The RMSE is used to evaluate models by summarizing the differences between the actual (observed) and predicted values.
### RMSE gives extra weight to large errors.
### In comparing models, the lower the RMSE, the better.
### The coefficient of determination, denoted R2 or r2 and pronounced R squared,
### is a number that indicates how well data fit a statistical model
### R-squared can be interpreted as the percent of variance explained by the model.
### The 2 effect size measure for multiple regression is defined as: Cohens f2 = R2 / (1 R2)
### By convention, 2 effect sizes of 0.02, 0.15, and 0.35 are termed 'small', 'medium', and 'large', respectively.
### The f2 may be helpful to calculate a priori sample size of multiple regression.
### Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New Jersey: Lawrence Erlbaum.
### Accuracy in a measurement system is the degress of the measurement of the quantuty to the actual quantity.
### In regression, the accuracy is the relation between the prediction estimated by the model and the actual dependent variable.
### Accuracy vary from 0 (no accuracy in prediction) to 1 (complete accuracy in prediction).
### In comparing models, the higher the accuracy, the better.
### Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship.
### Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.
### A model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.
#### As a rough guide against overfitting, calculate the number of data points in the estimation per coefficient estimated
#### If you have less than 10 data points per coefficient estimated, you should be alert to the possibility of overfitting.
### goodness of fit function
### Use: after fitting a regression model like this: fit <- lm(y~x1+x2+x3,data = mydata)
### Call the function: GoF(fit)
### The function will calculate for you the RMSE, the R2 of the model, the Cohen's f2, the accuracy of the model and the overfitting ratio
GoF <- function(fit)
{
rmse <- round(sqrt(mean(fit$residuals^2)),3) ### RMSE (root mean squared error)
r2 <- round(summary(fit)$r.squared,3) ### R2 of the model
f2 <- round(r2/(1-r2),3) ### Cohen's f2
n.points <- round(dim(fit$model)[1],3) ### number of data points in the analysis
num.coeff <- round(NROW(fit$coefficients)) ### number of estimated coefficients (including the intercept)
overfitting.ratio <- round(n.points/num.coeff) ### overfitting ration, best > 10
fit.Predicted <- predict(fit, fit$model) ### Use the built models to predict on test data
### Calculate Accuracy, based on this tutorial: http://rstatistics.net/robust-regression/
fit.Actuals.pred <- cbind(fit.Predicted, fit$model[1])
accuracy <- round(mean(apply(fit.Actuals.pred, 1, min)/ apply(fit.Actuals.pred, 1, max)),3)
### R-squared and adjusted R-squared with 95 confidence of interval
require(MBESS)
su <- summary(fit)
n.pr <- length(fit$coefficients)-1
### R-squared
R2.ci <- ci.R2(R2 = su$r.squared, N = length(su$residuals), K = n.pr, conf.level=.95)
R2 <- round(su$r.squared, 3)
R2.low <- round(R2.ci$Lower.Conf.Limit.R2, 3)
R2.up <- round(R2.ci$Upper.Conf.Limit.R2, 3)
### adjusted R-squared
adj.R2.ci <- ci.R2(R2 = su$adj.r.squared, N = length(su$residuals), K = n.pr, conf.level=.95)
adj.R2 <- round(su$adj.r.squared, 3)
adj.R2.low <- round(adj.R2.ci$Lower.Conf.Limit.R2, 3)
adj.R2.up <- round(adj.R2.ci$Upper.Conf.Limit.R2, 3)
### rearranged outcome
R2.out <- paste0("R-squared: ", R2, " (95%CI: ", R2.low, " to ", R2.up, ")")
adj.R2.out <- paste0("Adjusted R-squared: ", adj.R2, " (95%CI: ", adj.R2.low, " to ", adj.R2.up, ")")
### Predictive R-squared
### This calculate the PRESS (predictive residual sum of squares), the lower, the better
#' @title PRESS
#' @author Thomas Hopper
#' @description Returns the PRESS statistic (predictive residual sum of squares).
#' Useful for evaluating predictive power of regression models.
#' @param linear.model A linear regression model (class 'lm'). Required.
#'
PRESS <- function(linear.model) {
#' calculate the predictive residuals
pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
#' calculate the PRESS
PRESS <- sum(pr^2)
return(PRESS)
}
### This calculate the Predictive r-squared
#' @title Predictive R-squared
#' @author Thomas Hopper
#' @description returns the predictive r-squared. Requires the function PRESS(), which returns
#' the PRESS statistic.
#' @param linear.model A linear regression model (class 'lm'). Required.
#'
pred_r_squared <- function(linear.model) {
#' Use anova() to get the sum of squares for the linear model
lm.anova <- anova(linear.model)
#' Calculate the total sum of squares
tss <- sum(lm.anova$'Sum Sq')
# Calculate the predictive R^2
pred.r.squared <- 1-PRESS(linear.model)/(tss)
return(pred.r.squared)
}
### Rearranged outcome
press <- round(PRESS(fit), 2)
p.r2 <- round(pred_r_squared(fit), 3)
### 'if' statement on the risk of overfitting
if(overfitting.ratio>9) print("There is no risk of overfitting") else print("There is risk of overfitting")
### create an output list of the calculated indexes
out <- c(rmse, f2, accuracy, overfitting.ratio)
### assign names to the calculated indexes
names(out) <- c("RMSE","Cohen's f2", "Accuracy", "Overfitting ratio")
out2 <- c(R2.out, adj.R2.out)
out3 <- c(p.r2, press)
names(out3) <- c("Predictive R-squared", "PRESS")
list(out, out2, out3) ### return all output
}
####################################################################################################
#### Diagnostics function
####################################################################################################
GoF2 <- function(fit)
{
### Violation of homoscedasticity
### The bptest() function in library(lmtest) performs the Breusch-Pagan test for heteroscedasticity
library(lmtest)
H <-"Violation of homoscedasticity (Yes if Breusch-Pagan test's p < 0.05)"
bp <- bptest(fit)
###
### Multi-collinearity
library(car)
VI.F <-"Multi-collinearity: Variance inflation factors must be not > 2)"
VI.Pr <- "If any problem, TRUE will be printed below the problematic variables"
if (length(fit$coefficients)<3)
{vf=NA; sq.vf=NA}
else
{vf=round(vif(fit),3); # variance inflation factors
sq.vf <- sqrt(vif(fit)) > 2} # problem?
###
### Non-independence of Errors
### Test for Autocorrelated Errors
NI <- "Durbin-Watson test for non-independence of Errors (DW statistics should be near 2, with p > 0.05)"
DW <- durbinWatsonTest(fit)
###
### RESET Test
### Ramsey, J. B. (1969). "Tests for Specification Errors in Classical Linear Least Squares Regression Analysis".
### Journal of the Royal Statistical Society Series B 31 (2): 350371
### the Ramsey Regression Equation Specification Error Test (RESET)
### is a general specification test for the linear regression model.
### If non-linear combinations of the explanatory variables have any power
### in explaining the response variable, the model is mis-specified.
R.T <- "If the RESET' s p < 0.05, the model is mispecified"
require(lmtest)
r.t <- resettest(fit, power=2, type="regressor")
r.t.round <- round(r.t$statistic[[1]],2)
p.rt <- r.t$p.value
p.rt <- ifelse(p.rt < 0.0001, 0.0001, round(p.rt,3))
RESET <- paste0("RESET Test: ", r.t.round, "; df(", r.t$parameter[1],",",r.t$parameter[2],"); p = ", p.rt)
###
### Outlier based on boxplot
n.c <- length(fit$coefficients) ### calculate the number of coefficients
bx <- boxplot(fit$model[,2:n.c], plot=FALSE) ### apply boxplot to the predictors (all coefficients minus the dependent target)
outliers <- bx$out ### extract the outliers / influential points
n.out <- length(bx$out) ### calculate the number of potential outliers / influential points
### Normality of residuals
R <- "Now check normality of residuals (after saving, type 'dev.off()')"
out <-list(H,bp,VI.F,VI.Pr, vf,sq.vf,NI,DW,R.T,RESET,R)
###
########################
### Diagnostics plots
########################
par(mfrow=c(1,3))
library(car)
# qq plot for studentized residuals
qqPlot(fit, main="QQ Plot\nNumbered potential outliers must be inside the CI",id.n=5)
# distribution of studentized residuals
library(MASS)
sresid <- studres(fit)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
### plot of residuals against fitted values, with Tuckey test
r.p<-residualPlots(fit,~1,fitted=TRUE)
if (r.p[2] > 0)
{r.p[2] = r.p[2]}
else
{r.p[2] = 0.0001}
title(paste("Residuals against fitted values\nTuckey test: ",r.p[1],", p-value = ",r.p[2]))
par(mfrow=c(1,1))
### done
### 'if' statement on the risk of overfitting
if(n.out>0) out.l <- paste0("Outliers or influential points were detected: n = ", n.out) else out.l <- paste0("No effect attributable to outliers")
list("Violation of homoscedasticity (Yes if Breusch-Pagan test's p < 0.05)" = bp,
"Multi-collinearity: Variance inflation factors must be not > 2)" = vf,
"If any problem, TRUE will be printed below the problematic variables"=sq.vf,
"Durbin-Watson test for non-independence of Errors (DW statistics should be near 2, with p > 0.05)"=DW,
"If the RESET' s p < 0.05, the model is misspecified"=RESET,
"Outliers / influential points"= out.l,
"Now check normality of residuals (after saving, type 'dev.off()')")
}
#############################################################################################################
#############################################################################################################
####
#### EXAMPLE
####
#############################################################################################################
#############################################################################################################
#### Copy and paste the commands to see how the function works
data(airquality)
attach(airquality)
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
### first model
fit1 <- lm(Ozone~ Solar.R + Wind + Temp,data=airquality,na.action=na.omit)
summary(fit1)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.48 -14.22 -3.55 10.10 95.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.3421 23.0547 -2.79 0.0062 **
## Solar.R 0.0598 0.0232 2.58 0.0112 *
## Wind -3.3336 0.6544 -5.09 1.5e-06 ***
## Temp 1.6521 0.2535 6.52 2.4e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.2 on 107 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.606, Adjusted R-squared: 0.595
## F-statistic: 54.8 on 3 and 107 DF, p-value: <2e-16
### call the function
#### you can call the function with: source("GoF.Regression.R")
#### but we now will use the function as it is
### apply the fist function
GoF(fit1)
## Loading required package: MBESS
## Loading required package: gsl
## Loading required package: MASS
## [1] "There is no risk of overfitting"
## [[1]]
## RMSE Cohen's f2 Accuracy Overfitting ratio
## 20.796 1.538 0.573 28.000
##
## [[2]]
## [1] "R-squared: 0.606 (95%CI: 0.467 to 0.703)"
## [2] "Adjusted R-squared: 0.595 (95%CI: 0.454 to 0.693)"
##
## [[3]]
## Predictive R-squared PRESS
## 0.573 52038.870
### apply the second function
GoF2(fit1)
## Warning: package 'lmtest' was built under R version 3.1.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Warning: package 'car' was built under R version 3.1.3
## $`Violation of homoscedasticity (Yes if Breusch-Pagan test's p < 0.05)`
##
## studentized Breusch-Pagan test
##
## data: fit
## BP = 5.055, df = 3, p-value = 0.1678
##
##
## $`Multi-collinearity: Variance inflation factors must be not > 2)`
## Solar.R Wind Temp
## 1.095 1.329 1.431
##
## $`If any problem, TRUE will be printed below the problematic variables`
## Solar.R Wind Temp
## FALSE FALSE FALSE
##
## $`Durbin-Watson test for non-independence of Errors (DW statistics should be near 2, with p > 0.05)`
## lag Autocorrelation D-W Statistic p-value
## 1 0.03151 1.935 0.644
## Alternative hypothesis: rho != 0
##
## $`If the RESET' s p < 0.05, the model is misspecified`
## [1] "RESET Test: 13.12; df(3,104); p = 1e-04"
##
## $`Outliers / influential points`
## [1] "Outliers or influential points were detected: n = 3"
##
## [[7]]
## [1] "Now check normality of residuals (after saving, type 'dev.off()')"
#############################################################################################################
### second model
#############################################################################################################
fit2 <- lm(Ozone~ Solar.R + Wind + Temp+Month+Day,data=airquality,na.action=na.omit)
summary(fit2)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day, data = airquality,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.01 -12.28 -3.30 8.45 95.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.1163 23.4825 -2.73 0.0074 **
## Solar.R 0.0503 0.0234 2.15 0.0341 *
## Wind -3.3184 0.6445 -5.15 1.2e-06 ***
## Temp 1.8958 0.2739 6.92 3.7e-10 ***
## Month -3.0400 1.5135 -2.01 0.0471 *
## Day 0.2739 0.2297 1.19 0.2358
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.9 on 105 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.625, Adjusted R-squared: 0.607
## F-statistic: 35 on 5 and 105 DF, p-value: <2e-16
GoF(fit2)
## [1] "There is no risk of overfitting"
## [[1]]
## RMSE Cohen's f2 Accuracy Overfitting ratio
## 20.287 1.667 0.582 18.000
##
## [[2]]
## [1] "R-squared: 0.625 (95%CI: 0.48 to 0.713)"
## [2] "Adjusted R-squared: 0.607 (95%CI: 0.458 to 0.698)"
##
## [[3]]
## Predictive R-squared PRESS
## 0.578 51386.750
GoF2(fit2)
## Warning: 'rlm' failed to converge in 20 steps
## $`Violation of homoscedasticity (Yes if Breusch-Pagan test's p < 0.05)`
##
## studentized Breusch-Pagan test
##
## data: fit
## BP = 5.309, df = 5, p-value = 0.3793
##
##
## $`Multi-collinearity: Variance inflation factors must be not > 2)`
## Solar.R Wind Temp Month Day
## 1.152 1.329 1.722 1.257 1.011
##
## $`If any problem, TRUE will be printed below the problematic variables`
## Solar.R Wind Temp Month Day
## FALSE FALSE FALSE FALSE FALSE
##
## $`Durbin-Watson test for non-independence of Errors (DW statistics should be near 2, with p > 0.05)`
## lag Autocorrelation D-W Statistic p-value
## 1 -0.04491 2.088 0.806
## Alternative hypothesis: rho != 0
##
## $`If the RESET' s p < 0.05, the model is misspecified`
## [1] "RESET Test: 7.94; df(5,100); p = 1e-04"
##
## $`Outliers / influential points`
## [1] "Outliers or influential points were detected: n = 3"
##
## [[7]]
## [1] "Now check normality of residuals (after saving, type 'dev.off()')"
Overall, these two models did not violate the main requirements for a regression, but for the most important: normality of residuals. The two models were also misspecified, and some influence by outliers was detected. What to do? Check for outliers and test for a non-linear model
### Please, if you use this script in a study, cite it in the references:
### Preti, A. (2015). Regression diagnostics made easy: A script in r.
### Retrieved from https://rpubs.com/RatherBit
I hope you’ve enjoyed this!
Have a nice day!