June 1, 2015

Regression diagnostics made easy

Please, if you use this script in a study, cite it in the references:

Preti, A. (2015). Regression diagnostics made easy: A script in r. Retrieved from https://rpubs.com/RatherBit

#######################################################################################

When I started my activity as a statistician in the medical field, regression was a technique requiring to pull together one or more independent variables supposed to be related to a dependent one. All I needed was to get some p < .05 and, possibly, a R-squared > 10%.

I was vaguely aware that some requirements had to be met for a linear regression be valid:

  • linearity and additivity of the relationship between dependent and independent variables

  • statistical independence of the errors

  • homoscedasticity (i.e., constant variance) of the errors

  • normality of the error distribution.

However, the procedures to test these requirements with available commercial software were complicated, so I crossed the fingers.

Then I’ve discovered R, and the wide range of opportunities that available packages offer to execute appropriate diagnostics for regression.

Over time, I’ve prepared a set of scripts to test some measures helpful to evaluate a regression model. Then I’ve reunited these scripts into two functions: one function calculates the root-mean-square error (RMSE), the coefficient of determination (denoted R2 or r2 and pronounced R squared), the Cohen’s f2 effect size measure for multiple regression, accuracy and risk of overfitting.

The other function is a diagnostics, which assesses violation of homoscedasticity, Durbin-Watson test for non-independence of errors, and normality of residuals, also plotting the qq plot for studentized residuals and the plot of residuals against fitted values, with Tuckey test (to confirm normality of residuals).

Here these functions, that you can copy and paste into a script and save it in your main working directory as “GoF.Regression.R”.

You can call the function with source(“GoF.Regression.R”).

The first function can be called with ‘GoF(fit)’, where ‘GoF’ is the name of the function and ‘fit’ is the fit of the model. See the example.

The second function can be called with ‘GoF2(fit)’, where ‘GoF2’ is the name of the function and ‘fit’ is the fit of the model. See the example.

########################################################################################
########################################################################################
###
### Function for calculating RMSE, f2, risk of overfitting and accuracy in regression
###
########################################################################################
########################################################################################




cat("Utility for calculating RMSE, f2, risk of overfitting and accuracy in regression \n
and for other regression diagnostics \n
Apply to the fit of the model: fit <- lm(y~x1+x2+x3,data = mydata) \n
Usage --> GoF(fit) \n 
--> GOF2(fit)
")


### Contact about this funtion: apreti@tin.it
### Twitter:  @AntoViral

### Best use: put the script into your main directory
### When you need it, call it with this command: source("GoF.Regression.R")
### The function will be loaded in your workspace --> check with ls()

### Information about the indexes were retrieved from Wikipedia or other publicly available web sites

### The root-mean-square error (RMSE) is a frequently used measure of the differences 
### between value predicted by a model or an estimator and the values actually observed. B
### The RMSE represents the sample standard deviation of the differences between predicted values and observed values. 
### The RMSE is used to evaluate models by summarizing  the differences between the actual (observed) and predicted values. 
### RMSE gives extra weight to large errors.
### In comparing models, the lower the RMSE, the better.

### The coefficient of determination, denoted R2 or r2 and pronounced R squared, 
### is a number that indicates how well data fit a statistical model
### R-squared can be interpreted as the "percent of variance explained" by the model.

### The f2 effect size measure for multiple regression is defined as: Cohen's f2 = R2 / (1 - R2)
### By convention, f2 effect sizes of 0.02, 0.15, and 0.35 are termed 'small', 'medium', and 'large', respectively.
### The f2 may be helpful to calculate a priori sample size of multiple regression.
### Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New Jersey: Lawrence Erlbaum.

### Accuracy in a measurement system is the degree of closeness of measurements of a quantity to that quantity's actual (true) value.
### In regression, the accuracy is the relation between the prediction estimated by the model and the actual dependent variable.
### Accuracy vary from 0 (no accuracy in prediction) to 1 (complete accuracy in prediction).
### In comparing models, the higher the accuracy, the better.

### Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. 
### Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. 
### A model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.
### As a rough guide against overfitting, calculate the number of data points in the estimation per coefficient estimated 
### If you have less than 10 data points per coefficient estimated, you should be alert to the possibility of overfitting. 


### goodness of fit function

### Use: after fitting a regression model like this: fit <- lm(y~x1+x2+x3,data = mydata)
### Call the function: GoF(fit)
### The function will calculate for you the RMSE, the R2 of the model, the Cohen's f2, the accuracy of the model and the overfitting ratio

GoF <- function(fit)
 {
   rmse <- round(sqrt(mean(fit$residuals^2)),3)     ### RMSE (root mean squared error)
   r2 <- round(summary(fit)$r.squared,3)            ### R2 of the model
   f2 <- round(r2/(1-r2),3)                 ### Cohen's f2
   n.points <- round(dim(fit$model)[1],3)           ### number of data points in the analysis
   num.coeff <- round(NROW(fit$coefficients))       ### number of estimated coefficients (including the intercept)
   overfitting.ratio <- round(n.points/num.coeff)   ### overfitting ration, best > 10
   fit.Predicted <- predict(fit, fit$model)     ### Use the built models to predict on test data
                                
   ### Calculate Accuracy, based on this tutorial: http://rstatistics.net/robust-regression/
   fit.Actuals.pred <- cbind(fit.Predicted, fit$model[1])
   accuracy <- round(mean(apply(fit.Actuals.pred, 1, min)/ apply(fit.Actuals.pred, 1, max)),3) 

                                ### 'if' statement on the risk of overfitting
 if(overfitting.ratio>9) print("There is no risk of overfitting") else print("There is risk of overfitting")
                                ### create an output list of the calculated indexes
 out <- c(rmse, r2, f2, accuracy, overfitting.ratio)        
                                ### assign names to the calculated indexes
 names(out) <- c("RMSE","R-squared","Cohen's f2", "Accuracy", "Overfitting ratio")
 return(out)                            ### return the output
 }



####################################################################################################
#### Diagnostics function
####################################################################################################


GoF2 <- function(fit)
 {

### Violation of homoscedasticity
### The bptest() function in library(lmtest) performs the Breusch-Pagan test for heteroscedasticity 
library(lmtest)
H <-"Violation of homoscedasticity (Yes if Breusch-Pagan test's p < 0.05)"
bp <- bptest(fit)
###


### Non-independence of Errors
### Test for Autocorrelated Errors
library(car)
NI <- "Durbin-Watson test for non-independence of Errors (DW statistics should be near 2, with p > 0.05)"
DW <- durbinWatsonTest(fit)
###
### Normality of residuals
R <- "Now check normality of residuals (after saving, type 'dev.off()')"
out <-list(H,bp,NI,DW,R)
###

########################
### Diagnostics plots
########################

par(mfrow=c(1,3))

library(car)

# qq plot for studentized residuals
qqPlot(fit, main="QQ Plot\nPotential outliers must be inside CI",id.n=5)

# distribution of studentized residuals
library(MASS)
sresid <- studres(fit)
hist(sresid, freq=FALSE,
   main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit) 

### plot of residuals against fitted values, with Tuckey test
r.p<-residualPlots(fit,~1,fitted=TRUE)
title(paste("Residuals against fitted values\nTuckey test: ",r.p[1],", p-value = ",r.p[2]))
### done

return(out)
 }
#############################################################################################################
#############################################################################################################
####
#### EXAMPLE
####
#############################################################################################################
#############################################################################################################
#### Copy and paste the commands to see how the function works
data(airquality)
attach(airquality)

head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
### first model
fit1 <- lm(Ozone~ Solar.R + Wind + Temp,data=airquality,na.action=na.omit) 
summary(fit1)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40.48 -14.22  -3.55  10.10  95.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.3421    23.0547   -2.79   0.0062 ** 
## Solar.R       0.0598     0.0232    2.58   0.0112 *  
## Wind         -3.3336     0.6544   -5.09  1.5e-06 ***
## Temp          1.6521     0.2535    6.52  2.4e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.2 on 107 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.606,  Adjusted R-squared:  0.595 
## F-statistic: 54.8 on 3 and 107 DF,  p-value: <2e-16
### call the function
#### you can call the function with: source("GoF.Regression.R")
#### but we now will use the function as it is
GoF <- function(fit)
     {
       rmse <- round(sqrt(mean(fit$residuals^2)),3)     ### RMSE (root mean squared error)
       r2 <- round(summary(fit)$r.squared,3)            ### R2 of the model
       f2 <- round(r2/(1-r2),3)                 ### Cohen's f2
       n.points <- round(dim(fit$model)[1],3)           ### number of data points in the analysis
       num.coeff <- round(NROW(fit$coefficients))       ### number of estimated coefficients (including the intercept)
       overfitting.ratio <- round(n.points/num.coeff)   ### overfitting ration, best > 10
       fit.Predicted <- predict(fit, fit$model)     ### Use the built models to predict on test data
                                    ### Calculate Accuracy, based on this tutorial: http://rstatistics.net/robust-regression/
       fit.Actuals.pred <- cbind(fit.Predicted, fit$model[1])
       accuracy <- round(mean(apply(fit.Actuals.pred, 1, min)/ apply(fit.Actuals.pred, 1, max)),3) 

                                    ### 'if' statement on the risk of overfitting
     if(overfitting.ratio>9) print("There is no risk of overfitting") else print("There is risk of overfitting")
                                    ### create an output list of the calculated indexes
     out <- c(rmse, r2, f2, accuracy, overfitting.ratio)        
                                    ### assign names to the calculated indexes
     names(out) <- c("RMSE","R-squared","Cohen's f2", "Accuracy", "Overfitting ratio")
     return(out)                            ### return the output
     }

    ####################################################################################################
    #### Diagnostics function
    ####################################################################################################

    GoF2 <- function(fit)
     {

    ### Violation of homoscedasticity
    ### The bptest() function in library(lmtest) performs the Breusch-Pagan test for heteroscedasticity 
    library(lmtest)
    H <-"Violation of homoscedasticity (Yes if Breusch-Pagan test's p < 0.05)"
    bp <- bptest(fit)
    ###

    ### Non-independence of Errors
    ### Test for Autocorrelated Errors
    library(car)
    NI <- "Durbin-Watson test for non-independence of Errors (DW statistics should be near 2, with p > 0.05)"
    DW <- durbinWatsonTest(fit)
    ###
    ### Normality of residuals
    R <- "Now check normality of residuals (after saving, type 'dev.off()')"
    out <-list(H,bp,NI,DW,R)
    ###

    ########################
    ### Diagnostics plots
    ########################

    par(mfrow=c(1,3))

    library(car)

    # qq plot for studentized residuals
    qqPlot(fit, main="QQ Plot\nPotential outliers must be inside CI",id.n=5)

    # distribution of studentized residuals
    library(MASS)
    sresid <- studres(fit)
    hist(sresid, freq=FALSE,
       main="Distribution of Studentized Residuals")
    xfit<-seq(min(sresid),max(sresid),length=40)
    yfit<-dnorm(xfit)
    lines(xfit, yfit) 

    ### plot of residuals against fitted values, with Tuckey test
    r.p<-residualPlots(fit,~1,fitted=TRUE)
    title(paste("Residuals against fitted values\nTuckey test: ",r.p[1],", p-value = ",r.p[2]))
    ### done

    return(out)
     }
### apply the fist function
GoF(fit1)
## [1] "There is no risk of overfitting"
##              RMSE         R-squared        Cohen's f2          Accuracy 
##            20.796             0.606             1.538             0.573 
## Overfitting ratio 
##            28.000
### apply the second function

par(mar = c(5, 4, 5, 4) + 0.1)
GoF2(fit1)
## Warning: package 'lmtest' was built under R version 3.1.3
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Warning: package 'car' was built under R version 3.1.3

plot of chunk unnamed-chunk-10

## [[1]]
## [1] "Violation of homoscedasticity (Yes if Breusch-Pagan test's p < 0.05)"
## 
## [[2]]
## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 5.055, df = 3, p-value = 0.1678
## 
## 
## [[3]]
## [1] "Durbin-Watson test for non-independence of Errors (DW statistics should be near 2, with p > 0.05)"
## 
## [[4]]
##  lag Autocorrelation D-W Statistic p-value
##    1         0.03151         1.935    0.65
##  Alternative hypothesis: rho != 0
## 
## [[5]]
## [1] "Now check normality of residuals (after saving, type 'dev.off()')"
#############################################################################################################
### second model
#############################################################################################################
fit2 <- lm(Ozone~ Solar.R + Wind + Temp+Month+Day,data=airquality,na.action=na.omit)
summary(fit2)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day, data = airquality, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37.01 -12.28  -3.30   8.45  95.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.1163    23.4825   -2.73   0.0074 ** 
## Solar.R       0.0503     0.0234    2.15   0.0341 *  
## Wind         -3.3184     0.6445   -5.15  1.2e-06 ***
## Temp          1.8958     0.2739    6.92  3.7e-10 ***
## Month        -3.0400     1.5135   -2.01   0.0471 *  
## Day           0.2739     0.2297    1.19   0.2358    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.9 on 105 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.625,  Adjusted R-squared:  0.607 
## F-statistic:   35 on 5 and 105 DF,  p-value: <2e-16
GoF(fit2)
## [1] "There is no risk of overfitting"
##              RMSE         R-squared        Cohen's f2          Accuracy 
##            20.287             0.625             1.667             0.582 
## Overfitting ratio 
##            18.000
par(mar = c(5, 4, 5, 4) + 0.1)
GoF2(fit2)
## Warning: 'rlm' failed to converge in 20 steps

plot of chunk unnamed-chunk-14

## [[1]]
## [1] "Violation of homoscedasticity (Yes if Breusch-Pagan test's p < 0.05)"
## 
## [[2]]
## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 5.309, df = 5, p-value = 0.3793
## 
## 
## [[3]]
## [1] "Durbin-Watson test for non-independence of Errors (DW statistics should be near 2, with p > 0.05)"
## 
## [[4]]
##  lag Autocorrelation D-W Statistic p-value
##    1        -0.04491         2.088   0.898
##  Alternative hypothesis: rho != 0
## 
## [[5]]
## [1] "Now check normality of residuals (after saving, type 'dev.off()')"

Overall, these two models did not violate the main requirements for a regression, but for the most important: normality of residuals. What to do? Check for outliers and test for a non-linear model…

### Please, if you use this script in a study, cite it in the references:
### Preti, A. (2015). Regression diagnostics made easy: A script in r. Retrieved from https://rpubs.com/RatherBit

I hope you’ve enjoyed this!

Twitter: @AntoViral

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
library(knitr)