Linear Regression: Diagnostics

Introduction

After building or creating a linear regression model and evaluating for estimation and goodness of fit, the data scientist should use regression diagnostics to truly ascertain the value of the model before using the model for future predictions.

The regression diagnostics include checking to see if the errors are independent, have equal variance and are normally distributed in order to confirm that the structural part of the model is correct. And lastly, unusual observations that don’t fit the model should be considered given that these observations could change the choice and fit of the model.

The ability to detect variance in the errors can be done easily with a plot of the residuals. The goal is to confirm homoscedasticity in the variance of the residuals.

Implementation

Using the dataset found at https://www.kaggle.com/loveall/appliances-energy-prediction, the following implementation walks through several diagnostic techniques for properly evaluating a linear model. The linear model attempts to predict the amount of energy consumed by the appliances of a home based on the outside temperature, humidity, windspeed and dew point.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

energy_data <- read.csv("KAG_energydata_complete.csv")

energy_data <- energy_data[complete.cases(energy_data), ]

summary(energy_data)
##                   date         Appliances          lights             T1       
##  2016-01-11 17:00:00:    1   Min.   :  10.00   Min.   : 0.000   Min.   :16.79  
##  2016-01-11 17:10:00:    1   1st Qu.:  50.00   1st Qu.: 0.000   1st Qu.:20.76  
##  2016-01-11 17:20:00:    1   Median :  60.00   Median : 0.000   Median :21.60  
##  2016-01-11 17:30:00:    1   Mean   :  97.69   Mean   : 3.802   Mean   :21.69  
##  2016-01-11 17:40:00:    1   3rd Qu.: 100.00   3rd Qu.: 0.000   3rd Qu.:22.60  
##  2016-01-11 17:50:00:    1   Max.   :1080.00   Max.   :70.000   Max.   :26.26  
##  (Other)            :19729                                                     
##       RH_1             T2             RH_2             T3       
##  Min.   :27.02   Min.   :16.10   Min.   :20.46   Min.   :17.20  
##  1st Qu.:37.33   1st Qu.:18.79   1st Qu.:37.90   1st Qu.:20.79  
##  Median :39.66   Median :20.00   Median :40.50   Median :22.10  
##  Mean   :40.26   Mean   :20.34   Mean   :40.42   Mean   :22.27  
##  3rd Qu.:43.07   3rd Qu.:21.50   3rd Qu.:43.26   3rd Qu.:23.29  
##  Max.   :63.36   Max.   :29.86   Max.   :56.03   Max.   :29.24  
##                                                                 
##       RH_3             T4             RH_4             T5       
##  Min.   :28.77   Min.   :15.10   Min.   :27.66   Min.   :15.33  
##  1st Qu.:36.90   1st Qu.:19.53   1st Qu.:35.53   1st Qu.:18.28  
##  Median :38.53   Median :20.67   Median :38.40   Median :19.39  
##  Mean   :39.24   Mean   :20.86   Mean   :39.03   Mean   :19.59  
##  3rd Qu.:41.76   3rd Qu.:22.10   3rd Qu.:42.16   3rd Qu.:20.62  
##  Max.   :50.16   Max.   :26.20   Max.   :51.09   Max.   :25.80  
##                                                                 
##       RH_5             T6              RH_6             T7       
##  Min.   :29.82   Min.   :-6.065   Min.   : 1.00   Min.   :15.39  
##  1st Qu.:45.40   1st Qu.: 3.627   1st Qu.:30.02   1st Qu.:18.70  
##  Median :49.09   Median : 7.300   Median :55.29   Median :20.03  
##  Mean   :50.95   Mean   : 7.911   Mean   :54.61   Mean   :20.27  
##  3rd Qu.:53.66   3rd Qu.:11.256   3rd Qu.:83.23   3rd Qu.:21.60  
##  Max.   :96.32   Max.   :28.290   Max.   :99.90   Max.   :26.00  
##                                                                  
##       RH_7             T8             RH_8             T9       
##  Min.   :23.20   Min.   :16.31   Min.   :29.60   Min.   :14.89  
##  1st Qu.:31.50   1st Qu.:20.79   1st Qu.:39.07   1st Qu.:18.00  
##  Median :34.86   Median :22.10   Median :42.38   Median :19.39  
##  Mean   :35.39   Mean   :22.03   Mean   :42.94   Mean   :19.49  
##  3rd Qu.:39.00   3rd Qu.:23.39   3rd Qu.:46.54   3rd Qu.:20.60  
##  Max.   :51.40   Max.   :27.23   Max.   :58.78   Max.   :24.50  
##                                                                 
##       RH_9           T_out         Press_mm_hg        RH_out      
##  Min.   :29.17   Min.   :-5.000   Min.   :729.3   Min.   : 24.00  
##  1st Qu.:38.50   1st Qu.: 3.667   1st Qu.:750.9   1st Qu.: 70.33  
##  Median :40.90   Median : 6.917   Median :756.1   Median : 83.67  
##  Mean   :41.55   Mean   : 7.412   Mean   :755.5   Mean   : 79.75  
##  3rd Qu.:44.34   3rd Qu.:10.408   3rd Qu.:760.9   3rd Qu.: 91.67  
##  Max.   :53.33   Max.   :26.100   Max.   :772.3   Max.   :100.00  
##                                                                   
##    Windspeed        Visibility      Tdewpoint           rv1          
##  Min.   : 0.000   Min.   : 1.00   Min.   :-6.600   Min.   : 0.00532  
##  1st Qu.: 2.000   1st Qu.:29.00   1st Qu.: 0.900   1st Qu.:12.49789  
##  Median : 3.667   Median :40.00   Median : 3.433   Median :24.89765  
##  Mean   : 4.040   Mean   :38.33   Mean   : 3.761   Mean   :24.98803  
##  3rd Qu.: 5.500   3rd Qu.:40.00   3rd Qu.: 6.567   3rd Qu.:37.58377  
##  Max.   :14.000   Max.   :66.00   Max.   :15.500   Max.   :49.99653  
##                                                                      
##       rv2          
##  Min.   : 0.00532  
##  1st Qu.:12.49789  
##  Median :24.89765  
##  Mean   :24.98803  
##  3rd Qu.:37.58377  
##  Max.   :49.99653  
## 
dim(energy_data)
## [1] 19735    29
# Take a sample of 5000 instances for shapiro wilk's test later
energy_data_5K <- sample_n(energy_data, 5000)

lmod <- lm(Appliances ~ T_out + RH_out + Windspeed + Tdewpoint, data=energy_data_5K)

summary(lmod)
## 
## Call:
## lm(formula = Appliances ~ T_out + RH_out + Windspeed + Tdewpoint, 
##     data = energy_data_5K)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -104.53  -48.45  -29.44   -4.43  988.86 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 291.0093    58.3886   4.984 6.44e-07 ***
## T_out        -5.9884     2.7624  -2.168 0.030221 *  
## RH_out       -2.2887     0.6050  -3.783 0.000157 ***
## Windspeed     2.4125     0.6266   3.850 0.000120 ***
## Tdewpoint     6.6857     2.8754   2.325 0.020104 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105 on 4995 degrees of freedom
## Multiple R-squared:  0.02732,    Adjusted R-squared:  0.02655 
## F-statistic: 35.08 on 4 and 4995 DF,  p-value: < 2.2e-16

Output linear model to confirm significance of selected independent variables. Temperature and dewpoint do not appear appears as value as humidity and wind speed to the model predictions.

energy_data_5K %>%
 ggplot(aes(x = (T_out), y = (Appliances))) +
 geom_point(colour = "red") +
 geom_smooth(method = "lm", fill = NA)
## `geom_smooth()` using formula 'y ~ x'

Plot of the outside temperature to the energy used by the appliances shows a slightly positive linear relationship. The plot does show many outliers.

energy_data_5K %>%
 ggplot(aes(x = sqrt(T_out), y = sqrt(Appliances))) +
 geom_point(colour = "red") +
 geom_smooth(method = "lm", fill = NA)
## Warning in sqrt(T_out): NaNs produced

## Warning in sqrt(T_out): NaNs produced

## Warning in sqrt(T_out): NaNs produced
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 300 rows containing non-finite values (stat_smooth).
## Warning: Removed 300 rows containing missing values (geom_point).

Applying square root to the outside temperature and energy used to perhaps identify a clearer linear relationship. Unfortunately, the resulting plot is no clearer of a linear relationship than the initial plot.

plot(fitted(lmod), residuals(lmod), xlab="Fitted", ylab="Residuals")
abline(h=0)

The fitted vs residuals plot shows that the variance in residuals is relatively even across the predicted values.

plot(fitted(lmod), sqrt(abs(residuals(lmod))), xlab="Fitted", ylab="Residuals Square Root")

Attempting the fitted vs residuals plot by taking the square root of the residuals. The plot does not reveal anything more meaningful than the first fitted vs residuals plot.

plot(energy_data_5K$T_out, residuals(lmod), xlab="Temperature Outside", ylab="Residuals")
abline(h=0)

plot(energy_data_5K$RH_out, residuals(lmod), xlab="Humidity Outside", ylab="Residuals")
abline(h=0)

The above plots show two of the predictor variables against the residuals. No significance in temperature impacts the size of the residuals, and similar for humidity. The measure of residuals appears constant throughout the amount of humidity.

shapiro.test(residuals(lmod))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lmod)
## W = 0.58785, p-value < 2.2e-16

Applying the Shapiro-Wilk’s test to the residuals to check for the normality. Based on the result of the p-value being less than 0.05, the distribution of the given data is considered different from normal distribution significantly. I think that’s a fair assessment as the data didn’t necessarily follow a normal distribution.

Conclusion

Unusual observations, or outliers, can meaningfully change the fit of the model. Also, using plots can help determine if the data follows a normal distribution to perform proper diagnostics. The above implementation shows just how much real-world data is not going to fit basic normality and thus require further investigation and application of additional techniques to build proper regression models. As the initial summary of the data indicates, more variables are available for the regression model. Those values represent the temperature and humidity of different rooms within the house. The opportunity for improved models requires time and patience to work through the different diagnostics and comparison of different linear regression models.