After building or creating a linear regression model and evaluating for estimation and goodness of fit, the data scientist should use regression diagnostics to truly ascertain the value of the model before using the model for future predictions.
The regression diagnostics include checking to see if the errors are independent, have equal variance and are normally distributed in order to confirm that the structural part of the model is correct. And lastly, unusual observations that don’t fit the model should be considered given that these observations could change the choice and fit of the model.
The ability to detect variance in the errors can be done easily with a plot of the residuals. The goal is to confirm homoscedasticity in the variance of the residuals.
Using the dataset found at https://www.kaggle.com/loveall/appliances-energy-prediction, the following implementation walks through several diagnostic techniques for properly evaluating a linear model. The linear model attempts to predict the amount of energy consumed by the appliances of a home based on the outside temperature, humidity, windspeed and dew point.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
energy_data <- read.csv("KAG_energydata_complete.csv")
energy_data <- energy_data[complete.cases(energy_data), ]
summary(energy_data)
## date Appliances lights T1
## 2016-01-11 17:00:00: 1 Min. : 10.00 Min. : 0.000 Min. :16.79
## 2016-01-11 17:10:00: 1 1st Qu.: 50.00 1st Qu.: 0.000 1st Qu.:20.76
## 2016-01-11 17:20:00: 1 Median : 60.00 Median : 0.000 Median :21.60
## 2016-01-11 17:30:00: 1 Mean : 97.69 Mean : 3.802 Mean :21.69
## 2016-01-11 17:40:00: 1 3rd Qu.: 100.00 3rd Qu.: 0.000 3rd Qu.:22.60
## 2016-01-11 17:50:00: 1 Max. :1080.00 Max. :70.000 Max. :26.26
## (Other) :19729
## RH_1 T2 RH_2 T3
## Min. :27.02 Min. :16.10 Min. :20.46 Min. :17.20
## 1st Qu.:37.33 1st Qu.:18.79 1st Qu.:37.90 1st Qu.:20.79
## Median :39.66 Median :20.00 Median :40.50 Median :22.10
## Mean :40.26 Mean :20.34 Mean :40.42 Mean :22.27
## 3rd Qu.:43.07 3rd Qu.:21.50 3rd Qu.:43.26 3rd Qu.:23.29
## Max. :63.36 Max. :29.86 Max. :56.03 Max. :29.24
##
## RH_3 T4 RH_4 T5
## Min. :28.77 Min. :15.10 Min. :27.66 Min. :15.33
## 1st Qu.:36.90 1st Qu.:19.53 1st Qu.:35.53 1st Qu.:18.28
## Median :38.53 Median :20.67 Median :38.40 Median :19.39
## Mean :39.24 Mean :20.86 Mean :39.03 Mean :19.59
## 3rd Qu.:41.76 3rd Qu.:22.10 3rd Qu.:42.16 3rd Qu.:20.62
## Max. :50.16 Max. :26.20 Max. :51.09 Max. :25.80
##
## RH_5 T6 RH_6 T7
## Min. :29.82 Min. :-6.065 Min. : 1.00 Min. :15.39
## 1st Qu.:45.40 1st Qu.: 3.627 1st Qu.:30.02 1st Qu.:18.70
## Median :49.09 Median : 7.300 Median :55.29 Median :20.03
## Mean :50.95 Mean : 7.911 Mean :54.61 Mean :20.27
## 3rd Qu.:53.66 3rd Qu.:11.256 3rd Qu.:83.23 3rd Qu.:21.60
## Max. :96.32 Max. :28.290 Max. :99.90 Max. :26.00
##
## RH_7 T8 RH_8 T9
## Min. :23.20 Min. :16.31 Min. :29.60 Min. :14.89
## 1st Qu.:31.50 1st Qu.:20.79 1st Qu.:39.07 1st Qu.:18.00
## Median :34.86 Median :22.10 Median :42.38 Median :19.39
## Mean :35.39 Mean :22.03 Mean :42.94 Mean :19.49
## 3rd Qu.:39.00 3rd Qu.:23.39 3rd Qu.:46.54 3rd Qu.:20.60
## Max. :51.40 Max. :27.23 Max. :58.78 Max. :24.50
##
## RH_9 T_out Press_mm_hg RH_out
## Min. :29.17 Min. :-5.000 Min. :729.3 Min. : 24.00
## 1st Qu.:38.50 1st Qu.: 3.667 1st Qu.:750.9 1st Qu.: 70.33
## Median :40.90 Median : 6.917 Median :756.1 Median : 83.67
## Mean :41.55 Mean : 7.412 Mean :755.5 Mean : 79.75
## 3rd Qu.:44.34 3rd Qu.:10.408 3rd Qu.:760.9 3rd Qu.: 91.67
## Max. :53.33 Max. :26.100 Max. :772.3 Max. :100.00
##
## Windspeed Visibility Tdewpoint rv1
## Min. : 0.000 Min. : 1.00 Min. :-6.600 Min. : 0.00532
## 1st Qu.: 2.000 1st Qu.:29.00 1st Qu.: 0.900 1st Qu.:12.49789
## Median : 3.667 Median :40.00 Median : 3.433 Median :24.89765
## Mean : 4.040 Mean :38.33 Mean : 3.761 Mean :24.98803
## 3rd Qu.: 5.500 3rd Qu.:40.00 3rd Qu.: 6.567 3rd Qu.:37.58377
## Max. :14.000 Max. :66.00 Max. :15.500 Max. :49.99653
##
## rv2
## Min. : 0.00532
## 1st Qu.:12.49789
## Median :24.89765
## Mean :24.98803
## 3rd Qu.:37.58377
## Max. :49.99653
##
dim(energy_data)
## [1] 19735 29
# Take a sample of 5000 instances for shapiro wilk's test later
energy_data_5K <- sample_n(energy_data, 5000)
lmod <- lm(Appliances ~ T_out + RH_out + Windspeed + Tdewpoint, data=energy_data_5K)
summary(lmod)
##
## Call:
## lm(formula = Appliances ~ T_out + RH_out + Windspeed + Tdewpoint,
## data = energy_data_5K)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.53 -48.45 -29.44 -4.43 988.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 291.0093 58.3886 4.984 6.44e-07 ***
## T_out -5.9884 2.7624 -2.168 0.030221 *
## RH_out -2.2887 0.6050 -3.783 0.000157 ***
## Windspeed 2.4125 0.6266 3.850 0.000120 ***
## Tdewpoint 6.6857 2.8754 2.325 0.020104 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 105 on 4995 degrees of freedom
## Multiple R-squared: 0.02732, Adjusted R-squared: 0.02655
## F-statistic: 35.08 on 4 and 4995 DF, p-value: < 2.2e-16
Output linear model to confirm significance of selected independent variables. Temperature and dewpoint do not appear appears as value as humidity and wind speed to the model predictions.
energy_data_5K %>%
ggplot(aes(x = (T_out), y = (Appliances))) +
geom_point(colour = "red") +
geom_smooth(method = "lm", fill = NA)
## `geom_smooth()` using formula 'y ~ x'
Plot of the outside temperature to the energy used by the appliances shows a slightly positive linear relationship. The plot does show many outliers.
energy_data_5K %>%
ggplot(aes(x = sqrt(T_out), y = sqrt(Appliances))) +
geom_point(colour = "red") +
geom_smooth(method = "lm", fill = NA)
## Warning in sqrt(T_out): NaNs produced
## Warning in sqrt(T_out): NaNs produced
## Warning in sqrt(T_out): NaNs produced
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 300 rows containing non-finite values (stat_smooth).
## Warning: Removed 300 rows containing missing values (geom_point).
Applying square root to the outside temperature and energy used to perhaps identify a clearer linear relationship. Unfortunately, the resulting plot is no clearer of a linear relationship than the initial plot.
plot(fitted(lmod), residuals(lmod), xlab="Fitted", ylab="Residuals")
abline(h=0)
The fitted vs residuals plot shows that the variance in residuals is relatively even across the predicted values.
plot(fitted(lmod), sqrt(abs(residuals(lmod))), xlab="Fitted", ylab="Residuals Square Root")
Attempting the fitted vs residuals plot by taking the square root of the residuals. The plot does not reveal anything more meaningful than the first fitted vs residuals plot.
plot(energy_data_5K$T_out, residuals(lmod), xlab="Temperature Outside", ylab="Residuals")
abline(h=0)
plot(energy_data_5K$RH_out, residuals(lmod), xlab="Humidity Outside", ylab="Residuals")
abline(h=0)
The above plots show two of the predictor variables against the residuals. No significance in temperature impacts the size of the residuals, and similar for humidity. The measure of residuals appears constant throughout the amount of humidity.
shapiro.test(residuals(lmod))
##
## Shapiro-Wilk normality test
##
## data: residuals(lmod)
## W = 0.58785, p-value < 2.2e-16
Applying the Shapiro-Wilk’s test to the residuals to check for the normality. Based on the result of the p-value being less than 0.05, the distribution of the given data is considered different from normal distribution significantly. I think that’s a fair assessment as the data didn’t necessarily follow a normal distribution.
Unusual observations, or outliers, can meaningfully change the fit of the model. Also, using plots can help determine if the data follows a normal distribution to perform proper diagnostics. The above implementation shows just how much real-world data is not going to fit basic normality and thus require further investigation and application of additional techniques to build proper regression models. As the initial summary of the data indicates, more variables are available for the regression model. Those values represent the temperature and humidity of different rooms within the house. The opportunity for improved models requires time and patience to work through the different diagnostics and comparison of different linear regression models.