Residual Analysis
Before start residual analysis, lets understand what is it?
What is Residuals?
Residuals are differences between the one-step-predicted output from the model and the measured output from the validation data set. They are also known as errors.
Residual = Observed value - Predicted value
\[residuals = y - \hat y\]
- residuals = error
- y = Observed value
- \(\hat y\) = Predicted value
What is residual analysis used for?
Residuals in a statistical or machine learning model are the differences between observed and predicted values of data. They are a diagnostic measure used when assessing the quality of a model.
Residual Plot
Residual plot shows the linear relationship between dependent variable and independent variable. If the residuals and independent variable forms random pattern then linearity is available or else linear model is not good fit for the data.
We will work on an example, to get more insight.
Load data
A laboratory (Smith Scientific Services, Akron, OH) conducted data set (treadwear.txt) containing the mileage (x, in 1000 miles) driven and the depth of the remaining groove (y, in mils). The fitted line plot of the resulting data:
## mileage groove
## 1 0 394.33
## 2 4 329.50
## 3 8 291.00
## 4 12 255.17
## 5 16 229.33
## 6 20 204.83
## 7 24 179.00
## 8 28 163.83
## 9 32 150.33
Above plot shows linear relation between mileage and groove. Lets apply the linear model model to get the predicted value.
##
## Call:
## lm(formula = groove ~ mileage, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.099 -11.392 -6.902 7.051 33.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 360.6367 11.6886 30.85 9.70e-09 ***
## mileage -7.2806 0.6138 -11.86 6.87e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.02 on 7 degrees of freedom
## Multiple R-squared: 0.9526, Adjusted R-squared: 0.9458
## F-statistic: 140.7 on 1 and 7 DF, p-value: 6.871e-06
Apply fitted line
plot(data$mileage, data$groove, col = "darkblue", main = "Regression Plot")
abline(lm, col = "darkred")Above linear model show,
- lower p-value which is statistical significant
- High R-squared means 95% variability explain by models
- Low standard error good for the model
Models shows fine, we will do the residual analysis to confirm that model is good fit or not.
Plot residuals
Above plot clearly explains that a non-linear model would better describe the relationship between the two variables. Another thing we have noticed here, \(R^2\) value is very high. This is a good example to explain, large \(R^2\) value should not be interpreted as meaning that the estimated regression line fits the data well.
So, linear model is not a good fit here.
Summary
This analysis we have learned
- What is residuals
- Residual Plot analysis
- How Residuals identify specific problem