Residual Analysis

Before start residual analysis, lets understand what is it?

What is Residuals?

Residuals are differences between the one-step-predicted output from the model and the measured output from the validation data set. They are also known as errors.

Residual = Observed value - Predicted value

\[residuals = y - \hat y\]

residuals = error
y = Observed value
\(\hat y\) = Predicted value

What is residual analysis used for?

Residuals in a statistical or machine learning model are the differences between observed and predicted values of data. They are a diagnostic measure used when assessing the quality of a model.

Residual Plot

Residual plot shows the linear relationship between dependent variable and independent variable. If the residuals and independent variable forms random pattern then linearity is available or else linear model is not good fit for the data.

We will work on an example, to get more insight.

Load data

A laboratory (Smith Scientific Services, Akron, OH) conducted data set (treadwear.txt) containing the mileage (x, in 1000 miles) driven and the depth of the remaining groove (y, in mils). The fitted line plot of the resulting data:

data <- read.csv("/Users/subhalaxmirout/DATA 621/treadwear.csv")
data

##   mileage groove
## 1       0 394.33
## 2       4 329.50
## 3       8 291.00
## 4      12 255.17
## 5      16 229.33
## 6      20 204.83
## 7      24 179.00
## 8      28 163.83
## 9      32 150.33

plot(data$mileage, data$groove, col = "darkblue", main = "Mileage vs Groove")

Above plot shows linear relation between mileage and groove. Lets apply the linear model model to get the predicted value.

lm <- lm(groove ~ mileage, data = data)
summary(lm)

## 
## Call:
## lm(formula = groove ~ mileage, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.099 -11.392  -6.902   7.051  33.693 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 360.6367    11.6886   30.85 9.70e-09 ***
## mileage      -7.2806     0.6138  -11.86 6.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.02 on 7 degrees of freedom
## Multiple R-squared:  0.9526, Adjusted R-squared:  0.9458 
## F-statistic: 140.7 on 1 and 7 DF,  p-value: 6.871e-06

Apply fitted line

plot(data$mileage, data$groove, col = "darkblue", main = "Regression Plot")
abline(lm, col = "darkred")

Above linear model show,

lower p-value which is statistical significant
High R-squared means 95% variability explain by models
Low standard error good for the model

Models shows fine, we will do the residual analysis to confirm that model is good fit or not.

Plot residuals

residuals <- resid(lm)
plot(data$mileage, residuals, col = "darkblue", main = "Residual Plot")

hist(residuals)

plot(lm)

Above plot clearly explains that a non-linear model would better describe the relationship between the two variables. Another thing we have noticed here, \(R^2\) value is very high. This is a good example to explain, large \(R^2\) value should not be interpreted as meaning that the estimated regression line fits the data well.

So, linear model is not a good fit here.

Summary

This analysis we have learned

What is residuals
Residual Plot analysis
How Residuals identify specific problem

Residual Analysis

Subhalaxmi Rout

12/20/2020