Question

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

  • Predictor: Speed
  • Response Variable: stopping distance

Data Summary

Let’s check the data. Our objective is to explain “stopping distance” by Speed.

head(dataCar)
describe(dataCar)
summary(dataCar)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
tibble::glimpse(dataCar)
## Observations: 50
## Variables: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13...
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28...
attach(dataCar)
head(dataCar)
Checking collinearity.

Let’s check if there is any collinearity exist in the data.

cor(dataCar)
##           speed      dist
## speed 1.0000000 0.8068949
## dist  0.8068949 1.0000000
cor.plot(dataCar,numbers=TRUE)

Above plot shows 81 % data collinearity, this indicates that 81% of the speed can be explain by the data of distance.

pairs.panels(dataCar)

Above plot of speeds data shows normally distributed graph, which indicates that our dataset of of Speed is following normal distribution.

Regression Model

We will use lm method to build our regression model and see how we can explain “Distance” as the function of “Speed” . In this scenario we have
Response variable : “Distance”
Predictor variable : Speed

We know The simplest regression model is a straight line. It has the mathematical form:
y = a0 + a1x1 ……same as y= m x + c where (m is slope, and c is a constant)
Here
+ x1 is the input to the system + a0 is the y-intercept of the line, + a1 is the slope, and + y is the output value the model predicts.

In our case we can write :
Predicted Distance = yIntercept + slope of line * Actual value of Speed

model_dis_speed = lm(dist~speed,dataCar)
model_dis_speed
## 
## Call:
## lm(formula = dist ~ speed, data = dataCar)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

We get the value of yIntercept(a0)= -17.579 , and slope(a1) is 3.932 for the linear REgression formula of Distance by function of speed.

Evaluating the Quality of the Model

summary(model_dis_speed)
## 
## Call:
## lm(formula = dist ~ speed, data = dataCar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

From the above Residuals quantile distribution, we needs to see if these value form a normal distribution of the Residuals. We can use scatter plot of Residual and QQ Plot to see this this .

Residuals vs Fitted


residuals vs Fitted value

plot(model_dis_speed$fitted.values, model_dis_speed$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0)

hist(model_dis_speed$residuals)

plot_ss(x = speed, y = dist, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     -17.579        3.932  
## 
## Sum of Squares:  11353.52

Scatterplot of the Residual suggest that residuals are not forming any pattern and are showing random variations in the data, which satisfies the Linear Regression Model Requirement. Data points are equally distributed on both side of 0 on the y axsis.

Histogram of Residual shows a normal distribution.

QQ Plot

Q-Q plots take the sample data, sort it in ascending order, and then plot them versus quantiles calculated from a theoretical distribution. So,QQ Plot is the plot of residual in ascending order and the against the quantile calculated from a theoretical distribution.

if residuals are normally distributed. then the residuals are lined well on the straight line with little deviation from the line. Any strong or sharp deviation shows some sort of data inconsistency or inability of the data to explain all the data points.

qqnorm(resid(model_dis_speed))
qqline(resid(model_dis_speed))


Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12


R-squared ranges between 0 and 1, 1 being best possible fit and 0 being the worst fit indicator of regression line.

From the R-square Correlation between Distance and Speed is = 0.65, which means that our regression line can explain up to 65% of the data. It also suggests that 35% of data is not explained by the regression line, which may or may not always account for all the variations in the data. The F-statistic may not give more information with Simple Linear Regression.

Regression line

Below plot shows Speed and Distance along with Regression line fits.

plot(dist ~ speed, main = "Relationship")
abline(model_dis_speed)

Plot Analysis

par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(model_dis_speed)

From the above plot “Residual vs Fitter” : We see most of data points follow the randomness of the residuals required for linear regression model. Only exception or outliers are three data points of 23,35,and 49.

Some time these outliers are important but how to identify them. We switch our focus to “Residuals vs Leverage” and look for dotted red line , we don’t see any data points inside the dotted lines , which means that mostly all the points are in the range of cooks distance.

“Residuals vs Leverage”: This plot helps us to find influential cases (i.e., subjects) if any. Not all outliers are influential in linear regression analysis (whatever outliers mean). Even though data have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. They follow the trend in the majority of cases and they don’t really matter; they are not influential. On the other hand, some cases could be very influential even if they look to be within a reasonable range of the values.

Scale-Location: It’s also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). It’s good if you see a horizontal line with equally (randomly) spread points.

Here we see residuals are spread randomly across the red line , since we don’t see prominent pattern here its a good check for the assumption of equal variance in the residuals (homoscedasticity).

qqnorm(model_dis_speed$residuals)
qqline(model_dis_speed$residuals)  # adds diagonal line to the normal prob plot

Model with Quadratic term

Extending our Model to a log function of speed and then try to explain Distance.

model2_dis_speed = lm(dist~speed*log(speed),dataCar)
summary(model2_dis_speed)
## 
## Call:
## lm(formula = dist ~ speed * log(speed), data = dataCar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.246  -9.161  -3.083   5.580  45.294 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)        -11.93      45.94  -0.260    0.796
## speed              -35.57      46.30  -0.768    0.446
## log(speed)          77.58     115.83   0.670    0.506
## speed:log(speed)     9.20      10.29   0.894    0.376
## 
## Residual standard error: 15.32 on 46 degrees of freedom
## Multiple R-squared:  0.668,  Adjusted R-squared:  0.6464 
## F-statistic: 30.85 on 3 and 46 DF,  p-value: 4.392e-11
par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(model2_dis_speed)

Conclusion

With Quadratic terms we were able to explain ~67% of data , which is not too much improvement in the model compared with simple linear model. Given the limitation of data we can say that we can predict Distance with some possibility of error. But if we have more data available we may be able to use the information in better predicting the distance by Speed.