Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
#Lets see summary of the data
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
colnames(cars)
## [1] "speed" "dist"
nrow(cars)
## [1] 50
We can see that dataset has 2 columns and 50 rows.
#Print the data
cars
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## 7 10 18
## 8 10 26
## 9 10 34
## 10 11 17
## 11 11 28
## 12 12 14
## 13 12 20
## 14 12 24
## 15 12 28
## 16 13 26
## 17 13 34
## 18 13 34
## 19 13 46
## 20 14 26
## 21 14 36
## 22 14 60
## 23 14 80
## 24 15 20
## 25 15 26
## 26 15 54
## 27 16 32
## 28 16 40
## 29 17 32
## 30 17 40
## 31 17 50
## 32 18 42
## 33 18 56
## 34 18 76
## 35 18 84
## 36 19 36
## 37 19 46
## 38 19 68
## 39 20 32
## 40 20 48
## 41 20 52
## 42 20 56
## 43 20 64
## 44 22 66
## 45 23 54
## 46 24 70
## 47 24 92
## 48 24 93
## 49 24 120
## 50 25 85
#Correlation
cor(cars$dist,cars$speed)
## [1] 0.8068949
#Plot the spread
plot(x = cars$speed, y = cars$dist, main="Cars Data", xlab = "Speed(mph)", ylab = "Distance(feet)")
We can see as speed increases distance is also increasing, we can safely assume that distance is a function of speed.
cars_model <- lm(cars$dist ~ cars$speed)
summary(cars_model)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Summary: R-squared for the model is .6511, so model explains around 65% of the variation in distance due to speed. Also the standard error is very less compared to the coefficients(around 10 times) which is good for the model.
plot(cars$speed, cars$dist, xlab = "Speed (mph)", ylab = "Distance (feet)",main="Speed vsStopping Distance")
abline(cars_model)
The regression line will be :
distance = -17.5791 + 3.9324 * speed
summary(residuals(cars_model))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -29.069 -9.525 -2.272 0.000 9.215 43.201
The mean is equal to zero, so that looks good.
plot(cars_model$residuals ~ cars$speed, xlab='Fitted Values', ylab='Residuals',main="Speed vs Linear Model Residuals")
abline(h=0, lty=3)
qqnorm(cars_model$residuals)
qqline(cars_model$residuals)
Seeing the residual plot, we can see there is constant variability and no pattern. Q-Q plot also looks good with some outliers at the tails.