Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
rm(list=ls())
library(ggplot2)
The cars dataset has 50 rows and 2 columns. Each row is an observation that relates to a reading between car speed and the distance it takes for a car to stop. The columns in the dataset are “speed”" and “dist”.
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
plot(x = cars$speed, y = cars$dist, main="Cars Data - R package", xlab = "Speed(mph)", ylab = "Distance(feet)")
Now, let’s look at the correlation between peed and disance and create the linear regression model.
cars.lm <- lm(cars$dist ~ cars$speed)
cars.lm
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Coefficients:
## (Intercept) cars$speed
## -17.579 3.932
corr<-cor(cars$dist,cars$speed)
(round(corr,4))
## [1] 0.8069
plot(x = cars$speed, y = cars$dist, main="Cars Data - R package", xlab = "Speed(mph)", ylab = "Distance(feet)")
abline(h=mean(cars$dist))
abline(cars.lm, col="red")
The black horizontal line indicates average distance and the red line is actual regression model. It explains as speed increases distance car travels after brakes are applied also increases. Now we can look at the actual quality of the linear model.
summary(cars.lm)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
linear regession equation–dist = -17.5791 + (3.9324 * speed) Correlation Coefficient–0.8069 Multiple R-squared–0.6511 R-quared–0.6438 The reported R-Squared of 0.6511 for this model means that the model explains 65.11 percent of the data’s variation.
residuals(cars.lm)
## 1 2 3 4 5 6
## 3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584
## 7 8 9 10 11 12
## -3.744993 4.255007 12.255007 -8.677401 2.322599 -15.609810
## 13 14 15 16 17 18
## -9.609810 -5.609810 -1.609810 -7.542219 0.457781 0.457781
## 19 20 21 22 23 24
## 12.457781 -11.474628 -1.474628 22.525372 42.525372 -21.407036
## 25 26 27 28 29 30
## -15.407036 12.592964 -13.339445 -5.339445 -17.271854 -9.271854
## 31 32 33 34 35 36
## 0.728146 -11.204263 2.795737 22.795737 30.795737 -21.136672
## 37 38 39 40 41 42
## -11.136672 10.863328 -29.069080 -13.069080 -9.069080 -5.069080
## 43 44 45 46 47 48
## 2.930920 -2.933898 -18.866307 -6.798715 15.201285 16.201285
## 49 50
## 43.201285 4.268876
summary(residuals(cars.lm))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -29.069 -9.525 -2.272 0.000 9.215 43.201
ggplot(cars.lm, aes(.fitted, .resid)) +
geom_point(color = "red", size=2) +
labs(title = "Fitted Values vs Residuals") +
labs(x = "Fitted Values") +
labs(y = "Residuals")
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))
#Conclusion Residual is a difference between actual measured value and corresponding values on the fitted regression line. The positive value indicates observed value is above the fitted line and the negative value means the observed value is below the fitted line.
In case of the best-fitted model, mean of the residual will be zero, as this follows a normal distribution. For any given data there will be enough observed values above and below the fitted line.
We see that the two ends diverge from the Q-Q plot line. This behavior indicates that the residuals are not normally distributed. The plot suggests that the distribution’s tails are “heavier” than what we would expect from a normal distribution. Speed is not a sufficient indicator for distance in this case.