Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
ggplot(data=cars, aes(cars$speed)) +
geom_histogram(aes(fill = ..count..)) +
scale_fill_gradient("Count", low = "green", high = "red") +
labs(title = "Historgram - Speed") +
labs(x = "speed") +
labs(y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=cars, aes(cars$dist)) +
geom_histogram(aes(fill = ..count..)) +
scale_fill_gradient("Count", low = "green", high = "red") +
labs(title = "Historgram - Distance") +
labs(x = "distance ") +
labs(y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(cars, aes(x=speed, y=dist)) +
geom_point(size=2, shape=23)
cor(cars$speed, cars$dist)
## [1] 0.8068949
Using the simple linear regression, yhat = a*x+b. b is the y-intercept of the line, a is the slope, x is speed and y is output dist. Using lm function we can have the model:
car_model = lm(cars$dist~ cars$speed)
car_model
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Coefficients:
## (Intercept) cars$speed
## -17.579 3.932
The regression model is \[ dist = 3.932 * speed - 17.579 \]
plot(x = cars$speed, y = cars$dist, main="Cars Data - R package", xlab = "Speed(mph)", ylab = "Distance(feet)")
abline(car_model, col="red")
summary(car_model)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The Multiple R-squared value 0.6511 means this moel could explain 65.11% of the data variation
qqnorm(resid(car_model))
qqline(resid(car_model))
Based on the visualization of the residuals, we see that the two end fiverge form the QQ plot line. This indicates that the residuals are normally distributed.
We see that the data has 0.8069 correlation and 65.11% multiple R -square and QQ-plot shows that using speed as the only predictor in the model is insufficient to explain the distance. Therefore, we would suggest adding other factors into the model the make the model more reliable.