Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
library(ggplot2)
library(DT)
?cars
datatable(cars, options = list(scrollX = TRUE))
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.
cars
A data frame with 50 observations on 2 variables.
[,1] | speed | numeric | Speed (mph) |
---|---|---|---|
[,2] | dist | numeric | Stopping distance (ft) |
Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.
McNeil, D. R. (1977) Interactive Data Analysis. Wiley.
Independent Variable: speed Dependent Variable: distance
Both variables’ histograms appears nearly normal.
ggplot(data = cars, aes(x = speed)) + geom_histogram(bins = 10) +
labs(x = "Speed" , y = "count") +
ggtitle("Speed Histogram") + theme(plot.title=element_text(hjust=0.5))+
geom_vline(xintercept = mean(cars$speed),
color = "indianred") +
geom_vline(xintercept = median(cars$speed),
color = "cornflowerblue")
ggplot(data = cars, aes(x = dist)) + geom_histogram(bins = 15) +
labs(x = "Stopping Distance (ft)" , y = "count") +
ggtitle("Stopping Distance (ft) Histogram") + theme(plot.title=element_text(hjust=0.5)) +
geom_vline(xintercept = mean(cars$dist),
color = "indianred") +
geom_vline(xintercept = median(cars$dist),
color = "cornflowerblue")
There is a strong relationship between the speed of the car and the distances needed for the car to stop. The faster the car, the longer the distance required the car to stop.
ggplot(data = cars, aes(x = speed, y = dist)) + geom_point() + geom_smooth() +
labs(x = "Speed" , y = "Stopping Distance (ft)") +
ggtitle("Speed v. Stopping Distance (ft)") + theme(plot.title=element_text(hjust=0.5))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The linear function is:
\[ stopping \; distance = B_{0}+B_{1}\cdot speed\] where
\(B_{0\) is the intercept.
\(B_{1}\) is the slope.
\[ stopping \; distance = -17.5791+3.9324\cdot speed\]
m1 <- lm(dist ~ speed, data = cars)
summary(m1)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
ggplot(cars, aes(x = speed, y = dist)) +
geom_point() +
stat_smooth(method = "lm", col = "indianred") +
ggtitle("Speed v. Stopping Distance (ft)") + theme(plot.title=element_text(hjust=0.5))
## `geom_smooth()` using formula = 'y ~ x'
Linearity and Constant Variability: Residuals indicates the difference between the actual and predicted value. We want the the mean of residuals to be close to 0. Based on the plot below, the points are distributed around 0 in no apparent pattern. Both linearity and constant variability appear to be met.
library(ggplot2)
ggplot(data = m1, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") +
ylab("Residuals") +
ggtitle("Residuals v. Fitted") + theme(plot.title=element_text(hjust=0.5))
Normality of Residuals: All of the points lies roughly on the line. The assumption of normality is met.
ggplot(data = m1, aes(x = .resid)) + geom_histogram(binwidth = 8) + xlab("Residuals")
ggplot(data = m1, aes(sample = .resid)) + stat_qq() +stat_qq_line(col = "red") +
xlab("Theoretical Quantities") + ylab("Standardized Resididuals")
ggtitle("Normal Q~Q ") + theme(plot.title=element_text(hjust=0.5))
## NULL
With all assumptions to be met and the adjusted R-square value for our model is 0.64, linear regression modeling appears to be an appropriate approach to model this dataset.