Using the “cars” dataset in R, build a linear model for stopping
distance as a function of speed and replicate the analysis of your
textbook chapter 3 (visualization, quality evaluation of the model, and
residual analysis).
Null Hypothesis (H₀):
There is no relationship between the speed of the cars’ and the cars’
stopping distances.
Alternative Hypothesis (H₁):
A linear relationship exists between the speed of the cars’ and the
cars’ stopping distances.
After loading the cars dataset, initial investigation yields two variables ‘speed’ and ‘distance’, as well as 50 rows of observations.
Variable Definition
Speed: The speed of the car (in miles per hour).
Distance(Dist): The distance required to stop the car (in feet).
Speed Summary Statistics
The data for the ‘speed’ variable is approximately symmetric as support
by similar mean(15.4) and median(15.0), with a range of values between
4-25 approximately represented in the 1st and 3rd quartiles. A histogram
of the values also supports that the underlying distribution is
normal.
Distance Summary Statistics
The data for the ‘distance’ variable appears skewed as indicated by the
difference between mean(42.98) and median(36.0) values. The data range
of values is between 2-120, with a greater number observations falling
in the first quartile than the third (1st Qu.: 26.00, Mean: 42.98,
Median: 36.00, 3rd Qu.: 56.00 ). The ‘distance’ observations are
therefore skewed to the right. A histogram of the values also confirms
right skew.
data<-data(cars)
#head(cars)
#nrow(cars)
#names(cars)
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
hist(cars$speed,
main = "Histogram of Car 'Speed' Variable",
xlab = "Speed (mph)",
col = "lightblue",
border = "black")
#breaks = seq(from = min(cars$speed), to = max(cars$speed), by = 5))
hist(cars$dist,
main = "Histogram of 'Dist' Variable",
xlab = "Distance",
col = "lightblue",
border = "black")
Speed and distance values were plotted against each other to visually assess for linearity. The below ‘Stopping Distance vs. Speed’ plot supports the notion that a linear relationship exists between the two variables.
plot(cars$dist ~ cars$speed, data=cars, main="Stopping Distance vs. Speed",
xlab="Speed (mph)", ylab="Stopping Distance (feet)")
A linear model was passed into R below, with distance defined as the dependent variable and speed defined as the independent variable. The y-intercept was determined to be a0 -17.579 and the slope = 3.932.
The regression model can therefore be represented as: \[Distance = -17.579 + 3.932 * speed\]
lm<-lm(cars$dist ~ cars$speed, data=cars)
lm
##
## Call:
## lm(formula = cars$dist ~ cars$speed, data = cars)
##
## Coefficients:
## (Intercept) cars$speed
## -17.579 3.932
plot(cars$dist ~ cars$speed, data=cars, main="Stopping Distance vs. Speed",
xlab="Speed (mph)", ylab="Stopping Distance (feet)")
abline(lm, col="red")
Below, the results of the linear model are interpreted.
summary(lm)
##
## Call:
## lm(formula = cars$dist ~ cars$speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Residuals
The median value being relatively near zero, and the first and third
quartile approximation of each other hints that the residuals are
normally distributed. However, minimum and maximum values are not of
similar magnitude, which provides some evidence that the model may not
be the best fit for the data.
Coefficients
Observing a standard error of at least five to ten times smaller than
the coefficient provides evidence of good model fit. Below the standard
error for the ‘speed’ variable is 9.4 times smaller (t-value) with a
statistically significant p-value of 1.49^-12. This magnitude test
statistic essentially means there is little variability with regard to
the slope estimate and provides further evidence that the simple linear
model is a good fit for this data.
Residual Standard Error and Degrees of Freedom
The residual standard error is approximately 1.5 times the 1st and 3rd
quartile residuals, meaning that the residuals appear normally
distributed.
Degrees of freedom refer to the total number of observations in the dataset minus the number of variables in the SLM. For this model we have 48 degrees of freedom.
The Multiple R-squared Value
The reported R2 of 0.6511 for this model means that 65.11% of the
variability in stopping distance is explained by the variation in
speed
Below, residual analysis is conducted in the form of Residual versus Fitted Value Plot and a Q-Q Plot
Residual versus Fitted Value Plot
For a Residual versus Fitted Value Plot to support the linear model,
residuals should be scattered around the horizontal axis where the
residual equals zero. In the below Residual versus Fitted Value Plot
which is for our linear model, the residuals look to be scattered
randomly around the horizontal axis where the residual equal zero. The
assumption of constant variance and linearity appear to be satisfied.
The plot also hints at outliers with residual measurements that
approximate ~40.
plot(fitted(lm),resid(lm))
Q-Q Plot
For the Q-Q Plot to support our linear model, we would expect the
plotted values to follow a straight line, indicating the residuals were
normally distributed. Below our model’s Q-Q Plot suggests that the
distribution of the residuals are somewhat normal. However, both the
right and left tails deviate slightly from the expected straight line,
suggesting that the model could be improved.
qqnorm(resid(lm))
qqline(resid(lm))
More Plots
The below plots identify potential outliers.
par(mfrow=c(2,2))
plot(lm)
Overall, the analysis does allude to a relationship between speed and stopping distance. The null hypothesis, that there is no relationship between the speed of the cars’ and the cars’ stopping distances, is able to be rejected. However, there is also evidence that other variables might influence this relationship, as noted by the skewed residuals in the QQ Plot and the presence of outliers. Other factors which might influence the relationship could include variables such as car type, brake type, car manufacturer and/or brake manufacturer.