Data605: HW11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Preliminary Review of the Numeric Variables

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

There is less variability in the speed variable from reviewing the summary statistics

glimpse(cars)

## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…

It does seem a little bit surprising that the distances are even values, but perhaps there was some rounding done to simplify this pre-loaded dataset or the measuring techniques were not designed to be percise (documentation indicates these values were obtained in the 1920s).

Initial Scatterplot

The first way to inspect for a potential linear relationship between a dependent and independent variable is to graph it in a scatterplot. It will be somewhat obvious if the relationship does not appear to be linear between the two variables.

data(cars)
ggplot(cars,aes(x=speed,y=dist)) +
    geom_point() +
    geom_smooth(method='lm')

## `geom_smooth()` using formula 'y ~ x'

There appears to be a fairly strong positive relationship between speed and stopping distance in which the distance it takes to stop increases with an increase in speed. The best fit line appears to do a pretty good job approximating the dependent and independent variables.

Simple Linear Regression Model

car.lm <- lm(dist ~ speed, data=cars)
summary(car.lm)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

It appears that the p-value from this SLR model is very small indicating that there is solid evidence of a linear relationship between the predictor and responses variables. From reviewing the summarized residual data, the average is close to zero and the quartiles are fairly similar to one another although there is a slight right skew between the min and max given the larger positive residuals. Lastly, the R-Squared statistic indicates that about 65% of the variability in distance cane be explained by the variance of the independent variable.

Diagnostic Plots

Assessing if any assumptions are violated to run SLR with this dataset

par(mfrow=c(2,2))
plot(car.lm)

The expecatation when evaluating the residuals and the standardized residuals on the two left plots are that there will not be a clear pattern or trend visible when looking at the graphs. The residuals appear to center around zero which indicates that the positive and negative differences balance each other out and more likely approximate a normal curve.

The Normal Q-Q Plot at the top right further solidifies the evaluation of normality of the residuals as the graph should somewhat closely follow the best case normal line. As expected in most real world data sets, there is some natural variation off this line which is typically at the largest and smallest values. There is a slight right skew to this cars dataset as was evidenced from the scatter plot chart.

The Residuals vs Leverage plot is used to assess the impact of outliers on the model in terms of modifying other residuals. There appear to be a few points that have a large impact on the model given their leverage. Point 39 or potentially 49 are closest to Cook’s distance line which signifies influential points that are impacting the regression coefficients. Given they are still within the edge of the constraints for this plot, there would not be significant impact if these points were removed from the model.

Conclusion

After following the outlined approach to assess the linear relationship of the two variables in the cars datas et it appears there is a statistically significant relationship between stopping distance and speed. A linear model is a fair method for measuring the variation between the dependent and independent inputs. Although the SLR and the speed does a decent job of estimating the distance to stop a car, there are likely other factors at play that will help increase the explained variability such as the weight of the car or the age of the brakes that could be further explained in MLR.