Assignment instructions

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

I coincidentally picked the same problem for my week 11 discussion item (second time!) so this is an extension of my week 11 discussion item, hewing closely to the analysis in chapter 3 of Linear Regression Using R: An Introduction to Data Modeling, Lilja, D., 2016



Select Data

Here we select the built-in-to-R data, “cars”:

df <- cars
summary(df)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00


Does our data look linear?

Here we plot the values to inspect for linearity.

plot(df$speed, df$dist, main="Does this look linear?", xlab="Speed", ylab="Distance")



Create new columns

Subtly, the slope of the implied line in the graph above seems to increase as speed increases so we’re going to create a new column as the independent variable, the square root of speed.

df$speed_root <- sqrt(df$speed)


Make the model

Here we make the model to arrive at:

\(\hat{distance} = -68.819 + 28.192\sqrt{speed}\)

lm <- lm(dist ~ speed_root, data=df)
lm
## 
## Call:
## lm(formula = dist ~ speed_root, data = df)
## 
## Coefficients:
## (Intercept)   speed_root  
##      -65.82        28.19


Residual Analysis

Here we look at the linear regression model overlayed on the data (remember we’ve taken the square root of speed).

plot(dist ~ speed_root, data=df)
abline(lm)

Here we extract additional data about the quality of the model.

Residuals in a good model should have a median value near zero with min and max of roughly the same magnitude. With a median of -2.978 and Min and Max of -28.258 and 47.709, we are not too far off.

Standard Error in a good model should be five to ten times smaller than their corresponding coefficients. We’re just over five for the intercept’s standard error (a test statistic of -5.242) and just under 9 for the square root of speed’s standard error (a t value of 8.812). The larger the ratio the less the variability in the coefficient estimate. (It’s not clear to me what the Std. Error of the intercept means.)

The Pr(>|t|) is the p-value. The p-value of 1.34e-11 means that there is that percentage that we’d observe a t value of 8.812 or more extreme if there were no linear relationship between the squareroot of speed and the stopping distance.

For the intercept, the p-value of 3.51e-06 means there is that percentage likelihood of observing a t value of 5.242 or more extreme assuming the true intercept is zero. (Of course the true intercept would be zero! Because then it’s already stopped!)

Residual standard error, if the residuals are normal, the residual standard error should be about 0.6666 times the first and third quantiles of the residuals.

The degrees of freedom is the number of observations in the model minus the number of coefficients. So we have 50 points of data and two coefficients (intercept and square root of speed) for 48 degrees of freedom.

The Multiple R-squared value, 61.8%, is the percent of the variation in the model explained by the independent variable, square root of speed.

The Adjusted R-squared value is the same, but slightly smaller and takes into account the number of independent variables in the model.

The F-statistic compares the current model to a model that only has the intercept parameter. It’s supposed to be more informative in Multiple linear regression with multiple independent or explanatory variables. It’s not clear to me what it says for our model.

summary(lm)
## 
## Call:
## lm(formula = dist ~ speed_root, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.258 -10.969  -2.978  10.518  47.709 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -65.819     12.555  -5.242 3.51e-06 ***
## speed_root    28.192      3.199   8.812 1.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.09 on 48 degrees of freedom
## Multiple R-squared:  0.618,  Adjusted R-squared:   0.61 
## F-statistic: 77.65 on 1 and 48 DF,  p-value: 1.345e-11


Residual Analysis

Residuals Plot

Here we plot the residuals of our model against their fitted values. In a good model we would expect to see an even dispersal of these above and below zero and along the whole range of fitted values. Ours do not seem uniformly scattered.

plot(fitted(lm), resid(lm))

Quantile-versus-quantile (Q-Q) Plot

Here we generate the Q-Q plot. If the residuals are normally distributed we would expect the points to follow a straight line. Since our points are slightly convex, or bowed to the bottom right corner, this indicates our residuals are slightly right-skewed.

qqnorm(lm$residuals)
qqline(lm$residuals)

All four default diagnostic plots for the single linear regression model

This is a better way to generate the Residuals Plot and the Q-Q Plot.

Also it has the “Scale-Location” plot, which is the Residuals plot after the residuals have been standardized and square rooted. This could aid in visually spotting patterns in the residuals.

The Residuals vs Leverage plot wasn’t discussed in the text book but it tells you if there are any data points that had an outsized influence on the regression model. It looks like two points had a lot of influence on the model and it would be interesting how the model would change without those two points.

plot(lm)



Was the linear model appropriate?

While our Adjusted R-squared value at 61% was relatively low, it looks like we have a great model!