Homework 11 - Linear Regression

The cars dataset in R has 50 observations of two variables from a 1920 study: speed and stopping distance (dist). We’ll build a simple linear regression model for stopping distance as a function of speed, and analyze the results.

Data Setup and EDA

df_cars = cars
head(df_cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Summary Statistics

summary(df_cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

The median speed is 15.0 (with a range from 4 to 25), and the median distance is 36.0 (with a range from 2 to 120). There are no missing values in the dataset.

Visualize the Data

plot(df_cars[,'speed'],
     df_cars[,'dist'],
     main='cars',
     ylab='Stopping Distance',
     xlab='Speed')

Using a scatterplot, we can surmise there’s a general relationship between the two variables - both seem to increase and decrease together.

Simple Linear Regression Model

lm_cars <- lm(dist ~ speed, data = df_cars)
lm_cars$coefficients

## (Intercept)       speed 
##  -17.579095    3.932409

\[\widehat{dist} = -17.579095 + 3.932409 \times speed\] The coefficients of the linear model allow us to create the linear equation for this relationship. For each unit of speed, we’ll require 3.93 units of stopping distance, minus 17.6.

plot(dist~speed,
     data=df_cars,
     main='lm_cars',
     ylab='Stopping Distance',
     xlab='Speed')

abline(lm_cars)

This linear equation is demonstrated by the abline above.

Evaluating Model Quality

summary(lm_cars)

## 
## Call:
## lm(formula = dist ~ speed, data = df_cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The summary() function provides diagnostic statistics about our model.

The Residuals section describes the distribution of the model residuals, which we would expect to be approximately normal with a median value near zero, min/max values of approximately the same magnitude, and quartile values of approximately the same magnitude. (In this case, we may suspect the distribution to be somewhat right-skewed, and we’ll confirm this later.)

The Coefficients section describes the fitted values (Estimate) and related statistics for each variable in our model.

As a general rule, the Std. Error should be 5 to 10 times smaller than the corresponding coefficient. The t-value column (Student’s t-statistic) provides this ratio, and in this case the SE for speed is 9.5 times smaller. Higher values for the t-value indicate less variability in the slope estimate for this variable.

Finally the p-value Pr(>|t|) is the probability of observing a given t-value this extreme (or more), under the assumption that no linear relationship exists between the predictor and response variables. The smaller our p-value, the more confidence we can have that a linear relationship exists - and in this case our p-value is very small (1.49e-12).

The Multiple R-squared and Adjusted R-squared statistics measure how much of the variability in the response can be explained by the model - this case, around 65%.

Residual Analysis

If the relationship is indeed linear, then the actual values above and below this line (the errors or “residuals”) should be roughly normally distributed in terms of their vertical distances from this line.

We can validate this with a scatterplot of the fitted values vs. the residuals. We would expect a completely random scattering around the 0 y-axis. However there does appear to be a slightly increasing trend in distance as we move from left to right.

plot(fitted(lm_cars), resid(lm_cars))
abline(0,0)

The quantile-versus-quantile (Q-Q) plot below provides a clearer representation of residuals distribution; we would expect to see most points fall along the q-q line.

As we can see, there are some non-normally distributed values towards the extremes, suggesting some skew to the distribution of residuals:

qqnorm(resid(lm_cars))
qqline(resid(lm_cars))

Conclusion

Overall this simple LR model seems to define the relationship between speed and distance fairly well, although some skewness in the distribution of residuals suggests this relationship is not perfectly described, and that there may be other variables at work not captured by the model.