The cars dataset in R has 50 observations of two
variables from a 1920 study: speed and stopping distance
(dist). We’ll build a simple linear regression model for
stopping distance as a function of speed, and analyze the results.
df_cars = cars
head(df_cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
summary(df_cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
The median speed is 15.0 (with a range from 4 to 25),
and the median distance is 36.0 (with a range from 2 to
120). There are no missing values in the dataset.
plot(df_cars[,'speed'],
df_cars[,'dist'],
main='cars',
ylab='Stopping Distance',
xlab='Speed')
Using a scatterplot, we can surmise there’s a general relationship between the two variables - both seem to increase and decrease together.
lm_cars <- lm(dist ~ speed, data = df_cars)
lm_cars$coefficients
## (Intercept) speed
## -17.579095 3.932409
\[\widehat{dist} = -17.579095 + 3.932409 \times speed\] The coefficients of the linear model allow us to create the linear equation for this relationship. For each unit of speed, we’ll require 3.93 units of stopping distance, minus 17.6.
plot(dist~speed,
data=df_cars,
main='lm_cars',
ylab='Stopping Distance',
xlab='Speed')
abline(lm_cars)
This linear equation is demonstrated by the abline
above.
summary(lm_cars)
##
## Call:
## lm(formula = dist ~ speed, data = df_cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The summary() function provides diagnostic statistics
about our model.
The Residuals section describes the distribution of the
model residuals, which we would expect to be approximately normal with a
median value near zero, min/max values of approximately the same
magnitude, and quartile values of approximately the same magnitude. (In
this case, we may suspect the distribution to be somewhat right-skewed,
and we’ll confirm this later.)
The Coefficients section describes the fitted values
(Estimate) and related statistics for each variable in our model.
As a general rule, the Std. Error should be 5 to 10
times smaller than the corresponding coefficient. The
t-value column (Student’s t-statistic) provides this ratio,
and in this case the SE for speed is 9.5 times smaller.
Higher values for the t-value indicate less variability in the slope
estimate for this variable.
Finally the p-value Pr(>|t|) is the probability of
observing a given t-value this extreme (or more), under the assumption
that no linear relationship exists between the predictor and response
variables. The smaller our p-value, the more confidence we can have that
a linear relationship exists - and in this case our p-value
is very small (1.49e-12).
The Multiple R-squared and
Adjusted R-squared statistics measure how much of the
variability in the response can be explained by the model - this case,
around 65%.
If the relationship is indeed linear, then the actual values above and below this line (the errors or “residuals”) should be roughly normally distributed in terms of their vertical distances from this line.
We can validate this with a scatterplot of the fitted values vs. the residuals. We would expect a completely random scattering around the 0 y-axis. However there does appear to be a slightly increasing trend in distance as we move from left to right.
plot(fitted(lm_cars), resid(lm_cars))
abline(0,0)
The quantile-versus-quantile (Q-Q) plot below provides a clearer representation of residuals distribution; we would expect to see most points fall along the q-q line.
As we can see, there are some non-normally distributed values towards the extremes, suggesting some skew to the distribution of residuals:
qqnorm(resid(lm_cars))
qqline(resid(lm_cars))
Overall this simple LR model seems to define the relationship between
speed and distance fairly well, although some
skewness in the distribution of residuals suggests this relationship is
not perfectly described, and that there may be other variables at work
not captured by the model.