In this analysis, I illustrate a univariate linear regression following the approach outlined in the textbook “Linear Regression Using R” by David Lilja. I use a standard data set cars included with R which contains the speed of cars and the distances taken to stop.
In the next section, I describe the data set. Then I visualize the data set via exploratory data analysis. Next, I run the regression and describe the model outputs. Finally, I conduct residual analysis to see if a univariate linear model can explain the data well.
The main innovation of this exercise (compared to the Week 11 Discussion) is the use of ggplot for data visualization.
The data set cars is derived from the 1930 book Methods of Correlation Analysis by M. Ezekiel using cars from the 1920s. 50 observations of two variables are recorded: speed measured in miles per hour and dist the stopping distance of each car in feet. We can think of these observations as experiments conducted with a number of cars on a race track to measure how much stopping distance is required from a given rate of speed.
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
We explain the columns above by using the first obbservation in the cars as an example. The first car sped up to 4 miles per hour and took 2 feet to subsequently stop.
We see in the scatterplot below that speed and distance to stop are generally proportional. This should be logical given the laws of physics and the physical properties of the braking systems of cars. As a car’s running speed is higher, the distance required during which brakes are applied to bring the car to rest should increase.
gg <- ggplot(data=cars, aes(x=speed, y = dist)) +
geom_point() + labs(title="Cars Stopping Distance vs. Speed", x="Speed (mph)", y="Distance (ft)", caption="source: M. Ezekiel, Methods of Correlation Analysis") + theme_economist()
plot(gg)
Now let’s explore the distribution of speed through summary statistics and its histogram.
The median speed in these trials was 15 mph. This is not very high. The lowest speed was 4 mph. The highest speed was 25 mph. So the distribution of speeds is consistent with the range observed in residential streets – not on highways. The distribution is relatively symmetric and appears centered around the median of 15mph.
summary(cars$speed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 15.0 15.4 19.0 25.0
gg2 <- ggplot(data=cars, aes(x=speed)) +
geom_histogram(bins=10) + labs(title="Cars Speed", x="Speed (mph)", caption="source: M. Ezekiel, Methods of Correlation Analysis") + theme_economist()
plot(gg2)
Now let’s explore the distance to stop.
The median stopping distance was 36 feet with a mean of 42.98 ft. The distribution is slightly skewed to the right due to an outlier at 120 ft and a slightly longer right tail.
summary(cars$dist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 26.00 36.00 42.98 56.00 120.00
gg3 <- ggplot(data=cars, aes(x=dist)) +
geom_histogram(bins=10) + labs(title="Cars Distance to Stop", x="Distance (ft)", caption="source: M. Ezekiel, Methods of Correlation Analysis") + theme_economist()
plot(gg3)
We now construct a linear regression model to explain stopping distances in terms of the car’s speed. We expect the model to show a positive slope for the fitted line because a higher initial car speed ought to require a longer distance to brake and stop.
mod1 = lm( cars$dist ~ cars$speed , data=cars)
summary(mod1)
##
## Call:
## lm(formula = cars$dist ~ cars$speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The linear regression model shows:
\[ distance = -17.5791 + 3.3924 * speed \]
To evaluate the quality of the model, we examine several things:
Now we visualize the fit of the linear regression model to the data.
gg4 <- ggplot(data=cars, aes(x=speed, y = dist)) +
geom_point() +
labs(title="Cars Stopping Distance vs. Speed", x="Speed (mph)",
y="Distance (ft)",
caption="source: M. Ezekiel, Methods of Correlation Analysis") +
geom_smooth(method="lm", se=TRUE , col = "red") +
theme_economist()
plot(gg4)
The regression line does seem to fit the data points well. However, we notice a slight assymmetry in that the outlier points at 14 mph and 24 mph above the line seem larger than the points below.
The slope of the line is positive which is consistent with my expectations. However, the intercept \(\beta_0 = -17.5791\) is inconsistent with physical reality. The linear regression model cannot extend to the speed equal to zero. Since the stopping distance cannot be negative -17.5791 feet if the car is standing still.
Because the p-value of the coefficients is sufficiently small, we know the slope \(\beta_1 = 3.3924\) is statistically significant at the \(p=0.001\) level which is a very strong result. However, the slope \(\beta_0 = -17.5791\) is only significant at the \(p-0.05\) level consistent with its physical implausibility.
The \(R^2\) is 65.11 percent which means the model explains 65 percent of the variability of the data. This is a strong fit.
The F-statistic of 89.57 and p-value of the regression suggest the model is statistically significant.
To display diagnostic plots using ggplot, we adapt the sample code from Rpubs by Raju Rimal. [https://rpubs.com/therimalaya/43190]
model = mod1
p1<-ggplot(model, aes(.fitted, .resid))+geom_point()
p1<-p1+stat_smooth(method="loess")+geom_hline(yintercept=0, col="red", linetype="dashed")
p1<-p1+xlab("Fitted values")+ylab("Residuals")
p1<-p1+ggtitle("Residual vs Fitted Plot")+theme_economist()
p2<-ggplot(model, aes(qqnorm(.stdresid)[[1]], .stdresid))+geom_point(na.rm = TRUE)
p2<-p2+geom_abline(aes(slope=1 , intercept=0 ))+xlab("Theoretical Quantiles")+ylab("Standardized Residuals")
p2<-p2+ggtitle("Normal Q-Q")+theme_economist()
p3<-ggplot(model, aes(.fitted, sqrt(abs(.stdresid))))+geom_point(na.rm=TRUE)
p3<-p3+stat_smooth(method="loess", na.rm = TRUE)+xlab("Fitted Value")
p3<-p3+ylab(expression(sqrt("|Standardized residuals|")))
p3<-p3+ggtitle("Scale-Location")+theme_economist()
p4<-ggplot(model, aes(.hat, .stdresid))+geom_point(aes(size=.cooksd), na.rm=TRUE)
p4<-p4+stat_smooth(method="loess", na.rm=TRUE)
p4<-p4+xlab("Leverage")+ylab("Standardized Residuals")
p4<-p4+ggtitle("Residual vs Leverage Plot")
p4<-p4+scale_size_continuous("Cook's Distance", range=c(1,5))
p4<-p4+theme_economist()+theme(legend.position="bottom")
p5<-ggplot(model, aes(seq_along(.cooksd), .cooksd))+geom_bar(stat="identity", position="identity")
p5<-p5+xlab("Obs. Number")+ylab("Cook's distance")
p5<-p5+ggtitle("Cook's distance")+theme_bw()
plot(p1)
plot(p2)
plot(p3)
plot(p4)
plot(p5)
In the above diagnostic plots, we conclude from p1 (Residual vs Fitted Plot) that positive residuals have larger outliers than negative residuals. Otherwise, the same is reasonable.
From plot p2 (Normal Q-Q Plot), we see the distribution is slightly fat-tailed at the extremities where a 3 standard deviation actual result is compared to a 2 sd theoretical result.
From plot p3 (Scale vs. Location) we see the variance of the residuals is roughly equal across the range of values. This is a good sign for the appropriateness of the regression model.
From plot p4 (Residual vs. Leverage Plot) there is evidence that some data points are moderate outliers.
From plot p5 (Cook’s distance) we observe that observation 149 could be viewed as an influential outlier. It has values speed = 24, distance = 120.
Re-estimating the linear model by removing observation 49 gives a slightly lower slope. This means the distance required to brake is reduced when the outlier is removed. It changes from \(\beta_1 = 3.9324\) to \(\beta_1 = 3.6396\). This appears to material but does not change the conclusion that the regression is significant.
cars1 = cars[-49,]
mod2 = lm( cars1$dist ~ cars1$speed, data=cars1)
summary(mod2)
##
## Call:
## lm(formula = cars1$dist ~ cars1$speed, data = cars1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.789 -9.149 -1.672 8.013 43.048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.0021 6.2951 -2.224 0.031 *
## cars1$speed 3.6396 0.3918 9.290 3.26e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.1 on 47 degrees of freedom
## Multiple R-squared: 0.6474, Adjusted R-squared: 0.6399
## F-statistic: 86.31 on 1 and 47 DF, p-value: 3.262e-12
The univariate linear regression model of stopping distance as a function of speed is sensible and roughly linear. This is supported by the laws of physics and general engineering concept for cars. The data is limited to residential speeds and the model may not work if speed are highway level (50-80 mph). There was one outlier but even testing the model with it removed did not change the story drastically.