Data 605 Assignment Week 11

Alexander Ng

11/10/2019

Overview

In this analysis, I illustrate a univariate linear regression following the approach outlined in the textbook “Linear Regression Using R” by David Lilja. I use a standard data set cars included with R which contains the speed of cars and the distances taken to stop.

In the next section, I describe the data set. Then I visualize the data set via exploratory data analysis. Next, I run the regression and describe the model outputs. Finally, I conduct residual analysis to see if a univariate linear model can explain the data well.

The main innovation of this exercise (compared to the Week 11 Discussion) is the use of ggplot for data visualization.

Obtaining and Describing the Data

The data set cars is derived from the 1930 book Methods of Correlation Analysis by M. Ezekiel using cars from the 1920s. 50 observations of two variables are recorded: speed measured in miles per hour and dist the stopping distance of each car in feet. We can think of these observations as experiments conducted with a number of cars on a race track to measure how much stopping distance is required from a given rate of speed.

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

We explain the columns above by using the first obbservation in the cars as an example. The first car sped up to 4 miles per hour and took 2 feet to subsequently stop.

Visualizing and Describing the Dataset

We see in the scatterplot below that speed and distance to stop are generally proportional. This should be logical given the laws of physics and the physical properties of the braking systems of cars. As a car’s running speed is higher, the distance required during which brakes are applied to bring the car to rest should increase.

gg <- ggplot(data=cars, aes(x=speed, y = dist)) +
   geom_point() + labs(title="Cars Stopping Distance vs. Speed", x="Speed (mph)", y="Distance (ft)", caption="source: M. Ezekiel, Methods of Correlation Analysis")  + theme_economist()

plot(gg)

Now let’s explore the distribution of speed through summary statistics and its histogram.

The median speed in these trials was 15 mph. This is not very high. The lowest speed was 4 mph. The highest speed was 25 mph. So the distribution of speeds is consistent with the range observed in residential streets – not on highways. The distribution is relatively symmetric and appears centered around the median of 15mph.

summary(cars$speed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    12.0    15.0    15.4    19.0    25.0
gg2 <- ggplot(data=cars, aes(x=speed)) +
   geom_histogram(bins=10) + labs(title="Cars Speed", x="Speed (mph)", caption="source: M. Ezekiel, Methods of Correlation Analysis")  + theme_economist()

plot(gg2)

Now let’s explore the distance to stop.
The median stopping distance was 36 feet with a mean of 42.98 ft. The distribution is slightly skewed to the right due to an outlier at 120 ft and a slightly longer right tail.

summary(cars$dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   26.00   36.00   42.98   56.00  120.00
gg3 <- ggplot(data=cars, aes(x=dist)) +
   geom_histogram(bins=10) + labs(title="Cars Distance to Stop", x="Distance (ft)", caption="source: M. Ezekiel, Methods of Correlation Analysis")  + theme_economist()
plot(gg3)

Constructing a linear regression

We now construct a linear regression model to explain stopping distances in terms of the car’s speed. We expect the model to show a positive slope for the fitted line because a higher initial car speed ought to require a longer distance to brake and stop.

mod1 = lm( cars$dist ~ cars$speed , data=cars)
summary(mod1)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The linear regression model shows:

\[ distance = -17.5791 + 3.3924 * speed \]

Evaluating the Quality of the model

To evaluate the quality of the model, we examine several things:

  1. visualize the fit of the linear model versus the scatter plot
  2. consider the coefficients (sign, magnitude) and real world accuracy.
  3. analyze the statistical significance of the coefficients and the regression
  4. inspect the R-squared
  5. review the F-statistic

Now we visualize the fit of the linear regression model to the data.

gg4 <- ggplot(data=cars, aes(x=speed, y = dist)) +
   geom_point() + 
  labs(title="Cars Stopping Distance vs. Speed", x="Speed (mph)", 
       y="Distance (ft)", 
       caption="source: M. Ezekiel, Methods of Correlation Analysis") + 
geom_smooth(method="lm", se=TRUE , col = "red") +
theme_economist()

plot(gg4)

The regression line does seem to fit the data points well. However, we notice a slight assymmetry in that the outlier points at 14 mph and 24 mph above the line seem larger than the points below.

The slope of the line is positive which is consistent with my expectations. However, the intercept \(\beta_0 = -17.5791\) is inconsistent with physical reality. The linear regression model cannot extend to the speed equal to zero. Since the stopping distance cannot be negative -17.5791 feet if the car is standing still.

Because the p-value of the coefficients is sufficiently small, we know the slope \(\beta_1 = 3.3924\) is statistically significant at the \(p=0.001\) level which is a very strong result. However, the slope \(\beta_0 = -17.5791\) is only significant at the \(p-0.05\) level consistent with its physical implausibility.

The \(R^2\) is 65.11 percent which means the model explains 65 percent of the variability of the data. This is a strong fit.

The F-statistic of 89.57 and p-value of the regression suggest the model is statistically significant.

Residual Analysis

To display diagnostic plots using ggplot, we adapt the sample code from Rpubs by Raju Rimal. [https://rpubs.com/therimalaya/43190]

model = mod1

p1<-ggplot(model, aes(.fitted, .resid))+geom_point()
p1<-p1+stat_smooth(method="loess")+geom_hline(yintercept=0, col="red", linetype="dashed")
p1<-p1+xlab("Fitted values")+ylab("Residuals")
p1<-p1+ggtitle("Residual vs Fitted Plot")+theme_economist()

p2<-ggplot(model, aes(qqnorm(.stdresid)[[1]], .stdresid))+geom_point(na.rm = TRUE)
p2<-p2+geom_abline(aes(slope=1 , intercept=0 ))+xlab("Theoretical Quantiles")+ylab("Standardized Residuals")
p2<-p2+ggtitle("Normal Q-Q")+theme_economist()

p3<-ggplot(model, aes(.fitted, sqrt(abs(.stdresid))))+geom_point(na.rm=TRUE)
p3<-p3+stat_smooth(method="loess", na.rm = TRUE)+xlab("Fitted Value")
p3<-p3+ylab(expression(sqrt("|Standardized residuals|")))
p3<-p3+ggtitle("Scale-Location")+theme_economist()

p4<-ggplot(model, aes(.hat, .stdresid))+geom_point(aes(size=.cooksd), na.rm=TRUE)
p4<-p4+stat_smooth(method="loess", na.rm=TRUE)
p4<-p4+xlab("Leverage")+ylab("Standardized Residuals")
p4<-p4+ggtitle("Residual vs Leverage Plot")
p4<-p4+scale_size_continuous("Cook's Distance", range=c(1,5))
p4<-p4+theme_economist()+theme(legend.position="bottom")

p5<-ggplot(model, aes(seq_along(.cooksd), .cooksd))+geom_bar(stat="identity", position="identity")
p5<-p5+xlab("Obs. Number")+ylab("Cook's distance")
p5<-p5+ggtitle("Cook's distance")+theme_bw()
plot(p1)

plot(p2)

plot(p3)

plot(p4)

plot(p5)

In the above diagnostic plots, we conclude from p1 (Residual vs Fitted Plot) that positive residuals have larger outliers than negative residuals. Otherwise, the same is reasonable.

From plot p2 (Normal Q-Q Plot), we see the distribution is slightly fat-tailed at the extremities where a 3 standard deviation actual result is compared to a 2 sd theoretical result.

From plot p3 (Scale vs. Location) we see the variance of the residuals is roughly equal across the range of values. This is a good sign for the appropriateness of the regression model.

From plot p4 (Residual vs. Leverage Plot) there is evidence that some data points are moderate outliers.

From plot p5 (Cook’s distance) we observe that observation 149 could be viewed as an influential outlier. It has values speed = 24, distance = 120.

Re-estimating the linear model by removing observation 49 gives a slightly lower slope. This means the distance required to brake is reduced when the outlier is removed. It changes from \(\beta_1 = 3.9324\) to \(\beta_1 = 3.6396\). This appears to material but does not change the conclusion that the regression is significant.

cars1 = cars[-49,]
mod2 = lm( cars1$dist ~ cars1$speed, data=cars1)
summary(mod2)
## 
## Call:
## lm(formula = cars1$dist ~ cars1$speed, data = cars1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.789  -9.149  -1.672   8.013  43.048 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.0021     6.2951  -2.224    0.031 *  
## cars1$speed   3.6396     0.3918   9.290 3.26e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.1 on 47 degrees of freedom
## Multiple R-squared:  0.6474, Adjusted R-squared:  0.6399 
## F-statistic: 86.31 on 1 and 47 DF,  p-value: 3.262e-12

Conclusion

The univariate linear regression model of stopping distance as a function of speed is sensible and roughly linear. This is supported by the laws of physics and general engineering concept for cars. The data is limited to residential speeds and the model may not work if speed are highway level (50-80 mph). There was one outlier but even testing the model with it removed did not change the story drastically.