Problem 1:

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Solution:

Visualization:

The first step in this one-factor modeling process is to determine whether or it looks as though a linear relationship exists between the predictor (speed) and the output value (distance). We do this using the plot function and observe that distance does indeed tend to increase with speed. So, our next step is to develop a regression model that will help us quantify the degree of linearity in the relationship between the output and the predictor.

The simplest regression model is a straight line and has the form y = mx + c. Here, x is the input to the system, c is the y-intercept and m is the slope of the line.

plot(cars[,"speed"], cars[,"dist"], main='Stopping Distance vs. Speed', xlab='Speed (mph)', ylab='Stopping Distance (ft)')

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Quality evaluation of the model:

Since R provides the function lm() to generate a linear model of the data, we will use that to generate a one-factor linear model as given below and assign it to the object cars_lm. As we can see from the summary function below, the lm function creates a linear model with the intercept (c) = -17.579 and slope m = 3.932. Hence, the one-factor linear regression model can be written as dist = -17.579 + 3.932 * speed.

The model quality can be determined by some of the model-related data extracted by the summary function. We examine the residuals first which represent the difference between the actual measured values and the corresponding values on the fitted regression line. Looking at the ab-line plot below and the min residual value of -29.069, this indicates the distance from the regression line to the point furthest below the line. Similarly, the max residual value of 43.201 is the distance from the regression line to the point furthest above the line. Next, the median is the median value of all the residuals and lastly 1Q and 3Q values are the points that indicate the first and third quartile of all sorted residual values.

Residuals

The interpretation of the residual values is that if the regression line is a good fit for the data, we would expect the residual values to be normally distributed around a mean of 0 which further implies that a good model will tend to have a median value near 0, have min and max residuals to be of roughly the same magnitude and the first and third quartiles also to be of roughly the same magnitude.

Looking at the values of the model below, the median residual value of -2.272 is close to 0 which indicates that the model is a good fit for the data. Similarly, the min and max residual values of -29.069 and 43.201 respectively which are of roughly the same magnitude and also the 1Q and 3Q residual values of -9.525 and 9.215 respectively which are essentially the same in magnitude imply a good model fit. So, overall assessing model quality using residuals leads us to the conclusion that this linear model is a good fit.

Standard error

The std. error column shows the statistical standard error for each of the coefficients. A good model will typically have standard error that is at least five to ten times smaller than the corresponding coefficient. Here, the standard error for speed is 9.5 (3.9324 / 0.4155) times smaller than the speed coefficient. However, for the intercept, the standard error is only 2.6 (17.5791 / 6.7584) times smaller than the coefficient. What this tells us is that although there is little variability in the slope estimate m (standard error for speed is 9.5 times smaller than the coefficient), there is more variability in the y-intercept coefficient.

p-value

The column labeled Pr(>|t|) shows the probability that the corresponding coefficient is not relevant in the model. From values below, the probability that the speed coefficient is not relevant is a miniscule 1.49e-12 while the probability that the intercept is not relevant is larger at 0.01 or roughly a 1/10 chance that the specific intercept is not relevant to this linear model. Again, this is yet another indication that the model may not be predicting the intercept value very well even though it is predicting the speed coefficient well.

Multiple R-squared value

The multiple R-squared value is a number between 0 and 1 and is a statistical measure of how well the model describes the data. This value is computed by dividing the total variation of the model by the total variation of the data and the higher the value, the better the fit although not always. The reported R^2 here of 0.6511 or 65.11% means that the linear model here explains ~65% of the variation in the data.

Again, by this measure the overall model appears to be a good fit for the data.

attach (cars)
cars_lm <- lm(dist ~ speed); cars_lm
## 
## Call:
## lm(formula = dist ~ speed)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932
plot(speed, dist)
abline(cars_lm)

summary(cars_lm)
## 
## Call:
## lm(formula = dist ~ speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Residual Analysis:

As we know, the residual value is the difference between the actual measured value and the value that the fitted regression line predicts for that corresponding data point. Residual values greater than zero mean that the regression model predicted a value that was too small compared to the actual measured value, and negative residual values indicate that the model predicted too large a value compared to the actual measured value.

A model that fits the data well would tend to over-predict as often as it under-predicts. Hence, if we plot the residual values, we would expect to see them distributed uniformly around zero for a well-fitted model. From the plot below, we see that although the residuals tend to increase as we move to the right, they are more or less uniformly scattered above and below zero.

Another test of the residuals uses the quantile-versus-quantile, or Q-Q plot. The Q-Q plot is used to provide a nice visual indication of whether the residuals from the model are normally distributed. If the residuals are normally distributed, we would expect the points to flow along a straight line. The residuals in our model mostly follow a straight line indicating a normal distribution with few outliers at the ends. Hence, the residual analysis through both the residual plot as well as the Q-Q plot tells us that this linear model fits the data well, albeit with some outliers.

Conclusion:

In conclusion, we first determined through visual analysis that a linear relationship exists between speed and distance which allowed us to proceed with a one-factor linear model. Next, the quality evaluation of the model showed that the proposed linear model is a good fit for the data through an evaluation of the residuals (median residual close to 0, 1Q and 3Q residuals close in value, min and max residuals close in value), an evaluation of the coefficients (std error for the speed coefficient is ~10x smaller than the coefficient value), an evaluation of the p-value (miniscule probability of the speed coefficient not being relevant in the designed model) and through the multiple R-squared value of ~65% for the model which indicated a good fit. The residual analysis (residual plot, Q-Q plot) similarly leads us to believe that this model is a good fit, albeit with the presence of outliers which need to be investigated further.

plot(fitted(cars_lm), resid(cars_lm)) 

qqnorm(resid(cars_lm))
qqline(resid(cars_lm))