Problem Set

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Introduction

The cars dataset were recorded in the 1920s. The data give the speed of cars and the distances taken to stop.

Exploratory Data Analysis

From the summary below, we can see that the cars dataset has 2 variables with 50 observations. The 2 variables are the speed measured in miles per hour and the stopping distance (dist) measured in feet.

Also, the summary statistics which standard deviation values show that both variables have high variation. The box plot below clearly shows this . The skew value for the speed is -0.11. This value implies that the distribution of the data is slightly skewed to the left or negatively skewed. For the stopping distance is opposite where its skew value is 0.76. This means that the distribution of the data is slightly skewed to the right or positively skewed. The kurtosis value for both variables are close to zero. This implies that the distribution of the data is mesokurtic. Mesokurtic distributions are similar to normal distributions, in which extreme or outlier events are very unlikely.

library(psych)

# Statistic summary of the dataset
str(cars)

## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

describe(cars)

##       vars  n  mean    sd median trimmed   mad min max range  skew kurtosis
## speed    1 50 15.40  5.29     15   15.47  5.93   4  25    21 -0.11    -0.67
## dist     2 50 42.98 25.77     36   40.88 23.72   2 120   118  0.76     0.12
##         se
## speed 0.75
## dist  3.64

# Box plot
boxplot(cars, main = 'Stopping Distance & Speed Variables')

Below is the scatter plot where the speed is the explanatory variable and stopping distance is the response variable.

# Scattered plot
plot(cars$speed, cars$dist, xlab = 'Speed (mph)', 
     ylab = 'Stopping Distance (ft)', 
     main = 'Stopping Distance vs Speed',
     col = "red",
     pch = 16)

Let’s now build a linear model and find the best fitting line.

# Linear regression model
cars_lm <- lm(cars$dist ~ cars$speed)
cars_lm

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932

The linear model is \[dist = 3.932*speed - 17.579\]

# Linear regression plot with the best fit line
plot(cars$speed, cars$dist, xlab = 'Speed (mph)', 
     ylab = 'Stopping Distance (ft)', 
     main = 'Stopping Distance vs Speed',
     col = "red",
     pch = 16)
abline(cars_lm)

From the plot above, there appears to be positive linear relationship between the speed and stopping distance variables. Let’s now evaluate the linear model generated. In other words, as the number of speed increases, the stopping distance for a car to stop will likely to increase as well.

We can use the function summary() as shown below to extract some additional information that we can use to determine how well the data fit the resulting model.

summary(cars_lm)

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

First of all, let’s examine the residuals information from the summary. A good model would tend to have a median value near zero, minimum and maximum values of roughly the same magnitude, and first and third quartile values of roughly the same magnitude. We don’t see this happens to this model except the 1Q and 3Q are roughly the same magnitude. Also, if the model fits the data well, we would expect the residuals to be normally (Gaussian) distributed around a mean of zero. That is, a good model’s residuals should be roughly balanced around and not too far away from the mean of zero.

From the residuals plot below, it looks to be heteroscedasticity. We can see that the residuals are not uniformly scattered above and below zero and the residuals tend to increase as we move to the right. It looks like a cone shaped pattern. The vertical range of the residuals increases as the fitted values increases. Also, the residuals histogram is showing right skewed distribution. This also suggests that the residuals are not normally distributed. Overall, these plots tells us that using the speed as the sole predictor in the regression model does not sufficiently or fully explain the data.

# Residuals plot
plot(cars_lm$fitted.values, cars_lm$residuals, 
     xlab='Fitted Values', ylab='Residuals')
abline(h = 0, lty = 3, col="blue")
abline(h = 20, lty = 3, col="red")
abline(h = -20, lty = 3, col="red")

# Histogram plot
hist(cars_lm$residuals, main="Histogram of Linear model Residuals", xlab="Residuals")

For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient. From the summary, we can see that the standard error for the speed is 9.5 smaller than the corresponding coefficient (3.9324/0.4155=9.5). This large ratio means that there is relatively little variability in the slope estimate, \({a}_{1}\). But the standard error for the intercept, \({a}_{0}\) is only 2.6 times smaller than the corresponding coefficient. This small ratio suggests that the estimate of this coefficient for this model can vary a lot.

The p-value for each independent variable tests the null hypothesis that the variable has no correlation with the dependent variable. From the summary, we see that the p-values for both the speed and intercept are significantly less than 5% significance level. This indicates that the sample data (50 observations) provide enough evidence to reject the null hypothesis for the entire population. So, this suggests that there is a strong association between the changes in the speed variable and the shifts in the stopping distance variable.

The last piece of information we can grab from the model summary to examine the quality of the regression model’s fit to the data are those final few lines in the output. We can see that the residual standard error is 15.38. Smaller residual standard error means predictions are better. So, a good model tends to have a smaller residual standard error but this model gives large residual standard error. Recall that the mean of the residuals is 0 and if a variable follows an approximately normal distribution, about 95% of observations are within 2 standard deviations of the mean.

The reported R-squared of 0.6511 for this model means that the model explains 65.11 percent of the data’s variation. If we take the square root of the R-squared value, that will be the correlation coefficient, R (sqrt(0.6511)=0.807) of the model. This can be used to quantify the strength of the relationship between the variables. The larger the number of R the stronger the relationship between the variables.

In additional to the residuals plot, we also can use the quantile-versus-quantile (QQ) plot to examine whether the residuals from the model are normally distributed. If the residuals were normally distributed, we would expect the points plotted in the QQ plot to follow a straight diagonal line. From the QQ plot below, we see that the two ends diverge from that straight diagonal line and also the data points are not lying closely on that line. This behavior indicates that the residuals are not normally distributed. This test further confirms that using only the speed as a predictor in the model is insufficient to explain the data.

qqnorm(cars_lm$residuals)
qqline(cars_lm$residuals)

Conclusion

To conclude, the speed as the sole predictor in the regression model does not sufficiently or fully explain the data. This does not mean that the single explanatory variable model is useless. It’s just that it will not have good predictions. Perhaps we will need to build a more complex model by adding more predictors such as the car weight, brake types, tire types, etc., that is better able to explain the data. Or if we don’t have other predictors available to use then we can rectify the heteroscedasticity by using methods such the Box-Cox, Generalized Least Squares Estimator, to transform non-normal dependent variables into a normal shape.

Reference

Lilja, D. (April, 2017). Linear Regression Using R - A Introduction to Data. University of Minnesota Libraries Publishing. Retrieved from https://conservancy.umn.edu/handle/11299/189222.

Assignment #11

Sie Siong Wong

11/2/2020

Problem Set

Introduction

Exploratory Data Analysis

Conclusion

Reference