MSDS Spring 2018

DATA 605 Fundamentals of Computational Mathematics

Jiadi Li

HW #11 - Linear Model: One-Factor Regression

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

1.Visualize the Data

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

scatter plot: stoping distance vs. speed

plot(cars$speed,cars$dist,xlab = 'speed',ylab = 'stopping distance')

The plot shows stopping distance tends to increase as the speed increases. The relationship between the predictor(speed) and the output(stopping distance) has a tendency of being linear with a positive correlation.

cor(cars$speed,cars$dist)

## [1] 0.8068949

The strength of the linear relationship can be quantified with its correlation coefficient (0.81).

2.linear regression model

stoping distance as a function of speed:

lrmodel <- lm(dist ~ speed, data = cars)  

summary(lrmodel)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Based on the coefficients:
\(y\) = -17.5791 + 3.9324\(\times\)speed

plot(cars$speed,cars$dist,main = 'Regression Model: Stopping distance ~ Speed',xlab = 'Predictor: speed',ylab = 'stopping distance')
abline(lrmodel)

3.evaluate the quality of the model

summary(lrmodel$residuals)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -29.069  -9.525  -2.272   0.000   9.215  43.201

The residuals are the differences between the actual measured values and the corresponding values on the fitted regression line. If the line is a good fit with the data, we would expect residual values that are normally distributed around a mean of zero. This distribution implies that there is a decreasing probability of finding residual values as we move further away from the mean. That is, a good model’s residuals should be roughly balanced around and not too far away from the mean of zero.

Based on the residuals information, this model is a fair fit of the data since:
the mean of the residual is exactly 0, the median is approximately 0; while 1st and 3rd quartile values are roughly the same, the maximum residual is further away.
Visualization of residual analysis will be covered later in part 4.

The \(Std. Error\) column shows the statistical standard error for each of the coefficients. For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient.

Based on the coefficients information, this model is a fair model since:
3.9324 / 0.4155 = 9.46426 which shows that the standard error is more than 5 times smaller than the corresponding coefficient. However, -17.5791 / 6.7584 = 2.60107 shows that the standard error is less than 5 times smaller than the value of the corresponding coefficient.

The last column, labeled \(Pr(>|t|)\), shows the probability that the corresponding coefficient is not relevant in the model. This value is also known as the significance or p-value of the coefficient.

Based on the coefficients information, this model is a fair model since:
while the probability that the intercept is not relevant is 1.23%, the probability that the speed variable is not relevant in this model is extremely small.

The quality of the regression model’s fit to the data:
The \(Multiple R-squared value\) is a number between 0 and 1. It is a statistical measure of how well the model describes the measured data. The reported \(R^2\) of 0.6511 for this model means that the model explains 65.11% of the data’s variation.

4.residual analysis

hist(lrmodel$residuals,breaks = 20,xlab = 'Residuals',main = 'Histogram of Residuals',prob = TRUE)

In addition to viewing the residual data, ploting the residuals in a histogram reveals that the residuals are not uniformly distributed below 0. This plot tells us that using the speed alone can not fully predict the stopping distance.

qqnorm(lrmodel$residuals)
qqline(lrmodel$residuals)  # adds diagonal line to the normal prob plot

If the residuals were normally distributed, we would expect the points plotted in this figure to follow a straight line.

With our model, not only there are significant diverge towards the two ends, but also there are deviation in the middle. This behavior indicates that the residuals are not normally distributed. This test further confirms that using only the speed as a predictor in the model is insufficient to explain the data.