library(ggplot2)
library(dplyr)

Week 11, Regression Analysis in R 1: 4-10 Apr

Problem

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)


SOLUTION:

Explore the Data

Below we can see that the cars dataset contains 50 observations of 2 variables: speed and dist.

Each observation represents the speed of a given car and the distance that it takes for the car to stop.

Show some of data in cars dataset

head(cars, 10)
#>    speed dist
#> 1      4    2
#> 2      4   10
#> 3      7    4
#> 4      7   22
#> 5      8   16
#> 6      9   10
#> 7     10   18
#> 8     10   26
#> 9     10   34
#> 10    11   17


Structure of the cars dataset

str(cars)
#> 'data.frame':    50 obs. of  2 variables:
#>  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
#>  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...


Show some summary statistics about the dataset

summary(cars)
#>      speed           dist       
#>  Min.   : 4.0   Min.   :  2.00  
#>  1st Qu.:12.0   1st Qu.: 26.00  
#>  Median :15.0   Median : 36.00  
#>  Mean   :15.4   Mean   : 42.98  
#>  3rd Qu.:19.0   3rd Qu.: 56.00  
#>  Max.   :25.0   Max.   :120.00


Visualize the Data

Determine whether or not it looks as though a linear relationship exists between the predictor “speed” and the output value “stopping distance” (dist).

ggplot(cars, aes(x = speed, y = dist, color = speed)) +
  geom_point(shape = 1, alpha = 0.6) +
  geom_smooth(method="lm", se=FALSE) +
  labs(title = "Car Speed vs Stopping Distance") +
    xlab("Speed") + 
    ylab("Stopping Distance")


This figure shows that the “stopping distance” tends to increase as the speed increases, as we expected. After superimposing a straight line on this scatter plot, we see that the relationship between the predictor (the speed) and the output (the stopping distance) is roughly linear and positive. It is not perfectly linear, however.


Develop a Regression Model

Our next step is to develop a regression model that will help us quantify the degree of linearity in the relationship between the output and the predictor.


Generate a linear model using the lm function

model = lm (cars$dist ~ cars$speed, data = cars)


Show the statistics summary of the model

mod_summary <- summary(model)

mod_summary
#> 
#> Call:
#> lm(formula = cars$dist ~ cars$speed, data = cars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -29.069  -9.525  -2.272   9.215  43.201 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
#> cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 15.38 on 48 degrees of freedom
#> Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
#> F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12


Show the resulting linear equation for the model

intercept <- round(model$coefficients[1], 4)
slope     <- round(model$coefficients[2], 4)

\(\widehat{dist} = {-17.5791} + {3.9324} * {speed}\)


Evaluating the Quality of the Model


Evaluating the Residuals

Evaluate the Residuals generated by the model

summary(mod_summary$residuals)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -29.069  -9.525  -2.272   0.000   9.215  43.201

When we look at the residual values reported by summary(), we can see that we have a relatively good model because the median value is near zero, the minimum and maximum values are of roughly 10 units apart in magnitude, and first and third quartile values are of roughly the same magnitude. Therefore, for this model, the residual values are not too far off what we would expect for Gaussian-distributed numbers. But we will need to run additional tests to verify this.


Evaluating the Coefficients

Evaluate the Coefficients generated by the model

mod_summary$coefficients
#>               Estimate Std. Error   t value     Pr(>|t|)
#> (Intercept) -17.579095  6.7584402 -2.601058 1.231882e-02
#> cars$speed    3.932409  0.4155128  9.463990 1.489836e-12

For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient. In this case, our standard error is about 2.6 time smaller in magnitude than the Intercept coefficient. The standard error for the slope is roughly 9.46 times smaller.

The value of Pr(>|t|) for speed is very small, so the probability that speed is not relevant to the model is very tiny. The value of Pr(>|t|) for distance is small (0.0123), which represent a small probability that the distance is not relevant to the model.


Evaluating the Residual standard error

Evaluate the Residual standard error generated by the model

mod_summary$sigma
#> [1] 15.37959

If the residuals are distributed normally, the first and third quantiles of the previous residuals should be about 1.5 times this standard error. For this model, the residuals are roughly 1.5 time the standard error.


Quality Evaluation of the Model

The Multiple R-squared value is a statistical measure of how well the model describes the data. The reported R-Squared of 0.6511 from the summary for this model means that the model explains 65.11 percent of the data’s variation.


Residual Analysis

Plot the Fitted Values versus the Residual Values

ggplot(model, aes(.fitted, .resid)) + 
  geom_point(size=2) +
  labs(title = "Fitted Values vs Residuals") +
  labs(x = "Fitted Values") +
  labs(y = "Residuals") +
  geom_abline(slope = 0)

From the plot above, we see that the residuals tend to increase as we move to the right. Additionally, the residuals are not uniformly scattered above and below zero. Overall, this plot tells us that using the speed as the sole predictor in the regression model does not sufficiently or fully explain the data. There might be other additional factors that affect the distance needed to stop a car, such as terrain conditions, tire thread, weather conditions, etc.


Generate the Q-Q plot

qqnorm(resid(model))
qqline(resid(model))

The Q-Q Plot above, shows how the data, for the most part, follows along model until a it starts diverging for higher values of speed, which can be seen near the upper right of the plot.

LS0tDQp0aXRsZTogIkRBVEE2MDUgLSBIb21ld29yayAxMSINCmF1dGhvcjogIkVzdGViYW4gQXJhbWF5byINCmRhdGU6ICIyMDIyLTA0LTEwIg0Kb3V0cHV0OiBvcGVuaW50cm86OmxhYl9yZXBvcnQNCi0tLQ0KDQpgYGB7ciBnbG9iYWwtb3B0aW9ucywgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvPVRSVUUsIHdhcm5pbmc9RkFMU0UsIG1lc3NhZ2U9RkFMU0UsDQogICAgICAgICAgICAgICAgICAgICAgY29sbGFwc2UgPSBUUlVFLA0KICAgICAgICAgICAgICAgICAgICAgIGNvbW1lbnQgPSAiIz4iICkNCmBgYA0KDQoNCmBgYHtyfQ0KbGlicmFyeShnZ3Bsb3QyKQ0KbGlicmFyeShkcGx5cikNCmBgYA0KDQoNCg0KIyBXZWVrIDExLCBSZWdyZXNzaW9uIEFuYWx5c2lzIGluIFIgMTogNC0xMCBBcHINCg0KIyMgUHJvYmxlbQ0KDQpVc2luZyB0aGUg4oCcY2Fyc+KAnSBkYXRhc2V0IGluIFIsIGJ1aWxkIGEgbGluZWFyIG1vZGVsIGZvciBzdG9wcGluZyBkaXN0YW5jZSBhcyBhIGZ1bmN0aW9uIG9mIHNwZWVkIGFuZCByZXBsaWNhdGUgdGhlIGFuYWx5c2lzIG9mIHlvdXIgdGV4dGJvb2sgY2hhcHRlciAzICh2aXN1YWxpemF0aW9uLCBxdWFsaXR5IGV2YWx1YXRpb24gb2YgdGhlIG1vZGVsLCBhbmQgcmVzaWR1YWwgYW5hbHlzaXMuKQ0KDQo8YnI+DQoNCioqU09MVVRJT046KioNCg0KIyMjIEV4cGxvcmUgdGhlIERhdGENCg0KQmVsb3cgd2UgY2FuIHNlZSB0aGF0IHRoZSBjYXJzIGRhdGFzZXQgY29udGFpbnMgNTAgb2JzZXJ2YXRpb25zIG9mIDIgdmFyaWFibGVzOiBzcGVlZCBhbmQgZGlzdC4NCg0KRWFjaCBvYnNlcnZhdGlvbiByZXByZXNlbnRzIHRoZSBzcGVlZCBvZiBhIGdpdmVuIGNhciBhbmQgdGhlIGRpc3RhbmNlIHRoYXQgaXQgdGFrZXMgZm9yIHRoZSBjYXIgdG8gc3RvcC4NCg0KDQpTaG93IHNvbWUgb2YgZGF0YSBpbiBjYXJzIGRhdGFzZXQNCg0KYGBge3J9DQpoZWFkKGNhcnMsIDEwKQ0KYGBgDQoNCjxicj4NCg0KU3RydWN0dXJlIG9mIHRoZSBjYXJzIGRhdGFzZXQNCg0KYGBge3J9DQpzdHIoY2FycykNCmBgYA0KDQo8YnI+DQoNClNob3cgc29tZSBzdW1tYXJ5IHN0YXRpc3RpY3MgYWJvdXQgdGhlIGRhdGFzZXQNCg0KYGBge3J9DQpzdW1tYXJ5KGNhcnMpDQpgYGANCg0KPGJyPg0KDQojIyMgVmlzdWFsaXplIHRoZSBEYXRhDQoNCkRldGVybWluZSB3aGV0aGVyIG9yIG5vdCBpdCBsb29rcyBhcyB0aG91Z2ggYSBsaW5lYXIgcmVsYXRpb25zaGlwIGV4aXN0cyBiZXR3ZWVuIHRoZSBwcmVkaWN0b3IgInNwZWVkIiBhbmQNCnRoZSBvdXRwdXQgdmFsdWUgInN0b3BwaW5nIGRpc3RhbmNlIiAoZGlzdCkuDQoNCmBgYHtyfQ0KZ2dwbG90KGNhcnMsIGFlcyh4ID0gc3BlZWQsIHkgPSBkaXN0LCBjb2xvciA9IHNwZWVkKSkgKw0KICBnZW9tX3BvaW50KHNoYXBlID0gMSwgYWxwaGEgPSAwLjYpICsNCiAgZ2VvbV9zbW9vdGgobWV0aG9kPSJsbSIsIHNlPUZBTFNFKSArDQogIGxhYnModGl0bGUgPSAiQ2FyIFNwZWVkIHZzIFN0b3BwaW5nIERpc3RhbmNlIikgKw0KICAgIHhsYWIoIlNwZWVkIikgKyANCiAgICB5bGFiKCJTdG9wcGluZyBEaXN0YW5jZSIpDQpgYGANCjxicj4NCg0KVGhpcyBmaWd1cmUgc2hvd3MgdGhhdCB0aGUgInN0b3BwaW5nIGRpc3RhbmNlIiB0ZW5kcyB0byBpbmNyZWFzZSBhcyB0aGUgc3BlZWQgaW5jcmVhc2VzLCBhcyB3ZSBleHBlY3RlZC4gQWZ0ZXIgc3VwZXJpbXBvc2luZyBhIHN0cmFpZ2h0IGxpbmUgb24gdGhpcyBzY2F0dGVyIHBsb3QsIHdlIHNlZSB0aGF0IHRoZSByZWxhdGlvbnNoaXAgYmV0d2VlbiB0aGUgcHJlZGljdG9yICh0aGUgc3BlZWQpIGFuZCB0aGUgb3V0cHV0ICh0aGUgc3RvcHBpbmcgZGlzdGFuY2UpIGlzICoqcm91Z2hseSBsaW5lYXIgYW5kIHBvc2l0aXZlKiouIEl0IGlzIG5vdCBwZXJmZWN0bHkgbGluZWFyLCBob3dldmVyLg0KDQo8YnI+DQoNCiMjIyBEZXZlbG9wIGEgUmVncmVzc2lvbiBNb2RlbA0KDQpPdXIgbmV4dCBzdGVwIGlzIHRvIGRldmVsb3AgYSByZWdyZXNzaW9uIG1vZGVsIHRoYXQgd2lsbCBoZWxwIHVzIHF1YW50aWZ5IHRoZSBkZWdyZWUgb2YgbGluZWFyaXR5IGluIHRoZSByZWxhdGlvbnNoaXAgYmV0d2VlbiB0aGUgb3V0cHV0IGFuZCB0aGUgcHJlZGljdG9yLg0KDQo8YnI+DQoNCkdlbmVyYXRlIGEgbGluZWFyIG1vZGVsIHVzaW5nIHRoZSBsbSBmdW5jdGlvbg0KDQpgYGB7cn0NCm1vZGVsID0gbG0gKGNhcnMkZGlzdCB+IGNhcnMkc3BlZWQsIGRhdGEgPSBjYXJzKQ0KDQpgYGANCg0KPGJyPg0KDQpTaG93IHRoZSBzdGF0aXN0aWNzIHN1bW1hcnkgb2YgdGhlIG1vZGVsDQoNCmBgYHtyfQ0KbW9kX3N1bW1hcnkgPC0gc3VtbWFyeShtb2RlbCkNCg0KbW9kX3N1bW1hcnkNCmBgYA0KDQo8YnI+DQoNClNob3cgdGhlICoqcmVzdWx0aW5nIGxpbmVhciBlcXVhdGlvbiBmb3IgdGhlIG1vZGVsKiogDQpgYGB7cn0NCmludGVyY2VwdCA8LSByb3VuZChtb2RlbCRjb2VmZmljaWVudHNbMV0sIDQpDQpzbG9wZSAgICAgPC0gcm91bmQobW9kZWwkY29lZmZpY2llbnRzWzJdLCA0KQ0KYGBgDQokXHdpZGVoYXR7ZGlzdH0gPSB7YHIgaW50ZXJjZXB0YH0gKyB7YHIgc2xvcGVgfSAqIHtzcGVlZH0kDQoNCjxicj4NCg0KDQojIyMgRXZhbHVhdGluZyB0aGUgUXVhbGl0eSBvZiB0aGUgTW9kZWwNCg0KPGJyPg0KDQojIyMjIEV2YWx1YXRpbmcgdGhlIFJlc2lkdWFscw0KDQpFdmFsdWF0ZSB0aGUgUmVzaWR1YWxzIGdlbmVyYXRlZCBieSB0aGUgbW9kZWwNCg0KYGBge3J9DQpzdW1tYXJ5KG1vZF9zdW1tYXJ5JHJlc2lkdWFscykNCmBgYA0KDQoNCldoZW4gd2UgbG9vayBhdCB0aGUgcmVzaWR1YWwgdmFsdWVzIHJlcG9ydGVkIGJ5IHN1bW1hcnkoKSwgd2UgY2FuIHNlZSB0aGF0IHdlIGhhdmUgYSByZWxhdGl2ZWx5IGdvb2QgbW9kZWwgYmVjYXVzZSB0aGUgbWVkaWFuIHZhbHVlIGlzIG5lYXIgemVybywgdGhlIG1pbmltdW0gYW5kIG1heGltdW0gdmFsdWVzIGFyZSBvZiByb3VnaGx5IDEwIHVuaXRzIGFwYXJ0IGluIG1hZ25pdHVkZSwgYW5kIGZpcnN0IGFuZCB0aGlyZCBxdWFydGlsZSB2YWx1ZXMgYXJlIG9mIHJvdWdobHkgdGhlIHNhbWUgbWFnbml0dWRlLiBUaGVyZWZvcmUsIGZvciB0aGlzIG1vZGVsLCB0aGUgcmVzaWR1YWwgdmFsdWVzIGFyZSBub3QgdG9vIGZhciBvZmYgd2hhdCB3ZSB3b3VsZCBleHBlY3QgZm9yIEdhdXNzaWFuLWRpc3RyaWJ1dGVkIG51bWJlcnMuIEJ1dCB3ZSB3aWxsIG5lZWQgdG8gcnVuIGFkZGl0aW9uYWwgdGVzdHMgdG8gdmVyaWZ5IHRoaXMuDQoNCg0KPGJyPg0KDQojIyMjIEV2YWx1YXRpbmcgdGhlIENvZWZmaWNpZW50cw0KDQpFdmFsdWF0ZSB0aGUgQ29lZmZpY2llbnRzIGdlbmVyYXRlZCBieSB0aGUgbW9kZWwNCg0KYGBge3J9DQptb2Rfc3VtbWFyeSRjb2VmZmljaWVudHMNCmBgYA0KDQpGb3IgYSBnb29kIG1vZGVsLCB3ZSB0eXBpY2FsbHkgd291bGQgbGlrZSB0byBzZWUgYSBzdGFuZGFyZCBlcnJvciB0aGF0IGlzIGF0IGxlYXN0IGZpdmUgdG8gdGVuIHRpbWVzIHNtYWxsZXIgdGhhbiB0aGUgY29ycmVzcG9uZGluZyBjb2VmZmljaWVudC4gSW4gdGhpcyBjYXNlLCBvdXIgc3RhbmRhcmQgZXJyb3IgaXMgYWJvdXQgMi42IHRpbWUgc21hbGxlciBpbiBtYWduaXR1ZGUgdGhhbiB0aGUgSW50ZXJjZXB0IGNvZWZmaWNpZW50LiBUaGUgc3RhbmRhcmQgZXJyb3IgZm9yIHRoZSBzbG9wZSBpcyByb3VnaGx5IDkuNDYgdGltZXMgc21hbGxlci4NCg0KVGhlIHZhbHVlIG9mIFByKD58dHwpIGZvciBzcGVlZCBpcyB2ZXJ5IHNtYWxsLCBzbyB0aGUgcHJvYmFiaWxpdHkgdGhhdCBzcGVlZCBpcyBub3QgcmVsZXZhbnQgdG8gdGhlIG1vZGVsIGlzIHZlcnkgdGlueS4gVGhlIHZhbHVlIG9mIFByKD58dHwpIGZvciBkaXN0YW5jZSBpcyBzbWFsbCAoMC4wMTIzKSwgd2hpY2ggcmVwcmVzZW50IGEgc21hbGwgcHJvYmFiaWxpdHkgdGhhdCB0aGUgZGlzdGFuY2UgaXMgbm90IHJlbGV2YW50IHRvIHRoZSBtb2RlbC4NCg0KDQo8YnI+DQoNCiMjIyMgRXZhbHVhdGluZyB0aGUgUmVzaWR1YWwgc3RhbmRhcmQgZXJyb3INCg0KRXZhbHVhdGUgdGhlIFJlc2lkdWFsIHN0YW5kYXJkIGVycm9yIGdlbmVyYXRlZCBieSB0aGUgbW9kZWwNCg0KYGBge3J9DQptb2Rfc3VtbWFyeSRzaWdtYQ0KYGBgDQoNCklmIHRoZSByZXNpZHVhbHMgYXJlIGRpc3RyaWJ1dGVkIG5vcm1hbGx5LCB0aGUgZmlyc3QgYW5kIHRoaXJkIHF1YW50aWxlcyBvZiB0aGUgcHJldmlvdXMgcmVzaWR1YWxzIHNob3VsZCBiZSBhYm91dCAxLjUgdGltZXMgdGhpcyBzdGFuZGFyZCBlcnJvci4gRm9yIHRoaXMgbW9kZWwsIHRoZSByZXNpZHVhbHMgYXJlIHJvdWdobHkgMS41IHRpbWUgdGhlIHN0YW5kYXJkIGVycm9yLg0KDQo8YnI+DQoNCiMjIyMgUXVhbGl0eSBFdmFsdWF0aW9uIG9mIHRoZSBNb2RlbA0KDQpUaGUgTXVsdGlwbGUgUi1zcXVhcmVkIHZhbHVlIGlzIGEgc3RhdGlzdGljYWwgbWVhc3VyZSBvZiBob3cgd2VsbCB0aGUgbW9kZWwgZGVzY3JpYmVzIHRoZSBkYXRhLiBUaGUgcmVwb3J0ZWQgUi1TcXVhcmVkIG9mIDAuNjUxMSBmcm9tIHRoZSBzdW1tYXJ5IGZvciB0aGlzIG1vZGVsIG1lYW5zIHRoYXQgdGhlIG1vZGVsIGV4cGxhaW5zIDY1LjExIHBlcmNlbnQgb2YgdGhlIGRhdGHigJlzIHZhcmlhdGlvbi4NCg0KDQo8YnI+DQoNCiMjIyMgUmVzaWR1YWwgQW5hbHlzaXMNCg0KUGxvdCB0aGUgRml0dGVkIFZhbHVlcyB2ZXJzdXMgdGhlIFJlc2lkdWFsIFZhbHVlcw0KDQpgYGB7cn0NCmdncGxvdChtb2RlbCwgYWVzKC5maXR0ZWQsIC5yZXNpZCkpICsgDQogIGdlb21fcG9pbnQoc2l6ZT0yKSArDQogIGxhYnModGl0bGUgPSAiRml0dGVkIFZhbHVlcyB2cyBSZXNpZHVhbHMiKSArDQogIGxhYnMoeCA9ICJGaXR0ZWQgVmFsdWVzIikgKw0KICBsYWJzKHkgPSAiUmVzaWR1YWxzIikgKw0KICBnZW9tX2FibGluZShzbG9wZSA9IDApDQpgYGANCkZyb20gdGhlIHBsb3QgYWJvdmUsIHdlIHNlZSB0aGF0IHRoZSByZXNpZHVhbHMgdGVuZCB0byBpbmNyZWFzZSBhcyB3ZSBtb3ZlIHRvIHRoZSByaWdodC4gQWRkaXRpb25hbGx5LCB0aGUgcmVzaWR1YWxzIGFyZSBub3QgdW5pZm9ybWx5IHNjYXR0ZXJlZCBhYm92ZSBhbmQgYmVsb3cgemVyby4gT3ZlcmFsbCwgdGhpcyBwbG90IHRlbGxzIHVzIHRoYXQgdXNpbmcgdGhlICpzcGVlZCogYXMgdGhlIHNvbGUgcHJlZGljdG9yIGluIHRoZSByZWdyZXNzaW9uIG1vZGVsIGRvZXMgbm90IHN1ZmZpY2llbnRseSBvciBmdWxseSBleHBsYWluIHRoZSBkYXRhLiBUaGVyZSBtaWdodCBiZSBvdGhlciBhZGRpdGlvbmFsIGZhY3RvcnMgdGhhdCBhZmZlY3QgdGhlIGRpc3RhbmNlIG5lZWRlZCB0byBzdG9wIGEgY2FyLCBzdWNoIGFzIHRlcnJhaW4gY29uZGl0aW9ucywgdGlyZSB0aHJlYWQsIHdlYXRoZXIgY29uZGl0aW9ucywgZXRjLg0KDQoNCjxicj4NCg0KIyMjIyBHZW5lcmF0ZSB0aGUgUS1RIHBsb3QNCg0KYGBge3J9DQpxcW5vcm0ocmVzaWQobW9kZWwpKQ0KcXFsaW5lKHJlc2lkKG1vZGVsKSkNCmBgYA0KDQpUaGUgUS1RIFBsb3QgYWJvdmUsIHNob3dzIGhvdyB0aGUgZGF0YSwgZm9yIHRoZSBtb3N0IHBhcnQsIGZvbGxvd3MgYWxvbmcgbW9kZWwgdW50aWwgYSBpdCBzdGFydHMgZGl2ZXJnaW5nIGZvciBoaWdoZXIgdmFsdWVzIG9mIHNwZWVkLCB3aGljaCBjYW4gYmUgc2VlbiBuZWFyIHRoZSB1cHBlciByaWdodCBvZiB0aGUgcGxvdC4NCg0KDQoNCg0KDQo=