data("mtcars")
# make factors for the categorical variables
# "V-Shaped" or otherwise
mtcars$vs = as.factor(mtcars$vs)
# automatic or manual
mtcars$am = as.factor(mtcars$am)Consider the mtcars dataset, which contains data from the 1974 Motor Trend Magazine on 32 cars.
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
| Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
| Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
| Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
| Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
| Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
| Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
| Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
| Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
| Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
| Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
| Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
| Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
| Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
| Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
| Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
| Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
| AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
| Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
| Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
| Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
| Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
| Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
| Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
| Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
| Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
| Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
The variables include:
| Name | Description |
|---|---|
| mpg | Miles/(US) gallon |
| cyl | Number of cylinders |
| disp | Displacement (cu.in.) |
| hp | Gross horsepower |
| drat | Rear axle ratio |
| wt | Weight (1000 lbs) |
| qsec | 1/4 mile time |
| vs | Engine (0 = V-shaped, 1 = straight) |
| am | Transmission (0 = automatic, 1 = manual) |
| gear | Number of forward gears |
| carb | Number of carburetors |
The goal is to predict MPG from the other features.
First, let’s consider a model which includes each of the other variables.
##
## Call:
## lm(formula = formula1, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.450644 -1.604402 -0.119605 1.219268 4.627094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.3033742 18.7178844 0.65731 0.518124
## cyl -0.1114405 1.0450234 -0.10664 0.916087
## disp 0.0133352 0.0178575 0.74676 0.463489
## hp -0.0214821 0.0217686 -0.98684 0.334955
## drat 0.7871110 1.6353731 0.48130 0.635278
## wt -3.7153039 1.8944143 -1.96119 0.063252 .
## qsec 0.8210407 0.7308448 1.12341 0.273941
## vs1 0.3177628 2.1045086 0.15099 0.881423
## am1 2.5202269 2.0566506 1.22540 0.233990
## gear 0.6554130 1.4932600 0.43891 0.665206
## carb -0.1994193 0.8287525 -0.24063 0.812179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.6502 on 21 degrees of freedom
## Multiple R-squared: 0.869016, Adjusted R-squared: 0.806642
## F-statistic: 13.9325 on 10 and 21 DF, p-value: 0.000000379315
##
## Call:
## lm(formula = formula1b, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.450644 -1.604402 -0.119605 1.219268 4.627094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.3033742 18.7178844 0.65731 0.518124
## cyl -0.1114405 1.0450234 -0.10664 0.916087
## disp 0.0133352 0.0178575 0.74676 0.463489
## hp -0.0214821 0.0217686 -0.98684 0.334955
## drat 0.7871110 1.6353731 0.48130 0.635278
## wt -3.7153039 1.8944143 -1.96119 0.063252 .
## qsec 0.8210407 0.7308448 1.12341 0.273941
## vs1 0.3177628 2.1045086 0.15099 0.881423
## am1 2.5202269 2.0566506 1.22540 0.233990
## gear 0.6554130 1.4932600 0.43891 0.665206
## carb -0.1994193 0.8287525 -0.24063 0.812179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.6502 on 21 degrees of freedom
## Multiple R-squared: 0.869016, Adjusted R-squared: 0.806642
## F-statistic: 13.9325 on 10 and 21 DF, p-value: 0.000000379315
Well, this is not promising – although this generates a model with a high \(R^2\), NONE of the variables are significant!
Let’s look at the correlations between the variables
From the above, we note that the following variables have strong negative correlation with mpg :
and the following variables have strong correlation with mpg :
The Boruta algorithm determines “importance” of various variables:
## Loading required package: ranger
## Boruta performed 14 iterations in 0.440195084 secs.
## 10 attributes confirmed important: am, carb, cyl, disp, drat and 5 more;
## No attributes deemed unimportant.
According to the above, all variables are “important”, with disp, wt, hp, and cyl most important.
Let’s consider a model with just these “most important” variables:
##
## Call:
## lm(formula = formula3, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.056232 -1.463594 -0.428125 1.285408 5.826944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.8285367 2.7574679 14.80653 0.000000000000017614 ***
## disp 0.0115992 0.0117268 0.98912 0.33138556
## wt -3.8539035 1.0154736 -3.79518 0.00075895 ***
## hp -0.0205384 0.0121468 -1.69085 0.10237913
## cyl -1.2933197 0.6558768 -1.97189 0.05894681 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.51252 on 27 degrees of freedom
## Multiple R-squared: 0.848635, Adjusted R-squared: 0.82621
## F-statistic: 37.8441 on 4 and 27 DF, p-value: 0.000000000106109
Under the above, only wt, the weight of the car, is significant.
Stepwise regression is a “greedy” algorithm whichcan run forward (adding the most promising variables to an empty model) or backward (dropping the most unpromising variables from a complete model).
# Stepwise Regression
library(MASS)
# The specification "mpg ~ ." specifies that all of the other variables in the dataframe
# are predictors to be used in the model.
allvars <- lm("mpg ~ .",data=mtcars)
# The specification "mpg ~ 1" means that the model has only an intercept term,
# with no predictor variables.
novars <- lm("mpg ~ 1",data=mtcars)
# step forward from an empty model, adding the most promising variables at each step
# The AIC (Akaike Information Criterion) is a comparative measure;
# the model with the lowest (i.e., most negative) AIC is "best"
stepforward <- stepAIC(novars, direction="both",scope=list(upper=allvars,lower=novars))## Start: AIC=115.94
## mpg ~ 1
##
## Df Sum of Sq RSS AIC
## + wt 1 847.7252 278.3219 73.21736
## + cyl 1 817.7130 308.3342 76.49435
## + disp 1 808.8885 317.1587 77.39732
## + hp 1 678.3729 447.6743 88.42656
## + drat 1 522.4805 603.5667 97.98786
## + vs 1 496.5279 629.5193 99.33506
## + am 1 405.1506 720.8966 103.67231
## + carb 1 341.7761 784.2711 106.36860
## + gear 1 259.7492 866.2980 109.55178
## + qsec 1 197.3919 928.6553 111.77605
## <none> 1126.0472 115.94345
##
## Step: AIC=73.22
## mpg ~ wt
##
## Df Sum of Sq RSS AIC
## + cyl 1 87.1500 191.1720 63.19800
## + hp 1 83.2742 195.0478 63.84027
## + qsec 1 82.8583 195.4636 63.90843
## + vs 1 54.2281 224.0939 68.28253
## + carb 1 44.6024 233.7196 69.62836
## + disp 1 31.6394 246.6825 71.35572
## <none> 278.3219 73.21736
## + drat 1 9.0806 269.2413 74.15591
## + gear 1 1.1369 277.1850 75.08638
## + am 1 0.0022 278.3197 75.21711
## - wt 1 847.7252 1126.0472 115.94345
##
## Step: AIC=63.2
## mpg ~ wt + cyl
##
## Df Sum of Sq RSS AIC
## + hp 1 14.55145 176.6205 62.66456
## + carb 1 13.77242 177.3996 62.80540
## <none> 191.1720 63.19800
## + qsec 1 10.56739 180.6046 63.37837
## + gear 1 3.02815 188.1438 64.68706
## + disp 1 2.67959 188.4924 64.74629
## + vs 1 0.70594 190.4660 65.07961
## + am 1 0.12491 191.0471 65.17708
## + drat 1 0.00102 191.1709 65.19783
## - cyl 1 87.14997 278.3219 73.21736
## - wt 1 117.16227 308.3342 76.49435
##
## Step: AIC=62.66
## mpg ~ wt + cyl + hp
##
## Df Sum of Sq RSS AIC
## <none> 176.6205 62.66456
## - hp 1 14.55145 191.1720 63.19800
## + am 1 6.62275 169.9978 63.44158
## + disp 1 6.17616 170.4444 63.52554
## - cyl 1 18.42723 195.0478 63.84027
## + carb 1 2.51872 174.1018 64.20494
## + drat 1 2.24529 174.3752 64.25515
## + qsec 1 1.40105 175.2195 64.40971
## + gear 1 0.85585 175.7647 64.50912
## + vs 1 0.05988 176.5606 64.65371
## - wt 1 115.35403 291.9746 76.74978
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.929035 -1.559784 -0.531133 1.185008 5.898552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7517874 1.7868640 21.68704 < 0.000000000000000222 ***
## wt -3.1669731 0.7405759 -4.27637 0.00019948 ***
## cyl -0.9416168 0.5509164 -1.70918 0.09848010 .
## hp -0.0180381 0.0118762 -1.51884 0.14001516
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.51155 on 28 degrees of freedom
## Multiple R-squared: 0.84315, Adjusted R-squared: 0.826345
## F-statistic: 50.1715 on 3 and 28 DF, p-value: 0.0000000000218418
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## mpg ~ 1
##
## Final Model:
## mpg ~ wt + cyl + hp
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 31 1126.047187 115.9434500
## 2 + wt 1 847.7252500 30 278.321938 73.2173629
## 3 + cyl 1 87.1499713 29 191.171966 63.1979989
## 4 + hp 1 14.5514461 28 176.620520 62.6645624
Forward stepwise selected the model mpg ~ wt + cyl + hp .
It stopped with AIC = 62.66 and \(R^2 = 0.843\) , but only wt is highly significant.
Let’s try the opposite direction: start from a full model, and drop variables when doing so improves the AIC:
## Start: AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - cyl 1 0.079871 147.5743 68.91507
## - vs 1 0.160126 147.6546 68.93247
## - carb 1 0.406669 147.9011 68.98585
## - gear 1 1.353055 148.8475 69.18996
## - drat 1 1.627026 149.1215 69.24881
## - disp 1 3.916667 151.4111 69.73641
## - hp 1 6.839910 154.3343 70.34833
## - qsec 1 8.864116 156.3586 70.76531
## <none> 147.4944 70.89774
## - am 1 10.546651 158.0411 71.10781
## - wt 1 27.014385 174.5088 74.27966
##
## Step: AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - vs 1 0.268523 147.8428 66.97324
## - carb 1 0.520142 148.0944 67.02766
## - gear 1 1.821147 149.3955 67.30755
## - drat 1 1.982623 149.5569 67.34212
## - disp 1 3.900939 151.4752 67.74996
## - hp 1 7.363165 154.9375 68.47314
## <none> 147.5743 68.91507
## - qsec 1 10.093256 157.6676 69.03209
## - am 1 11.835944 159.4102 69.38384
## + cyl 1 0.079871 147.4944 70.89774
## - wt 1 27.028045 174.6024 72.29681
##
## Step: AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 1 0.685461 148.5283 65.12126
## - gear 1 2.143684 149.9865 65.43390
## - drat 1 2.213899 150.0567 65.44888
## - disp 1 3.646651 151.4895 65.75297
## - hp 1 7.105960 154.9488 66.47548
## <none> 147.8428 66.97324
## - am 1 11.569381 159.4122 67.38424
## - qsec 1 15.683048 163.5259 68.19953
## + vs 1 0.268523 147.5743 68.91507
## + cyl 1 0.188268 147.6546 68.93247
## - wt 1 27.379942 175.2228 70.41031
##
## Step: AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 1 1.56497 150.0933 63.45667
## - drat 1 1.93213 150.4604 63.53485
## <none> 148.5283 65.12126
## - disp 1 10.11026 158.6386 65.22856
## - am 1 12.32322 160.8515 65.67186
## - hp 1 14.82564 163.3539 66.16586
## + carb 1 0.68546 147.8428 66.97324
## + vs 1 0.43384 148.0944 67.02766
## + cyl 1 0.41443 148.1139 67.03185
## - qsec 1 26.40806 174.9363 68.35796
## - wt 1 69.12692 217.6552 75.34964
##
## Step: AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - drat 1 3.34455 153.4378 62.16190
## - disp 1 8.54536 158.6386 63.22857
## <none> 150.0933 63.45667
## - hp 1 13.28465 163.3779 64.17056
## + gear 1 1.56497 148.5283 65.12126
## + cyl 1 1.00340 149.0899 65.24203
## + vs 1 0.64549 149.4478 65.31875
## + carb 1 0.10675 149.9865 65.43390
## - am 1 20.03588 170.1291 65.46630
## - qsec 1 25.57441 175.6677 66.49146
## - wt 1 67.57196 217.6652 73.35111
##
## Step: AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - disp 1 6.62865 160.0665 61.51530
## <none> 153.4378 62.16190
## - hp 1 12.57205 166.0099 62.68196
## + drat 1 3.34455 150.0933 63.45667
## + gear 1 2.97739 150.4604 63.53485
## + cyl 1 2.44669 150.9911 63.64752
## + vs 1 1.12081 152.3170 63.92730
## + carb 1 0.01143 153.4264 64.15952
## - qsec 1 26.46980 179.9076 65.25464
## - am 1 32.19752 185.6353 66.25754
## - wt 1 69.04304 222.4809 72.05136
##
## Step: AIC=61.52
## mpg ~ hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - hp 1 9.21947 169.2859 61.30730
## <none> 160.0665 61.51530
## + disp 1 6.62865 153.4378 62.16190
## + carb 1 3.22719 156.8393 62.86354
## + drat 1 1.42785 158.6386 63.22857
## - qsec 1 20.22461 180.2911 63.32277
## + cyl 1 0.24898 159.8175 63.46549
## + vs 1 0.24855 159.8179 63.46557
## + gear 1 0.17112 159.8953 63.48107
## - am 1 25.99284 186.0593 64.33054
## - wt 1 78.49377 238.5602 72.28435
##
## Step: AIC=61.31
## mpg ~ wt + qsec + am
##
## Df Sum of Sq RSS AIC
## <none> 169.2859 61.30730
## + hp 1 9.21947 160.0665 61.51530
## + carb 1 8.03594 161.2500 61.75104
## + disp 1 3.27607 166.0099 62.68196
## + cyl 1 1.50106 167.7849 63.02230
## + drat 1 1.39962 167.8863 63.04164
## + gear 1 0.12272 169.1632 63.28410
## + vs 1 0.00047 169.2855 63.30722
## - am 1 26.17770 195.4636 63.90843
## - qsec 1 109.03377 278.3197 75.21711
## - wt 1 183.34726 352.6332 82.79016
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.481067 -1.555525 -0.725731 1.411012 4.660998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.959593 1.38195 0.17791517
## wt -3.916504 0.711202 -5.50688 0.0000069527 ***
## qsec 1.225886 0.288670 4.24668 0.00021617 ***
## am1 2.935837 1.410905 2.08082 0.04671551 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.45885 on 28 degrees of freedom
## Multiple R-squared: 0.849664, Adjusted R-squared: 0.833556
## F-statistic: 52.7496 on 3 and 28 DF, p-value: 0.0000000000121045
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Final Model:
## mpg ~ wt + qsec + am
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 21 147.494430 70.8977443
## 2 - cyl 1 0.0798712088 22 147.574301 68.9150683
## 3 - vs 1 0.2685228049 23 147.842824 66.9732418
## 4 - carb 1 0.6854607735 24 148.528285 65.1212642
## 5 - gear 1 1.5649705268 25 150.093255 63.4566688
## 6 - drat 1 3.3445511717 26 153.437807 62.1619012
## 7 - disp 1 6.6286536884 27 160.066460 61.5153025
## 8 - hp 1 9.2194693468 28 169.285930 61.3073047
Backward stepwise selected the model mpg ~ wt + qsec + am with a slightly lower AIC (61.31) and a slightly higher \(R^2 = 0.850\) . Each of the variables is significant at 95% confidence, and one variable, am, is dichotomous: it indicates whether the car has an automatic or manual (stick-shift) transmission.
So, let’s try adding quadratic and interactions to this model:
##
## Call:
## lm(formula = formula5, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.481067 -1.555525 -0.725731 1.411012 4.660998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.959593 1.38195 0.17791517
## wt -3.916504 0.711202 -5.50688 0.0000069527 ***
## qsec 1.225886 0.288670 4.24668 0.00021617 ***
## am1 2.935837 1.410905 2.08082 0.04671551 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.45885 on 28 degrees of freedom
## Multiple R-squared: 0.849664, Adjusted R-squared: 0.833556
## F-statistic: 52.7496 on 3 and 28 DF, p-value: 0.0000000000121045
Rather than creating a separate variable wt2=wt^2 , the proper way to add a quadratic term to a model is by specifying poly(variablename, 2) . Under the latter method, R will not allow the quadratic term to be kept without also keeping the linear term.
##
## Call:
## lm(formula = mpg ~ poly(wt, 2) + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.750796 -1.312468 -0.495434 1.072236 4.855303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.353222 5.255953 0.44773 0.6579234
## poly(wt, 2)1 -25.463350 3.765874 -6.76161 0.00000029221 ***
## poly(wt, 2)2 7.065872 2.509318 2.81585 0.0089778 **
## qsec 0.970511 0.273906 3.54323 0.0014614 **
## am1 1.021518 1.434550 0.71208 0.4825217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.20149 on 27 degrees of freedom
## Multiple R-squared: 0.883791, Adjusted R-squared: 0.866574
## F-statistic: 51.3348 on 4 and 27 DF, p-value: 0.00000000000310741
Here the \(R^2\) has increased to 0.884, which improves the earlier result.
Let’s try adding an interaction between qsec and am . Since both variables are already in the model, we can simply change qsec+am to qsec*am ; this will keep the 2 underlying variables while adding an interaction which will be named qsec:am
##
## Call:
## lm(formula = mpg ~ poly(wt, 2) + qsec * am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.661388 -1.431606 -0.156905 1.183505 3.957388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.992714 5.824678 1.02885 0.313027
## poly(wt, 2)1 -24.205530 3.821151 -6.33462 0.0000010444 ***
## poly(wt, 2)2 5.732660 2.657629 2.15706 0.040424 *
## qsec 0.760502 0.310687 2.44780 0.021436 *
## am1 -10.328338 8.456372 -1.22137 0.232907
## qsec:am1 0.669237 0.491620 1.36129 0.185105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.16752 on 26 degrees of freedom
## Multiple R-squared: 0.891522, Adjusted R-squared: 0.870661
## F-statistic: 42.736 on 5 and 26 DF, p-value: 0.00000000000993575
This has improved the \(R^2\) slightly, to 0.892 , but am and the interaction term, qsec:am, are not individually significant.
The model is \[mpg = 5.993 -24.206 \cdot wt + 5.733 \cdot wt^2 + 0.761 \cdot qsec -10.328 \cdot am + 0.669 \cdot am \cdot qsec\] .
This means that:
wt of the car increases by 1 unit, the miles per gallon mpg changes by \(-24.206 \cdot wt + 5.732 \cdot wt^2\) , where we have to consider the quadratic term along with the linear termam==0):
qsec “1/4 mile time” increases by 1 unit, the mpg increases by 0.761am==1):
qsec “1/4 mile time” increases by 1 unit, the mpg increases by 0.761 + 0.669 = 1.43This means that the interaction can be expressed as two separate models:
For automatic cars (am==0) the model is \[mpg_{automatic} = 5.993 -24.206 \cdot wt + 5.733 \cdot wt^2 + 0.761 \cdot qsec\] For manual cars (am==1) the model is \[\begin{aligned}
mpg_{manual} &= (5.993 -10.328) -24.206 \cdot wt + 5.733 \cdot wt^2 + (0.761 + 0.669) \cdot qsec \\
&= \quad \quad -4.335 \quad \quad -24.206 \cdot wt + 5.733 \cdot wt^2 + \quad \quad \quad 1.43 \cdot qsec
\end{aligned}\] .
Residual = resid(m7)
Fitted = fitted(m7)
plot(Fitted,Residual,
main="mtcars Dataset: Fitted vs. Residuals",
xlab="Fitted mpg (Miles per Gallon))")
abline(h=0, col="blue")hist(Residual, main = "Histogram of Residuals - 8 breaks", ylab = "Density",
ylim = c(0, 0.35),
xlim = c(-6,6),
prob = TRUE,breaks=8, col="lightblue")
curve(dnorm(x, mean = mean(Residual), sd = sd(Residual)), col="red", add=TRUE)The above histogram looks vaguely Normal. It is difficult to assess because there are only 32 data points.
We can run several standard tests of normality:
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## rivers
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.97 0.4998
## Kolmogorov-Smirnov 0.0887 0.9434
## Cramer-von Mises 2.4371 0.0000
## Anderson-Darling 0.3011 0.5586
## -----------------------------------------------
Three out of four tests pass – only Cramer-von Mises fails.
We can test using the lmSupport package:
## Descriptive Statistics for Studentized Residuals
##
## Call:
## lm(formula = mpg ~ poly(wt, 2) + qsec * am, data = mtcars)
##
## Coefficients:
## (Intercept) poly(wt, 2)1 poly(wt, 2)2 qsec am1 qsec:am1
## 5.992714 -24.205530 5.732660 0.760502 -10.328338 0.669237
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = Model)
##
## Value p-value Decision
## Global Stat 2.64503844 0.618866 Assumptions acceptable.
## Skewness 0.59000334 0.442418 Assumptions acceptable.
## Kurtosis 0.46970548 0.493123 Assumptions acceptable.
## Link Function 1.58012407 0.208743 Assumptions acceptable.
## Heteroscedasticity 0.00520555 0.942483 Assumptions acceptable.
All tests pass, including the test of Heteroscedasticity, indicating that the residuals have constant variance.
While there is clearly a relationship between the independent variable vs. the residual, the residuals themselves are actually independent of each other, as each represents the result from an individual country. Plotting the residuals sequentially (where the sequence happens to be alphabetical by country name) doesn’t yield any discernable pattern:
alpha_Residuals <- t(t(Residual[order(rownames(mtcars))]))
alpha_Residuals %>% kable(caption = "Residual") %>% kable_styling(c("striped", "bordered"))| AMC Javelin | -2.225402818 |
| Cadillac Fleetwood | -2.050588463 |
| Camaro Z28 | -1.018642663 |
| Chrysler Imperial | 2.821691116 |
| Datsun 710 | -3.661388284 |
| Dodge Challenger | -1.226871406 |
| Duster 360 | -1.430263386 |
| Ferrari Dino | 0.304378522 |
| Fiat 128 | 3.957387786 |
| Fiat X1-9 | -2.066866646 |
| Ford Pantera L | -0.101949870 |
| Honda Civic | -0.669267752 |
| Hornet 4 Drive | 1.333389926 |
| Hornet Sportabout | 1.509702508 |
| Lincoln Continental | -1.671453773 |
| Lotus Europa | 0.897628777 |
| Maserati Bora | 0.770241525 |
| Mazda RX4 | -0.606280813 |
| Mazda RX4 Wag | -0.002961576 |
| Merc 230 | -0.211859782 |
| Merc 240D | 3.787550367 |
| Merc 280 | 1.036260200 |
| Merc 280C | -0.820040882 |
| Merc 450SE | 1.402480525 |
| Merc 450SL | 0.887664531 |
| Merc 450SLC | -1.319570420 |
| Pontiac Firebird | 3.653150732 |
| Porsche 914-2 | 1.133543056 |
| Toyota Corolla | 2.418715870 |
| Toyota Corona | -3.021563634 |
| Valiant | -1.435632679 |
| Volvo 142E | -2.373180593 |
plot(alpha_Residuals, main="Residuals, sequenced alphabetically by car name")
abline(h=0, col="red")There is no discernable pattern to the residuals, which are independent of each other.
Here are the standard plots:
One key observation is that cars with manual transmission have better gas mileage than cars with automatic transmission (at least among the cars in the set of vehicles on this Motor Trend list.)
ggplot(mtcars, aes(y=mpg, x=factor(am, labels = c("automatic", "manual")), fill=factor(am)))+
theme(legend.position="none")+
geom_boxplot(colour="black", size=1)+
xlab("transmission type") + ylab("MPG")+
ggtitle("MPG by Transmission type") +
theme(plot.title = element_text(hjust = 0.5))plot(m7$fitted,m7$residuals, pch=19, cex=1.5, col=mtcars$am,
xlab = "Fitted MPG",
ylab = "Residual",
main = "Fitted MPG and residual, by car transmission type")
legend("bottomright",legend=c("Manual","Automatic"), pch=16, col=unique(mtcars$am))Yes, the linear model is appropriate, as the key variables are significant, the \(R^2\) is high, and the conditions for linear regression are met.