MY605_mtcars_regression

mtcars - Predicting Mileage per Gallon

Using R, build a multiple regression model for data that interests you.

Include in this model

at least one quadratic term,
one dichotomous term, and
one dichotomous vs. quantitative interaction term.

data("mtcars")

# make factors for the categorical variables
# "V-Shaped" or otherwise
mtcars$vs = as.factor(mtcars$vs)
# automatic or manual
mtcars$am = as.factor(mtcars$am)

Consider the mtcars dataset, which contains data from the 1974 Motor Trend Magazine on 32 cars.

mtcars %>% kable() %>% kable_styling(c("striped", "bordered"))

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

The variables include:

Name	Description
mpg	Miles/(US) gallon
cyl	Number of cylinders
disp	Displacement (cu.in.)
hp	Gross horsepower
drat	Rear axle ratio
wt	Weight (1000 lbs)
qsec	1/4 mile time
vs	Engine (0 = V-shaped, 1 = straight)
am	Transmission (0 = automatic, 1 = manual)
gear	Number of forward gears
carb	Number of carburetors

The goal is to predict MPG from the other features.

Base model

First, let’s consider a model which includes each of the other variables.

formula1 <- "mpg ~ ."
m1 <- lm(formula1, data=mtcars) 
summary(m1)

## 
## Call:
## lm(formula = formula1, data = mtcars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3.450644 -1.604402 -0.119605  1.219268  4.627094 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)  
## (Intercept) 12.3033742 18.7178844  0.65731 0.518124  
## cyl         -0.1114405  1.0450234 -0.10664 0.916087  
## disp         0.0133352  0.0178575  0.74676 0.463489  
## hp          -0.0214821  0.0217686 -0.98684 0.334955  
## drat         0.7871110  1.6353731  0.48130 0.635278  
## wt          -3.7153039  1.8944143 -1.96119 0.063252 .
## qsec         0.8210407  0.7308448  1.12341 0.273941  
## vs1          0.3177628  2.1045086  0.15099 0.881423  
## am1          2.5202269  2.0566506  1.22540 0.233990  
## gear         0.6554130  1.4932600  0.43891 0.665206  
## carb        -0.1994193  0.8287525 -0.24063 0.812179  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.6502 on 21 degrees of freedom
## Multiple R-squared:  0.869016,   Adjusted R-squared:  0.806642 
## F-statistic: 13.9325 on 10 and 21 DF,  p-value: 0.000000379315

formula1b <- "mpg ~ ."
m1b <- lm(formula1b, data=mtcars) 
summary(m1b)

## 
## Call:
## lm(formula = formula1b, data = mtcars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3.450644 -1.604402 -0.119605  1.219268  4.627094 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)  
## (Intercept) 12.3033742 18.7178844  0.65731 0.518124  
## cyl         -0.1114405  1.0450234 -0.10664 0.916087  
## disp         0.0133352  0.0178575  0.74676 0.463489  
## hp          -0.0214821  0.0217686 -0.98684 0.334955  
## drat         0.7871110  1.6353731  0.48130 0.635278  
## wt          -3.7153039  1.8944143 -1.96119 0.063252 .
## qsec         0.8210407  0.7308448  1.12341 0.273941  
## vs1          0.3177628  2.1045086  0.15099 0.881423  
## am1          2.5202269  2.0566506  1.22540 0.233990  
## gear         0.6554130  1.4932600  0.43891 0.665206  
## carb        -0.1994193  0.8287525 -0.24063 0.812179  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.6502 on 21 degrees of freedom
## Multiple R-squared:  0.869016,   Adjusted R-squared:  0.806642 
## F-statistic: 13.9325 on 10 and 21 DF,  p-value: 0.000000379315

Well, this is not promising – although this generates a model with a high \(R^2\), NONE of the variables are significant!

Correlations

Let’s look at the correlations between the variables

From the above, we note that the following variables have strong negative correlation with mpg :

carb
wt
hp
cyl
disp

and the following variables have strong correlation with mpg :

qsec
drat
gear
am (dichotomous)
vs (dichotomous)

Boruta

The Boruta algorithm determines “importance” of various variables:

library(Boruta)

## Loading required package: ranger

Boruta(mpg ~ .  , data=mtcars)->Bor.hvo
print(Bor.hvo)

## Boruta performed 14 iterations in 0.440195084 secs.
##  10 attributes confirmed important: am, carb, cyl, disp, drat and 5 more;
##  No attributes deemed unimportant.

plot(Bor.hvo,  cex.axis=0.75, las=2, main="Boruta algorithm for Feature Selection")

According to the above, all variables are “important”, with disp, wt, hp, and cyl most important.

Important variables

Let’s consider a model with just these “most important” variables:

formula3 <- "mpg ~ disp+wt+hp+cyl"
m3 <- lm(formula3, data=mtcars) 
summary(m3)

## 
## Call:
## lm(formula = formula3, data = mtcars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -4.056232 -1.463594 -0.428125  1.285408  5.826944 
## 
## Coefficients:
##               Estimate Std. Error  t value             Pr(>|t|)    
## (Intercept) 40.8285367  2.7574679 14.80653 0.000000000000017614 ***
## disp         0.0115992  0.0117268  0.98912           0.33138556    
## wt          -3.8539035  1.0154736 -3.79518           0.00075895 ***
## hp          -0.0205384  0.0121468 -1.69085           0.10237913    
## cyl         -1.2933197  0.6558768 -1.97189           0.05894681 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.51252 on 27 degrees of freedom
## Multiple R-squared:  0.848635,   Adjusted R-squared:  0.82621 
## F-statistic: 37.8441 on 4 and 27 DF,  p-value: 0.000000000106109

Under the above, only wt, the weight of the car, is significant.

Stepwise regression

Stepwise regression is a “greedy” algorithm whichcan run forward (adding the most promising variables to an empty model) or backward (dropping the most unpromising variables from a complete model).

Forward stepwise

# Stepwise Regression
library(MASS)
# The specification "mpg ~ ." specifies that all of the other variables in the dataframe 
# are predictors to be used in the model.
allvars <- lm("mpg ~ .",data=mtcars)

# The specification "mpg ~ 1" means that the model has only an intercept term, 
# with no predictor variables.
novars <- lm("mpg ~ 1",data=mtcars)

# step forward from an empty model, adding the most promising variables at each step
# The AIC (Akaike Information Criterion) is a comparative measure; 
# the model with the lowest (i.e., most negative) AIC is "best"
stepforward <- stepAIC(novars, direction="both",scope=list(upper=allvars,lower=novars))

## Start:  AIC=115.94
## mpg ~ 1
## 
##        Df Sum of Sq       RSS       AIC
## + wt    1  847.7252  278.3219  73.21736
## + cyl   1  817.7130  308.3342  76.49435
## + disp  1  808.8885  317.1587  77.39732
## + hp    1  678.3729  447.6743  88.42656
## + drat  1  522.4805  603.5667  97.98786
## + vs    1  496.5279  629.5193  99.33506
## + am    1  405.1506  720.8966 103.67231
## + carb  1  341.7761  784.2711 106.36860
## + gear  1  259.7492  866.2980 109.55178
## + qsec  1  197.3919  928.6553 111.77605
## <none>              1126.0472 115.94345
## 
## Step:  AIC=73.22
## mpg ~ wt
## 
##        Df Sum of Sq       RSS       AIC
## + cyl   1   87.1500  191.1720  63.19800
## + hp    1   83.2742  195.0478  63.84027
## + qsec  1   82.8583  195.4636  63.90843
## + vs    1   54.2281  224.0939  68.28253
## + carb  1   44.6024  233.7196  69.62836
## + disp  1   31.6394  246.6825  71.35572
## <none>               278.3219  73.21736
## + drat  1    9.0806  269.2413  74.15591
## + gear  1    1.1369  277.1850  75.08638
## + am    1    0.0022  278.3197  75.21711
## - wt    1  847.7252 1126.0472 115.94345
## 
## Step:  AIC=63.2
## mpg ~ wt + cyl
## 
##        Df Sum of Sq      RSS      AIC
## + hp    1  14.55145 176.6205 62.66456
## + carb  1  13.77242 177.3996 62.80540
## <none>              191.1720 63.19800
## + qsec  1  10.56739 180.6046 63.37837
## + gear  1   3.02815 188.1438 64.68706
## + disp  1   2.67959 188.4924 64.74629
## + vs    1   0.70594 190.4660 65.07961
## + am    1   0.12491 191.0471 65.17708
## + drat  1   0.00102 191.1709 65.19783
## - cyl   1  87.14997 278.3219 73.21736
## - wt    1 117.16227 308.3342 76.49435
## 
## Step:  AIC=62.66
## mpg ~ wt + cyl + hp
## 
##        Df Sum of Sq      RSS      AIC
## <none>              176.6205 62.66456
## - hp    1  14.55145 191.1720 63.19800
## + am    1   6.62275 169.9978 63.44158
## + disp  1   6.17616 170.4444 63.52554
## - cyl   1  18.42723 195.0478 63.84027
## + carb  1   2.51872 174.1018 64.20494
## + drat  1   2.24529 174.3752 64.25515
## + qsec  1   1.40105 175.2195 64.40971
## + gear  1   0.85585 175.7647 64.50912
## + vs    1   0.05988 176.5606 64.65371
## - wt    1 115.35403 291.9746 76.74978

summary(stepforward)

## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3.929035 -1.559784 -0.531133  1.185008  5.898552 
## 
## Coefficients:
##               Estimate Std. Error  t value               Pr(>|t|)    
## (Intercept) 38.7517874  1.7868640 21.68704 < 0.000000000000000222 ***
## wt          -3.1669731  0.7405759 -4.27637             0.00019948 ***
## cyl         -0.9416168  0.5509164 -1.70918             0.09848010 .  
## hp          -0.0180381  0.0118762 -1.51884             0.14001516    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.51155 on 28 degrees of freedom
## Multiple R-squared:  0.84315,    Adjusted R-squared:  0.826345 
## F-statistic: 50.1715 on 3 and 28 DF,  p-value: 0.0000000000218418

# display results
stepforward$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## mpg ~ 1
## 
## Final Model:
## mpg ~ wt + cyl + hp
## 
## 
##    Step Df    Deviance Resid. Df  Resid. Dev         AIC
## 1                             31 1126.047187 115.9434500
## 2  + wt  1 847.7252500        30  278.321938  73.2173629
## 3 + cyl  1  87.1499713        29  191.171966  63.1979989
## 4  + hp  1  14.5514461        28  176.620520  62.6645624

Forward stepwise selected the model mpg ~ wt + cyl + hp .
It stopped with AIC = 62.66 and \(R^2 = 0.843\) , but only wt is highly significant.

Backward stepwise

Let’s try the opposite direction: start from a full model, and drop variables when doing so improves the AIC:

stepbackward <- stepAIC(allvars, direction="both",scope=list(upper=allvars,lower=novars))

## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq      RSS      AIC
## - cyl   1  0.079871 147.5743 68.91507
## - vs    1  0.160126 147.6546 68.93247
## - carb  1  0.406669 147.9011 68.98585
## - gear  1  1.353055 148.8475 69.18996
## - drat  1  1.627026 149.1215 69.24881
## - disp  1  3.916667 151.4111 69.73641
## - hp    1  6.839910 154.3343 70.34833
## - qsec  1  8.864116 156.3586 70.76531
## <none>              147.4944 70.89774
## - am    1 10.546651 158.0411 71.10781
## - wt    1 27.014385 174.5088 74.27966
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq      RSS      AIC
## - vs    1  0.268523 147.8428 66.97324
## - carb  1  0.520142 148.0944 67.02766
## - gear  1  1.821147 149.3955 67.30755
## - drat  1  1.982623 149.5569 67.34212
## - disp  1  3.900939 151.4752 67.74996
## - hp    1  7.363165 154.9375 68.47314
## <none>              147.5743 68.91507
## - qsec  1 10.093256 157.6676 69.03209
## - am    1 11.835944 159.4102 69.38384
## + cyl   1  0.079871 147.4944 70.89774
## - wt    1 27.028045 174.6024 72.29681
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq      RSS      AIC
## - carb  1  0.685461 148.5283 65.12126
## - gear  1  2.143684 149.9865 65.43390
## - drat  1  2.213899 150.0567 65.44888
## - disp  1  3.646651 151.4895 65.75297
## - hp    1  7.105960 154.9488 66.47548
## <none>              147.8428 66.97324
## - am    1 11.569381 159.4122 67.38424
## - qsec  1 15.683048 163.5259 68.19953
## + vs    1  0.268523 147.5743 68.91507
## + cyl   1  0.188268 147.6546 68.93247
## - wt    1 27.379942 175.2228 70.41031
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq      RSS      AIC
## - gear  1   1.56497 150.0933 63.45667
## - drat  1   1.93213 150.4604 63.53485
## <none>              148.5283 65.12126
## - disp  1  10.11026 158.6386 65.22856
## - am    1  12.32322 160.8515 65.67186
## - hp    1  14.82564 163.3539 66.16586
## + carb  1   0.68546 147.8428 66.97324
## + vs    1   0.43384 148.0944 67.02766
## + cyl   1   0.41443 148.1139 67.03185
## - qsec  1  26.40806 174.9363 68.35796
## - wt    1  69.12692 217.6552 75.34964
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq      RSS      AIC
## - drat  1   3.34455 153.4378 62.16190
## - disp  1   8.54536 158.6386 63.22857
## <none>              150.0933 63.45667
## - hp    1  13.28465 163.3779 64.17056
## + gear  1   1.56497 148.5283 65.12126
## + cyl   1   1.00340 149.0899 65.24203
## + vs    1   0.64549 149.4478 65.31875
## + carb  1   0.10675 149.9865 65.43390
## - am    1  20.03588 170.1291 65.46630
## - qsec  1  25.57441 175.6677 66.49146
## - wt    1  67.57196 217.6652 73.35111
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq      RSS      AIC
## - disp  1   6.62865 160.0665 61.51530
## <none>              153.4378 62.16190
## - hp    1  12.57205 166.0099 62.68196
## + drat  1   3.34455 150.0933 63.45667
## + gear  1   2.97739 150.4604 63.53485
## + cyl   1   2.44669 150.9911 63.64752
## + vs    1   1.12081 152.3170 63.92730
## + carb  1   0.01143 153.4264 64.15952
## - qsec  1  26.46980 179.9076 65.25464
## - am    1  32.19752 185.6353 66.25754
## - wt    1  69.04304 222.4809 72.05136
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq      RSS      AIC
## - hp    1   9.21947 169.2859 61.30730
## <none>              160.0665 61.51530
## + disp  1   6.62865 153.4378 62.16190
## + carb  1   3.22719 156.8393 62.86354
## + drat  1   1.42785 158.6386 63.22857
## - qsec  1  20.22461 180.2911 63.32277
## + cyl   1   0.24898 159.8175 63.46549
## + vs    1   0.24855 159.8179 63.46557
## + gear  1   0.17112 159.8953 63.48107
## - am    1  25.99284 186.0593 64.33054
## - wt    1  78.49377 238.5602 72.28435
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq      RSS      AIC
## <none>              169.2859 61.30730
## + hp    1   9.21947 160.0665 61.51530
## + carb  1   8.03594 161.2500 61.75104
## + disp  1   3.27607 166.0099 62.68196
## + cyl   1   1.50106 167.7849 63.02230
## + drat  1   1.39962 167.8863 63.04164
## + gear  1   0.12272 169.1632 63.28410
## + vs    1   0.00047 169.2855 63.30722
## - am    1  26.17770 195.4636 63.90843
## - qsec  1 109.03377 278.3197 75.21711
## - wt    1 183.34726 352.6332 82.79016

summary(stepbackward)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3.481067 -1.555525 -0.725731  1.411012  4.660998 
## 
## Coefficients:
##              Estimate Std. Error  t value     Pr(>|t|)    
## (Intercept)  9.617781   6.959593  1.38195   0.17791517    
## wt          -3.916504   0.711202 -5.50688 0.0000069527 ***
## qsec         1.225886   0.288670  4.24668   0.00021617 ***
## am1          2.935837   1.410905  2.08082   0.04671551 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.45885 on 28 degrees of freedom
## Multiple R-squared:  0.849664,   Adjusted R-squared:  0.833556 
## F-statistic: 52.7496 on 3 and 28 DF,  p-value: 0.0000000000121045

stepbackward$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
## Final Model:
## mpg ~ wt + qsec + am
## 
## 
##     Step Df     Deviance Resid. Df Resid. Dev        AIC
## 1                               21 147.494430 70.8977443
## 2  - cyl  1 0.0798712088        22 147.574301 68.9150683
## 3   - vs  1 0.2685228049        23 147.842824 66.9732418
## 4 - carb  1 0.6854607735        24 148.528285 65.1212642
## 5 - gear  1 1.5649705268        25 150.093255 63.4566688
## 6 - drat  1 3.3445511717        26 153.437807 62.1619012
## 7 - disp  1 6.6286536884        27 160.066460 61.5153025
## 8   - hp  1 9.2194693468        28 169.285930 61.3073047

Backward stepwise selected the model mpg ~ wt + qsec + am with a slightly lower AIC (61.31) and a slightly higher \(R^2 = 0.850\) . Each of the variables is significant at 95% confidence, and one variable, am, is dichotomous: it indicates whether the car has an automatic or manual (stick-shift) transmission.

So, let’s try adding quadratic and interactions to this model:

formula5 <- mpg ~ wt + qsec + am
m5 <- lm(formula5, data=mtcars)
summ5 <- summary(m5)
summ5

## 
## Call:
## lm(formula = formula5, data = mtcars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3.481067 -1.555525 -0.725731  1.411012  4.660998 
## 
## Coefficients:
##              Estimate Std. Error  t value     Pr(>|t|)    
## (Intercept)  9.617781   6.959593  1.38195   0.17791517    
## wt          -3.916504   0.711202 -5.50688 0.0000069527 ***
## qsec         1.225886   0.288670  4.24668   0.00021617 ***
## am1          2.935837   1.410905  2.08082   0.04671551 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.45885 on 28 degrees of freedom
## Multiple R-squared:  0.849664,   Adjusted R-squared:  0.833556 
## F-statistic: 52.7496 on 3 and 28 DF,  p-value: 0.0000000000121045

Quadratic term

Rather than creating a separate variable wt2=wt^2 , the proper way to add a quadratic term to a model is by specifying poly(variablename, 2) . Under the latter method, R will not allow the quadratic term to be kept without also keeping the linear term.

m6 <- lm(mpg ~ poly(wt,2) + qsec + am, data=mtcars)
summ6 <- summary(m6)
summ6

## 
## Call:
## lm(formula = mpg ~ poly(wt, 2) + qsec + am, data = mtcars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3.750796 -1.312468 -0.495434  1.072236  4.855303 
## 
## Coefficients:
##                Estimate Std. Error  t value      Pr(>|t|)    
## (Intercept)    2.353222   5.255953  0.44773     0.6579234    
## poly(wt, 2)1 -25.463350   3.765874 -6.76161 0.00000029221 ***
## poly(wt, 2)2   7.065872   2.509318  2.81585     0.0089778 ** 
## qsec           0.970511   0.273906  3.54323     0.0014614 ** 
## am1            1.021518   1.434550  0.71208     0.4825217    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.20149 on 27 degrees of freedom
## Multiple R-squared:  0.883791,   Adjusted R-squared:  0.866574 
## F-statistic: 51.3348 on 4 and 27 DF,  p-value: 0.00000000000310741

Here the \(R^2\) has increased to 0.884, which improves the earlier result.

Interaction term

Let’s try adding an interaction between qsec and am . Since both variables are already in the model, we can simply change qsec+am to qsec*am ; this will keep the 2 underlying variables while adding an interaction which will be named qsec:am

m7 <- lm(mpg ~ poly(wt,2) + qsec*am,data=mtcars)
summ7 <- summary(m7)
summ7

## 
## Call:
## lm(formula = mpg ~ poly(wt, 2) + qsec * am, data = mtcars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3.661388 -1.431606 -0.156905  1.183505  3.957388 
## 
## Coefficients:
##                Estimate Std. Error  t value     Pr(>|t|)    
## (Intercept)    5.992714   5.824678  1.02885     0.313027    
## poly(wt, 2)1 -24.205530   3.821151 -6.33462 0.0000010444 ***
## poly(wt, 2)2   5.732660   2.657629  2.15706     0.040424 *  
## qsec           0.760502   0.310687  2.44780     0.021436 *  
## am1          -10.328338   8.456372 -1.22137     0.232907    
## qsec:am1       0.669237   0.491620  1.36129     0.185105    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.16752 on 26 degrees of freedom
## Multiple R-squared:  0.891522,   Adjusted R-squared:  0.870661 
## F-statistic:  42.736 on 5 and 26 DF,  p-value: 0.00000000000993575

This has improved the \(R^2\) slightly, to 0.892 , but am and the interaction term, qsec:am, are not individually significant.

Interpret all coefficients.

The model is \[mpg = 5.993 -24.206 \cdot wt + 5.733 \cdot wt^2 + 0.761 \cdot qsec -10.328 \cdot am + 0.669 \cdot am \cdot qsec\] .

This means that:

As the weight wt of the car increases by 1 unit, the miles per gallon mpg changes by \(-24.206 \cdot wt + 5.732 \cdot wt^2\) , where we have to consider the quadratic term along with the linear term
If the car has an automatic transmission (am==0):
- the intercept term is 5.993 , which would be the mpg if the car’s weight and qsec were both zero
- as the qsec “1/4 mile time” increases by 1 unit, the mpg increases by 0.761
If the car has a manual transmission (am==1):
- the intercept term becomes 5.993 - 10.328 = -4.335, which would be the mpg if the car’s weight and qsec were both zero
- as the qsec “1/4 mile time” increases by 1 unit, the mpg increases by 0.761 + 0.669 = 1.43

This means that the interaction can be expressed as two separate models:

For automatic cars (am==0) the model is \[mpg_{automatic} = 5.993 -24.206 \cdot wt + 5.733 \cdot wt^2 + 0.761 \cdot qsec\] For manual cars (am==1) the model is \[\begin{aligned} mpg_{manual} &= (5.993 -10.328) -24.206 \cdot wt + 5.733 \cdot wt^2 + (0.761 + 0.669) \cdot qsec \\ &= \quad \quad -4.335 \quad \quad -24.206 \cdot wt + 5.733 \cdot wt^2 + \quad \quad \quad 1.43 \cdot qsec \end{aligned}\] .

Conduct residual analysis.

Plots

Residual = resid(m7) 
Fitted = fitted(m7)
plot(Fitted,Residual, 
     main="mtcars Dataset: Fitted vs. Residuals", 
     xlab="Fitted mpg (Miles per Gallon))")
abline(h=0, col="blue")

QQplot

qqnorm(Residual)
qqline(Residual, col="red")

Plot histograms of residuals

hist(Residual, main = "Histogram of Residuals - 8 breaks", ylab = "Density", 
     ylim = c(0, 0.35),
     xlim = c(-6,6),
     prob = TRUE,breaks=8, col="lightblue")
curve(dnorm(x, mean = mean(Residual), sd = sd(Residual)), col="red", add=TRUE)

The above histogram looks vaguely Normal. It is difficult to assess because there are only 32 data points.

Normality tests

We can run several standard tests of normality:

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:MASS':
## 
##     cement

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_normality(m7)

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk               0.97          0.4998 
## Kolmogorov-Smirnov        0.0887         0.9434 
## Cramer-von Mises          2.4371         0.0000 
## Anderson-Darling          0.3011         0.5586 
## -----------------------------------------------

Three out of four tests pass – only Cramer-von Mises fails.

Homogeneity of residuals variance.

We can test using the lmSupport package:

library(lmSupport)
modelAssumptions(m7,"NORMAL")

## Descriptive Statistics for Studentized Residuals
## 
## Call:
## lm(formula = mpg ~ poly(wt, 2) + qsec * am, data = mtcars)
## 
## Coefficients:
##  (Intercept)  poly(wt, 2)1  poly(wt, 2)2          qsec           am1      qsec:am1  
##     5.992714    -24.205530      5.732660      0.760502    -10.328338      0.669237  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = Model) 
## 
##                         Value  p-value                Decision
## Global Stat        2.64503844 0.618866 Assumptions acceptable.
## Skewness           0.59000334 0.442418 Assumptions acceptable.
## Kurtosis           0.46970548 0.493123 Assumptions acceptable.
## Link Function      1.58012407 0.208743 Assumptions acceptable.
## Heteroscedasticity 0.00520555 0.942483 Assumptions acceptable.

All tests pass, including the test of Heteroscedasticity, indicating that the residuals have constant variance.

Independence of residuals

While there is clearly a relationship between the independent variable vs. the residual, the residuals themselves are actually independent of each other, as each represents the result from an individual country. Plotting the residuals sequentially (where the sequence happens to be alphabetical by country name) doesn’t yield any discernable pattern:

alpha_Residuals <- t(t(Residual[order(rownames(mtcars))]))
alpha_Residuals %>%  kable(caption = "Residual") %>% kable_styling(c("striped", "bordered"))

Residual
AMC Javelin	-2.225402818
Cadillac Fleetwood	-2.050588463
Camaro Z28	-1.018642663
Chrysler Imperial	2.821691116
Datsun 710	-3.661388284
Dodge Challenger	-1.226871406
Duster 360	-1.430263386
Ferrari Dino	0.304378522
Fiat 128	3.957387786
Fiat X1-9	-2.066866646
Ford Pantera L	-0.101949870
Honda Civic	-0.669267752
Hornet 4 Drive	1.333389926
Hornet Sportabout	1.509702508
Lincoln Continental	-1.671453773
Lotus Europa	0.897628777
Maserati Bora	0.770241525
Mazda RX4	-0.606280813
Mazda RX4 Wag	-0.002961576
Merc 230	-0.211859782
Merc 240D	3.787550367
Merc 280	1.036260200
Merc 280C	-0.820040882
Merc 450SE	1.402480525
Merc 450SL	0.887664531
Merc 450SLC	-1.319570420
Pontiac Firebird	3.653150732
Porsche 914-2	1.133543056
Toyota Corolla	2.418715870
Toyota Corona	-3.021563634
Valiant	-1.435632679
Volvo 142E	-2.373180593

plot(alpha_Residuals, main="Residuals, sequenced alphabetically by car name")
abline(h=0, col="red")

There is no discernable pattern to the residuals, which are independent of each other.

Here are the standard plots:

plot(m7)

Manual is better

One key observation is that cars with manual transmission have better gas mileage than cars with automatic transmission (at least among the cars in the set of vehicles on this Motor Trend list.)

ggplot(mtcars, aes(y=mpg, x=factor(am, labels = c("automatic", "manual")), fill=factor(am)))+
  theme(legend.position="none")+
  geom_boxplot(colour="black", size=1)+
  xlab("transmission type") + ylab("MPG")+
  ggtitle("MPG by Transmission type") +
  theme(plot.title = element_text(hjust = 0.5))

plot(m7$fitted,m7$residuals, pch=19, cex=1.5, col=mtcars$am,
     xlab = "Fitted MPG",
     ylab = "Residual",
     main = "Fitted MPG and residual, by car transmission type")
legend("bottomright",legend=c("Manual","Automatic"), pch=16, col=unique(mtcars$am))

Was the linear model appropriate? Why or why not?

Yes, the linear model is appropriate, as the key variables are significant, the \(R^2\) is high, and the conditions for linear regression are met.