Using regression analysis to explore factors influencing a car’s mileage.

Ximena Ramirez

Questions

  • What is the relationship between a car’s transmission (predictor) and its miles per gallon (outcome)?
  • Is there a significant difference in the average MPG between an automatic car or manual car?
  • Are there other variables/predictors which significantly impact the MPG?
  • Goal: Determine significant predictors (i.e. feature selection)

Motivation

  • We need cars with better mileage.
  • Individual Impact
    • reduced fuel costs & better investment
  • Environmental & Societal Impact
    • reduced greenhouse gas emissions & air quality
  • Economic & Geopolitical Impact
    • reduce country’s dependence on foreign oil

The Data - Overview

mtcars dataset from R - Extracted from the 1974 Motor Trend US magazine - Measured fuel usage & 10 automobile design and performance aspects for 32 automobiles. - Car models include sedans, luxury sedans, muscle cars and high-end sports cars (1973-74 models).

The Data - Variables

data(mtcars) #load the data set
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

mpg - Miles/(US) gallon

cyl - Number of cylinders

disp - Displacement (cu.in.)

hp - Gross horsepower

drat - Rear axle ratio

wt - Weight (per 1000 lbs)

qsec - acceleration (1/4 mile)

vs - Engine shape (0 = V-shaped, 1 = straight)

am - Transmission (0 = automatic, 1 = manual)

gear - No. of forward gears

carb - Number of carburetors

  • Learn more about the variables here!

Exploring the Dataset

str(mtcars) #MPG = mpg, Transmission = am
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
mtcars[!complete.cases(mtcars),] #Check for NAs in dataset
 [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
<0 rows> (or 0-length row.names)
  • All variables numeric…
  • There are no missing values!

Cleaning the Dataset

mtcars2 <- within(mtcars, {
        vs <- factor(vs, labels = c("V", "S"))
        am <- factor(am, labels = c("automatic", "manual"))
        cyl  <- ordered(cyl)
        gear <- ordered(gear)
        carb <- ordered(carb)
        })
#used from mtcars documentation
  • New data frame mtcars2 created
  • Want vs and am as factors
  • Want cyl, gear,carb as ordered factors

Exploratory Plot

mean_am <- aggregate(mpg ~ am, mean, data = mtcars2)
print(mean_am)
         am      mpg
1 automatic 17.14737
2    manual 24.39231
  • There is a mean difference of about 7.245 mpg between automatic and manual transmissions.

Is Difference Significant?


    Welch Two Sample t-test

data:  mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group automatic and group manual is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean in group automatic    mean in group manual 
               17.14737                24.39231 
  • Ho: meana = meanm (i.e., difference in means equals 0)
  • 95% CI does not include 0
  • p-value = 0.0014 < 0.05
  • Hence, difference in means is significant!

Modeling - Simple Linear Regression


Call:
lm(formula = mpg ~ am, data = mtcars2)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3923 -3.0923 -0.2974  3.2439  9.5077 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   17.147      1.125  15.247 1.13e-15 ***
ammanual       7.245      1.764   4.106 0.000285 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.902 on 30 degrees of freedom
Multiple R-squared:  0.3598,    Adjusted R-squared:  0.3385 
F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Model: mpg = 17.15 + 7.245 * am + e

Where Transmission is either 0 (automatic) or 1 (manual).

  • Automatic: the mean mpg 17.15 plus error
  • Manual: the mean mpg is expected to increase by 7.245 plus error
  • The p-value for the slope is significant (smaller than 0.001)

The Best Model?

Model: mpg =17.15 + 7.245* am + e

print(simple_summary$r.squared) #check R-Squared Value
[1] 0.3597989
  • The model has an R-squared value of only about 0.36.

  • Not a strong predictive model.

  • Therefore, there may be other regressors/variables.

Confounding Variables? Pairs Plot!

  • First row of plot shows possible correlation between mpg several other variables.

  • Therefore, a regression model including more than am (transmission) may be a better fit.

  • cyl, disp, hp, drat, wt

Modeling - Multivariate Regression

Create model including ALL variables & check residuals.

fit_all <- lm(mpg ~ ., data = mtcars2)
summary(fit_all)$coef
               Estimate  Std. Error     t value   Pr(>|t|)
(Intercept) 26.57171130 19.56615969  1.35804428 0.19452722
cyl.L       -0.23770312  5.06255894 -0.04695316 0.96317000
cyl.Q        2.02541267  2.14952059  0.94226251 0.36098761
disp         0.03554632  0.03189920  1.11433290 0.28267339
hp          -0.07050683  0.03942556 -1.78835344 0.09393155
drat         1.18283018  2.48348458  0.47627845 0.64073922
wt          -4.52977584  2.53874584 -1.78425732 0.09461859
qsec         0.36784482  0.93539569  0.39325050 0.69966720
vsS          1.93085054  2.87125777  0.67247551 0.51150791
ammanual     1.21211570  3.21354514  0.37718957 0.71131573
gear.L       1.78784595  2.64200408  0.67670068 0.50889747
gear.Q       0.12234634  2.40896047  0.05078802 0.96016458
carb.L       6.06155540  6.72821581  0.90091572 0.38187054
carb.Q       1.78825141  2.80043297  0.63856247 0.53273703
carb.C       0.42384333  2.57388972  0.16467035 0.87140202
carb^4       0.93317347  2.45041298  0.38082294 0.70867391
carb^5      -2.46409938  2.90450253 -0.84837226 0.40956528
  • Now none of the variables are significant!
  • How do we pick a model?

Modeling with ANOVA

Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq F value    Pr(>F)    
cyl        2 824.78  412.39 51.3766 1.943e-07 ***
disp       1  57.64   57.64  7.1813   0.01714 *  
hp         1  18.50   18.50  2.3050   0.14975    
drat       1  11.91   11.91  1.4843   0.24191    
wt         1  55.79   55.79  6.9500   0.01870 *  
qsec       1   1.52    1.52  0.1899   0.66918    
vs         1   0.30    0.30  0.0376   0.84878    
am         1  16.57   16.57  2.0639   0.17135    
gear       2   5.02    2.51  0.3128   0.73606    
carb       5  13.60    2.72  0.3388   0.88144    
Residuals 15 120.40    8.03                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Compares significance of each variable
  • Take variables with small p values

Modeling with ANOVA


Call:
lm(formula = mpg ~ disp + wt + cyl, data = mtcars2)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5965 -1.2361 -0.4855  1.4740  5.8043 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.498891   2.483658  12.280 1.46e-12 ***
disp         0.001715   0.013481   0.127  0.89972    
wt          -3.306751   1.105083  -2.992  0.00586 ** 
cyl.L       -4.470885   1.837358  -2.433  0.02186 *  
cyl.Q        0.934208   1.032856   0.904  0.37374    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.603 on 27 degrees of freedom
Multiple R-squared:  0.8375,    Adjusted R-squared:  0.8135 
F-statistic:  34.8 on 4 and 27 DF,  p-value: 2.726e-10
  • Chosen predictors are not all significant now…
  • Can we use ANOVA?
    • Not all inputs are categorical
    • Better for comparing nested models

Problems with ANOVA Model

  • Residuals do not appear normal or homoscedastic.
  • Will need to try another approach…

Modeling with Correlation Matrix

sort(cor(mtcars)[1,]) #use mtcars due to numerical argument
        wt        cyl       disp         hp       carb       qsec       gear 
-0.8676594 -0.8521620 -0.8475514 -0.7761684 -0.5509251  0.4186840  0.4802848 
        am         vs       drat        mpg 
 0.5998324  0.6640389  0.6811719  1.0000000 
  • Want largest (absolute value) correlation coefficient
  • Other uses of correlation matrix

Modeling with Correlation Matrix


Call:
lm(formula = mpg ~ wt + cyl + disp + hp, data = mtcars2)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2740 -1.0349 -0.3831  0.9810  5.4192 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.595992   2.862725  11.736 6.86e-12 ***
wt          -3.428626   1.055455  -3.248  0.00319 ** 
cyl.L       -2.653932   1.989795  -1.334  0.19385    
cyl.Q        1.297738   1.002641   1.294  0.20693    
disp         0.004199   0.012917   0.325  0.74774    
hp          -0.023517   0.012216  -1.925  0.06523 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.482 on 26 degrees of freedom
Multiple R-squared:  0.8578,    Adjusted R-squared:  0.8305 
F-statistic: 31.37 on 5 and 26 DF,  p-value: 3.18e-10

Corr Model: MPG = 33.60 - 3.43wt – 2.65cyl.L + 1.29 cyl.Q + 0.0042disp - 0.024hp + error

Modeling with Step-Wise Algorithm

fit_step <- step(fit_all, direction = "both", trace = 0) #iterative feature selection
summary(fit_step)

Call:
lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9387 -1.2560 -0.4013  1.1253  5.0513 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 31.97665    3.06337  10.438 8.61e-11 ***
cyl.L       -1.52995    1.61521  -0.947  0.35225    
cyl.Q        1.59177    0.88076   1.807  0.08231 .  
hp          -0.03211    0.01369  -2.345  0.02693 *  
wt          -2.49683    0.88559  -2.819  0.00908 ** 
ammanual     1.80921    1.39630   1.296  0.20646    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.41 on 26 degrees of freedom
Multiple R-squared:  0.8659,    Adjusted R-squared:  0.8401 
F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

Step Model: MPG = 31.98 - 1.53cyl.L + 1.59cyl.Q - 0.032hp - 2.50wt + 1.81am + error

  • Low AIC used to determine best fit with least parameters
  • Note: am now accounts for only a 1.81 difference in mpg between automatic & manual!

Residuals

  • Residuals vs. Fitted - dataset appears linear
  • Scale-Location - residuals have equal variance along regression line (homoscedastic)
  • Q-Q Residuals - residuals normally distributed w/ exceptions
  • Residuals vs. Leverage - no influential case detected

Summary

  • 3 models made
    • fit_anova <- lm(mpg ~ wt + cyl + disp, data = mtcars2)
    • fit_cor <- lm(mpg ~ wt + cyl + disp + hp, data = mtcars2)
    • fit_step <- lm(mpg ~ wt + cyl + hp + am, data = mtcars2)
  • Consider each models Adj. R2
  • Consider magnitude of each variable
  • Weight was significant in every model
  • NOTE: Each model has the following contrasts for the levels of cyl:
contrasts(mtcars2$cyl) #row 1 is plugged into model if 4 cyl, row 2 for 6 cyl, row 3 for 8 cyl
                .L         .Q
[1,] -7.071068e-01  0.4082483
[2,] -7.850462e-17 -0.8164966
[3,]  7.071068e-01  0.4082483

Further Considerations…Collinearity

Check each models variance inflation factor to check for interaction between predictors:

          GVIF Df GVIF^(1/(2*Df))
disp 12.772144  1        3.573814
wt    5.348955  1        2.312781
cyl   6.969360  2        1.624794
          GVIF Df GVIF^(1/(2*Df))
wt    5.368271  1        2.316953
cyl   8.993015  2        1.731715
disp 12.900887  1        3.591780
hp    3.531254  1        1.879163
        GVIF Df GVIF^(1/(2*Df))
cyl 5.824545  2        1.553515
hp  4.703625  1        2.168784
wt  4.007113  1        2.001778
am  2.590777  1        1.609589
  • Consider leaving disp out of fit_cor

Limitations & Improvements

  • Small sample size & overfitting
    • Consider bootstrapping
    • Collect data vs. online
    • Random Forests
  • Consider interactions
    • Remove redundancy
  • ANOVA to compare models
  • New variables
    • Updated data
    • Changes in technology & parameters

Conclusions

  • Manual cars may have better mileage…
    • Cylinders & weight impact!
    • Other variables: drat, qsec, vs, gear, or carb
  • Create predictive model (next step)
    • parsimony vs performance
    • Importance of interpretability
    • Get an idea! Click here!
  • Impact: Facilitates & optimizes experimental design

THE END!