Using regression analysis to explore factors influencing a car’s mileage.

Ximena Ramirez

Questions

What is the relationship between a car’s transmission (predictor) and its miles per gallon (outcome)?
Is there a significant difference in the average MPG between an automatic car or manual car?
Are there other variables/predictors which significantly impact the MPG?
Goal: Determine significant predictors (i.e. feature selection)

Motivation

We need cars with better mileage.
Individual Impact
- reduced fuel costs & better investment
Environmental & Societal Impact
- reduced greenhouse gas emissions & air quality
Economic & Geopolitical Impact
- reduce country’s dependence on foreign oil

The Data - Overview

mtcars dataset from R - Extracted from the 1974 Motor Trend US magazine - Measured fuel usage & 10 automobile design and performance aspects for 32 automobiles. - Car models include sedans, luxury sedans, muscle cars and high-end sports cars (1973-74 models).

The Data - Variables

data(mtcars) #load the data set
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

mpg - Miles/(US) gallon

cyl - Number of cylinders

disp - Displacement (cu.in.)

hp - Gross horsepower

drat - Rear axle ratio

wt - Weight (per 1000 lbs)

qsec - acceleration (1/4 mile)

vs - Engine shape (0 = V-shaped, 1 = straight)

am - Transmission (0 = automatic, 1 = manual)

gear - No. of forward gears

carb - Number of carburetors

Learn more about the variables here!

Exploring the Dataset

str(mtcars) #MPG = mpg, Transmission = am

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

mtcars[!complete.cases(mtcars),] #Check for NAs in dataset

 [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
<0 rows> (or 0-length row.names)

All variables numeric…
There are no missing values!

Cleaning the Dataset

mtcars2 <- within(mtcars, {
        vs <- factor(vs, labels = c("V", "S"))
        am <- factor(am, labels = c("automatic", "manual"))
        cyl  <- ordered(cyl)
        gear <- ordered(gear)
        carb <- ordered(carb)
        })
#used from mtcars documentation

New data frame mtcars2 created
Want vs and am as factors
Want cyl, gear,carb as ordered factors

Exploratory Plot

mean_am <- aggregate(mpg ~ am, mean, data = mtcars2)
print(mean_am)

         am      mpg
1 automatic 17.14737
2    manual 24.39231

There is a mean difference of about 7.245 mpg between automatic and manual transmissions.

Is Difference Significant?


    Welch Two Sample t-test

data:  mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group automatic and group manual is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean in group automatic    mean in group manual 
               17.14737                24.39231

H_o: mean_a = mean_m (i.e., difference in means equals 0)
95% CI does not include 0
p-value = 0.0014 < 0.05
Hence, difference in means is significant!

Modeling - Simple Linear Regression


Call:
lm(formula = mpg ~ am, data = mtcars2)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3923 -3.0923 -0.2974  3.2439  9.5077 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   17.147      1.125  15.247 1.13e-15 ***
ammanual       7.245      1.764   4.106 0.000285 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.902 on 30 degrees of freedom
Multiple R-squared:  0.3598,    Adjusted R-squared:  0.3385 
F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Model: mpg = 17.15 + 7.245 * am + e

Where Transmission is either 0 (automatic) or 1 (manual).

Automatic: the mean mpg 17.15 plus error
Manual: the mean mpg is expected to increase by 7.245 plus error
The p-value for the slope is significant (smaller than 0.001)

The Best Model?

Model: mpg =17.15 + 7.245* am + e

print(simple_summary$r.squared) #check R-Squared Value

[1] 0.3597989

The model has an R-squared value of only about 0.36.
Not a strong predictive model.
Therefore, there may be other regressors/variables.

Confounding Variables? Pairs Plot!

First row of plot shows possible correlation between mpg several other variables.
Therefore, a regression model including more than am (transmission) may be a better fit.
cyl, disp, hp, drat, wt

Modeling - Multivariate Regression

Create model including ALL variables & check residuals.

fit_all <- lm(mpg ~ ., data = mtcars2)
summary(fit_all)$coef

               Estimate  Std. Error     t value   Pr(>|t|)
(Intercept) 26.57171130 19.56615969  1.35804428 0.19452722
cyl.L       -0.23770312  5.06255894 -0.04695316 0.96317000
cyl.Q        2.02541267  2.14952059  0.94226251 0.36098761
disp         0.03554632  0.03189920  1.11433290 0.28267339
hp          -0.07050683  0.03942556 -1.78835344 0.09393155
drat         1.18283018  2.48348458  0.47627845 0.64073922
wt          -4.52977584  2.53874584 -1.78425732 0.09461859
qsec         0.36784482  0.93539569  0.39325050 0.69966720
vsS          1.93085054  2.87125777  0.67247551 0.51150791
ammanual     1.21211570  3.21354514  0.37718957 0.71131573
gear.L       1.78784595  2.64200408  0.67670068 0.50889747
gear.Q       0.12234634  2.40896047  0.05078802 0.96016458
carb.L       6.06155540  6.72821581  0.90091572 0.38187054
carb.Q       1.78825141  2.80043297  0.63856247 0.53273703
carb.C       0.42384333  2.57388972  0.16467035 0.87140202
carb^4       0.93317347  2.45041298  0.38082294 0.70867391
carb^5      -2.46409938  2.90450253 -0.84837226 0.40956528

Now none of the variables are significant!
How do we pick a model?

Modeling with ANOVA

Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq F value    Pr(>F)    
cyl        2 824.78  412.39 51.3766 1.943e-07 ***
disp       1  57.64   57.64  7.1813   0.01714 *  
hp         1  18.50   18.50  2.3050   0.14975    
drat       1  11.91   11.91  1.4843   0.24191    
wt         1  55.79   55.79  6.9500   0.01870 *  
qsec       1   1.52    1.52  0.1899   0.66918    
vs         1   0.30    0.30  0.0376   0.84878    
am         1  16.57   16.57  2.0639   0.17135    
gear       2   5.02    2.51  0.3128   0.73606    
carb       5  13.60    2.72  0.3388   0.88144    
Residuals 15 120.40    8.03                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compares significance of each variable
Take variables with small p values

Modeling with ANOVA


Call:
lm(formula = mpg ~ disp + wt + cyl, data = mtcars2)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5965 -1.2361 -0.4855  1.4740  5.8043 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.498891   2.483658  12.280 1.46e-12 ***
disp         0.001715   0.013481   0.127  0.89972    
wt          -3.306751   1.105083  -2.992  0.00586 ** 
cyl.L       -4.470885   1.837358  -2.433  0.02186 *  
cyl.Q        0.934208   1.032856   0.904  0.37374    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.603 on 27 degrees of freedom
Multiple R-squared:  0.8375,    Adjusted R-squared:  0.8135 
F-statistic:  34.8 on 4 and 27 DF,  p-value: 2.726e-10

Chosen predictors are not all significant now…
Can we use ANOVA?
- Not all inputs are categorical
- Better for comparing nested models

Problems with ANOVA Model

Residuals do not appear normal or homoscedastic.
Will need to try another approach…

Modeling with Correlation Matrix

sort(cor(mtcars)[1,]) #use mtcars due to numerical argument

        wt        cyl       disp         hp       carb       qsec       gear 
-0.8676594 -0.8521620 -0.8475514 -0.7761684 -0.5509251  0.4186840  0.4802848 
        am         vs       drat        mpg 
 0.5998324  0.6640389  0.6811719  1.0000000

Want largest (absolute value) correlation coefficient
Other uses of correlation matrix

Modeling with Correlation Matrix


Call:
lm(formula = mpg ~ wt + cyl + disp + hp, data = mtcars2)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2740 -1.0349 -0.3831  0.9810  5.4192 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.595992   2.862725  11.736 6.86e-12 ***
wt          -3.428626   1.055455  -3.248  0.00319 ** 
cyl.L       -2.653932   1.989795  -1.334  0.19385    
cyl.Q        1.297738   1.002641   1.294  0.20693    
disp         0.004199   0.012917   0.325  0.74774    
hp          -0.023517   0.012216  -1.925  0.06523 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.482 on 26 degrees of freedom
Multiple R-squared:  0.8578,    Adjusted R-squared:  0.8305 
F-statistic: 31.37 on 5 and 26 DF,  p-value: 3.18e-10

Corr Model: MPG = 33.60 - 3.43wt – 2.65cyl.L + 1.29 cyl.Q + 0.0042disp - 0.024hp + error

Modeling with Step-Wise Algorithm

fit_step <- step(fit_all, direction = "both", trace = 0) #iterative feature selection
summary(fit_step)


Call:
lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9387 -1.2560 -0.4013  1.1253  5.0513 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 31.97665    3.06337  10.438 8.61e-11 ***
cyl.L       -1.52995    1.61521  -0.947  0.35225    
cyl.Q        1.59177    0.88076   1.807  0.08231 .  
hp          -0.03211    0.01369  -2.345  0.02693 *  
wt          -2.49683    0.88559  -2.819  0.00908 ** 
ammanual     1.80921    1.39630   1.296  0.20646    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.41 on 26 degrees of freedom
Multiple R-squared:  0.8659,    Adjusted R-squared:  0.8401 
F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

Step Model: MPG = 31.98 - 1.53cyl.L + 1.59cyl.Q - 0.032hp - 2.50wt + 1.81am + error

Low AIC used to determine best fit with least parameters
Note: am now accounts for only a 1.81 difference in mpg between automatic & manual!

Residuals

Residuals vs. Fitted - dataset appears linear
Scale-Location - residuals have equal variance along regression line (homoscedastic)
Q-Q Residuals - residuals normally distributed w/ exceptions
Residuals vs. Leverage - no influential case detected

Summary

3 models made
- fit_anova <- lm(mpg ~ wt + cyl + disp, data = mtcars2)
- fit_cor <- lm(mpg ~ wt + cyl + disp + hp, data = mtcars2)
- fit_step <- lm(mpg ~ wt + cyl + hp + am, data = mtcars2)
Consider each models Adj. R²
Consider magnitude of each variable
Weight was significant in every model
NOTE: Each model has the following contrasts for the levels of cyl:

contrasts(mtcars2$cyl) #row 1 is plugged into model if 4 cyl, row 2 for 6 cyl, row 3 for 8 cyl

                .L         .Q
[1,] -7.071068e-01  0.4082483
[2,] -7.850462e-17 -0.8164966
[3,]  7.071068e-01  0.4082483

Further Considerations…Collinearity

Check each models variance inflation factor to check for interaction between predictors:

          GVIF Df GVIF^(1/(2*Df))
disp 12.772144  1        3.573814
wt    5.348955  1        2.312781
cyl   6.969360  2        1.624794

          GVIF Df GVIF^(1/(2*Df))
wt    5.368271  1        2.316953
cyl   8.993015  2        1.731715
disp 12.900887  1        3.591780
hp    3.531254  1        1.879163

        GVIF Df GVIF^(1/(2*Df))
cyl 5.824545  2        1.553515
hp  4.703625  1        2.168784
wt  4.007113  1        2.001778
am  2.590777  1        1.609589

Consider leaving disp out of fit_cor

Limitations & Improvements

Small sample size & overfitting
- Consider bootstrapping
- Collect data vs. online
- Random Forests
Consider interactions
- Remove redundancy
ANOVA to compare models
New variables
- Updated data
- Changes in technology & parameters

Conclusions

Manual cars may have better mileage…
- Cylinders & weight impact!
- Other variables: drat, qsec, vs, gear, or carb
Create predictive model (next step)
- parsimony vs performance
- Importance of interpretability
- Get an idea! Click here!
Impact: Facilitates & optimizes experimental design

Using regression analysis to explore factors influencing a car’s mileage.

Questions

Motivation

The Data - Overview

The Data - Variables

Exploring the Dataset

Cleaning the Dataset

Exploratory Plot

Is Difference Significant?

Modeling - Simple Linear Regression

The Best Model?

Confounding Variables? Pairs Plot!

Modeling - Multivariate Regression

Modeling with ANOVA

Modeling with ANOVA

Problems with ANOVA Model

Modeling with Correlation Matrix

Modeling with Correlation Matrix

Modeling with Step-Wise Algorithm

Residuals

Summary

Further Considerations…Collinearity

Limitations & Improvements

Conclusions

THE END!