You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
Using ANOVA to determine that models with all features aren’t significant, I selected to test three models. Miles per Gallon by transmission, by transmission and cylinder, and transmission, cylinder, and weight. In all three the estimates for having a manual transmission instead of an automatic were positive. But the confidence intervals for all but the first model included zero. Which means that accounting for simple things like the amount of cylinders and weight of the car, it can’t be concluded that either transmission is better for miles per gallon.
A variety of factors can affect the miles per gallon of a vehicle, including (but not limited to) cylinders, weight, and -what this analysis will possibly determine- transmission type (am).
First, we’ll simply examine the data for for it’s cylinder type, colored by transmission, and sized by weight to see if any obvious issues arise in the data.
Some confounding problems with a dataset of only 32 observations may be:
The cars with each transmission may not be appropriately sampled from the population
The transmission status itself may directly affect other features (such as weight)
The limited data is immediately a concern. There are no automatic (am = 0) cars with more than 25 miles per gallon. In addition, sizing the points by weight, there are no manual (am = 1) cars under 3,000 lbs. The following table of means shows some of these issues more concisely.
## # A tibble: 6 x 4
## # Groups: cyl [?]
## cyl am mpg wt
## <fctr> <chr> <dbl> <dbl>
## 1 4 Automatic 22.90000 2.935000
## 2 4 Manual 28.07500 2.042250
## 3 6 Automatic 19.12500 3.388750
## 4 6 Manual 20.56667 2.755000
## 5 8 Automatic 15.05000 4.104083
## 6 8 Manual 15.40000 3.370000
Comparing a model with just transmission, a second model adding in cylinder, and a third adding weight (as plotted above) and a final model of all features; an ANOVA shows that modeling all four of these scenarios, a model of all features fails to be significant. As expected though, it greatly reduces the RSS.
mdl1 <- lm(mpg ~ am, data = cars) #only transmission
mdl2 <- lm(mpg~ am + cyl, data = cars) #transmission and cylinder(4,6, or 8)
mdl3 <- lm(mpg ~ am + cyl + wt, data = cars) #adding weight (unit is 1000 lbs)
mdl4 <- lm(mpg~., data = cars) #Everything
anova(mdl1,mdl2,mdl3, mdl4)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl
## Model 3: mpg ~ am + cyl + wt
## Model 4: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 264.50 2 456.40 34.2326 3.488e-07 ***
## 3 27 182.97 1 81.53 12.2300 0.00227 **
## 4 20 133.32 7 49.64 1.0639 0.42114
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
models <- list(mdl1,mdl2,mdl3)
lapply(models, coefficients)
## [[1]]
## (Intercept) amManual
## 17.147368 7.244939
##
## [[2]]
## (Intercept) amManual cyl6 cyl8
## 24.801852 2.559954 -6.156118 -10.067560
##
## [[3]]
## (Intercept) amManual cyl6 cyl8 wt
## 33.7535920 0.1501031 -4.2573185 -6.0791189 -3.1495978
All three models estimate that a manual transmission adds to a car’s miles per gallon, but as decreasing amounts with each model.
Looking at the simple residual plots for each model doesn’t show anything too egregious such as increasing variance or other problems.
## [[1]]
## Test Pvalue
## NA NA
##
## [[2]]
## Test Pvalue
## 1.66887566 0.09514202
##
## [[3]]
## Test Pvalue
## 2.23713843 0.02527729
Looking at Pearson’s residual plots for each model, there isn’t much of a pattern (beyond whether or not features were included) for the residuals around miles per gallon.
So, only looking at the first three models, we can check the confidence intervals for transmission.
lapply(models,confint)
## [[1]]
## 2.5 % 97.5 %
## (Intercept) 14.85062 19.44411
## amManual 3.64151 10.84837
##
## [[2]]
## 2.5 % 97.5 %
## (Intercept) 22.09259815 27.511106
## amManual -0.09801611 5.217924
## cyl6 -9.30190345 -3.010332
## cyl8 -13.04201552 -7.093104
##
## [[3]]
## 2.5 % 97.5 %
## (Intercept) 27.980802 39.526382
## amManual -2.517734 2.817941
## cyl6 -7.152943 -1.361694
## cyl8 -9.533813 -2.624425
## wt -5.012761 -1.286434
In the most basic model the issues shown by the plot (selection bias) become obvious. A wide interval of between 3.5 to 11 miles per gallon increase for having an automatic transmission. In the following models (accounting for cylinder and then accounting for weight) the confidence interval includes zero in both models. Thus, it cannot be determined conclusively that either transmission (in a vacuum) is better for miles per gallon.