Summary

We are here to explore and analyze the relationship between a set of variables and miles per gallon (MPG) (outcome) in Car mileage dataset. We are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”

The data analysis will show that while apparently there is a difference, it is partially explained by the fact that automatic transmission is more prevalent for lighter cars, while manual transmission is more commonly seen in heavier cars. If two cars have similar attributes except for transmission type, they will have similar MPG. However, this conclusion comes with a rather large degree of uncertainty, due to the fact that the variance in the data is not completely explained by linear models.

Data loading and processing

library(datasets)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

[, 1] mpg – Miles/(US) gallon

[, 2] cyl – Number of cylinders

[, 3] disp – Displacement (cu.in.)

[, 4] hp – Gross horsepower

[, 5] drat – Rear axle ratio

[, 6] wt – Weight (1000 lbs)

[, 7] qsec – 1/4 mile time

[, 8] vs – V/S

[, 9] am – Transmission (0 = automatic, 1 = manual)

[,10] gear – Number of forward gears

[,11] carb – Number of carburetors

Few changes to help rest of plotting and analysis

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")

Exploratory data analysis

In below plot of all the variables is necessary to detect if there are any outliers that could influence the results and provide an intuition of the relationships between the variables.

pairs(mtcars, panel = panel.smooth, main = "mtcars data plot")

The differences between the MPG for cars with manual and automatic transmission can be visualized in the following box and whiskers plot:

boxplot(mpg ~ am, data = mtcars, main = "MPG by Transmission Type", ylab = "Miles per galleon")

The pairs plot showed some visible correlations in the first row between the target variable(mpg) and the other variables. We are also see that Transmission Automatic gives less mileage comparing Manual transmission.

Modeling Procedures - Simple Linear Regression

As seen in the exploratory analysis, the naive approach would be to fit a model that only takes into account the transmission type.

# Model1: Mileage and Transmission type regression analysis
fit1 <- lm(mpg ~ am, data = mtcars)
#summary(fit1)

# Model2: Mileage and Weight regression analysis
fit2 <- lm(mpg ~ wt, data = mtcars)
#summary(fit2)

# Model3: Both the weight and the transmission type to be checked for Mileage impact
fit3 <- lm(mpg ~ am + wt, data = mtcars)
#summary(fit3)

Modeling Procedures - Multivariate regression

Here we adopt a stepwise algorithm to choose the best model by using step() function

stepm = step(lm(data = mtcars, mpg ~ .), trace=0, steps = 10000)
#summary(stepm)

Model Comparison

Residuals

The residuals will be plotted with the car weight as the x axis, to visualize the variance unexplained by the models. The plots are present in appendix.

The residuals for model 1 exhibit a linear pattern, which means that another feature could help reduce their values. Model 1 has larger residuals than the other two models because the model explains a smaller fraction of the variance in the data. The residuals of models 2 and 3 are almost identical, however they seem to get larger as the weight goes further from the mean, and to reduce them further we would need either polynomial features or a non-linear model.

Coefficients

The estimated coefficients for models are:

summary(fit1)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## amManual     7.244939   1.764422  4.106127 2.850207e-04
summary(fit2)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 37.285126   1.877627 19.857575 8.241799e-19
## wt          -5.344472   0.559101 -9.559044 1.293959e-10
summary(fit3)$coef
##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept) 37.32155131  3.0546385 12.21799285 5.843477e-13
## amManual    -0.02361522  1.5456453 -0.01527855 9.879146e-01
## wt          -5.35281145  0.7882438 -6.79080719 1.867415e-07

The confidence interval

summary(fit3)$coef[2, 1] + c(-1, 1) * qt(.975, df = fit3$df) * summary(fit3)$coef[2, 2]
## [1] -3.184815  3.137584

The coefficient for the transmission variable has an estimated value of -0.0236, meaning that a car with manual transmission will have 0.0236 less miles per galleon than a similar car with automatic transmission. The 95% confidence interval for this coefficient is rather large compared to its estimated value, namely (-3.1848, 3.1376). To provide a basis for comparison, an increase in weight of 1000 lbs would lower the MPG by an average of 5.3528.

Conclusion

Now with all the previous analysis, we can conclude that our linear model is a resonable fit. With 95% confidence, we estimate that a the change from automatic to manual transmission results in increase in miles per gallon for the cars.

Appendix

par(mfrow = c(1, 4))
plot(fit1)

plot(fit2)

plot(fit3)