library(ggplot2)

Executive summary

This report observe influence of transmission type (automatic/manual) on miles per gallon for 32 cars. Data was obtained from standard dataset mtcars. Exploratory analysis shows that manual transmission cars typically have higher mileage per gallon (MPG) than automatic ones. Linear model with adjusted R squared value equal 88% was obtained. The difference in MPG between manual and automatic transmission is 14.08 - 2.93*weight considering this model. Analysis of Cook’s distances showed no highly influential data points.

1. Quick look on data

First, here’s dimension of data set and its first 4 lines.

## [1] 32 11
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

Second, we may estimate average mpg for both types of transmission, automatic and manual respectively.

## average mpg for automatic and manual transmission
round(c(mean(mtcars$mpg[mtcars$am == 0]), mean(mtcars$mpg[mtcars$am == 1])), 2)
## [1] 17.15 24.39
## 17.15 24.39

Also, Fig. 1 in Appendix visualizes average mpg and its quantiles. There is also Fig. 2 in Appendix that shows linear regression mpg~hp+am for both transmissions too. It’s easy to see that slopes are pretty equal for both types and bias between two regression lines caused by am term. All these things give an idea of strong dependence of miles per gallon and transmission type.

2. Regression model

Let’s look at linear regression using all features:

coef(summary(lm(mpg ~ ., data = mtcars)))
##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs           0.31776281  2.10450861  0.1509915 0.88142347
## am           2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

There are no features show at least 5% significance. We’ll follow advice to use features with modulus greater than 1, wt, am and qsec. It’s obvious to use weight and transmission type. So intuition to use qsec is that: we would like to use horse powers, type of engine, number of cylinders and displacement as one feature. And qsec is ideal candidate for this as resulting parameter of combining all other features.

That strategy leads to 2 models: mpg~qsec+wt+am and mpg~qsec+wt*am. After comparing adjusted R squared values 83% vs 88% we chose model with product of weight and transmission type, that describes 88% of variance of mpg.

fit_alt <- lm(mpg ~ qsec + wt + am, data = mtcars); # summary(fit_alt)
fit <- lm(mpg ~ qsec + wt*am, data = mtcars);       # summary(fit)
round(c(summary(fit_alt)$adj.r.squared, summary(fit)$adj.r.squared) , 2)
## [1] 0.83 0.88
coef(fit)
## (Intercept)        qsec          wt          am       wt:am 
##    9.723053    1.016974   -2.936531   14.079428   -4.141376

Coefficients means that every second of qsec gives additional mile per gallon. While qsec stay constant manual transmission changed to 14.08 - 2.93*weight. Interesting, there’s some weight0 that cars with manual and automatic transmission have equal mpg. We may find it solving equation wt*weight0 + wt:am*weight0 + am = 0. That gives weight0 = 1.989 lbs. Heavier cars with manual transmission have smaller mpg than automatic (with equal qsec).

Residuals and influence

Top 2 plots in Fig. 3 show that residuals distributed normally (Normal Q-Q) and fitted well. Also we find that maximal Cook distance of the fit is:

max(cooks.distance(fit))
## [1] 0.225106

As shown on Fig. 4. all values of Cook’s distance are quite similar, so we haven’t highly influential points.

Appendix

Fig.1. Boxplot of average mpg by transmission type
ggplot(mtcars, aes(x = am, y = mpg, colour = factor(am))) + 
        geom_boxplot() + 
        #labs(title = "Fig.1: boxplot of average mpg by transmission type") + 
        scale_colour_discrete(labels = c("automatic", "manual"), 
                              name = "Transmission type")

Fig.2. Regression lines of mpg~hp by transmission type
ggplot(mtcars) + 
        geom_jitter(aes(hp, mpg, colour = factor(am)), size = 3, alpha = .7) +
        geom_smooth(aes(hp, mpg, colour = factor(am)), method = lm) + 
        #labs(title = "Fig.2: Regression lines of mpg~hp by transmission type")+
        scale_colour_discrete(labels = c("automatic", "manual"), 
                              name = "Transmission type")

Fig. 3. Plot of the influence, leverage and residuals of used linear model.
par(mfrow=c(2, 2))
plot(fit)

Fig. 4. Plot of Cook’s distance of used linear mode.
plot(cooks.distance(fit))