Executive Summary

In Motor Trend magazine, we had a couple of requests to analyse the effect of manual / automatic gear gransmission on gas usage in cars. With respect to your request, we assigned one of our consultants to analyse the situation based on data gathered for 32 model of cars to answer the following questions:

  1. Is an automatic or manual transmission better for MPG
  2. Quantify the MPG difference between automatic and manual transmissions

This analysis revealed that:

  1. Method of transmission has a major impact on fuel consumption so the cars with automated transimssion have lower MPG than manual transmission cars
  2. Among the other parameter specified for each car, the weight and acceleration have more impact on fuel usage after method of transmission
  3. we can conclude that automated transmission cars have 2.94 MPG lower than manual transmission cars (considering all parameters)

Some facts about the data

First of all, we are going to look at the data:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We see that mtcars is a data frame with 32 observations on 11 variables :

Data Preparation

As we are going to work with gear transmission we will convert it to a factor variable

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")

Exploratory data analysis

Lets have a look on MPG and its distribution:

par(mfrow = c(1, 2))
boxplot(mtcars$mpg, col="blue", xlab="Miles per Gallon", ylab="Miles per Gallon", main="MPG boxplot")
h <- hist(mtcars$mpg, breaks=10, col="blue", xlab="Miles Per Gallon", main="Histogram of Miles per Gallon")
xx <-seq(min(mtcars$mpg),max(mtcars$mpg),length=40)
yy <-dnorm(xx,mean=mean(mtcars$mpg),sd=sd(mtcars$mpg))
yy <- yy * diff(h$mids[1:2])*length(mtcars$mpg)
lines(xx, yy, col="black", lwd=2)

From the above diagrams we see that MPG, has a mean near 20, and its value is normally distributed with standard deviation of 6.3

Now we are going to check that how MPG will change based on transmission type:

boxplot(mpg~am, data=mtcars, col = c("red", "green"), xlab = "Transmission", ylab = "Miles per Gallon",
        main = "MPG by Transmission Type")

From the above diagram we see that:

Lets have look at correlation between MPG and other variables:

# a <- cor(mtcars)
# sort(a[1,]) # assuming that MPG is the first variable

Form the above correlation data, we see that MPG has a high correlation with weight, number of cylinder, engine displacement and house power (i.e. more that 0.5)

Hypotesis testing

Now we are going to check that, is the difference between the mean of automated and manual trasmission gears, is significant for all of the data or this is just because of our sample selection. For this to reveal, we will do a sample t-test:

autoCars <- mtcars[mtcars$am == "Automatic",]
manualCars <- mtcars[mtcars$am == "Manual",]
t.test(autoCars$mpg, manualCars$mpg)
## 
##  Welch Two Sample t-test
## 
## data:  autoCars$mpg and manualCars$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

From this test we see that as p-value is less that 0.05, so the H0 is rejected and the difference in mean of MPG of automated and manual transmission cars is significant.

Single variable linear regression model

Now we are going to build a single varaible regression model to describe the relationship between MPG and automated or manual transmission:

fit <- lm(mpg~am, data = mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

From the above linear model, we see:

Multi variable linear regression model

For better fitting we are going to build a mult-variable linear regression model.

Which variables to include?

First we should decide about which variables to include. If we build the model with all variables included, we will have:

summary(lm(mpg ~ cyl+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+gear+carb, data = mtcars))$coef
##                     Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept)      12.30337416 18.71788443  0.6573058 0.51812440
## cyl              -0.11144048  1.04502336 -0.1066392 0.91608738
## disp              0.01333524  0.01785750  0.7467585 0.46348865
## hp               -0.02148212  0.02176858 -0.9868407 0.33495531
## drat              0.78711097  1.63537307  0.4813036 0.63527790
## wt               -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec              0.82104075  0.73084480  1.1234133 0.27394127
## factor(vs)1       0.31776281  2.10450861  0.1509915 0.88142347
## factor(am)Manual  2.52022689  2.05665055  1.2254035 0.23398971
## gear              0.65541302  1.49325996  0.4389142 0.66520643
## carb             -0.19941925  0.82875250 -0.2406258 0.81217871

From the above summary, we see that none of the variables are good predicors of MPG with respect of p-value (more than 0.05)

Stepwise variable selection

We build the following command to make decision about the variables to include:

library(MASS)
fit <- lm(mpg ~ cyl+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+gear+carb, data = mtcars)
step <- stepAIC(fit, direction="both", trace=FALSE)
summary(step)$coeff
##                   Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)       9.617781  6.9595930  1.381946 1.779152e-01
## wt               -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec              1.225886  0.2886696  4.246676 2.161737e-04
## factor(am)Manual  2.935837  1.4109045  2.080819 4.671551e-02
summary(step)$r.squared
## [1] 0.8496636

The above analysis shows us that:

  • Apart from transmission method, weight and acceleration have high impact on decribing the MPG
  • The r-squared value (0.85) shows us that our model is covering 85% of variance on MPG an so is robust and highly predictive

End of Report