In Motor Trend magazine, we had a couple of requests to analyse the effect of manual / automatic gear gransmission on gas usage in cars. With respect to your request, we assigned one of our consultants to analyse the situation based on data gathered for 32 model of cars to answer the following questions:
This analysis revealed that:
First of all, we are going to look at the data:
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We see that mtcars is a data frame with 32 observations on 11 variables :
As we are going to work with gear transmission we will convert it to a factor variable
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
Lets have a look on MPG and its distribution:
par(mfrow = c(1, 2))
boxplot(mtcars$mpg, col="blue", xlab="Miles per Gallon", ylab="Miles per Gallon", main="MPG boxplot")
h <- hist(mtcars$mpg, breaks=10, col="blue", xlab="Miles Per Gallon", main="Histogram of Miles per Gallon")
xx <-seq(min(mtcars$mpg),max(mtcars$mpg),length=40)
yy <-dnorm(xx,mean=mean(mtcars$mpg),sd=sd(mtcars$mpg))
yy <- yy * diff(h$mids[1:2])*length(mtcars$mpg)
lines(xx, yy, col="black", lwd=2)
From the above diagrams we see that MPG, has a mean near 20, and its value is normally distributed with standard deviation of 6.3
Now we are going to check that how MPG will change based on transmission type:
boxplot(mpg~am, data=mtcars, col = c("red", "green"), xlab = "Transmission", ylab = "Miles per Gallon",
main = "MPG by Transmission Type")
From the above diagram we see that:
Lets have look at correlation between MPG and other variables:
# a <- cor(mtcars)
# sort(a[1,]) # assuming that MPG is the first variable
Form the above correlation data, we see that MPG has a high correlation with weight, number of cylinder, engine displacement and house power (i.e. more that 0.5)
Now we are going to check that, is the difference between the mean of automated and manual trasmission gears, is significant for all of the data or this is just because of our sample selection. For this to reveal, we will do a sample t-test:
autoCars <- mtcars[mtcars$am == "Automatic",]
manualCars <- mtcars[mtcars$am == "Manual",]
t.test(autoCars$mpg, manualCars$mpg)
##
## Welch Two Sample t-test
##
## data: autoCars$mpg and manualCars$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
From this test we see that as p-value is less that 0.05, so the H0 is rejected and the difference in mean of MPG of automated and manual transmission cars is significant.
Now we are going to build a single varaible regression model to describe the relationship between MPG and automated or manual transmission:
fit <- lm(mpg~am, data = mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
From the above linear model, we see:
For better fitting we are going to build a mult-variable linear regression model.
First we should decide about which variables to include. If we build the model with all variables included, we will have:
summary(lm(mpg ~ cyl+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+gear+carb, data = mtcars))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## factor(vs)1 0.31776281 2.10450861 0.1509915 0.88142347
## factor(am)Manual 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
From the above summary, we see that none of the variables are good predicors of MPG with respect of p-value (more than 0.05)
We build the following command to make decision about the variables to include:
library(MASS)
fit <- lm(mpg ~ cyl+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+gear+carb, data = mtcars)
step <- stepAIC(fit, direction="both", trace=FALSE)
summary(step)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## factor(am)Manual 2.935837 1.4109045 2.080819 4.671551e-02
summary(step)$r.squared
## [1] 0.8496636
The above analysis shows us that: