The aim of this project is to analyze data about a collection of cars and explore the relationship between a set of variables and the miles per gallon (outcome) of the cars. Specifically the two questions that we are trying to answer are:
Question1: Is an automatic or manual transmission better for MPG? Question2: Quantify the MPG difference between automatic and manual transmissions.“
For the purpose of this analysis, we will be using the mtcars dataset provided in R.
library(datasets)
data(mtcars)
Let's take a quick look at the mtcars data.
head(mtcars,3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
We can make a plot to see how miles per gallon varies with the kind of transmission. We also make a matrix of scatter plots to understand the relation between all the variables in the mtcars dataset. The plots are shown in the Appendix below.
From the boxplot we can see that manual transmission cars seem to have a higher mileage than automatic transmissions. To understand this further let's perform hypothesis testing.
We can do a hypothesis test to confirm whether there is a statistically significant impact of manual/automatic transmission on the mileage of a car. The null hypothesis is that there is no significant difference between manual and automatic transmission w.r.t. mpg
t.test(mtcars$mpg ~ mtcars$am)
##
## Welch Two Sample t-test
##
## data: mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
The 95% confidence interval (-11.280194,-3.209684) is entirely below 0 and p-value is <0.05 which says that we reject the null hypothesis that the means are equal. Manual transmission seems to clearly have better mileage than Automatic transmission. But how much more? To quantify this, we build our regression model.
Next we need to look into building our model. Let's examine the fit of the base model - where mpg is the dependent variable, and am is the predictor.
fit <- lm(mpg~factor(am),data=mtcars)
summary(fit)$r.squared
## [1] 0.3597989
We can see that only 36% of the variability in mpg is explained by our base model. We need to look at building a better model to explain the variability in mpg. Putting together the pairs plot we plotted earlier and a correlation matrix of all the variables in the mtcars dataset, we should be able to extract variables that might be helpful in defining our regression model.
head(cor(mtcars),4)
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.6999381 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.7102139 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.4487591 0.6587479
## qsec vs am gear carb
## mpg 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
## cyl -0.5912421 -0.8108118 -0.5226070 -0.4926866 0.5269883
## disp -0.4336979 -0.7104159 -0.5912270 -0.5555692 0.3949769
## hp -0.7082234 -0.7230967 -0.2432043 -0.1257043 0.7498125
Let's consider the variables cyl, disp, hp, wt apart from am to determine our best regression model. To test which regression model fits best, we try various models. Starting with the basic model comparing mpg with am, then adding on variables that seem to have most impact on mpg.The variables cyl and disp seem to show collinearity. So we build models that include them, exclude them and uses only one of them.
fit <- lm(mpg~factor(am),data=mtcars)
fit1 <- lm(mpg~factor(am)+wt+hp,data=mtcars)
fit2 <- lm(mpg~factor(am)+cyl+wt+hp,data=mtcars)
fit3 <- lm(mpg~factor(am)+cyl+disp+wt+hp,data=mtcars)
anova(fit,fit1,fit2,fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt + hp
## Model 3: mpg ~ factor(am) + cyl + wt + hp
## Model 4: mpg ~ factor(am) + cyl + disp + wt + hp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 180.29 2 540.61 43.0841 5.576e-09 ***
## 3 27 170.00 1 10.29 1.6407 0.2115
## 4 26 163.12 1 6.88 1.0963 0.3047
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With a p-value of 5.6*10-9 we can conclude that the model fit1 is significantly better than the base model that uses only 'am' as the predictor. The other two models don't significantly outperform fit1.
summary(fit1)
##
## Call:
## lm(formula = mpg ~ factor(am) + wt + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4221 -1.7924 -0.3788 1.2249 5.5317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.002875 2.642659 12.867 2.82e-13 ***
## factor(am)1 2.083710 1.376420 1.514 0.141268
## wt -2.878575 0.904971 -3.181 0.003574 **
## hp -0.037479 0.009605 -3.902 0.000546 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared: 0.8399, Adjusted R-squared: 0.8227
## F-statistic: 48.96 on 3 and 28 DF, p-value: 2.908e-11
The residuals are approximately normally distributed as seen from the residual plots (Shown in Appendix). This model explains approximately 84% of the variability in 'mpg' and is significantly better than the base model at predicting mileage of a car. The coefficient of 'am' suggests that manual transmission gives 2.084 miles per gallon more than automatic transmission taking into account 'wt' and 'hp' and keeping all other variables constant. Similarly, holding all other variables constant, every 1000lbs increase in weight reduces mpg by 2.87 and 100 units increase in horsepower reduces mpg by 3.74. The p-value of coefficient 'am' suggests a high uncertainty of 14.13% in the effect of transmission type on mpg in our model.
library(ggplot2)
p1 <- ggplot(mtcars,aes(factor(am),mpg))+geom_boxplot(aes(fill=factor(am)))+ggtitle("MPG vs Transmission type")+xlab("Transmission type (Automatic=0, Manual=1)")+ylab("Miles per Gallon (MPG)")
plot(p1)
pairs(mtcars)
par(mfrow=c(2,2))
plot(fit1)