Motor Trend is a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
We use Hypothesis testing and Multivariate Regression to analyze the relationship between Miles per gallon (MPG) and other variables, including the mode of transmission (Automatic/Manual).
We conclude that Manual transmission is better for MPG compared to Automatic transmission.
Other variables in the final model are weight and quarter mile time (acceleration), which have signficant impact in quantifying the difference of mpg between automatic and manual transmission cars.
We load the mtcars data set to look at various column names and their contents
data(mtcars)
names (mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
The predictor variable.am, is a numeric class. Since it is a dichotomous variable, let’s convert this to a factor class and label the levels as Automatic and Manual.
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
We begin the exploratory data analysis by looking at the pairwise scatter plot between all variables.(Plots shown in Appendix 1)
Before modeling our variable of interest MPG, we need to check if it follows a Normal distribution, whether there are any outliers, etc.
par(mfrow = c(1, 2))
# Histogram to test for Normality
x <- mtcars$mpg
h<-hist(x, breaks=10, col="green", xlab="Miles Per Gallon",
main="Histogram of Miles per Gallon")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="red", lwd=3)
# Kernel Density Plot
d <- density(mtcars$mpg)
plot(d, xlab = "MPG", main ="Density Plot for MPG")
From the histogram, MPG seems to follow approximately a Normal distribution, and we dont see any outliers.
We can check how mpg varies by automatic versus manual transmission using a Boxplot. A boxplot was created to test the association between mpg and transmission type.
boxplot(mpg~am, data = mtcars,
col = c("dark green", "light green"),
xlab = "Transmission",
ylab = "Miles per Gallon",
main = "MPG by Transmission Type")
From the boxplot we see that manual transmission gives more Miles per Gallon compared to Automatic.However, we can dig deeper to confirm.
Null Hypothesis (H0):
There is no difference with regards to Miles per Gallon (MPG) for Automatic and Manual transmission.
Alternate Hypothesis (Ha):
There is a difference with regards to Miles per Gallon (MPG) for Automatic and Manual transmission.
aggregate(mpg~am, data = mtcars, mean)
## am mpg
## 1 Automatic 17.14737
## 2 Manual 24.39231
The mean MPG for manual transmission is 24.39231 whereas that for automatic transmission is 17.14737.
Thus mean MPG of cars with manual transmission is 7.245 MPGs higher than that of cars with automatic transmission cars. (We have not yet considered other confounding variables)
We will run a t-test with alpha-value at 0.5 to find if the difference is significant.
autoTrans <- mtcars[mtcars$am == "Automatic",]
manualTrans <- mtcars[mtcars$am == "Manual",]
t.test(autoTrans$mpg, manualTrans$mpg)
##
## Welch Two Sample t-test
##
## data: autoTrans$mpg and manualTrans$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
The p-value is 0.001374, hence we can reject the null hypothesis and conclude that there is a signficiant difference in the mean MPG between cars with manual transmission and cars with automatic transmission.
Now we need to quantify the difference as per the second problem statement.
Since we are interested in the determining the relationship between mpg and other variables, we first check the correlation between mpg and other variables by using the cor() function.
data(mtcars)
sort(cor(mtcars)[1,])
## wt cyl disp hp carb qsec
## -0.8676594 -0.8521620 -0.8475514 -0.7761684 -0.5509251 0.4186840
## gear am vs drat mpg
## 0.4802848 0.5998324 0.6640389 0.6811719 1.0000000
We see that our variable of interest am is highly correlated with the dependent variable mpg.
Variables showing positive correlation with mpg in descending order of strength are drat,vs,am,gear and qsec.
Variables showing negative correlation with mpg in descending order of strength are wt,cyl,disp,hp and carb.
fit <- lm(mpg~am, data = mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Looking at the intercept and coefficients, we can say that, on average, automatic cars have 17.147 MPG and manual transmission cars have (17.147+ 7.245)=24.392 MPGs. (We have not yet considered other confounding variables)
In addition, we see that the R^2 value is 0.3598. This means that our model only explains 35.98% of the variance.
# We use stepwise algorithm to select the best model by using step() function
bestFitModel = step(lm(data = mtcars, mpg ~ .),trace=0,steps=10000)
summary(bestFitModel)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
This shows that apart from transmission, weight of the vehicle as well as accelaration act as confounding variables in explaining the variation in mpg. The adjusted R^2 is 84% which means that the model explains 84% of the variation in mpg which is very good in terms of predictive power.
# As seen from stepwise regression,We select model with 3 variables wt, qsec and am; which accounts for 84% of total variance.
bestFitModel <- lm(mpg~am + wt + qsec, data = mtcars)
anova(fit, bestFitModel)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + qsec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 45.618 1.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is very small(1.55e-09), hence we reject the null hypothesis and can say that our multivariate model is significantly different from our simple linear regression model.
Before finalizing our model, it is important to check the residuals for any signs of non-normality and examine the residuals vs. fitted values plot for heteroskedasticity.
This check and the relavant plots are shown in Appendix 2.
The residual diagnostics show normality and exhibit no evidence of heteroskedasticity.
Now we can check the important parameters of our final model through the “summary” command.
# bestFitModel results
summary(bestFitModel)
##
## Call:
## lm(formula = mpg ~ am + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## am 2.9358 1.4109 2.081 0.046716 *
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Weight of the vehicle and accelaration speed act as confounding variables when we are determining the relation between mode of transmission and mpg.
Accounting for the above confounding variables,on an average, manual transmission cars have 2.94 MPGs more than automatic transmission cars. (Much lower than the earlier 7.245, which didn’t consider Confounding.)
The adjusted R^2 is 84% which means that the model explains 84% of the variation in mpg which is very good in terms of predictive power.
pairs(mtcars)
par(mfrow = c(2,2))
plot(bestFitModel)