Author: Murtuza Ali Lakhani Date: October 23, 2015
The purpose of this assignment was to analyze a data set of a collection of cars to determine the relationship between a set of variables and miles per gallon and address two research questions: 1) Is an automatic or a manual transmission better for MPG? 2) What is the MPG difference between automatic and manual transmissions? The results of stepwise regression show that in addition to the transmission type variable two other variables have a significant influence on the dependent variable, mpg. These variables are wt and qsec. The combined model explains 84.97 per cent of the variation in mpg. The low p-value (1.21e-11) for the model indicates that it is significant. Further, the coefficient of 2.9358 for transmission type paints the picture for its influence on mpg. The coefficient indicates that for the change in transmission from automatic to manual the mpg increases by 2.9358. This is sufficient to conclude that manual transmission yields a better gas mileage than an automatic transmission–and this difference is 2.9358 miles per gallon.
In this part of the project, we focus on reading in the data and performing exploratory analysis.
Read in the mtcars data set and explore variable and variable types.
data(mtcars)
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
class(mtcars$am)
## [1] "numeric"
We convert the “am” predictor to a factor class with two levels. We label the levels “automatic” and “manual” for better readability.
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
The first check is to examine the boxplot of gas mileage on cars equipped with automatic and manual transmission. The boxplot in Appendix I - Part 1 shows that manual cars have better gas mileage, but this data point is insufficient to conclude because 1) standard deviations have not yet been comprehended, and 2) relationships across study variables have not been taken into account.
The second check is to perform t-test to test for significance in the gas mileage difference with transmission type. The results are included in Appendix I - Part 2. The low p-value (0.001374) indicates there is a significant difference in gas mileage between the transmission types, but even this is insufficient to conclude with confidence. A t-test assumes that all other study variables are silent; that is, all other variables have equal influence on automatic and manual cars–which may or may not be correct. This brings us to the third check.
For the third check we perform a correlation analysis to understand how the variables in the data set relate to one another. The results are included in Appendix I - Part 3. The correlation results indicate that all the model variables are strongly correlated with gas mileage, mpg. This suggests the need for multiple regression analysis to comprehend the interactions across dependent and independent variables.
The key assumption in regression analysis is multivariate normality. We check this assumption using the mardiaTest function given in the MVN package. The results are included in Appendix II - Part 1. The outcome shows that the data are not multivariate normal. Appendix II - Part 2 shows the distribution properties of dependent variable, mpg. Shapiro-Wilk normality test indicates that mpg is not quite normal, but is close to normal. For the purposes of this exercise we will ignore these deviations in the given data.
Since this case involves a large number of independent variables, we perform stepwise regression, whereby different combinations of independent variables are tried until the best model is obtained.
stepwise = step(lm(data = mtcars, mpg ~ .), trace=0, steps=20000)
summary(stepwise)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## amManual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The results of stepwise regression show that in addition to the transmission type variable two other variables have a significant influence on the dependent variable, mpg. These variables are wt and qsec. The combined model explains 84.97 per cent of the variation in mpg. The low p-value (1.21e-11) for the model indicates that it is significant. Further, the coefficient of 2.9358 for transmission type paints the picture for its influence on mpg. The coefficient indicates that for the change in transmission from automatic to manual the mpg increases by 2.9358.
As a final check, we look at the residuals for signs of non-normality and examine the residuals vs. fitted values plot for heteroscedasticity. The residual diagnostics show normality with no evidence of heteroscedasticity (Appendix III).
The above analysis is sufficient to conclude that manual transmission yields a better gas mileage than an automatic transmission–and this difference is 2.9358 miles per gallon.
boxplot(mpg~am, data = mtcars,
col = c("light blue", "light green"),
xlab = "Transmission Type (am)",
ylab = "Miles per Gallon (mpg)",
main = "Miles per Gallon by Transmission Type")
automatic <- mtcars[mtcars$am == "Automatic",]
manual <- mtcars[mtcars$am == "Manual",]
t.test(automatic$mpg, manual$mpg)
##
## Welch Two Sample t-test
##
## data: automatic$mpg and manual$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
data(mtcars)
sort(cor(mtcars)[1,])
## wt cyl disp hp carb qsec
## -0.8676594 -0.8521620 -0.8475514 -0.7761684 -0.5509251 0.4186840
## gear am vs drat mpg
## 0.4802848 0.5998324 0.6640389 0.6811719 1.0000000
library(MVN)
mardiaTest(mtcars, qqplot = FALSE)
## Mardia's Multivariate Normality Test
## ---------------------------------------
## data : mtcars
##
## g1p : 73.98927
## chi.skew : 394.6095
## p.value.skew : 2.114446e-05
##
## g2p : 143.229
## z.kurtosis : 0.038304
## p.value.kurt : 0.9694453
##
## chi.small.skew : 438.2442
## p.value.small : 1.687389e-08
##
## Result : Data are not multivariate normal.
## ---------------------------------------
x <- mtcars$mpg
h<-hist(x, breaks=10, col="Green", xlab="Miles per Gallon (mpg)",
main="Distribution Properties of Miles per Gallon (mpg)")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="Red", lwd=3)
# Kernel Density Plot
d <- density(mtcars$mpg)
plot(d, xlab = "Miles per Gallon (mpg)", main ="Density Plot for Miles per Gallon (mpg)")
par(mfrow = c(2,2))
finalfit <- lm(mpg~am + wt + qsec, data = mtcars)
plot(finalfit)