You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
Write executive summary
library(datasets)
library(Hmisc)
library(ggplot2)
library(corrgram)
library(knitr)
data(mtcars)
Correlation matrix for the data
mcorr<-rcorr(as.matrix(mtcars))
mcorr$r
Following can be inferred from above mpg has high correlation with wt, cyl and disp.
## am mpg
## 1 0 17.14737
## 2 1 24.39231
From the box plot and mean data, manual transmission (am = 1) is better than automatic transmission (am=0) for mpg.
We will be perfomring a t test and then regression to know and quantify the mpg diference for manual and automatic transmission.
To check if mpg data is normally distributed
p<-plot(density(mtcars$mpg), main="mpg density plot")
We can conduct t test to see the calim that manual transmission is better than automatic one for mpg.
Null Hypothesis: There is no difference between Manual and Automatic transmision Alternate Hypothesis: There is significant difference between Manual and Automatic transmission.
t.test(mtcars$mpg~mtcars$am,conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
The p valve is lower than 0.05. This means Alternative hypothesis is true. We can say that Manual transmission is better than automatic one based on t test.
To quantify the diference we would do the linear regression
model1 <- lm(mpg ~ am, data = mtcars)
summary(model1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
We can reject the Null hypothesis as p valve is lower than 0.05 but model explains only 34% of the variance. This indicates that transmission is a very poor predictor of mpg.
Next option is to use multi factor regression analysis using step function.
stepmodel = step(lm(data = mtcars, mpg ~ .),trace=0,steps=100)
summary(stepmodel)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
This shows a better model which explains for 83% of variance. This indicates that wt, qsec and am are much better predictor of mpg.
Now we will test our prediction of wt, qsec and am being predictor of mpg.
fit1<- lm(mpg ~ wt + qsec + am, data = mtcars)
fit2<- lm(mpg ~ wt + qsec, data = mtcars)
Now data population. Two data frames for
data1 = data.frame(wt = mean(mtcars$wt), qsec = mean(mtcars$qsec), am = 0)
data2 = data.frame(wt = mean(mtcars$wt), qsec = mean(mtcars$qsec), am = 1)
Now using predict function
predict_am0<- predict(fit1, newdata = data1, level=0.95, interval="confidence")
predict_am1<- predict(fit1, newdata = data2, level=0.95, interval="confidence")
predict_withoutam<-predict(fit2, newdata = data1, level=0.95, interval="confidence")
predict_test<-rbind(predict_am0,predict_am1,predict_withoutam)
kable(x = predict_test,align = NULL, padding = 10,caption = "Comparision of prediction test")
| fit | lwr | upr | |
|---|---|---|---|
| 1 | 18.89794 | 17.42441 | 20.37147 |
| 1 | 21.83378 | 19.90054 | 23.76702 |
| 1 | 20.09062 | 19.15198 | 21.02927 |
There is difference of 3 mpg units as earlier predicted between predicted values for Automatic (predict_am0) and Manual (predict_am1) Without using am as variable for prediction (predict_withoutam) value lies in between earlier two predictions.
This indicates am mode makes a difference but not a huge differnce. Prediction of mpg if done with am can result more or less within the limits.
Form our analysis we can infer the following
Manual transmission mode gives 7.25 mpg more than autromatic, but that is not a good prediction as variance explained is only 34%.
If we add Weight (wt) and acceleration (qsec) than Manual transmission mode gives 2.9 mpg better than automatic. This is a very good prediction as r squared is at 84%. Much better than earlier ratio.
Prediction of mpg without transmission mode can be done without transmission mode also with r squared ratio at 81% (model fit2).
Overall, there are better predictors of mpg than transmission mode.
Diagnostic plots of the residuals for model fit 2.
par(mfrow=c(2,2))
plot(fit2)
The residuals have some outliers but overall normally distributed.