Regression Models Assignment

Instructions

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”

Exectutive Summary

Write executive summary

Loading libraries and data

library(datasets)
library(Hmisc)
library(ggplot2)
library(corrgram)
library(knitr)
data(mtcars)

Data Exploration and Transfromation

Correlation matrix for the data

mcorr<-rcorr(as.matrix(mtcars))
mcorr$r

Following can be inferred from above mpg has high correlation with wt, cyl and disp.

##   am      mpg
## 1  0 17.14737
## 2  1 24.39231

From the box plot and mean data, manual transmission (am = 1) is better than automatic transmission (am=0) for mpg.

Data Analysis

We will be perfomring a t test and then regression to know and quantify the mpg diference for manual and automatic transmission.

Statistical Analysis

To check if mpg data is normally distributed

p<-plot(density(mtcars$mpg), main="mpg density plot")

We can conduct t test to see the calim that manual transmission is better than automatic one for mpg.

Null Hypothesis: There is no difference between Manual and Automatic transmision Alternate Hypothesis: There is significant difference between Manual and Automatic transmission.

t.test(mtcars$mpg~mtcars$am,conf.level=0.95)

## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

The p valve is lower than 0.05. This means Alternative hypothesis is true. We can say that Manual transmission is better than automatic one based on t test.

Linear Regression

To quantify the diference we would do the linear regression

model1 <- lm(mpg ~ am, data = mtcars)
summary(model1)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

We can reject the Null hypothesis as p valve is lower than 0.05 but model explains only 34% of the variance. This indicates that transmission is a very poor predictor of mpg.

Next option is to use multi factor regression analysis using step function.

stepmodel = step(lm(data = mtcars, mpg ~ .),trace=0,steps=100)
summary(stepmodel)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

This shows a better model which explains for 83% of variance. This indicates that wt, qsec and am are much better predictor of mpg.

Testing the Prediction

Now we will test our prediction of wt, qsec and am being predictor of mpg.

fit1<- lm(mpg ~ wt + qsec + am, data = mtcars)
fit2<- lm(mpg ~ wt + qsec, data = mtcars)

Now data population. Two data frames for

data1 = data.frame(wt = mean(mtcars$wt), qsec = mean(mtcars$qsec), am = 0)
data2 = data.frame(wt = mean(mtcars$wt), qsec = mean(mtcars$qsec), am = 1)

Now using predict function

predict_am0<- predict(fit1, newdata = data1, level=0.95, interval="confidence")
predict_am1<- predict(fit1, newdata = data2, level=0.95, interval="confidence")
predict_withoutam<-predict(fit2, newdata = data1, level=0.95, interval="confidence")

predict_test<-rbind(predict_am0,predict_am1,predict_withoutam)
kable(x = predict_test,align = NULL, padding = 10,caption = "Comparision of prediction test")

Comparision of prediction test
	fit	lwr	upr
1	18.89794	17.42441	20.37147
1	21.83378	19.90054	23.76702
1	20.09062	19.15198	21.02927

There is difference of 3 mpg units as earlier predicted between predicted values for Automatic (predict_am0) and Manual (predict_am1) Without using am as variable for prediction (predict_withoutam) value lies in between earlier two predictions.

This indicates am mode makes a difference but not a huge differnce. Prediction of mpg if done with am can result more or less within the limits.

Conclusion

Form our analysis we can infer the following

Manual transmission mode gives 7.25 mpg more than autromatic, but that is not a good prediction as variance explained is only 34%.
If we add Weight (wt) and acceleration (qsec) than Manual transmission mode gives 2.9 mpg better than automatic. This is a very good prediction as r squared is at 84%. Much better than earlier ratio.
Prediction of mpg without transmission mode can be done without transmission mode also with r squared ratio at 81% (model fit2).

Overall, there are better predictors of mpg than transmission mode.

Appendix

Diagnostic plots of the residuals for model fit 2.

par(mfrow=c(2,2))
plot(fit2)

The residuals have some outliers but overall normally distributed.