The purpose of this project is to explore the relationship between the type of transmission in a car and the MPG consumption using the ‘mtcars’ dataset in R. This analysis will fit multiple linear models, and at the end a conclusion will be made in order to define the best one.
THe first step in this analysis is to preprocess the data, and also create a boxplot comparing the MPG consumption and the transmission type.
library (ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
mtcars$am[mtcars$am==1]="Manual"
mtcars$am[mtcars$am==0]="Auto"
mtcars$am=as.factor(mtcars$am)
ggplot(mtcars, aes(x=am, y=mpg, fill=am))+geom_boxplot()+
labs(x="Tansmission type", y="Miles/gallon", title="MPG Consumption by Transmission Type")+
scale_fill_manual(values = c("blue","green"), name="Transmission")
This plot suggests that the manual transmission appears to have a better performance compared to the automatic one.
In order to conclude if the transmission type is statistically siginificant multiple models were fit.
The first model will only take into account the transmission type and the defined outcome (MPG)
fit.1=lm(mpg~am, data=mtcars)
summary(fit.1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
confint(fit.1)
## 2.5 % 97.5 %
## (Intercept) 14.85062 19.44411
## amManual 3.64151 10.84837
Given the results on this analysis we can conclude that the transmission type is statistically significant. Also we can say that a manual transmission will have on average a 7.2 MPG greater consumption compared to an automatic one. Finally the 95% confidence interval for this coefficient is (3.64, 10.84).
Now, two mode models will be created adding the variables horsepower and car weight.
fit.2=lm(mpg~am+hp, data=mtcars)
summary(fit.2)
##
## Call:
## lm(formula = mpg ~ am + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3843 -2.2642 0.1366 1.6968 5.8657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.584914 1.425094 18.655 < 2e-16 ***
## amManual 5.277085 1.079541 4.888 3.46e-05 ***
## hp -0.058888 0.007857 -7.495 2.92e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.909 on 29 degrees of freedom
## Multiple R-squared: 0.782, Adjusted R-squared: 0.767
## F-statistic: 52.02 on 2 and 29 DF, p-value: 2.55e-10
fit.3=lm(mpg~am+hp+wt, data=mtcars)
summary(fit.3)
##
## Call:
## lm(formula = mpg ~ am + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4221 -1.7924 -0.3788 1.2249 5.5317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.002875 2.642659 12.867 2.82e-13 ***
## amManual 2.083710 1.376420 1.514 0.141268
## hp -0.037479 0.009605 -3.902 0.000546 ***
## wt -2.878575 0.904971 -3.181 0.003574 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared: 0.8399, Adjusted R-squared: 0.8227
## F-statistic: 48.96 on 3 and 28 DF, p-value: 2.908e-11
It is worth mentioning that using the variable weight in the model makes the transmission type non siginicant, which suggests there is confounding between these variables.
In order to select the best model an ANOVA table is created.
anova(fit.1,fit.2,fit.3)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + hp
## Model 3: mpg ~ am + hp + wt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 245.44 1 475.46 73.841 2.445e-09 ***
## 3 28 180.29 1 65.15 10.118 0.003574 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using the RSS we can conclude that the model with the 3 variables is the best, but also we have to take into account that one of the variables is not significant. It is suggested that a new model is fit without the transmission type.
Finally, a residual analysis is made with the following results.
par(mfrow=c(2,2))
plot(fit.3)
This analysis shows us that there is no heteroskedacity, but the residuals still show an underlying relationship in the data that is not taken into account in the model.