In this report we will the influence of transmission on MPG (miles per gallon) cars can do. For that we will use mtcars data We will see that to define a good model of consumption in cars we will need to take into consideration the weight and the horsepower of the car. And we will see that manual cars almost can travel 2.1MPG more than automatic cars.
Let’s see how it is distributed the MPG for automatic/manual transmission
with(mtcars, plot(mpg ~ am, col=am, xlab="Transmission", ylab="MPG", main="Consumption depending on transmission"))
However, if we check the relation of MPG vs Transmission as a linear regression:
fit<-lm(mpg ~ factor(am), mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## factor(am)Manual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
We can see only 36% of the variation is explained by the model. We need to re-define the model making it multivariable dependant. Let’s check on which variables.
# I need to do this to use cor to evaluate the correlation of the database
data(mtcars)
We can check which are the parameters more correlated with MPG:
sort(abs(cor(mtcars)[1,]), decreasing=TRUE)
## mpg wt cyl disp hp drat vs
## 1.0000000 0.8676594 0.8521620 0.8475514 0.7761684 0.6811719 0.6640389
## am carb gear qsec
## 0.5998324 0.5509251 0.4802848 0.4186840
This means that MPG is heavily correlated with number of weight, number of cylinders, displacement and horsepower. However, if we look a little closer, we could also see that cylinders, displacement and horsepower are highly correlated, and displacement is also correlated with the weight.
sort(abs(cor(mtcars)[2,]), decreasing=TRUE)
## cyl disp mpg hp vs wt drat
## 1.0000000 0.9020329 0.8521620 0.8324475 0.8108118 0.7824958 0.6999381
## qsec carb am gear
## 0.5912421 0.5269883 0.5226070 0.4926866
sort(abs(cor(mtcars)[3,]), decreasing=TRUE)
## disp cyl wt mpg hp vs drat
## 1.0000000 0.9020329 0.8879799 0.8475514 0.7909486 0.7104159 0.7102139
## am gear qsec carb
## 0.5912270 0.5555692 0.4336979 0.3949769
Now we need to choose which of the three parameters we are going to use in our model. First we need to ensure that the different parameters are relevant to the model:
# models
fit2<-lm(mpg ~ factor(am) + wt, mtcars)
fit3<-lm(mpg ~ factor(am) + wt + factor(cyl), mtcars)
fit4<-lm(mpg ~ factor(am) + wt + disp, mtcars)
fit5<-lm(mpg ~ factor(am) + wt + hp, mtcars)
anova(fit, fit2, fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt
## Model 3: mpg ~ factor(am) + wt + factor(cyl)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 65.3095 1.107e-08 ***
## 3 27 182.97 2 95.35 7.0353 0.003473 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit, fit2, fit4)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt
## Model 3: mpg ~ factor(am) + wt + disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 50.2610 1.032e-07 ***
## 3 28 246.56 1 31.76 3.6072 0.06788 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit, fit2, fit5)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt
## Model 3: mpg ~ factor(am) + wt + hp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 68.734 5.071e-09 ***
## 3 28 180.29 1 98.03 15.224 0.0005464 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Cylinders and horsepower are relevant, their p-value is lower than 5%, so the null hypothesis can be rejected, and it can be included the new parameter in the model. It is needed to choose which one is going to be used. In order to do that, the one which explains the model variation with higher value will be selected.
summary(fit3)$r.squared
## [1] 0.8375127
summary(fit5)$r.squared
## [1] 0.8398903
The model with weight and horsepower is the one which higher R-squared value, thus explainning most of the variation seen. This model will be used to model the relationship between MPG and transmission, and quantify the MPG difference between automatic and manual transmission.
summary(fit5)
##
## Call:
## lm(formula = mpg ~ factor(am) + wt + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4221 -1.7924 -0.3788 1.2249 5.5317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.002875 2.642659 12.867 2.82e-13 ***
## factor(am)1 2.083710 1.376420 1.514 0.141268
## wt -2.878575 0.904971 -3.181 0.003574 **
## hp -0.037479 0.009605 -3.902 0.000546 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared: 0.8399, Adjusted R-squared: 0.8227
## F-statistic: 48.96 on 3 and 28 DF, p-value: 2.908e-11
From that summary, we can see that Manual cars can do 2.1MPG more than Automatic cars
From the figures
par(mfrow = c(2,2))
plot(fit5)
It can be seen that the residuals are normally distributed.
Manual transmission cars almost can travel 2.1MPG more than automatic cars.