We can infer that under any circumstance using the data set, that the manual transmission cars get better mileage (fuel efficiency) than automatic transmission cars, which is to say that they cover more miles per gallon of fuel consumed.
By adding the covariates: cyl (num of cylinders), hp (horsepower) y wt (weight) in the simple linear regression model (mpg ~ am), we can infer that manual transmission cars get 1.81 miles per gallon (on average) more than the automatic transmission cars.
We are going to investigate gas efficiency in cars, measured in miles per gallon, by using the well-known public data set mtcars. In this case, we are going to focus on comparing the efficiency of automatic and manual transmission cars. We are also going to evaluate the impact of introducing all or some covariates in the linear fitting to get a holistic view of the reality of the data set.
datos <- mtcars
str(datos)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head(order(datos))
## [1] 225 226 229 231 236 237
At first glance, we can see that each observation refers to a particular car model, without repeating any observations for the same models.
According to the documentation of the data set mtcars we can see that the covariates that interest us initially for the purpose of this study are:
mpg - (continuous numerical) The quantity of miles we can travel per gallon of fuel.
am - (binary: 0 - automatic, 1 - manual) This is the type of transmission of the car used for the measurement. I then change the covariate am into a factor, and change the description of its levels for better understanding.
datos$am <- factor(datos$am)
levels(datos$am) <- c("automatic","manual")
head(table(datos$mpg, datos$am))
##
## automatic manual
## 10.4 2 0
## 13.3 1 0
## 14.3 1 0
## 14.7 1 0
## 15 0 1
## 15.2 2 0
We see that the lowest miles per gallon values correspond to the automatic transmission cars.
tail(table(datos$mpg, datos$am))
##
## automatic manual
## 24.4 1 0
## 26 0 1
## 27.3 0 1
## 30.4 0 2
## 32.4 0 1
## 33.9 0 1
At the same time, we see that the highest miles per gallon correspond to the manual transmission cars.
If we observe the Appendix - Plot 1, we can clearly see that the manual transmission cars are more efficient in terms of distance traveled per unit of fuel than automatic cars, on average.
We are going to confirm this information formulating a Hypothesis Test, comparing the data from the manual transmission cars with that of the automatic cars.
The Null Hypothesis is going to be that the difference between the types of transmissions is zero.
And the Alternative Hypothesis is that the difference between transmission types is different than zero, which is to say, that one of the two types of cars is more efficient than the other with regard to the distance traveled per unit of fuel.
datos.automatic <- subset(datos, as.character(am) == "automatic")
datos.manual <- subset(datos, as.character(am) == "manual")
datos.ht <- t.test(datos.manual$mpg, datos.automatic$mpg, paired = FALSE, var.equal = FALSE)
datos.ht$conf.int
## [1] 3.209684 11.280194
## attr(,"conf.level")
## [1] 0.95
datos.ht$p.value
## [1] 0.001373638
Given that the p-value is less than 0.05 (5%), we can consider the null hypothesis as rejected (accepting the alternative Hypothesis).
We’ve already confirmed that the efficiency with regard to the distance traveled for manual transmission cars is different from the efficiency of automatic transmission cars. We can even say that the manual transmission cars cover much greater distances per unit of fuel than automatic cars.
There are obviously other covariates that can intervene, varying and likely balancing this difference. These include factors like weight, cylinders, horsepower, etc…
We are only going to use one covariate am to adjust the linear regression.
fit.slm <- lm(mpg ~ am, data = datos)
summary(fit.slm)
##
## Call:
## lm(formula = mpg ~ am, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## ammanual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
We can draw 2 interesting conclusions:
The increase, on average, in the distance traveled by manual transmission cars per unit of fuel is 7.245 mpg, with its total average 17.147 + 7.245 = 24.392 mpg. It is much higher than that of automatic transmission cars.
This result is not significant with regard to the total group of data and their variability, because, as we can see, it only explains 36% of the variation of all of the data set.
We are going to include all of the covariates of the data set, in order to study their impact on the linear regression model. I have previously converted some of these covariates into factors.
datos$cyl<-relevel(as.factor(datos$cyl), ref='8')
datos$vs<-relevel(as.factor(datos$vs), ref='0')
datos$gear<-relevel(as.factor(datos$gear), ref='3')
datos$carb<-relevel(as.factor(datos$carb), ref='2')
fit.mlm <- lm(mpg ~ ., data = datos)
summary(fit.mlm)
##
## Call:
## lm(formula = mpg ~ ., data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.56362 17.74315 1.272 0.2229
## cyl4 0.33616 7.15954 0.047 0.9632
## cyl6 -2.31253 5.49944 -0.421 0.6801
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## ammanual 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb1 0.97935 2.31797 0.423 0.6787
## carb3 3.97899 3.85764 1.031 0.3187
## carb4 2.07078 3.49316 0.593 0.5621
## carb6 5.45692 5.82857 0.936 0.3640
## carb8 8.22977 7.84852 1.049 0.3110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
datos.origen <- mtcars
cor(datos.origen)[1,]
## mpg cyl disp hp drat wt
## 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594
## qsec vs am gear carb
## 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
As we suspected, including all of the covariates is not a good idea. We can observe how the previous 7.245 turns into 1.212, a value that it is not representative, nor correctly inferred. What we have managed to do is to widen the variability spectrum of the data, arriving at 89%.
With regard to the correlation, we can observe a strong relationship with the covariates: cyl, disp, hp and wt. These 4 covariates, together with am, should be used to build our own linear regression model.
Using the information and the deductions that are available to us at this time, I am going to build our own linear regression model.
fit.jclm <- lm(mpg ~ am + cyl + disp + hp + wt, data = datos)
summary(fit.jclm)
##
## Call:
## lm(formula = mpg ~ am + cyl + disp + hp + wt, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9374 -1.3347 -0.3903 1.1910 5.0757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.146495 4.144394 7.515 7.2e-08 ***
## ammanual 1.806099 1.421079 1.271 0.2155
## cyl4 2.717781 2.898149 0.938 0.3573
## cyl6 -0.418285 2.243037 -0.186 0.8536
## disp 0.004088 0.012767 0.320 0.7515
## hp -0.032480 0.013983 -2.323 0.0286 *
## wt -2.738695 1.175978 -2.329 0.0282 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.453 on 25 degrees of freedom
## Multiple R-squared: 0.8664, Adjusted R-squared: 0.8344
## F-statistic: 27.03 on 6 and 25 DF, p-value: 8.861e-10
My initial conclusion is that I can dispense with the covariate disp as it has a variation, due to an increase of one unit, +0.004 mpg - an almost imperceptible quantity.
My final proposal is to include the covariates cyl, hp and wt (in addition to am) in the lineal regression model. Now we are going to see if this coincides with the automatic calculations of the function step of R.
At this point, we are going to do a calculation of the best linear fitting possible.
fit.blm <- step(lm(mpg ~ ., data = datos), trace = 0)
summary(fit.blm)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.54465 3.88461 8.120 1.34e-08 ***
## cyl4 2.16368 2.28425 0.947 0.35225
## cyl6 -0.86767 1.71921 -0.505 0.61803
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## ammanual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
The model fits much better by including the covariates cyl (num of cylinders), hp (horsepower) and wt (weight). We can see that 87% of the variability is explained by the calculation, a more than acceptable amount.
We can see that the manual transmission cars have 1.81 mpg more compared with those with automatic transmissions, with the rest of the covariate remaining constant.
We can reach other conclusions like the following: when we increase the weight by 1,000 pounds, we see that the distance traveled decreases by 2.5 mpg, with the rest of the covariates remaining constant.
Likewise, the 4-cylinder cars travel a distance per gallon of 2.2 mpg farther than the 8-cylinder cars, with the rest of the covariates remaining constant.
The only thing to add is that our conclusions coincide with those calculated using command step of R. The only thing left therefore it to compare the models that have been generated. To make this comparison clearer, we are only going to do it using the most simple and optimized models. If the difference between them is great, we can assume that the optimized model is superior.
We are going to demonstrate that the multivariate model is very different from the Simple Model of Linear Regression. With this, we can assume that it is better adjusted and closer to the underlying reality of the analyzed data.
anova(fit.slm, fit.blm)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 151.03 4 569.87 24.527 1.688e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this context, if we consider that the null hypothesis is that the 2 models are equal, and the alternative is that they are different, we can reject the null hypothesis because the resulting p-value (1.688e-08 ~ 0) is extremely inferior to 5% (0.05). With this we can conclude that these two models are very different.
In the Appendix - Plot 2 we can see its residues, which by any reckoning do not appear to present any anomalies.
Plot 1 - Distance covered in mpg, comparing the type of transmission, automatic vs. manual.
g <- ggplot(datos, aes(x = am, y = mpg))
g <- g + geom_boxplot(aes(fill = am))
g <- g + labs(title = "Distance traveled (mpg) by automatic vs. manual transmission cars")
g <- g + labs(x = "Transmission", y = "Distance (mpg)")
g
NOTE: in an isolated form (taking into account 2 covariates), we can see that the manual transmission cars make better use of the fuel (gallons), than the automatic transmission cars, which is to say, they cover more distance (miles).
Plot 2 - Representation of Residuals
par(mfrow = c(2,2))
plot(fit.blm)
NOTE: The 4 Plots seem to be correct, and nothing seems out of the ordinary. The QQ-plot adjusts to the line nearly perfectly and there do not seem to be any anomalies. I don’t see the need to make other diagnostics.