Executive Summary

This work examine the mtcars dataset. In this case we examine the relationship between miles per galon (MPG) and other variables. Whit this relationships me explain two questions, (I) the efect of automatic or manual transmission over de MPG, (II) measure MPG performance between automatic and manual transmissions. We conclude manual transmission is better than the automatic transmission, with 40% higher yield.

Exploratory Analysis

View the first 10 cars.

head(mtcars, 10)

Plot data to visualize the distribuiton values.

plot(mtcars)

Summaryse all variables.

library(knitr)
data(mtcars);kable(summary(mtcars[1:5]));kable(summary(mtcars[6:11]))
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am gear carb
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000 Median :2.000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000

Visualize the data types:

class(c(mtcars$mpg, mtcars$cyl, mtcars$cyl, mtcars$drat, mtcars$gear, mtcars$wt, mtcars$wt, mtcars$qsec, mtcars$am))
## [1] "numeric"

How we have variables witch continuous values and the number of cases are poor (32), we convert as factor to optime the analisis:

mtcars$cyl  <- factor(mtcars$cyl)
mtcars$vs   <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am   <- factor(mtcars$am,labels=c("Automatic","Manual"))

DataSet Analysis

Aplly linear regression to identify the efect of preditive variables over the result of MPG

fit1 = lm(mpg ~am, data = mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The results indicate that the MPG in automatic vehicles was 17.1 MPG, whereas in the manuals it was 7.2 MPG higher than the first one. The adjusted R indicated 0.36, which indicates a small explanatory power of 36% of the behavior in the outcome variable. The solution is to use a test of multiple predictors. The use of multiple variables aims to specialize the model and gain explanatory power.

fit2 <- lm(mpg~am + cyl + disp + hp + wt, data = mtcars)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ am + cyl + disp + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9374 -1.3347 -0.3903  1.1910  5.0757 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.864276   2.695416  12.564 2.67e-12 ***
## amManual     1.806099   1.421079   1.271   0.2155    
## cyl6        -3.136067   1.469090  -2.135   0.0428 *  
## cyl8        -2.717781   2.898149  -0.938   0.3573    
## disp         0.004088   0.012767   0.320   0.7515    
## hp          -0.032480   0.013983  -2.323   0.0286 *  
## wt          -2.738695   1.175978  -2.329   0.0282 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.453 on 25 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8344 
## F-statistic: 27.03 on 6 and 25 DF,  p-value: 8.861e-10

In this case the R adjusted is better which the before model - this model explain 82% of comportament of outcome variable. To confirm this we apply the anova test.

options(scipen = 999)
anova(fit1, fit2)

The p-value of 0.00000008636804 or 8,637e-08 that the second test has a significant explanatory power gain.

Appendix

boxplot(mpg ~ am, data=mtcars, col=(c("yellow","pink")), xlab="Transmission Type (0 = Automatic, 1 = Manual)", ylab="MPG", main="Distribution MPG vs. Transmission")

At this point we conclude that, manual transmission is better wich automatic just because your gain is over the automatic transmission.

We retrieve the mean values of two types of transmission to observe the difference between theirs results.

ag = aggregate(mtcars$mpg,by=list(mtcars$am),FUN=mean)
ag
ag$x[2]/ag$x[1]
## [1] 1.42251

We now conclude that the manual transmission is (on average) 42% more efficient than the automatic transmission.

boxplot(mpg ~ cyl, data=mtcars, col=(c("yellow","pink", "red")), xlab="Cylinders", ylab="MPG", main="Distribution Cylinders vs. MPG")

boxplot(mpg ~ disp, data=mtcars,  col= "green", xlab="Displacement", ylab="MPG", main="Distribution Displacement vs. MPG")

boxplot(mpg ~ hp, data=mtcars, xlab="HP", ylab="MPG", main="Distribution HP vs. MPG")

boxplot(mpg ~ wt, data=mtcars, xlab="WT",  col="red", ylab="MPG", main="Distribution WT vs. MPG")

Check Correlations

mtcars_used <- matrix(c(mtcars$mpg, mtcars$am, mtcars$cyl, mtcars$disp, mtcars$hp, mtcars$wt),ncol = 6, nrow = 32)
mtcars_used = as.data.frame(mtcars_used)
names(mtcars_used) = c("MPG", "AM", "CYL", "DISP", "HP", "WT")
head(mtcars_used)
pairs(mtcars_used, panel = panel.smooth, col = 9)

cor(mtcars_used)
##             MPG         AM        CYL       DISP         HP         WT
## MPG   1.0000000  0.5998324 -0.8521620 -0.8475514 -0.7761684 -0.8676594
## AM    0.5998324  1.0000000 -0.5226070 -0.5912270 -0.2432043 -0.6924953
## CYL  -0.8521620 -0.5226070  1.0000000  0.9020329  0.8324475  0.7824958
## DISP -0.8475514 -0.5912270  0.9020329  1.0000000  0.7909486  0.8879799
## HP   -0.7761684 -0.2432043  0.8324475  0.7909486  1.0000000  0.6587479
## WT   -0.8676594 -0.6924953  0.7824958  0.8879799  0.6587479  1.0000000

Check Residuals

First Model

par(mfrow = c(2,2))
plot(fit1)

Second Model

par(mfrow = c(2,2))
plot(fit2)

Results

At de multivariate model the residuals are normally distributed, reinforcing the fitted model quality.