Executive Summary

The objective of this document is looking at a data set of a collection of cars, and explore the relationship between a set of variables and miles per gallon (MPG) (outcome).

There is a particular interest in the following two questions:

Key takeaways: - Regarding the first question: Manual Transmissions are better for MPG. - Regarding the second question: Manual transmission are better by a factor of 1.8. Also, the means of MPG are better by 7.25.

Data Analysis

At first glance, it seems that Manual Transmission cars have higher MPG than Automatic cars. Let’s try to quantify it. For code reference and graphics got to the Appendix Section, Plot 1.

Linear Model

fit <- lm(mpg ~ am, data = mtcars)
summary(fit)$coefficients
summary(fit)$r.squared
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am           7.244939   1.764422  4.106127 2.850207e-04
## [1] 0.3597989

Conclussions:

  • The average of Automatinc Transmissions MPG is 17.14.
  • The difference in the empirical mean between Manual and Automatic MPG is 7.24.
  • Is Automatic different from Manual? The 0.0002 p-value (less than 0.05) is actually saying that there is a statistical difference in MPG between cars with Automatic and Manual Transmission.
  • The R Squared value tells us that only 36% of the variability in MPG values is explained by the Transmission.

Multivariate Linear Model

fullModel <- lm(mpg ~ ., data=mtcars)
summary(fullModel)$adj.r.squared
sum(summary(fullModel)$coef[,4] < 0.05)
## [1] 0.8066423
## [1] 0

Although the Adjusted R-squared is 0.77, there aren’t good predictors since none of the p-values are lower than 0.05. For more details on coefficients check Appendix: Full Model Coefficients.

stepModel <- step(fullModel, k=log(nrow(mtcars)))
summary(stepModel)$coefficients
summary(stepModel)$r.squared
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## am           2.935837  1.4109045  2.080819 4.671551e-02
## [1] 0.8496636

Much better! The variables (wt + qsec + am) can explain the 85% of the variability in MPG values. Also, all the coefficients are statistically signifficant and also suggest that wt is the most explanatory variable.

Residuals Analysis

For details on the plot check Apendix: Residuals Plot.

  • Residuals vs. Fitted: Shows no consistent pattern, it supports the accuracy of the independence assumption. Also, no other patterns are recognized (exponential, sinusoidal) son no further exploration is suggested
  • Normal Q-Q: The residuals are normally distributed because the points are closely to the line. the further from the mean, the more distributed the points are.
  • Scale-Location: confirms the constant variance assumption, as the points are randomly distributed.
  • Residuals vs. Leverage: that no outliers are present, as all values fall well within the 0.5 bands.

Inference

Assuming that cars with Manual and Automatic Transmission are from the same population as the Null Hypothesis, the p-value of 0.00137 reject it. So, the automatic and manual transmissions are from different populations.

Apendix

Data Loading and Summarization:

data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
summary(mtcars$mpg)
table(mtcars$am)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.42   19.20   20.09   22.80   33.90 
## 
## Automatic    Manual 
##        19        13

Plot 1: MPG by Transmission Boxplot

library(ggplot2)
g <- ggplot(mtcars, aes(x=am,y=mpg, fill=am)) + geom_boxplot()
g <- g + xlab("Transmission") + ylab("Miles per Galon")
g <- g + ggtitle("Miles Per Galon by Transmission \n (Plot 1)")
g

Full Model Coeficients

summary(fullModel)$coef
##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs           0.31776281  2.10450861  0.1509915 0.88142347
## am           2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

How do we interpret the p-values?

  • A predictor that has a low p-value is likely to be a meaningful addition to our model because changes in the predictor’s value are related to changes in the response variable.
  • A larger p-value suggests that changes in the predictor are not associated with changes in the response.

Residuals Plot:

plot(stepModel)

Inference

result <- t.test(mtcars$mpg ~ mtcars$am)
result$p.value
## [1] 0.001373638
result$estimate
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231