This document is the final report of the Peer Assessment project from Coursera’s course Regression Models, as part of the Specialization in Data Science. It was built up in RStudio, using its knitr functions, meant to be published in pdf format.
This analysis meant to be a research for Motor Trend, a magazine about the automotive industry. By looking at a dataset of a collection of cars (mtcars), we are interested in exploring the relationship between a set of variables described below and the fuel autonomy in miles per gallon (MPG) as the outcome. We are particularly interested to explore:
In order to answer these questions, we performed a very quick exploratory data analysis, and then used hypothesis testing and linear regression as methodologies to make the necessary inferences. Both simple and multivariate linear regression analysis (supported by an ANOVA of the variables to be included into the final model) have been used. Using model selection strategy, it has been found out that :
For the purpose of this analysis we use mtcars dataset which is a dataset extracted from the 1974 Motor Trend US magazine, and comprises fuel autonomy and 10 more aspects of automobile design and performance for 32 automobiles (1973-74 models). The table below shows a brief description of the variables in the dataset:
| column | variable | description | unit |
|---|---|---|---|
| [, 1] | mpg | fuel autonomy | miles/US gallon |
| [, 2] | cyl | number of cylinders | number |
| [, 3] | disp | displacement | cu.in. |
| [, 4] | hp | gross power | horsepower |
| [, 5] | drat | rear axle ratio | ratio |
| [, 6] | wt | car weight | lb/1000 |
| [, 7] | qsec | 1/4 mile time | seconds |
| [, 8] | vs | engine type | 0 = V engine, 1 = straight engine |
| [, 9] | am | transmission | 0 = automatic, 1 = manual |
| [,10] | gear | forward gears | number |
| [,11] | carb | carburetors | number |
We first load the R libraries that are necessary for the analysis.
rm(list=ls()) # free up memory for the download of the data sets
setwd("~/Cursos/Data Science/07 Regression Models/Projetos")
library(knitr)
library(ggplot2)
library(GGally)
library(datasets)
library(MASS)
The next step is loading the dataset. Its 6 first rows are shown below.
data(mtcars)
kable(head(mtcars),align = 'c')
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Data overview
Initially, we have a quick look at the MPG variable for both automatic transmission data and manual. A small boxplot with these numbers is shown below.
trAutom <- mtcars$mpg[mtcars$am == 0]
trManual <- mtcars$mpg[mtcars$am == 1]
summary(trAutom)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 14.95 17.30 17.15 19.20 24.40
summary(trManual)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 21.00 22.80 24.39 30.40 33.90
ggplot(mtcars,aes(y = mpg, x = factor(am), fill=factor(am))) +
geom_boxplot() + geom_jitter() +
ggtitle("Fuel Autonomy (in miles/US gallon)") +
ylab("mpg") + xlab("transmission type") +
scale_x_discrete(breaks=NULL) +
scale_fill_discrete(name="transmission type", labels=c("automatic","manual"))
t Test
In order to check for significant difference on MPG between automatic and manual transmissions (to justify further analyses) it has been performed a t Test with the data.
var.test(trAutom,trManual)
##
## F test to compare two variances
##
## data: trAutom and trManual
## F = 0.38656, num df = 18, denom df = 12, p-value = 0.06691
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.1243721 1.0703429
## sample estimates:
## ratio of variances
## 0.3865615
With a p-value of 0.067, we assume that the variances are not equal for the t Test. In fact, when trying both equal or not equal variances, the t Test shows no significant difference in results.
t.test(trAutom, trManual, paired = FALSE, alternative="two.sided", var.equal=FALSE)
##
## Welch Two Sample t-test
##
## data: trAutom and trManual
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
The t Test p-value is 0.0014, which shows significant difference between the averages of automatic and manual transmissions (7.245 increased MPG for manual transmission).
Data Correlations
A first glimpse on the correlations of all the variables with MPG is shown in the table below.
ggpairs(mtcars,
lower = list(continuous = "smooth",params = c(method = "loess", colour="blue")),
diag=list(continuous="bar", params=c(colour="blue")),
upper=list(params=list(corSize=20)), axisLabels='show')
Most of the variables show some impact on MPG. For that reason, it is advisable to run an ANOVA to separate the ones that are really impacting MPG.
A first Linear Regression Analysis, using only MPG and transmission type (am) as variables was made to show the impact of transmission on MPG witout taking into account the other variables.
trLM <- lm(mpg ~ am, data = mtcars)
summary(trLM)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
As said before, it shows a big difference in MPG favorable to manual transmission (+ 7.245 miles per gallon) when the other variables are not considered.
By looking at the correlations table, it is easy to see that there are other variables also impacting on MPG and a Multivariable Regression Analysis is then performed below.
Including all variables we have:
trMVAR <- lm(mpg ~ . , data = mtcars)
summary(trMVAR)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## vs 0.31776281 2.10450861 0.1509915 0.88142347
## am 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
We may observe that all variables have p-values higher than 0.05, which shows that all of them have some sort of impact on MPG.
To separate the ones that are really impacting, an ANOVA (using MASS package stepAIC function) is performed.
fitModel <- stepAIC(lm(mpg ~ . ,data=mtcars), direction = 'both', trace = FALSE)
fitModel
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Coefficients:
## (Intercept) wt qsec am
## 9.618 -3.917 1.226 2.936
According to the analysis above, the most impacting variables on MPG, besides transmission type (am), are the weight of the car (wt) and quarter mile time (qsec).
This means that other variables are less significant than those two or that the correlation among variables allows us to choose only those, minimizing the deviations (variances) in the final model.
The final model, including the relationship among MPG and transmission (am), weight (wt) and quarter mile time (qsec) is:
finalModel <- lm(mpg ~ factor(am) + qsec + wt, data = mtcars)
summary(finalModel)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## factor(am)1 2.935837 1.4109045 2.080819 4.671551e-02
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
In this model, we see a reduced impact of transmission on MPG, closer to reality. If the other variables are kept constant, the new impact of transmission on MPG would be only 2.936 miles per gallon (in average), favorable to the manual transmission.
par(mfrow = c(2, 2))
plot(finalModel)
There are no significant visual trends on the residuals of the final model, and it can be observed good normality pattern. These allow us to conclude that the model could be validated.
As conclusions of the analysis above, we reinforce that: