This is the course project of Regression Models course, part of Data Science Specialization, by John Hopkins Bloomberg school of Pulibc Health at Coursera.
In this work we will be looking at a data set of a collection of cars, interested in exploring the relationship between a set of variables and the consumption in miles per gallon (MPG).
In this analysis we’ll try to address the following two questions:
For this analysis we’ll be using the mtcars R standard dataset.According with R documentation, the mtcars dataset was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models), has 11 variables:
data(mtcars) # loading the dataset
str(mtcars) # looking the data structure
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
To perform the analysis, it’s necessary to transform some ‘num’ values in to factors
mtcars$am <- factor(mtcars$am,labels=c('Automatic','Manual'))
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
First, is there a significant MPG difference between Automatic and Manual, in overall?
t.test(mpg~am,data=mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
Seems that really are a statistic significant difference, with p-value = 0.01374, in the comsumption between the two type of transmission, favoring the manual type (average of 24.39231 mpg) against the automatic type (17.14737 mpg, in averate).
So, lets find the best model to fit mtcars variables to describe the comsumption. We’ll first construct a model adding all variables and so use the Stepwise Algorithm to check what are the more significant parameters in the model.
allvars <- lm(mpg ~ .,data=mtcars) # initial model with all variables
best <- step(allvars)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
We can see above, the best fitting model (R-squared = 0.87), beyond the transmission type (am), also involves the variables cyl6/cyl8 (# of cylinders), hp (horse power) and wt (weight), so at same levels of cyl, hp and wt, the change from automatic to manual transmission increases the mpg in 1.8.
You can see the residual analisys of this model in the appendices.
library(corrplot)
data(mtcars) # reconvert the factors to numerals
cor_mat <- cor(mtcars)
ord <- corrMatOrder(cor_mat, order="AOE")
corrplot.mixed(cor_mat[ord,ord])
par(mfrow=c(2,2))
plot(best)
We can see that the Residuals vs. Fitted chart indicates the independece condition of the residuals (they are randomly scattered) and the Normal Q-Q chart indicates they are normally distributed (the points follow the line).