This document focuses on the analysis of the R dataset mtcars.
Description of dataset mtcars can be found in internet. Link: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html
Question 1: is a car with automatic or manual transmission better in term of miles per gallons (mpg)?
Question 2: quantify the mpg difference between automatic and manual transmission.
I will address both issues from different angles employing a set of methodologies that can be broadly grouped as follows:
Univariate analysis on target variable (mpg).
Bivariate analysis on target variable and relevant covariates.
Multivariate analysis: estimation a set of regression models for the conditional mean of mpg. For model selection, I compare the Best Fit and Forward Stepwise Selection procedures.
In this chapter I focus on analyzing the target variable (mpg) alone by splitting the observations into two groups, i.e. cars wtih automatic or manual transmission. I will execute 3 analysis:
Compute sample means by group (i.e. transmission automatic vs. manual).
Validate if the difference of the group means is statistically significant by computing a 95% confidence interval for means difference.
Verify the robustness of this result by executing a permutation test with Monte Carlo trials that shuffle the allocation group > mpg.
Initially a perform a subset of the data by group.
#### generate subset: automatic and manual cars ####
cars_auto = subset(mtcars, am == 0)
cars_manu = subset(mtcars, am == 1)
# dimensions
dim(mtcars)
## [1] 32 11
dim(cars_auto); dim(cars_manu)
## [1] 19 11
## [1] 13 11
# sample means mpg by group
mean(cars_auto$mpg); mean(cars_manu$mpg)
## [1] 17.14737
## [1] 24.39231
sd(cars_auto$mpg); sd(cars_manu$mpg)
## [1] 3.833966
## [1] 6.166504
# % increase in mpg based on the sample mean
(mean(cars_manu$mpg) - mean(cars_auto$mpg))/mean(cars_auto$mpg)
## [1] 0.4225103
#### mpg plots ####
par(mfrow = c(2, 1))
hist(cars_auto$mpg, main = "Distribution mpg - automatic transmission", xlab = "mpg")
abline(v = mean(cars_auto$mpg), col = "red")
hist(cars_manu$mpg, main = "Distribution mpg - manual transmission", xlab = "mpg")
abline(v = mean(cars_manu$mpg), col = "red")
Conclusions:
mpg empirical mean of cars with manual transmission is greater than cars with automatic transmission, however has also higher variance.
Based on this result alone can be concluded that on average cars with manual transmission have 42% mileage than cars with automatic transmission. >>> is correct to conclude this?
The analysis on sample means concludes that sample mean of mpg for car with manual trasmission is greater than automatic:
Now I test if this difference (i.e. in the sample means) is statistically significant (from zero).
I execute a t.test for unpaired samples: I assume inequality in variances for the two groups for the calculation of the pooled variance.
#### 95% confidence interval for mean difference ####
# Question: is the sample mean difference significant?
t.test(cars_manu$mpg, cars_auto$mpg, paired = F, var.equal = F)
##
## Welch Two Sample t-test
##
## data: cars_manu$mpg and cars_auto$mpg
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.209684 11.280194
## sample estimates:
## mean of x mean of y
## 24.39231 17.14737
Conclusions:
The 95% interval does not contain 0
The sample mean difference is significant at 95% (p-value 0.1%).
I test the robustness of result obtained in the previous step.
I execute a permutation test by shuffling the allocation mean > groups with 100000 trials of Montecarlo simulation.
Code:
#### Permutation test ####
# what if I shuffle the am groups and calculate the mean?
# get target variable and group vectors
y = mtcars$mpg
group = mtcars$am
y; group
# baseline group means and difference
baselineMeans = tapply(mtcars$mpg, mtcars$am, mean)
baselineMeansDiff = baselineMeans[2] - baselineMeans[1]
tStat = function(w, g) mean(w[g == 1]) - mean(w[g == 0])
observedDiff = tStat(y, group)
# check if function works - should be 0:
baselineMeansDiff - observedDiff
# execute shuffle:
permutations = sapply(1:100000, function(i) tStat(y, sample(group)))
Plotting experiment results:
# shuffle experiment results plots:
par(mfrow = c(2, 1), mar = c(4, 4, 2, 2))
hist(permutations, main = "Distribution of shuffled group mean differences") # distribution of difference of averages of permuted groups
plot(permutations, type = "b", main = "Shuffled group mean trials", xlab = "trial", ylab = "shuffled group mean differences", ylim = c(-14, 14))
abline(h = observedDiff, col = "red", lwd = 3)
# there is not even 1 case where by chance I get a difference value greater than the observed!
mean(permutations > observedDiff)
## [1] 2e-04
CONCLUSIONS:
Out of 100000 trials only the 0.002% have breached the observed values for the difference in group empirical means.
We can conclude that the empirical means difference of groups is robust with regards to random reshuffling and in not likely to be generated by pure chance. is this correct?
In this chapter I analyse the behavior of target variable (mpg) conditional on a set of explanatory variables.
#### generate subset: automatic and manual cars ####
cars_auto = subset(mtcars, am == 0)
cars_manu = subset(mtcars, am == 1)
#### Visual inspection of all covariates ####
pairs(mtcars)
#### 4 bivariate analysis: hp / wt / drat / disp ####
par(mfrow = c(2, 2), mar = c(2, 3, 2, 3))
# plot1
with(mtcars, plot(hp, mpg, type = "n", main = "mpg vs. hp - by transmission type")) # no data
with(cars_auto, points(hp, mpg, col = "red", pch = 20))
with(cars_manu, points(hp, mpg, col = "blue", pch = 20))
legend("topright", pch = 20, col = c("red", "blue"), legend = c("auto", "manu")) # add legend
model1_auto = lm(mpg ~ hp, data = cars_auto)
model1_manu = lm(mpg ~ hp, data = cars_manu)
abline(model1_auto, col = "red", lwd = 2)
abline(model1_manu, col = "blue", lwd = 2)
abline(v = 175, lty = 2)
# plot2
with(mtcars, plot(wt, mpg, type = "n", main = "mpg vs. weight - by transmission type")) # no data
with(cars_auto, points(wt, mpg, col = "red", pch = 20))
with(cars_manu, points(wt, mpg, col = "blue", pch = 20))
legend("topright", pch = 20, col = c("red", "blue"), legend = c("auto", "manu")) # add legend
abline(v = 3.2, lty = 2)
# plot 3
with(mtcars, plot(drat, mpg, type = "n", main = "mpg vs. drat - by transmission type")) # no data
with(cars_auto, points(drat, mpg, col = "red", pch = 20))
with(cars_manu, points(drat, mpg, col = "blue", pch = 20))
legend("topright", pch = 20, col = c("red", "blue"), legend = c("auto", "manu")) # add legend
model2_auto = lm(mpg ~ drat, data = cars_auto)
model2_manu = lm(mpg ~ drat, data = cars_manu)
abline(model2_auto, col = "red", lwd = 2)
abline(model2_manu, col = "blue", lwd = 2)
abline(v = 175, lty = 2)
# plot 4
with(mtcars, plot(disp, mpg, type = "n", main = "mpg vs. disp - by transmission type")) # no data
with(cars_auto, points(disp, mpg, col = "red", pch = 20))
with(cars_manu, points(disp, mpg, col = "blue", pch = 20))
legend("topright", pch = 20, col = c("red", "blue"), legend = c("auto", "manu")) # add legend
labels = with(mtcars, paste(as.character(disp), as.character(mpg), sep = ",")) # generate point labels
with(mtcars, text(disp, mpg, labels = labels, cex = 0.7, pos = 2))
abline(v = 167.6, lty = 2)
CONCLUSIONS:
mpg vs. hp: linear negative relation: as horse power of the engine (hp) increases, the mileage (mpg) reduces. Cars with manual transmission seems however to be more efficient: the group restricted regression (blue) has a higher intercept. It has to be highlighted however that the parameters of blue regression might be influenced by two extreme values with high hp - the regression should be re-estimated by removing the two datapoints.
mpg vs. weight: negative relation, the functional form might be non-linear (hyperbolic ?), as weight of the car increases, the mileage decreases. The weight variable seems to provide perfect separation between manual and automatic transmission cars, i.e. all cars that are heavier than 3.2 ton (circa) are automatic and vice-versa.
mpg vs. drat: the functional form is not clear: it appears also to be increase in the variance as the real axel ratio (drat) increases. To verify this a regression model using all observations has to be estimated and analyse the residuals for verifying if the model is heteroskedastic.
mpg vs. disp: seems to have a negative (hyperbolic ?) relation: as the displacement (disp) of the engine increases, the mileage decreases. Also in this case it seems that disp accounts for perfect separation in the transmission type: almost all cars with disp > 180 are automatic.
In this chapter I run a set of regression models for estimating the impact of some predictors on the mpg.
For model selection I employ the following techniques:
Manual selection of regressors: I hand pick regressors for
Best fit procedure:
Forward stepwise procedure:
Analysis of covariance matrix:
### analyse covariance matrix for regressor selection:
z <- cor(mtcars)
require(lattice)
## Loading required package: lattice
levelplot(z)
A model with only transmission:
# only am
data = mtcars
data$am = as.factor(data$am)
model2 = lm(mpg ~ am, data = data)
# get results
summary(model2)
##
## Call:
## lm(formula = mpg ~ am, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Observations:
the intercept is 17.15: exactly the same mean of mpg for cars with automatic transmission
the coefficient of am is 7.24: exactly the difference of mpg means for cars with manual / automatic transmission
the sum of intercept and am coefficient gives the mpg unconditional mean for cars with manual transmission
I run the best fit procedure for identifying the optimal number of regressors that minimises the cp, which is (…)
#### model selection using leaps ####
library(leaps)
data = mtcars
data$log_mpg = log(data$mpg) # add log of y
#### method 1. best fit ####
regfit.full = regsubsets(log_mpg ~. , data = data, nvmax = 10)
reg.summary = summary(regfit.full)
reg.summary
## Subset selection object
## Call: regsubsets.formula(log_mpg ~ ., data = data, nvmax = 10)
## 11 Variables (and intercept)
## Forced in Forced out
## mpg FALSE FALSE
## cyl FALSE FALSE
## disp FALSE FALSE
## hp FALSE FALSE
## drat FALSE FALSE
## wt FALSE FALSE
## qsec FALSE FALSE
## vs FALSE FALSE
## am FALSE FALSE
## gear FALSE FALSE
## carb FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 ( 1 ) "*" " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " "*" " " " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " "*" " " " " "*" " " " " " " " " " "
## 4 ( 1 ) "*" " " "*" " " " " "*" " " " " "*" " " " "
## 5 ( 1 ) "*" " " "*" " " " " "*" " " "*" "*" " " " "
## 6 ( 1 ) "*" "*" "*" " " " " " " " " " " "*" "*" "*"
## 7 ( 1 ) "*" "*" "*" " " " " "*" " " " " "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" " " " " "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
Plotting the ….
# how I select the optimal number of variables?
plot(reg.summary$cp, xlab = "Number of variables", ylab = "cp", type = "b")
regfit.fwd = regsubsets(log_mpg ~ ., data = data, nvmax = 10, method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(log_mpg ~ ., data = data, nvmax = 10, method = "forward")
## 11 Variables (and intercept)
## Forced in Forced out
## mpg FALSE FALSE
## cyl FALSE FALSE
## disp FALSE FALSE
## hp FALSE FALSE
## drat FALSE FALSE
## wt FALSE FALSE
## qsec FALSE FALSE
## vs FALSE FALSE
## am FALSE FALSE
## gear FALSE FALSE
## carb FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: forward
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 ( 1 ) "*" " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " "*" " " " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " "*" " " " " "*" " " " " " " " " " "
## 4 ( 1 ) "*" " " "*" " " " " "*" " " " " "*" " " " "
## 5 ( 1 ) "*" " " "*" " " " " "*" " " "*" "*" " " " "
## 6 ( 1 ) "*" " " "*" " " " " "*" " " "*" "*" "*" " "
## 7 ( 1 ) "*" " " "*" " " " " "*" " " "*" "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" " " " " "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
Plotting the ….
plot(regfit.fwd, scale = "Cp")
A model including all regressors.
#### lm with all variables / no split ####
# prepare data
data = mtcars
data$am = as.factor(data$am)
model1 = lm(mpg ~ ., data = data)
# get results
summary(model1)
##
## Call:
## lm(formula = mpg ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am1 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
# plot residual analysis
par(mfrow = c(2, 2))
plot(model1)
# plot hist
par(mfrow = c(1, 1))
hist(model1$residuals)
# normality test on residuals
shapiro.test(model1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.9569, p-value = 0.2261