Preamble:

This document focuses on the analysis of the R dataset mtcars.

Description of dataset mtcars can be found in internet. Link: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html

Research questions:

Analysis structure:

I will address both issues from different angles employing a set of methodologies that can be broadly grouped as follows:

  1. Univariate analysis on target variable (mpg).

  2. Bivariate analysis on target variable and relevant covariates.

  3. Multivariate analysis: estimation a set of regression models for the conditional mean of mpg. For model selection, I compare the Best Fit and Forward Stepwise Selection procedures.

Chapter 1. Univariate analysis

In this chapter I focus on analyzing the target variable (mpg) alone by splitting the observations into two groups, i.e. cars wtih automatic or manual transmission. I will execute 3 analysis:

  1. Compute sample means by group (i.e. transmission automatic vs. manual).

  2. Validate if the difference of the group means is statistically significant by computing a 95% confidence interval for means difference.

  3. Verify the robustness of this result by executing a permutation test with Monte Carlo trials that shuffle the allocation group > mpg.

1.1 Sample means by group

Initially a perform a subset of the data by group.

#### generate subset: automatic and manual cars ####
cars_auto = subset(mtcars, am == 0)
cars_manu = subset(mtcars, am == 1)

# dimensions
dim(mtcars)
## [1] 32 11
dim(cars_auto); dim(cars_manu)
## [1] 19 11
## [1] 13 11
# sample means mpg by group
mean(cars_auto$mpg); mean(cars_manu$mpg)
## [1] 17.14737
## [1] 24.39231
sd(cars_auto$mpg); sd(cars_manu$mpg)
## [1] 3.833966
## [1] 6.166504
# % increase in mpg based on the sample mean
(mean(cars_manu$mpg) - mean(cars_auto$mpg))/mean(cars_auto$mpg)
## [1] 0.4225103
#### mpg plots ####
par(mfrow = c(2, 1))
hist(cars_auto$mpg, main = "Distribution mpg - automatic transmission", xlab = "mpg")
abline(v = mean(cars_auto$mpg), col = "red")
hist(cars_manu$mpg, main = "Distribution mpg - manual transmission", xlab = "mpg")
abline(v = mean(cars_manu$mpg), col = "red")

Conclusions:

1.2 95% confidence interval for the difference of the group means

The analysis on sample means concludes that sample mean of mpg for car with manual trasmission is greater than automatic:

Now I test if this difference (i.e. in the sample means) is statistically significant (from zero).

I execute a t.test for unpaired samples: I assume inequality in variances for the two groups for the calculation of the pooled variance.

#### 95% confidence interval for mean difference ####

# Question: is the sample mean difference significant?
t.test(cars_manu$mpg, cars_auto$mpg, paired = F, var.equal = F)
## 
##  Welch Two Sample t-test
## 
## data:  cars_manu$mpg and cars_auto$mpg
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.209684 11.280194
## sample estimates:
## mean of x mean of y 
##  24.39231  17.14737

Conclusions:

1.3 Permutation test on groups association

I test the robustness of result obtained in the previous step.

I execute a permutation test by shuffling the allocation mean > groups with 100000 trials of Montecarlo simulation.

Code:

#### Permutation test ####
# what if I shuffle the am groups and calculate the mean?

# get target variable and group vectors
y = mtcars$mpg
group = mtcars$am
y; group

# baseline group means and difference
baselineMeans = tapply(mtcars$mpg, mtcars$am, mean)
baselineMeansDiff = baselineMeans[2] - baselineMeans[1]

tStat = function(w, g) mean(w[g == 1]) - mean(w[g == 0])
observedDiff = tStat(y, group)

# check if function works - should be 0:
baselineMeansDiff - observedDiff

# execute shuffle:
permutations = sapply(1:100000, function(i) tStat(y, sample(group)))

Plotting experiment results:

# shuffle experiment results plots:
par(mfrow = c(2, 1), mar = c(4, 4, 2, 2))
hist(permutations, main = "Distribution of shuffled group mean differences") # distribution of difference of averages of permuted groups
plot(permutations, type = "b", main = "Shuffled group mean trials", xlab = "trial", ylab = "shuffled group mean differences", ylim = c(-14, 14))
abline(h = observedDiff, col = "red", lwd = 3)

# there is not even 1 case where by chance I get a difference value greater than the observed!
mean(permutations > observedDiff)
## [1] 2e-04

CONCLUSIONS:

Chapter 2. Bivariate analysis

In this chapter I analyse the behavior of target variable (mpg) conditional on a set of explanatory variables.

#### generate subset: automatic and manual cars ####
cars_auto = subset(mtcars, am == 0)
cars_manu = subset(mtcars, am == 1)

#### Visual inspection of all covariates ####
pairs(mtcars)

#### 4 bivariate analysis: hp / wt / drat / disp ####
par(mfrow = c(2, 2), mar = c(2, 3, 2, 3))

# plot1
with(mtcars, plot(hp, mpg, type = "n", main = "mpg vs. hp - by transmission type")) # no data
with(cars_auto, points(hp, mpg, col = "red", pch = 20))
with(cars_manu, points(hp, mpg, col = "blue", pch = 20))
legend("topright", pch = 20, col = c("red", "blue"), legend = c("auto", "manu")) # add legend
model1_auto = lm(mpg ~ hp, data = cars_auto)
model1_manu = lm(mpg ~ hp, data = cars_manu)
abline(model1_auto, col = "red", lwd = 2)
abline(model1_manu, col = "blue", lwd = 2)
abline(v = 175, lty = 2)

# plot2
with(mtcars, plot(wt, mpg, type = "n", main = "mpg vs. weight - by transmission type")) # no data
with(cars_auto, points(wt, mpg, col = "red", pch = 20))
with(cars_manu, points(wt, mpg, col = "blue", pch = 20))
legend("topright", pch = 20, col = c("red", "blue"), legend = c("auto", "manu")) # add legend
abline(v = 3.2, lty = 2)

# plot 3
with(mtcars, plot(drat, mpg, type = "n", main = "mpg vs. drat - by transmission type")) # no data
with(cars_auto, points(drat, mpg, col = "red", pch = 20))
with(cars_manu, points(drat, mpg, col = "blue", pch = 20))
legend("topright", pch = 20, col = c("red", "blue"), legend = c("auto", "manu")) # add legend
model2_auto = lm(mpg ~ drat, data = cars_auto)
model2_manu = lm(mpg ~ drat, data = cars_manu)
abline(model2_auto, col = "red", lwd = 2)
abline(model2_manu, col = "blue", lwd = 2)
abline(v = 175, lty = 2)

# plot 4
with(mtcars, plot(disp, mpg, type = "n", main = "mpg vs. disp - by transmission type")) # no data
with(cars_auto, points(disp, mpg, col = "red", pch = 20))
with(cars_manu, points(disp, mpg, col = "blue", pch = 20))
legend("topright", pch = 20, col = c("red", "blue"), legend = c("auto", "manu")) # add legend
labels = with(mtcars, paste(as.character(disp), as.character(mpg), sep = ",")) # generate point labels
with(mtcars, text(disp, mpg, labels = labels, cex = 0.7, pos = 2))
abline(v = 167.6, lty = 2)

CONCLUSIONS:

3. Multivariate analysis

In this chapter I run a set of regression models for estimating the impact of some predictors on the mpg.

For model selection I employ the following techniques:

  1. Manual selection of regressors: I hand pick regressors for

  2. Best fit procedure:

  3. Forward stepwise procedure:

3.1 Manual selection

Analysis of covariance matrix:

### analyse covariance matrix for regressor selection:
z <- cor(mtcars)
require(lattice)
## Loading required package: lattice
levelplot(z)

A model with only transmission:

# only am
data = mtcars
data$am = as.factor(data$am)
model2 = lm(mpg ~ am, data = data)

# get results
summary(model2)
## 
## Call:
## lm(formula = mpg ~ am, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Observations:

3.2 Best Fit Procedure

I run the best fit procedure for identifying the optimal number of regressors that minimises the cp, which is (…)

#### model selection using leaps ####
library(leaps)
data = mtcars
data$log_mpg = log(data$mpg) # add log of y

#### method 1. best fit ####
regfit.full = regsubsets(log_mpg ~. , data = data, nvmax = 10)
reg.summary = summary(regfit.full)
reg.summary
## Subset selection object
## Call: regsubsets.formula(log_mpg ~ ., data = data, nvmax = 10)
## 11 Variables  (and intercept)
##      Forced in Forced out
## mpg      FALSE      FALSE
## cyl      FALSE      FALSE
## disp     FALSE      FALSE
## hp       FALSE      FALSE
## drat     FALSE      FALSE
## wt       FALSE      FALSE
## qsec     FALSE      FALSE
## vs       FALSE      FALSE
## am       FALSE      FALSE
## gear     FALSE      FALSE
## carb     FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
##           mpg cyl disp hp  drat wt  qsec vs  am  gear carb
## 1  ( 1 )  "*" " " " "  " " " "  " " " "  " " " " " "  " " 
## 2  ( 1 )  "*" " " "*"  " " " "  " " " "  " " " " " "  " " 
## 3  ( 1 )  "*" " " "*"  " " " "  "*" " "  " " " " " "  " " 
## 4  ( 1 )  "*" " " "*"  " " " "  "*" " "  " " "*" " "  " " 
## 5  ( 1 )  "*" " " "*"  " " " "  "*" " "  "*" "*" " "  " " 
## 6  ( 1 )  "*" "*" "*"  " " " "  " " " "  " " "*" "*"  "*" 
## 7  ( 1 )  "*" "*" "*"  " " " "  "*" " "  " " "*" "*"  "*" 
## 8  ( 1 )  "*" "*" "*"  " " " "  "*" " "  "*" "*" "*"  "*" 
## 9  ( 1 )  "*" "*" "*"  " " "*"  "*" " "  "*" "*" "*"  "*" 
## 10  ( 1 ) "*" "*" "*"  " " "*"  "*" "*"  "*" "*" "*"  "*"

Plotting the ….

# how I select the optimal number of variables?
plot(reg.summary$cp, xlab = "Number of variables", ylab = "cp", type = "b")

3.3 Forward Stepwise Procedure

regfit.fwd = regsubsets(log_mpg ~ ., data = data, nvmax = 10, method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(log_mpg ~ ., data = data, nvmax = 10, method = "forward")
## 11 Variables  (and intercept)
##      Forced in Forced out
## mpg      FALSE      FALSE
## cyl      FALSE      FALSE
## disp     FALSE      FALSE
## hp       FALSE      FALSE
## drat     FALSE      FALSE
## wt       FALSE      FALSE
## qsec     FALSE      FALSE
## vs       FALSE      FALSE
## am       FALSE      FALSE
## gear     FALSE      FALSE
## carb     FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: forward
##           mpg cyl disp hp  drat wt  qsec vs  am  gear carb
## 1  ( 1 )  "*" " " " "  " " " "  " " " "  " " " " " "  " " 
## 2  ( 1 )  "*" " " "*"  " " " "  " " " "  " " " " " "  " " 
## 3  ( 1 )  "*" " " "*"  " " " "  "*" " "  " " " " " "  " " 
## 4  ( 1 )  "*" " " "*"  " " " "  "*" " "  " " "*" " "  " " 
## 5  ( 1 )  "*" " " "*"  " " " "  "*" " "  "*" "*" " "  " " 
## 6  ( 1 )  "*" " " "*"  " " " "  "*" " "  "*" "*" "*"  " " 
## 7  ( 1 )  "*" " " "*"  " " " "  "*" " "  "*" "*" "*"  "*" 
## 8  ( 1 )  "*" "*" "*"  " " " "  "*" " "  "*" "*" "*"  "*" 
## 9  ( 1 )  "*" "*" "*"  " " "*"  "*" " "  "*" "*" "*"  "*" 
## 10  ( 1 ) "*" "*" "*"  " " "*"  "*" "*"  "*" "*" "*"  "*"

Plotting the ….

plot(regfit.fwd, scale = "Cp")

*** Appendix ****

A model including all regressors.

#### lm with all variables / no split ####
# prepare data
data = mtcars
data$am = as.factor(data$am)

model1 = lm(mpg ~ ., data = data)

# get results
summary(model1)
## 
## Call:
## lm(formula = mpg ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am1          2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07
# plot residual analysis
par(mfrow = c(2, 2))
plot(model1)

# plot hist
par(mfrow = c(1, 1))
hist(model1$residuals)

# normality test on residuals
shapiro.test(model1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model1$residuals
## W = 0.9569, p-value = 0.2261