Executive Summary

This is a report as part of the grades required for approval in the Regression Models course. The instructions for this assignment are the following:

I will use the mtcars built-in dataset to demonstrate:

  1. Compared with automatic transmission, manual transmission will yield better MPG, achieving 24mpg, against 17 MPG for automatics.

  2. There are at least two confounders (Simpson’s Paradox), between MPG and the following variables:

Data Analysis

Loadind and exploring the data:

library(ggplot2)
data (mtcars)
dim(mtcars)
## [1] 32 11
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Setting factors

mtcars$vs <- factor(mtcars$vs)
mtcars$transm <- factor(mtcars$am, labels=c("Automatic","Manual")) 
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb    transm
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4    Manual
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4    Manual
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    Manual
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 Automatic
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 Automatic
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 Automatic

Comparing MPG per Transmission Types in a boxplot

boxplot(mpg ~ transm, data = mtcars, col = (c("red","blue")), ylab = "MPG", xlab = "Transmission Type")

We can see the means of MPG by Type of transmission

aggregate(mtcars$mpg,by=list(mtcars$transm),FUN=mean)
##     Group.1        x
## 1 Automatic 17.14737
## 2    Manual 24.39231

Regression Analysis

lmt <- lm(mpg ~ factor (am), data=mtcars)
summary(lmt)
## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## factor(am)1    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

In the simple linear regression test, we can see that p-value is less than 0.001, so we will reject the null hypothesis. However, the R-squared test is 0.3598, suggesting that only 35% of MPG variance can be explained by transmission type alone.

Variance Analysis

vart <- aov(mpg ~ ., data= mtcars)
summary (vart)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## cyl          1  817.7   817.7 102.591 2.3e-08 ***
## disp         1   37.6    37.6   4.717 0.04525 *  
## hp           1    9.4     9.4   1.176 0.29430    
## drat         1   16.5    16.5   2.066 0.16988    
## wt           1   77.5    77.5   9.720 0.00663 ** 
## qsec         1    3.9     3.9   0.495 0.49161    
## vs           1    0.1     0.1   0.016 0.90006    
## am           1   14.5    14.5   1.816 0.19657    
## gear         2    2.3     1.2   0.145 0.86578    
## carb         5   19.0     3.8   0.477 0.78789    
## Residuals   16  127.5     8.0                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the Variance test we are looking for variables with p-value less than 0.05. We have cyl with strong p-value, disp and wt.

Multivariate Regression Analysis

multivart <- lm(mpg ~ cyl + disp + wt + am, data = mtcars)
summary(multivart)
## 
## Call:
## lm(formula = mpg ~ cyl + disp + wt + am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.318 -1.362 -0.479  1.354  6.059 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.898313   3.601540  11.356 8.68e-12 ***
## cyl         -1.784173   0.618192  -2.886  0.00758 ** 
## disp         0.007404   0.012081   0.613  0.54509    
## wt          -3.583425   1.186504  -3.020  0.00547 ** 
## am           0.129066   1.321512   0.098  0.92292    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.642 on 27 degrees of freedom
## Multiple R-squared:  0.8327, Adjusted R-squared:  0.8079 
## F-statistic: 33.59 on 4 and 27 DF,  p-value: 4.038e-10

The multivariate regression test now shows R-squared value over 0.83, suggestion that 83% of variance can be explained by the multivariate model. P-values for cyl and weight are less than 0.05, suggesting that these are the counfounders in relation between MPG and Type of transmission.

So, we can fit the final model using just the relevant variables

multivart_final <- lm(mpg ~ cyl + wt + am, data = mtcars)
summary(multivart_final)
## 
## Call:
## lm(formula = mpg ~ cyl + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1735 -1.5340 -0.5386  1.5864  6.0812 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.4179     2.6415  14.923 7.42e-15 ***
## cyl          -1.5102     0.4223  -3.576  0.00129 ** 
## wt           -3.1251     0.9109  -3.431  0.00189 ** 
## am            0.1765     1.3045   0.135  0.89334    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.612 on 28 degrees of freedom
## Multiple R-squared:  0.8303, Adjusted R-squared:  0.8122 
## F-statistic: 45.68 on 3 and 28 DF,  p-value: 6.51e-11

MPG = 39.4+0.17x1-1.51x2-3.12x3, where x1=type of transmission (0 for manual and 1 for automatic), x2=number of cylinders and x3=weight

The final proof will be performed with residual plot analysis.

Residual Plot Analysis

par (mfrow=c(2,2))
plot (multivart_final)

The variance of “Residuals vs Fitted” plot is homoscedastic. The Q-Q test shows normal theoretical quantiles.

Conclusion

The number of cylinders and weight are confounders in the relation between MPG and type of transmission and the best model that explains the relation is a multivariate model MPG = 39.4+0.17x1-1.51x2-3.12x3, where x1=type of transmission (0 for manual and 1 for automatic), x2=number of cylinders and x3=weight. On average, a manual car will achieve 24 mpg and an automatics 17 mpg.