Bikram Bhusal
28 July, 2019

Introduction

In this analysis, we use the “mtcars” data and examine the relationship between a set of variables and miles per gallon (MPG) (outcome).We are particularly interested in the following two questions:

  1. Is an automatic or manual transmission better for MPG
  2. Quantify the MPG difference between automatic and manual transmissions

Data Analysis:

library(ggplot2)
data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
mtcars$vs<-factor(mtcars$vs)
mtcars$am.label<-factor(mtcars$am,labels=c("Automatic","Mannual"))# 0 auto & 1 manual
mtcars$gear<-factor(mtcars$gear)
mtcars$carb<-factor(mtcars$carb)

Boxplot of MPG by transmission type;

boxplot(mpg ~ am.label,data=mtcars,col=c("yellow","green"),xlab="Transmission Type",ylab="Miles Per Gallon")

so, we can say from the boxplot that Manual Transmission provides better MPG.

Regression Analysis

Let’s do the simple regression test of MPG vs transmission type

R_simple<-lm(mpg~factor(am),data = mtcars)
summary(R_simple)
## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## factor(am)1    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Here,p-value is less than 0.0003.So we donot rejet the null hypothesis.At the same time R-squared value(0.3598) suggest that the strength of relationship is not so impressive.

Now,let’s perform analysis of variance,

Aov<-aov(mpg~.,data=mtcars)
summary(Aov)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## cyl          1  817.7   817.7 102.591 2.3e-08 ***
## disp         1   37.6    37.6   4.717 0.04525 *  
## hp           1    9.4     9.4   1.176 0.29430    
## drat         1   16.5    16.5   2.066 0.16988    
## wt           1   77.5    77.5   9.720 0.00663 ** 
## qsec         1    3.9     3.9   0.495 0.49161    
## vs           1    0.1     0.1   0.016 0.90006    
## am           1   14.5    14.5   1.816 0.19657    
## gear         2    2.3     1.2   0.145 0.86578    
## carb         5   19.0     3.8   0.477 0.78789    
## Residuals   16  127.5     8.0                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see cyl,disp, wta and am has the p-value less than 0.05 Let’s perform the multivariabe linear regrassion analysis:

R_multivar<-lm(mpg~cyl+disp+wt+am,data = mtcars)
summary(R_multivar)
## 
## Call:
## lm(formula = mpg ~ cyl + disp + wt + am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.318 -1.362 -0.479  1.354  6.059 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.898313   3.601540  11.356 8.68e-12 ***
## cyl         -1.784173   0.618192  -2.886  0.00758 ** 
## disp         0.007404   0.012081   0.613  0.54509    
## wt          -3.583425   1.186504  -3.020  0.00547 ** 
## am           0.129066   1.321512   0.098  0.92292    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.642 on 27 degrees of freedom
## Multiple R-squared:  0.8327, Adjusted R-squared:  0.8079 
## F-statistic: 33.59 on 4 and 27 DF,  p-value: 4.038e-10

We notice that R-squared value is 0.8327.Which suggest that 83% or more of variance can be explained by this multivariable model.P-values for cyl and wt are below 0.05,suggesting that these are the confounding variables in the model.

pairs plot for the dataset

pairs(mpg ~ .,data=mtcars)

Residuals plot

par(mfrow=c(2,2))
plot(R_multivar)

The Residual vs fitted plot show that the residuals are homoscedastic. Also, QQ plot shows residuals are normally distributed(except some outliers).