Executive Summary

In this report, we examine mtcars dataset and explore how miles per galon (mpg) is affected by various other variables. Mostly, we aim to answer following two questions-

Our analysis shows that:

  1. Manual Transmission is much better than automatic transmission as it gives an average of 7 mpg more when compared to the automatic type.
  2. Apart from the tranmission type,
    • weight,
    • horsepower and
    • number of cylinders
      are some of the confounding variables which affect the base relation.

Exploratory Analysis

# loading required libraries
library(ggplot2)
# loading dataset
data(mtcars)
# copying it for later operations
mt <- mtcars
# summarazing the variables
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The classes are numeric for all the variables so we change that and transform them into factors.

mtcars$cyl  <- factor(mtcars$cyl)
mtcars$vs   <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am   <- factor(mtcars$am,labels=c("Automatic","Manual"))

Exploring mpg vs am (Transmission type).

boxplot(mpg ~ am, data = mtcars, col = (c("red","blue")), ylab = "Miles Per Gallon", xlab = "Type of Transmission", main="Miles per Gallon by Transmission type")

The plot clearly shows that there is a relation between the miles covered per gallon by different transmission types.

tapply(mtcars$mpg, mtcars$am, mean)
## Automatic    Manual 
##  17.14737  24.39231

In the initial analysis, we can see that Manual type gives around 7 mpg more than the Automatic type. We further explore this by regression modelling.

Regression Modelling

fit <- lm(mpg~am, mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

A p-value of 0.000285 suggests that we will not reject this hypothesis. But at the same time, R-squared value is around 0.35, which means that our variable only contributes 36% to the variance and there might be other variables affecting our model.

cor(mt)[1,]
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594  0.4186840 
##         vs         am       gear       carb 
##  0.6640389  0.5998324  0.4802848 -0.5509251

We see that variables cyl, disp, hp, wt are strongly correlated with mpg so we need to include some of them when fitting into a model.

fit2 <- lm(mpg~am + cyl + disp + hp + wt, mtcars)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ am + cyl + disp + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9374 -1.3347 -0.3903  1.1910  5.0757 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.864276   2.695416  12.564 2.67e-12 ***
## amManual     1.806099   1.421079   1.271   0.2155    
## cyl6        -3.136067   1.469090  -2.135   0.0428 *  
## cyl8        -2.717781   2.898149  -0.938   0.3573    
## disp         0.004088   0.012767   0.320   0.7515    
## hp          -0.032480   0.013983  -2.323   0.0286 *  
## wt          -2.738695   1.175978  -2.329   0.0282 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.453 on 25 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8344 
## F-statistic: 27.03 on 6 and 25 DF,  p-value: 8.861e-10

This model gives a p-value of less than 0.05 for cy16, hp, and wt. The R-squared value is also around 87% which is quite good.

To check for residuals:

par(mfrow = c(2, 2))
plot(fit2)

As seen, the residuals are normally distributed except for some of the outliers.

Conclusion

We can conclude that there is a definite relation between mpg and am. Apart from that, there are some confounding variables like wt, hp and cyl which affect the relation between mpg and am.