Executive Summary

In this report, we will explore the relationship between a set of variables and miles per gallon(MPG). We are particularly interested in the following two questions:

Data

We will be using the mtcars data set. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

library(datasets)
data(mtcars)

A data frame with 32 observations on 11 (numeric) variables.

[, 1] mpg Miles/(US) gallon

[, 2] cyl Number of cylinders

[, 3] disp Displacement (cu.in.)

[, 4] hp Gross horsepower

[, 5] drat Rear axle ratio

[, 6] wt Weight (1000 lbs)

[, 7] qsec 1/4 mile time

[, 8] vs Engine (0 = V-shaped, 1 = straight)

[, 9] am Transmission (0 = automatic, 1 = manual)

[,10] gear Number of forward gears

[,11] carb Number of carburetors

Exploratory Data Analysis

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Copying data to another data frame. Putting transmission data into a new column as a factor and creating separate data objects for the different transmission types

mtcars2 <- mtcars
mtcars2$trans <- mtcars$am
mtcars2[mtcars2$trans == 0,]$trans <- "Automatic"
mtcars2[mtcars2$trans == 1,]$trans <- "Manual"
mtcars2$trans <- as.factor(mtcars2$trans)

drop <- c("am")
mtcars2 <- mtcars2[, !(names(mtcars2) %in% drop)]

auto <- mtcars2[mtcars2$trans == "Automatic",]
manual <- mtcars2[mtcars2$trans == "Manual",]

Looking at Automatic Cars

summary(auto)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   :120.1   Min.   : 62.0  
##  1st Qu.:14.95   1st Qu.:6.000   1st Qu.:196.3   1st Qu.:116.5  
##  Median :17.30   Median :8.000   Median :275.8   Median :175.0  
##  Mean   :17.15   Mean   :6.947   Mean   :290.4   Mean   :160.3  
##  3rd Qu.:19.20   3rd Qu.:8.000   3rd Qu.:360.0   3rd Qu.:192.5  
##  Max.   :24.40   Max.   :8.000   Max.   :472.0   Max.   :245.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :2.465   Min.   :15.41   Min.   :0.0000  
##  1st Qu.:3.070   1st Qu.:3.438   1st Qu.:17.18   1st Qu.:0.0000  
##  Median :3.150   Median :3.520   Median :17.82   Median :0.0000  
##  Mean   :3.286   Mean   :3.769   Mean   :18.18   Mean   :0.3684  
##  3rd Qu.:3.695   3rd Qu.:3.842   3rd Qu.:19.17   3rd Qu.:1.0000  
##  Max.   :3.920   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##       gear            carb             trans   
##  Min.   :3.000   Min.   :1.000   Automatic:19  
##  1st Qu.:3.000   1st Qu.:2.000   Manual   : 0  
##  Median :3.000   Median :3.000                 
##  Mean   :3.211   Mean   :2.737                 
##  3rd Qu.:3.000   3rd Qu.:4.000                 
##  Max.   :4.000   Max.   :4.000

There are 19 automatic cars with a mean mpg = 17.15

Looking at Manual Cars

summary(manual)
##       mpg             cyl             disp             hp             drat     
##  Min.   :15.00   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :3.54  
##  1st Qu.:21.00   1st Qu.:4.000   1st Qu.: 79.0   1st Qu.: 66.0   1st Qu.:3.85  
##  Median :22.80   Median :4.000   Median :120.3   Median :109.0   Median :4.08  
##  Mean   :24.39   Mean   :5.077   Mean   :143.5   Mean   :126.8   Mean   :4.05  
##  3rd Qu.:30.40   3rd Qu.:6.000   3rd Qu.:160.0   3rd Qu.:113.0   3rd Qu.:4.22  
##  Max.   :33.90   Max.   :8.000   Max.   :351.0   Max.   :335.0   Max.   :4.93  
##        wt             qsec             vs              gear      
##  Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :4.000  
##  1st Qu.:1.935   1st Qu.:16.46   1st Qu.:0.0000   1st Qu.:4.000  
##  Median :2.320   Median :17.02   Median :1.0000   Median :4.000  
##  Mean   :2.411   Mean   :17.36   Mean   :0.5385   Mean   :4.385  
##  3rd Qu.:2.780   3rd Qu.:18.61   3rd Qu.:1.0000   3rd Qu.:5.000  
##  Max.   :3.570   Max.   :19.90   Max.   :1.0000   Max.   :5.000  
##       carb             trans   
##  Min.   :1.000   Automatic: 0  
##  1st Qu.:1.000   Manual   :13  
##  Median :2.000                 
##  Mean   :2.923                 
##  3rd Qu.:4.000                 
##  Max.   :8.000

There are 13 manual cars with a mean mpg = 24.39

Is an automatic or manual transmission better for MPG?

By direct comparison we can see that, on average, Manual cars have better mileage at 24.39 than Automatic cars at 17.15

Later on, we will build linear models, with other variables that affect MPG, to understand if the above comparison still holds true.

Quantify the MPG difference between automatic and manual transmissions

library(ggplot2)
g <- ggplot(data = mtcars2, aes(x = trans, y = mpg, fill = trans))
g <- g + geom_boxplot() + geom_jitter()
g <- g + xlab("Transmission") + ylab("Miles Per Gallon (MPG)") + 
    ggtitle("MPG difference by Transmission type")
g

By comparing the boxes, we can say that more than 50% cars with Manual transmission have a better mileage than 75% of the Automatic Cars

Hypothesis Testing

To check if the difference between mpg of Automatic and Manual cars is significant or not, we will perform the t-test

t.test(auto$mpg, manual$mpg)
## 
##  Welch Two Sample t-test
## 
## data:  auto$mpg and manual$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

Taking alpha (significance level) = 0.05 and with a p-value < aplha, we reject the null hypothesis. Thus, there is a significant difference between mpg of automatic and manual cars.

Regression Modelling

Strategy for Model Selection

Since the dependent variable, i.e. mpg, is continuous, we will perform linear regression.

Firstly, we will fit the linear model with mpg as the dependent variable and just transmission type as the predictor.

Later on we will include more variables to the model to understand how that impacts the results.

Model 1

fit1 <- lm(mpg ~ trans, data = mtcars2)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ trans, data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## transManual    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

With the R-squared value of ~0.36, this model explains 36% of the total variability in mpg. So, there’s scope to improve this model.

Interpreting the coefficients

The Intercept refers to the Automatic cars. The Estimate of ~17.15 tells us that on average an Automatic car will give you a mileage of 17.15 mpg.

The ‘transManual’ coefficient tells us the increase/decrease in mileage of Manual transmission cars as compared to the Automatic cars (Intercept). Thus the Estimate of ~7.25 indicates that a Manual transmission car, on average, gives you 7.25 mpg more mileage than an Automatic car

Now, building a model with all variables to investigate the combined effects on regression

Model with all Variables

fit_all <- lm(mpg ~ . , data = mtcars2)
summary(fit_all)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## transManual  2.52023    2.05665   1.225   0.2340  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Interpreting the coefficients

This model explains ~86% of the variability but, it seems none of the variables are significant. This could be due to multicollinearity

Diagnostics

Lets check the Variance Inflation factors

library(car)
## Warning: package 'car' was built under R version 4.1.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.1.2
vif(fit_all)
##       cyl      disp        hp      drat        wt      qsec        vs      gear 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873  5.357452 
##      carb     trans 
##  7.908747  4.648487

Many variables like cyl, disp, wt have a very high VIF indicating that they are highly correlated with other variables in the model. So lets check the correlation between all variables

library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.2
## corrplot 0.92 loaded
corrplot(cor(mtcars))

As expected, few of the variables like cyl, disp, wt are highly correlated not only with mpg but also with each other. Now, we can’t just remove all or many of them as that would result in bias. We also can’t keep all of them since we know they are inflating the variability to a large extent.

Lets build on top of Model 1 and create a couple of models by adding a few independent variables in iterations

Models 2 & 3

Model 2: Lets include some variables with very high correlation to mpg but sacrifice 1 of them having great correlation with another variable among them. Doing this to reduce the variance inflation due to multicollinearity we observed.

Picking cyl, disp, hp and wt. From the correlation matrix we can see that cyl has greater correlation with disp than hp or wt. Lets keep cyl but sacrifice disp.

fit2 <- lm(mpg ~ trans + cyl + hp + wt, mtcars2)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ trans + cyl + hp + wt, data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4765 -1.8471 -0.5544  1.2758  5.6608 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.14654    3.10478  11.642 4.94e-12 ***
## transManual  1.47805    1.44115   1.026   0.3142    
## cyl         -0.74516    0.58279  -1.279   0.2119    
## hp          -0.02495    0.01365  -1.828   0.0786 .  
## wt          -2.60648    0.91984  -2.834   0.0086 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.8267 
## F-statistic: 37.96 on 4 and 27 DF,  p-value: 1.025e-10

Model 3: Lets add a few more variables that have a good correlation with mpg

fit3 <- lm(mpg ~ trans + cyl + hp + wt + drat + vs, data = mtcars2)
summary(fit3)
## 
## Call:
## lm(formula = mpg ~ trans + cyl + hp + wt + drat + vs, data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5421 -1.5787 -0.4003  1.3326  5.4488 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 31.34852    9.01121   3.479  0.00186 **
## transManual  1.83252    1.76168   1.040  0.30820   
## cyl         -0.32673    0.85544  -0.382  0.70573   
## hp          -0.02660    0.01437  -1.850  0.07611 . 
## wt          -2.50419    0.96337  -2.599  0.01545 * 
## drat         0.40474    1.51180   0.268  0.79111   
## vs           1.19317    1.84800   0.646  0.52438   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.583 on 25 degrees of freedom
## Multiple R-squared:  0.8518, Adjusted R-squared:  0.8163 
## F-statistic: 23.96 on 6 and 25 DF,  p-value: 3.139e-09

Interpreting the coefficients

Both models explain around 85% variability in mpg. Wt variable seems to be of significance in both the models. The 2 variables added in model 3 aren’t significant according to the model.

Diagnostics

vif(fit2)
##    trans      cyl       hp       wt 
## 2.546159 5.333685 4.310029 3.988305
vif(fit3)
##     trans       cyl        hp        wt      drat        vs 
##  3.589652 10.842157  4.512313  4.127433  3.035194  4.029990
anova(fit1, fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ trans
## Model 2: mpg ~ trans + cyl + hp + wt
## Model 3: mpg ~ trans + cyl + hp + wt + drat + vs
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     27 170.00  3    550.90 27.5170 4.365e-08 ***
## 3     25 166.84  2      3.16  0.2369    0.7908    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Selection

From above results we can see that Model 2 is a significantly better fit over Model 1 whereas Model 3 is not adding any value over Model 2. The VIF measure of Model 2 also looks better as compared to other models.

Diagnostic Plots

Lets look at Model 2 diagnostic plots

par(mfrow  = c(2,2))
plot(fit2)

Examining the assumptions of Linear Regression through diagnostic plots:

  • Linearity and Independence of Residuals - In the Residual Plot, we don’t observe a particular pattern, so we can say that the residuals are linear and independent. The data can be said to be random

  • Normal distribution of Residuals - In the Q-Q plot, we can see that the residuals are roughly on a straight line, meaning the residuals are normally distributed

  • Equal variance of residuals - In the Scale-Location plot, the residuals seem equally spread around the horizontal line, suggesting homoscedasticity

Also - In the Residuals v/s Leverage plot, we don’t see a point beyond the Cook’s distance, indicating there isn’t any influential data point that’s affecting the regression estimate drastically

Conclusion

We selected Model 2 that explains 85% of the variability in mpg considering transmission type, no of cylinders, horsepower and weight as the independent variables. Based on this model, on average, a car with Manual transmission gives you 1.478 mpg more than an Automatic car.

Thus, A Manual transmission is better for MPG over an Automatic one

Figures

par(mfrow = c(1,1))
g

corrplot(cor(mtcars))

plot(fit2)