In this report, we will explore the relationship between a set of variables and miles per gallon(MPG). We are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”
We will be using the mtcars data set. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
library(datasets)
data(mtcars)
A data frame with 32 observations on 11 (numeric) variables.
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Copying data to another data frame. Putting transmission data into a new column as a factor and creating separate data objects for the different transmission types
mtcars2 <- mtcars
mtcars2$trans <- mtcars$am
mtcars2[mtcars2$trans == 0,]$trans <- "Automatic"
mtcars2[mtcars2$trans == 1,]$trans <- "Manual"
mtcars2$trans <- as.factor(mtcars2$trans)
drop <- c("am")
mtcars2 <- mtcars2[, !(names(mtcars2) %in% drop)]
auto <- mtcars2[mtcars2$trans == "Automatic",]
manual <- mtcars2[mtcars2$trans == "Manual",]
Looking at Automatic Cars
summary(auto)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. :120.1 Min. : 62.0
## 1st Qu.:14.95 1st Qu.:6.000 1st Qu.:196.3 1st Qu.:116.5
## Median :17.30 Median :8.000 Median :275.8 Median :175.0
## Mean :17.15 Mean :6.947 Mean :290.4 Mean :160.3
## 3rd Qu.:19.20 3rd Qu.:8.000 3rd Qu.:360.0 3rd Qu.:192.5
## Max. :24.40 Max. :8.000 Max. :472.0 Max. :245.0
## drat wt qsec vs
## Min. :2.760 Min. :2.465 Min. :15.41 Min. :0.0000
## 1st Qu.:3.070 1st Qu.:3.438 1st Qu.:17.18 1st Qu.:0.0000
## Median :3.150 Median :3.520 Median :17.82 Median :0.0000
## Mean :3.286 Mean :3.769 Mean :18.18 Mean :0.3684
## 3rd Qu.:3.695 3rd Qu.:3.842 3rd Qu.:19.17 3rd Qu.:1.0000
## Max. :3.920 Max. :5.424 Max. :22.90 Max. :1.0000
## gear carb trans
## Min. :3.000 Min. :1.000 Automatic:19
## 1st Qu.:3.000 1st Qu.:2.000 Manual : 0
## Median :3.000 Median :3.000
## Mean :3.211 Mean :2.737
## 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :4.000 Max. :4.000
There are 19 automatic cars with a mean mpg = 17.15
Looking at Manual Cars
summary(manual)
## mpg cyl disp hp drat
## Min. :15.00 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :3.54
## 1st Qu.:21.00 1st Qu.:4.000 1st Qu.: 79.0 1st Qu.: 66.0 1st Qu.:3.85
## Median :22.80 Median :4.000 Median :120.3 Median :109.0 Median :4.08
## Mean :24.39 Mean :5.077 Mean :143.5 Mean :126.8 Mean :4.05
## 3rd Qu.:30.40 3rd Qu.:6.000 3rd Qu.:160.0 3rd Qu.:113.0 3rd Qu.:4.22
## Max. :33.90 Max. :8.000 Max. :351.0 Max. :335.0 Max. :4.93
## wt qsec vs gear
## Min. :1.513 Min. :14.50 Min. :0.0000 Min. :4.000
## 1st Qu.:1.935 1st Qu.:16.46 1st Qu.:0.0000 1st Qu.:4.000
## Median :2.320 Median :17.02 Median :1.0000 Median :4.000
## Mean :2.411 Mean :17.36 Mean :0.5385 Mean :4.385
## 3rd Qu.:2.780 3rd Qu.:18.61 3rd Qu.:1.0000 3rd Qu.:5.000
## Max. :3.570 Max. :19.90 Max. :1.0000 Max. :5.000
## carb trans
## Min. :1.000 Automatic: 0
## 1st Qu.:1.000 Manual :13
## Median :2.000
## Mean :2.923
## 3rd Qu.:4.000
## Max. :8.000
There are 13 manual cars with a mean mpg = 24.39
By direct comparison we can see that, on average, Manual cars have better mileage at 24.39 than Automatic cars at 17.15
Later on, we will build linear models, with other variables that affect MPG, to understand if the above comparison still holds true.
library(ggplot2)
g <- ggplot(data = mtcars2, aes(x = trans, y = mpg, fill = trans))
g <- g + geom_boxplot() + geom_jitter()
g <- g + xlab("Transmission") + ylab("Miles Per Gallon (MPG)") +
ggtitle("MPG difference by Transmission type")
g
By comparing the boxes, we can say that more than 50% cars with Manual transmission have a better mileage than 75% of the Automatic Cars
To check if the difference between mpg of Automatic and Manual cars is significant or not, we will perform the t-test
t.test(auto$mpg, manual$mpg)
##
## Welch Two Sample t-test
##
## data: auto$mpg and manual$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
Taking alpha (significance level) = 0.05 and with a p-value < aplha, we reject the null hypothesis. Thus, there is a significant difference between mpg of automatic and manual cars.
Since the dependent variable, i.e. mpg, is continuous, we will perform linear regression.
Firstly, we will fit the linear model with mpg as the dependent variable and just transmission type as the predictor.
Later on we will include more variables to the model to understand how that impacts the results.
fit1 <- lm(mpg ~ trans, data = mtcars2)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ trans, data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## transManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
With the R-squared value of ~0.36, this model explains 36% of the total variability in mpg. So, there’s scope to improve this model.
The Intercept refers to the Automatic cars. The Estimate of ~17.15 tells us that on average an Automatic car will give you a mileage of 17.15 mpg.
The ‘transManual’ coefficient tells us the increase/decrease in mileage of Manual transmission cars as compared to the Automatic cars (Intercept). Thus the Estimate of ~7.25 indicates that a Manual transmission car, on average, gives you 7.25 mpg more mileage than an Automatic car
Now, building a model with all variables to investigate the combined effects on regression
fit_all <- lm(mpg ~ . , data = mtcars2)
summary(fit_all)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## transManual 2.52023 2.05665 1.225 0.2340
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
This model explains ~86% of the variability but, it seems none of the variables are significant. This could be due to multicollinearity
Lets check the Variance Inflation factors
library(car)
## Warning: package 'car' was built under R version 4.1.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.1.2
vif(fit_all)
## cyl disp hp drat wt qsec vs gear
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 5.357452
## carb trans
## 7.908747 4.648487
Many variables like cyl, disp, wt have a very high VIF indicating that they are highly correlated with other variables in the model. So lets check the correlation between all variables
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.2
## corrplot 0.92 loaded
corrplot(cor(mtcars))
As expected, few of the variables like cyl, disp, wt are highly correlated not only with mpg but also with each other. Now, we can’t just remove all or many of them as that would result in bias. We also can’t keep all of them since we know they are inflating the variability to a large extent.
Lets build on top of Model 1 and create a couple of models by adding a few independent variables in iterations
Model 2: Lets include some variables with very high correlation to mpg but sacrifice 1 of them having great correlation with another variable among them. Doing this to reduce the variance inflation due to multicollinearity we observed.
Picking cyl, disp, hp and wt. From the correlation matrix we can see that cyl has greater correlation with disp than hp or wt. Lets keep cyl but sacrifice disp.
fit2 <- lm(mpg ~ trans + cyl + hp + wt, mtcars2)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ trans + cyl + hp + wt, data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4765 -1.8471 -0.5544 1.2758 5.6608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.14654 3.10478 11.642 4.94e-12 ***
## transManual 1.47805 1.44115 1.026 0.3142
## cyl -0.74516 0.58279 -1.279 0.2119
## hp -0.02495 0.01365 -1.828 0.0786 .
## wt -2.60648 0.91984 -2.834 0.0086 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared: 0.849, Adjusted R-squared: 0.8267
## F-statistic: 37.96 on 4 and 27 DF, p-value: 1.025e-10
Model 3: Lets add a few more variables that have a good correlation with mpg
fit3 <- lm(mpg ~ trans + cyl + hp + wt + drat + vs, data = mtcars2)
summary(fit3)
##
## Call:
## lm(formula = mpg ~ trans + cyl + hp + wt + drat + vs, data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5421 -1.5787 -0.4003 1.3326 5.4488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.34852 9.01121 3.479 0.00186 **
## transManual 1.83252 1.76168 1.040 0.30820
## cyl -0.32673 0.85544 -0.382 0.70573
## hp -0.02660 0.01437 -1.850 0.07611 .
## wt -2.50419 0.96337 -2.599 0.01545 *
## drat 0.40474 1.51180 0.268 0.79111
## vs 1.19317 1.84800 0.646 0.52438
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.583 on 25 degrees of freedom
## Multiple R-squared: 0.8518, Adjusted R-squared: 0.8163
## F-statistic: 23.96 on 6 and 25 DF, p-value: 3.139e-09
Both models explain around 85% variability in mpg. Wt variable seems to be of significance in both the models. The 2 variables added in model 3 aren’t significant according to the model.
vif(fit2)
## trans cyl hp wt
## 2.546159 5.333685 4.310029 3.988305
vif(fit3)
## trans cyl hp wt drat vs
## 3.589652 10.842157 4.512313 4.127433 3.035194 4.029990
anova(fit1, fit2, fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ trans
## Model 2: mpg ~ trans + cyl + hp + wt
## Model 3: mpg ~ trans + cyl + hp + wt + drat + vs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 27 170.00 3 550.90 27.5170 4.365e-08 ***
## 3 25 166.84 2 3.16 0.2369 0.7908
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From above results we can see that Model 2 is a significantly better fit over Model 1 whereas Model 3 is not adding any value over Model 2. The VIF measure of Model 2 also looks better as compared to other models.
Lets look at Model 2 diagnostic plots
par(mfrow = c(2,2))
plot(fit2)
Examining the assumptions of Linear Regression through diagnostic plots:
Linearity and Independence of Residuals - In the Residual Plot, we don’t observe a particular pattern, so we can say that the residuals are linear and independent. The data can be said to be random
Normal distribution of Residuals - In the Q-Q plot, we can see that the residuals are roughly on a straight line, meaning the residuals are normally distributed
Equal variance of residuals - In the Scale-Location plot, the residuals seem equally spread around the horizontal line, suggesting homoscedasticity
Also - In the Residuals v/s Leverage plot, we don’t see a point beyond the Cook’s distance, indicating there isn’t any influential data point that’s affecting the regression estimate drastically
We selected Model 2 that explains 85% of the variability in mpg considering transmission type, no of cylinders, horsepower and weight as the independent variables. Based on this model, on average, a car with Manual transmission gives you 1.478 mpg more than an Automatic car.
Thus, A Manual transmission is better for MPG over an Automatic one
par(mfrow = c(1,1))
g
corrplot(cor(mtcars))
plot(fit2)