The study is aimed to better understand the factors on which the miles per gallon (MPG) of vehicles depend on. For this purpose, the ‘mtcars’ data set was loaded and studied to aide with the answers to the following questions:
1. Is an automatic or manual transmission better for MPG ?
2. Quantify the MPG difference between automatic and manual transmissions.
In order to arrive at the answers, the data set was scrutinised and filtered to narrow down on variables that contributed the most towards mpg of a vehicle. Further on, multiple models were realised - with the filtered variables as regressors and mpg as output -, of which one was selected. The residual plots were then plotted using the selected model and conclusions were drawn in on the same.
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). The data frame consists of 32 observations on 11 variables:
1 - mpg Miles/(US) gallon
2 - cyl Number of cylinders
3 - disp Displacement (cu.in.)
4 - hp Gross horsepower
5 - drat Rear axle ratio
6 - wt Weight (1000 lbs)
7 - qsec 1/4 mile time
8 - vs V/S
9 - am Transmission (0 = automatic, 1 = manual)
10- gear Number of forward gears
11- carb Number of carburetors
The first step exploring the data set by means of fitting the linear model. Within the model, we assign mpg as the outcome and the remaining variables as the regressors.
fit_exp<- lm(mpg~., data = mtcars)
summary(fit_exp)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
From the summary we are able to deduce that despite the high adjusted R squared, we might not need to fit variables to predict the output mpg. As fitting too many correlated variables would result in overfitting and inflated variance.
Coming to the first part of the question, we seek to find out what kind of transmission would be best for higher mpg
fit_am<- lm(mpg~ factor(am), data = mtcars)
summary(fit_am)
##
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## factor(am)1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The summary projects mpg to be 7.24 times higher when a vehicle has automatic transmission than when it has a manual transmission. However, the low value of adjacent R square suggests a lower explanatory value for the output ‘mpg’ with only ‘am’ as the regressor.
The selecting variables with the most impact on ‘mpg’ is conducted using AIC in a Stepwise Algorithm in both the direction. The step function here works on dropping models based on AIC values, this iterative function runs multiple simulations before a suitable model is reached.
# model<- step(lm(mpg~., data = mtcars), direction = "both")
model
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Coefficients:
## (Intercept) wt qsec am
## 9.618 -3.917 1.226 2.936
summary(lm(mpg~wt+qsec+am, data = mtcars))
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
From the simulations, it becomes evident that along with ‘am’, variables ‘qsec’ and ‘wt’ has to be accounted for to realise a robust model.
Now that we are aware of the covariates that ought to be fitted into the model, the next part is deciding on the interaction between these covariates. The most robust interaction would produce a model with a high adjacent R squared value.
m_1<- summary(lm(mpg ~ wt * qsec * factor(am), data = mtcars))
m_2<- summary(lm(mpg ~ wt + qsec * factor(am), data = mtcars))
m_3<- summary(lm(mpg ~ wt * qsec + factor(am), data = mtcars))
m_4<- summary(lm(mpg ~ qsec * wt + factor(am), data = mtcars))
m_5<- summary(lm(mpg ~ qsec + wt * factor(am), data = mtcars))
m_6<- summary(lm(mpg ~ qsec + wt + factor(am), data = mtcars))
models_radj<- data.frame((m_1)$adj.r.squared,(m_2)$adj.r.squared,(m_3)$adj.r.squared,(m_4)$adj.r.squared, (m_5)$adj.r.squared,(m_6)$adj.r.squared)
colnames(models_radj)<- c("m_1","m_2","m_3","m_4","m_5","m_6")
models_radj
## m_1 m_2 m_3 m_4 m_5 m_6
## 1 0.8759496 0.8531624 0.8347545 0.8347545 0.8804219 0.8335561
Model 5 ‘m_5’ seemed to have the highest adjacent R squared value, hence we proceed with the same.
mpg = 9.7230526 + (1.0169736)qsec + (-2.9365309)wt + (14.0794278)am + (-4.1413764)wt*am
Hence,
when am = 0, the slope of wt is -2.937 and the intercept is 9.723.
when am = 1, the slope of wt is -7.078 and the intercept is 23.802.
Therefore, on an average, a manual transmission vehicle would travel 14.079 miles per gallon more than an automatice transmission vehicle, making manul vehicles a better option
The uncertainty in the slope and intersection estimates for a 95% confidence interval has been calculated for below:
fit<- lm(mpg ~ qsec + wt * factor(am), data = mtcars)
confint(fit)
## 2.5 % 97.5 %
## (Intercept) -2.3807791 21.826884
## qsec 0.4998811 1.534066
## wt -4.3031019 -1.569960
## factor(am)1 7.0308746 21.127981
## wt:factor(am)1 -6.5970316 -1.685721
Figure 2 - in the Appendix - displays the residual plots for the selected model stored in the variable ‘fit’. From the plots, no concrete patterns could be recognised so as to render our model obsolete or flawed. This being said, the Q-Q plot looks rather skewed which does raise some concerns. To better understand the reason for this, below the hatvalues and dfbetas values of the model will be calculated.
hist(hatvalues(fit),xlab = "hatvalues", main = "Histograme of hatvalues for model 'fit'")
From the histogram, we can say with a good amount of certainty that coefficients with hatvalues greater than 0.3 seem to have high leverage.
dfbetas(fit)[which(hatvalues(fit) > 2*5/32),]
## (Intercept) qsec wt factor(am)1
## Merc 230 0.25656732 -0.35363727 0.012205353 0.13319953
## Lincoln Continental 0.31582832 -0.13136203 -0.548982600 -0.32920707
## Lotus Europa 0.02054631 -0.02254647 -0.007125334 0.02922367
## Maserati Bora 0.06467465 -0.07097064 -0.022428766 -0.39813267
## wt:factor(am)1
## Merc 230 -0.15370459
## Lincoln Continental 0.25093655
## Lotus Europa -0.03400343
## Maserati Bora 0.53072741
The skewed nature of the Q-Q plot could be explained by the high leverage points within the model, but since these points have low influence - as seen from above - the model wouldn’t be too affected and can be used for prediction and inference purposes.
Figure 1 - Exploratory Analysis with mpq as the output
Figure 2 - Residual Plots