The following report presents the relationship between a set of variables and miles per gallon (MPG), by using data set of a collection of cars. The report attempts to answer the question of interest - “Is an automatic or manual transmission better for MPG?”
The model fits MPG (outcome) with weight, transmission type and quarter mile time. Model selection was done by backward elimination of the variables till all remaining variables were statistically significant. The final best model was compared with other models with different variables by using the anova test.
Using the best model, Manual transmission came out to be better as compared to Automatic transmission. 95% confidence interval of difference between MPG was calculated between “Manual” and “Automatic” transmission. Residuals and diagnostic plot suggest that the chosen model fits data quite accurately.
round(cor(mtcars),2)data(mtcars)
round(cor(mtcars),2)
As shown below, I fitted the model to predict mpg by including drat, wt, qsec, vs, am, gear and carb as regressors.
I removed the variable from above which has highest p-value (vs in this case) indicating that inclusion of it is statistically insignificant.
I refitted the model by removing the previous statistically insignificant variable(vs). I repeated the first two steps till I was left with variable which were all statistically significant, i.e., p-value of their coefficients was low enough (< 0.05) which implies that null hypothesis -the variable doesn’t affect MPG - can be rejected.
These variable were wt, qsec and am.
fitr <- lm(mpg ~ drat + wt + qsec + vs + am + gear + carb, data = mtcars)
summary(fitr) # looking at the p-value to judge which variable to eliminate
fitr <- update(fitr, .~. - vs)
summary(fitr)
fitr <- update(fitr,.~. - gear)
summary(fitr)
fitr <- update(fitr, .~. - drat)
summary(fitr)
fitr <- update(fitr, .~. - carb)
summary(fitr)
summary(fitr)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## am 2.935837 1.4109045 2.080819 4.671551e-02
mpg is negatively related with wt and positively related with qsec and am. Thus expression for mpg can be written as follows: \(mpg = 9.6 -3.9*wt + 1.2*qsec + 2.9*am\)mpg decreases by 3.9 when wt increases by 1 ton, keeping other variables constant.mpg increases by 2.9 when manual transmission is used in place of automatic transmission.Let’s fit two more models as depicted below namely fit1 and fit2. I have added additional variables as one goes from fit1 to fit2 (added am) to fitr (added qsec). With anova function, significance of inclusion of these additional tems is tested.
#fit <- lm(mpg ~ ., data = mtcars)
fit1 <- lm(mpg ~ wt, data = mtcars)
fit2 <- lm(mpg ~ wt + factor(am),data=mtcars)
#fit3 <- lm(mpg ~ wt*factor(am),data=mtcars)
anova(fit1,fit2,fitr)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + factor(am)
## Model 3: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 278.32
## 2 29 278.32 1 0.002 0.0004 0.9847784
## 3 28 169.29 1 109.034 18.0343 0.0002162 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It can be seen above in the result that yes, inclusion of variable am seems to be necessary in the model fit2 and inclusion of variable qsec appears to be significant in the model fitr. Therefore, I have selected model fitr for the prediction of mpg.
As pointed out earlier in fitr model, mpg increases by 2.9 when manual transmission is used in place of automatic transmission, keeping other variables fixed.
Estimating the 95% Confidence interval for the difference in MPG as follows:
cof <- coef(summary(fitr))
be <- cof[4,1]
se <- cof[4,2]
q <- qt(p = .975,df = fitr$df )
be + c(-1,1)*q*se
## [1] 0.04573031 5.82594408
The p-value of coefficient for am in fitr model is quite small (0.0467) and the 95% confidence interval doesn’t include 0, thus we can say that mpg difference between automatic and manual transmission is quite significant. This MPG difference (Manual - Automatic) lies between 0.04573031 and 5.82594408 , 95 % of the time.
In the above residual vs fitted plot, the points looks fairly scattered around the horizontal axis. There is no systematic variation in the residuals that suggests heteroskedasticity or non-linearity.
Chrysler Imperial, Fiat 128 and Toyota Corolla are few cars that have large values of Residuals.
In Normal Q-Q plot, most of the points lie on the indicated line suggesting that residuals are normally distributed.
Below, I have investigated top 3 points with high leverage and influence. From the output, It can be seen that Toyota Corona, Fiat 128 and Chrysler Imperial are some of the influential points as pointed out in residual plot earlier.
tail(sort(hatvalues(fitr)),3) #Leverage
## Chrysler Imperial Lincoln Continental Merc 230
## 0.2296338 0.2642151 0.2970422
tail(sort(dfbetas(fitr)[,4]),3) #Influential Point
## Toyota Corona Fiat 128 Chrysler Imperial
## 0.4050410 0.4765680 0.5626418
Following graph shows the scatter plot between mpg and wt. Colour of the point indicates the type of transmission - light blue is for automatic transmission (am = 0) and salmon is for manual transmission (am=1).
By looking at the graph, one can infer that when the weight (wt) of the car is below 3 tons, manual transmission is better for mpg. Whereas, when weight(wt) of the car is larger than 3 tons, automatic transmission seems to be better for mpg.
It is evident from the box plot that mean value of MPG is higher for “manual” transmission type as compared to “automatic” transmission type.