Executive Summary

The following report presents the relationship between a set of variables and miles per gallon (MPG), by using data set of a collection of cars. The report attempts to answer the question of interest - “Is an automatic or manual transmission better for MPG?”

The model fits MPG (outcome) with weight, transmission type and quarter mile time. Model selection was done by backward elimination of the variables till all remaining variables were statistically significant. The final best model was compared with other models with different variables by using the anova test.

Using the best model, Manual transmission came out to be better as compared to Automatic transmission. 95% confidence interval of difference between MPG was calculated between “Manual” and “Automatic” transmission. Residuals and diagnostic plot suggest that the chosen model fits data quite accurately.

Model Selection

To select a set of variables, I have first looked into correlation between the different variables by using command round(cor(mtcars),2)
It can be seen in the output that wt is highly correlated with cyl (0.78), disp (0.89) and hp (0.66) variables.
Thus, these variables are not included in the list of regressors. Following is the code which illustrates the values of these correlations.

data(mtcars)
round(cor(mtcars),2)

As shown below, I fitted the model to predict mpg by including drat, wt, qsec, vs, am, gear and carb as regressors.
I removed the variable from above which has highest p-value (vs in this case) indicating that inclusion of it is statistically insignificant.
I refitted the model by removing the previous statistically insignificant variable(vs). I repeated the first two steps till I was left with variable which were all statistically significant, i.e., p-value of their coefficients was low enough (< 0.05) which implies that null hypothesis -the variable doesn’t affect MPG - can be rejected.
These variable were wt, qsec and am.

fitr <- lm(mpg ~ drat + wt + qsec + vs + am + gear + carb, data = mtcars)
summary(fitr) # looking at the p-value to judge which variable to eliminate
fitr <- update(fitr, .~. - vs)
summary(fitr)
fitr <- update(fitr,.~. - gear)
summary(fitr)
fitr <- update(fitr, .~. - drat)
summary(fitr)
fitr <- update(fitr, .~. - carb)
summary(fitr)

Analysis of Coefficients

summary(fitr)$coefficients

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## am           2.935837  1.4109045  2.080819 4.671551e-02

It can be seen above that mpg is negatively related with wt and positively related with qsec and am. Thus expression for mpg can be written as follows: \(mpg = 9.6 -3.9*wt + 1.2*qsec + 2.9*am\)
This means that mpg decreases by 3.9 when wt increases by 1 ton, keeping other variables constant.
Similarly, mpg increases by 2.9 when manual transmission is used in place of automatic transmission.

Comparing with other models

Let’s fit two more models as depicted below namely fit1 and fit2. I have added additional variables as one goes from fit1 to fit2 (added am) to fitr (added qsec). With anova function, significance of inclusion of these additional tems is tested.

#fit  <- lm(mpg ~ ., data = mtcars)
fit1 <- lm(mpg ~ wt, data = mtcars)
fit2 <- lm(mpg ~ wt + factor(am),data=mtcars)
#fit3 <- lm(mpg ~ wt*factor(am),data=mtcars)
anova(fit1,fit2,fitr)

## Analysis of Variance Table
## 
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + factor(am)
## Model 3: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 278.32                                   
## 2     29 278.32  1     0.002  0.0004 0.9847784    
## 3     28 169.29  1   109.034 18.0343 0.0002162 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It can be seen above in the result that yes, inclusion of variable am seems to be necessary in the model fit2 and inclusion of variable qsec appears to be significant in the model fitr. Therefore, I have selected model fitr for the prediction of mpg.

Is automatic transmission better or Manual?

As pointed out earlier in fitr model, mpg increases by 2.9 when manual transmission is used in place of automatic transmission, keeping other variables fixed.

Estimating the 95% Confidence interval for the difference in MPG as follows:

cof <- coef(summary(fitr))
be <- cof[4,1]
se <- cof[4,2]
q <- qt(p = .975,df = fitr$df )
be + c(-1,1)*q*se

## [1] 0.04573031 5.82594408

The p-value of coefficient for am in fitr model is quite small (0.0467) and the 95% confidence interval doesn’t include 0, thus we can say that mpg difference between automatic and manual transmission is quite significant. This MPG difference (Manual - Automatic) lies between 0.04573031 and 5.82594408 , 95 % of the time.

Residual Plot and Diagnostic

Residual

In the above residual vs fitted plot, the points looks fairly scattered around the horizontal axis. There is no systematic variation in the residuals that suggests heteroskedasticity or non-linearity.
Chrysler Imperial, Fiat 128 and Toyota Corolla are few cars that have large values of Residuals.
In Normal Q-Q plot, most of the points lie on the indicated line suggesting that residuals are normally distributed.

Diagnostic

Below, I have investigated top 3 points with high leverage and influence. From the output, It can be seen that Toyota Corona, Fiat 128 and Chrysler Imperial are some of the influential points as pointed out in residual plot earlier.

tail(sort(hatvalues(fitr)),3) #Leverage

##   Chrysler Imperial Lincoln Continental            Merc 230 
##           0.2296338           0.2642151           0.2970422

tail(sort(dfbetas(fitr)[,4]),3) #Influential Point

##     Toyota Corona          Fiat 128 Chrysler Imperial 
##         0.4050410         0.4765680         0.5626418

Apendix - Exploratory Data Analyses

Scatter Plot

Following graph shows the scatter plot between mpg and wt. Colour of the point indicates the type of transmission - light blue is for automatic transmission (am = 0) and salmon is for manual transmission (am=1).

By looking at the graph, one can infer that when the weight (wt) of the car is below 3 tons, manual transmission is better for mpg. Whereas, when weight(wt) of the car is larger than 3 tons, automatic transmission seems to be better for mpg.

Box Plot

It is evident from the box plot that mean value of MPG is higher for “manual” transmission type as compared to “automatic” transmission type.

Final Project

Regression Models Class