Regression Models Course Project

Synopsis

In this project, using the mtcars dataset, I will try to answer to the main question about the different impact of the automatic respect to the manual transmission:

Is an automatic or manual transmission better for MPG?
Quantify the MPG difference between automatic and manual transmissions

In order to do that, I have followed the steps:

Exploratory Data Analysis
Model Selection
Inference
Residual Anslysis

Linear regression and multivariate linear regression have been used to select the model.

Executive Summary

The data included in the R’s datasets is called mtcars and the description says:

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Before a quick look into the data, dimensions and how the data look like (type included):

data("mtcars")
dim(mtcars)

## [1] 32 11

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

So we have 32 observations on 11 variables. Always from the documentation, I have extracted the following variable descriptions:

Variables	Description
mpg	Miles/(US) gallon
cyl	Number of cylinders
disp	Displacement (cu.in.)
hp	Gross horsepower
drat	Rear axle ratio
wt	Weight (1000 lbs)
qsec	1/4 mile time
vs	V/S
am	Transmission (0 = automatic, 1 = manual)
gear	Number of forward gears
carb	Number of carburetors

Some of the values are discrete like: cyl, vs, am, gear and carb. I change them to factors:

mtcars$vs <- factor(mtcars$vs,labels=c('V','S'))
mtcars$am <- factor(mtcars$am,labels=c('A','M'))
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)

Exploratory Data Analysis

Let’s see how cyl, vs, gear and carb are distributed against am (that is the parameter that represents our main interest along mpg):

par(mfrow=c(2,2))

display_barplot <- function(table, title, xlab) {
  barplot(table, main=title, xlab=xlab, col=c("darkblue","red"), beside=TRUE)
  legend("topright", title = "AM", rownames(table), fill = 1:2, ncol = 1, cex = 0.5)
}

display_barplot(table(mtcars$am, mtcars$cyl) ,"AM vs. Cylinders", "Number of Cylinders")
display_barplot(table(mtcars$am, mtcars$vs) ,"AM vs. VS", "V/Straight Engine")
display_barplot(table(mtcars$am, mtcars$gear) ,"AM vs. Gears", "Number of Gears")
display_barplot(table(mtcars$am, mtcars$carb) ,"AM vs. Carb", "Number of Carburetors")

Gears looks not having the fifth one in the Automatic model. Similarly carburetors look limited in the case of automatic model.

Now, let’s have a summary of the data, included two scatterplot matrixes relative to the variables for a better data insight (mpg, hp and wt separated by am and cyl):

library(car)
scatterplotMatrix(~mpg+hp+wt|am, data=mtcars, main="MPG - HP - WT vs AM")

On the first quadrant we can see that we have higher values of mpg for manual transmission. This trend looks confirmed for mpg against hp and wt.

scatterplotMatrix(~mpg+hp+wt|cyl, data=mtcars, main="MPG - HP - WT vs CYL")

Here we have 3 main categories of cars based on the number cylinders: 4, 6 and 8. It is easy to see that for higher number of cylinders, mpg drops down due to higher fuel consumption. Same observation can be considered for higher values of hp and wt that correspond to lower mpg values.

In Appendix A there is a complete display of the correlations across the whole set of variables in the dataset. Looking at the values for mpg, some of those correlations look negative. However these correlations are meant for single independent variables. In case of multi-regressors, their impact on the outcome might change. Indeed, correlations suggest us another main point: mpg is strongly correlated to am but also cyl, disp, hp, drat, wt and vs and what makes the things even more complicated it is the mutual interaction between these variables. Variance Inflation Factors helps you to find out how much those variables are mutually influenced. For example this is how disp changes with and without wt:

sort(sqrt(vif(lm(mpg~., mtcars))), decreasing = TRUE)

##  [1] 22.432384 11.319053  7.769536  7.131081  5.312210  4.881683  3.284842
##  [8]  3.151269  2.843970  2.787389  2.609533  2.304823  2.236068  2.209453
## [15]  1.834225  1.812413  1.775181  1.686407  1.634138  1.615405  1.414214
## [22]  1.414214  1.364858  1.000000  1.000000  1.000000  1.000000  1.000000
## [29]  1.000000  1.000000

sort(sqrt(vif(lm(mpg~. -wt, mtcars))), decreasing = TRUE)

##  [1] 16.056162 10.255633  6.811936  5.272347  5.038395  3.176320  3.060302
##  [8]  2.835853  2.498479  2.296159  2.244637  2.236068  1.789537  1.782223
## [15]  1.749372  1.683999  1.615540  1.580658  1.414214  1.414214  1.319970
## [22]  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000

disp has the highest value, but it drops down removing wt. It suggests a interaction between the two.

Model Selection

To select the model, we will consider the fixed regressor am for the outcome mpg as the base model. I will add incrementally new parameters, based to their absolute correlations as in the graph in Appendix A creating so, a set of nested models. With the anova function then, I will select the best model, related to the p-value:

fit1  <- lm(mpg ~ am , mtcars)
fit2  <- update(fit1, mpg ~ am + wt)
fit3  <- update(fit1, mpg ~ am + wt + cyl)
fit4  <- update(fit1, mpg ~ am + wt + cyl + disp)
fit5  <- update(fit1, mpg ~ am + wt + cyl + disp + hp)
fit6  <- update(fit1, mpg ~ am + wt + cyl + disp + hp + drat)
fit7  <- update(fit1, mpg ~ am + wt + cyl + disp + hp + drat + carb)
fit8  <- update(fit1, mpg ~ am + wt + cyl + disp + hp + drat + carb + gear)
fit9  <- update(fit1, mpg ~ am + wt + cyl + disp + hp + drat + carb + gear + qsec)
fit10 <- update(fit1, mpg ~ am + wt + cyl + disp + hp + drat + carb + gear + qsec + vs)

anova(fit1, fit2, fit3, fit4, fit5, fit6, fit7, fit8, fit9, fit10)

## Analysis of Variance Table
## 
## Model  1: mpg ~ am
## Model  2: mpg ~ am + wt
## Model  3: mpg ~ am + wt + cyl
## Model  4: mpg ~ am + wt + cyl + disp
## Model  5: mpg ~ am + wt + cyl + disp + hp
## Model  6: mpg ~ am + wt + cyl + disp + hp + drat
## Model  7: mpg ~ am + wt + cyl + disp + hp + drat + carb
## Model  8: mpg ~ am + wt + cyl + disp + hp + drat + carb + gear
## Model  9: mpg ~ am + wt + cyl + disp + hp + drat + carb + gear + qsec
## Model 10: mpg ~ am + wt + cyl + disp + hp + drat + carb + gear + qsec + 
##     vs
##    Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1      30 720.90                                   
## 2      29 278.32  1    442.58 55.1371 2.129e-06 ***
## 3      27 182.97  2     95.35  5.9395   0.01259 *  
## 4      26 182.87  1      0.10  0.0123   0.91304    
## 5      25 150.41  1     32.46  4.0440   0.06266 .  
## 6      24 150.10  1      0.31  0.0384   0.84726    
## 7      19 132.74  5     17.36  0.4326   0.81894    
## 8      17 125.78  2      6.96  0.4335   0.65609    
## 9      16 124.03  1      1.75  0.2176   0.64756    
## 10     15 120.40  1      3.63  0.4522   0.51151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model based on am + wt looks the best one. Although it is still good, the model am + wt + cyl has lower p-value than am + wt. Let us see if it can be improved by including the interaction wt * cyl:

fit3Interaction  <- update(fit1, mpg ~ am + wt + cyl + wt * cyl)

anova(fit1, fit2, fit3, fit3Interaction)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + cyl
## Model 4: mpg ~ am + wt + cyl + wt:cyl
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 71.9823 7.916e-09 ***
## 3     27 182.97  2     95.35  7.7541  0.002399 ** 
## 4     25 153.71  2     29.26  2.3793  0.113263    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is not a real vantage with the interaction. So I will stick with the model mpg ~ am + wt only.

Inference

Comparing the coefficients for the model mpg ~ am and mpg ~ am + wt:

summary(fit1)$coefficients

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## amM          7.244939   1.764422  4.106127 2.850207e-04

summary(fit2)$coefficients

##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept) 37.32155131  3.0546385 12.21799285 5.843477e-13
## amM         -0.02361522  1.5456453 -0.01527855 9.879146e-01
## wt          -5.35281145  0.7882438 -6.79080719 1.867415e-07

It is possible to see that in the first case the manual transmission is better respect to the automatic one, as the coefficient increase is +7.245. In the second case, the two transmissions are pretty much the same. It means that wt regressor affects a lot the transmission and we cannot easily stating if one transmission is better than another.

Residual Analysis

In Appendix B residual plots for the models fit2 and fit10 are displayed. In both cases it seems that these models fit enough well the data. However there are a few values that are highlighted. It can be seen clearly from the graph Normal Q-Q as those values are out from the predicted linear model (potentially they have some leverage). Instead the Residuals-Fitted graph looks covering well the distribution of the data.

The suspicious values are related to the cars Chrysler Imperial, Toyota Corolla and Fiat 128. Let’s see their influence using hatvalues to quantify their level of leverage, and dfbetas to see how much their inclusion affects the slope coefficient:

require(ggplot2)
require(reshape2)

model_under_investigation <- lm (mpg ~ am + wt - 1, mtcars)

htv <- melt(as.matrix(round(hatvalues(model_under_investigation), 3)))
ggplot(data=htv, aes(x=Var1, y=value, group=Var2)) + geom_line() + theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) + ylab('hatvalues') + xlab('cars') + ggtitle("Hatvalues: mpg ~ am + wt")

dfb <- melt(round(dfbetas(model_under_investigation), 3))
ggplot(data=dfb, aes(x=Var1, y=value, group=Var2)) + geom_line(aes(colour = Var2))+ theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) + ylab('dfbetas') + xlab('cars') + ggtitle("Dfbetas: mpg ~ am + wt")

For the hatvalues peaks are not small, but not so big to affect in some way the model. In the case of dbetas values, the 3 cars look to have high peaks (positive and negative). In the case of Chrysler Imperial, for both am types, the peaks are negative against the positive for wt. It would explain why Chrysler Imperial is in the top left quadrant (Residual vs Fitted). We can conclude that those values have some influence but they do not affect particularly the model.

It is also interesting to see the graphical comparison of the models mpg ~ am, mpg ~ am + wt and mpg ~ .:

plot(predict(fit2))
lines(predict(fit2), col = "black")
lines(predict(fit1), col = "blue")
lines(predict(lm(mpg ~ ., mtcars)), col = "red")

Overall we can say that our model fit2 it is more closed to the fit10, indicating that am and wt are quite important factors.

Conclusions

Looking at the model I have used, there was not a real change between automatic and manual transmission using the parameters wt and am against mpg. Residual analysis revealed that the model fits quite well the data respect to including all the variable set. However the VIF and the correlation graph also say that some of the variables are mutually influenced. So, taking some of those in or out, we could over or under fitting the model.

To answer at the main question Is an automatic or manual transmission better for MPG? there is not a clear answer to the problem, because it is not really possible to define which parameters are more important than others. It is due to the fact that the variables are mutually influenced and using linear model to describe those interactions is it not enough accurate. So, exchanging parameters can easily leads to different results.

More advanced models and more insightful data are required to give a more accurate answer.

Appendix A

require(GGally)
require(ggplot2)

lowerFn <- function(data, mapping, ...) {
  p <- ggplot(data = data, mapping = mapping) +
    geom_point(color = 'blue', alpha=0.3, size=0.5) +
    geom_smooth(color = 'black', method='lm', size=0.7,...)
  p
}

printCorrelations <- function(modelData, graphTitle) {
  g <- ggpairs( 
    data = modelData,
    title = graphTitle,
    lower = list(
      continuous = wrap(lowerFn) #wrap("smooth", alpha = 0.3, color = "blue", lwd=1) 
    ),
    upper = list(continuous = wrap("cor", size = 2))
  )
  g <- g + theme(
    axis.text = element_text(size = 4),
    axis.title = element_text(size = 4),
    legend.background = element_rect(fill = "white"),
    panel.grid.major = element_line(colour = NA),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "grey95")
  )
  suppressMessages(print(g)) #, bottomHeightProportion = 0.5, leftWidthProportion = .5))
}

printCorrelations( subset(mtcars, select = c(mpg,am)), "Correlation Matrix for MPG/AM" )

printCorrelations( mtcars, "Correlation Martrix for all variables" )

Appendix B

Model mpg ~ am + wt

par(mfrow=c(2,2))
plot(fit2)

Model mpg ~ .

par(mfrow=c(2,2))
plot(fit10)

## Warning: not plotting observations with leverage one:
##   30, 31

## Warning: not plotting observations with leverage one:
##   30, 31