Summary

The purpose of the project is to demonstrate an understanding of the model selection process in a regression context and to garner insights about the data. We have used the dataset of a collection of cars from Motor Trend, a magazine about the automobile industry. We have explored the relationship between a set of variables and miles per gallon. We have choosen on the basis of specific criteria a model that give us a better out-of-sample accuracy. With this model we have illustrated that manual is better than automatic transmission

Exploratory data analysis

library(ggplot2);data(mtcars)
par(mfrow=c(2,2))
mtcars = transform(mtcars, transmission=factor(am,labels=c("automatic","manual")))
pairs(mtcars[,c(1,3,4,6,9)], panel = panel.smooth, main = "Exploring mtcars data", col = 1 + (mtcars$wt))


We have first explored all the mtcars data(not shown in this report), then we have subsetted variables that seem to have strong relationship with MPG. At the end, the graph above shows us that Weight seems to have a better relationship over disp and hp. We will be focusing on mpg as outcome and the set of variables: wt and am.

library(ggplot2);data(mtcars)
mtcars = transform(mtcars, transmission=factor(am,labels=c("automatic","manual")))
qplot(wt,mpg,data=mtcars,color=transmission,geom=c("point","smooth"),method="lm")


Here we smooth the relationship between Weight and miles per gallon (MPG) (outcome). We are looking how this relationship is different betwen manual and automatic transmission. There’s a negative relationship and it seems to be strong for manual transmission.

Is an automatic or manual transmission better for MPG?

Let’s include all the variable in the mtcars dataset as predictor and mpg as the outcome in our first model to understand multiple variable regression

fit0<-lm(mpg ~ ., data=mtcars)
as.data.frame(summary(fit0)$coef)[9,]
##    Estimate Std. Error  t value  Pr(>|t|)
## am 2.520227   2.056651 1.225404 0.2339897

We grap only am coefficients. Its éstimated value is 2.61. So we estimate an expected 2.61 increase in Miles/(US) gallon for every 1% increase in number of automatic/manual transmission in holding the remaining variables constant.
Notice that The t-test for \(H_0: \beta_{transmission} = 0\) versus \(H_a: \beta_{transmission} \neq 0\) has a P-value equal to 0.2. it is not significant.

fit1<-lm(mpg ~ factor(am), data=mtcars)
summary(fit1)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## factor(am)1  7.244939   1.764422  4.106127 2.850207e-04

If we include only am, disregarding or not the factor function, because we have only two levels, the estimate is now 7.24, but the P-value is 2.8e-04< 0.05. The test statistic is interestingly significative. Notice that we have only am1, the manual transmission in the table. It is because R has elected to choose am0: the automatic transmission as the reference category. The number7.24 is the estimated increase in mpg comparing manual transmission to automatic.

fit2<-lm(mpg ~ factor(am)-1, data=mtcars)
summary(fit2)$coef
##             Estimate Std. Error  t value     Pr(>|t|)
## factor(am)0 17.14737   1.124603 15.24749 1.133983e-15
## factor(am)1 24.39231   1.359578 17.94109 1.376283e-17

If we omit the intercept, then the model includes both automatic and manual transmission. Now manual is not a linear combination of automatic, there’s 2 means in the dataset. The expected value of the outcome should be the mean for automatic or manual transmission. As we can see in the table automatic is about 17.14737 and manual is about 24.39231 and it is clearly illustrated in the boxplot below.

boxplot(mpg ~ transmission, data = mtcars,
        xlab = "Type of transmission", ylab = "Miles/(US) gallon count",
        main = "Automatic versus manual transmission", varwidth = TRUE, col = "lightgray")


If I were to substract 17.14737 to 24.39231, I would get exactly 7.244939. In other words, the model are perfectly consistent. If you are going to fix the mean of the automatic and manual, any way you do this, the models will alway be consistent. If you parameterize the model so all the coefficients are interpreted as: automatic is the reference category in change from manual, then they will be consistent. In this model, the coefficients are interpreted as the mean.

Statistical inference

alpha <- 0.05
n<-nrow(mtcars)
pe <- coef(summary(fit1))["factor(am)1", "Estimate"]
se <- coef(summary(fit1))["factor(am)1", "Std. Error"]
tstat <- qt(1 - alpha/2, n - 2)  # n - 2 for model with intercept and slope
pe + c(-1, 1) * (se * tstat)
## [1]  3.64151 10.84837

Uncertainty and Conclusion

If we were willing to choose the model 1 as our best model, then the confidence interval for the 7.24 MPG difference would be 3.64151 to 10.84837

t.test(mpg ~ factor(am), data = mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by factor(am)
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

Conclusion 1

In the first and second model, we have illustrated that manual transmission is better than automatic transmission and an MPG difference of 7.24. The P-value is 0.00137, smaller than 0.05. We reject the null hypothesis and claim that there is a signficiant difference in the mean MPG between manual transmission cars and that of automatic transmission cars

fit3<-lm(mpg ~ am + wt, data=mtcars)
summary(fit3)$coef
##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept) 37.32155131  3.0546385 12.21799285 5.843477e-13
## am          -0.02361522  1.5456453 -0.01527855 9.879146e-01
## wt          -5.35281145  0.7882438 -6.79080719 1.867415e-07

In addition, if we include the weight variable, this ajustment reverse the sign of the estimate effect. However the transmission P-value 9.8e-01 is not significant.

Residuals and Diagnostic 1

n<-nrow(mtcars);x2 <- mtcars$wt ; x1 <-mtcars$am; y = mtcars$mpg
par(mfrow = c(1, 2))
plot(x1, y, pch=21,col="black",bg=topo.colors(n)[x2], frame = FALSE, cex = 1.5,main
='Unadjusted, color is Weight',cex.main=.9)
abline(lm(y ~ x1), lwd = 2)
plot(resid(lm(x1 ~ x2)), resid(lm(y ~ x2)), pch = 21, col = "black", bg = "lightblue", frame = FALSE, cex = 1.5)
title('Adjusted',cex.main=.9)
abline(0, coef(lm(y ~ x1 + x2))[2], lwd = 2)
x<-resid(lm(x1 ~ x2))
e<-resid(lm(y ~ x2))
for (i in 1 : n)
lines(c(x[i], x[i]), c(e[i], 0), col = "red", lwd = 1)


The first graph is the unajusted one. It shows a poor relationship. You can also see the confounder wt is poorly correlated with am. We can remove the effect of wt by taking it out of am. You can see that once we factor out wt, we have a relationonship slightly in the opposite direction, and it’s basically saying if we were to fix the unajusted model fit2, what you would see is a negative slope between mpg and wt.

fit4<-lm(mpg ~ wt + factor(am) + am*wt,data = mtcars)
summary(fit4)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 31.416055  3.0201093 10.402291 4.001043e-11
## wt          -3.785908  0.7856478 -4.818836 4.551182e-05
## factor(am)1 14.878423  4.2640422  3.489277 1.621034e-03
## wt:am       -5.298360  1.4446993 -3.667449 1.017148e-03

We consider a fourth model with mpg as the outcome model that considers the interaction between transmission (as a factor variable) and weight. automatic is the reference category. 31.41 and -3.78 is giving us the intercept and the slope for automatic, 14.87 is the change in the intercept for manual, so that the sum of 31.41 and 14.87 is the intercept for manual, and the sum of -3.78 and -5.29 is the change in the slope betwen automatic and manual transmission. We have two lines with different slopes, one for transmission and one for manual. Our new slope for manual transmission would be the sum of -3.78 and -5.29. The way that is illustrated is as follows:

fit4<-lm(mpg ~ wt + am + am*wt,data = mtcars)
plot(mtcars$wt,mtcars$mpg,pch=19,ylab="Miles/(US) gallon",xlab="Weight (lb/1000)")
points(mtcars$wt,mtcars$mpg,pch=19,col=((mtcars$am==0)*1+1))
abline(c(fit4$coeff[1],fit4$coeff[2]),col="red",lwd=3)
abline(c(fit4$coeff[1] + fit4$coeff[3],fit4$coeff[2] + fit4$coeff[4]),col="black",lwd=3)
legend("topright",col=c("black", "red"),lty=1, lwd=2,y.intersp=1.2,bty="n",
       legend=c("Automatic", "Manual"))

Conclusion 2

In the fourth model, We have demonstrated that manual transmission is better than automatic and that the MPG difference is 14.87. We have attempted to fix the model 2 using Model 3. Nonetheless, Model 3 shows a poor relationship. We have used a formal model selection based on the P-value for the likelihood ratio test comparing model 3 and model 4.

anova(fit3,fit4)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am + wt
## Model 2: mpg ~ wt + am + am * wt
##   Res.Df    RSS Df Sum of Sq     F   Pr(>F)   
## 1     29 278.32                               
## 2     28 188.01  1    90.312 13.45 0.001017 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The P-value is: *0.001017, smaller than 0.05. So, according to our criterion, we would reject the null hypothesis, which suggests that the interaction terms is necessary and the model 4 is the best one so far.

Residuals and Diagnostic 2

fit4<-lm(mpg ~ wt + am + am*wt,data = mtcars)
plot(fit4, which=c(1))

highest <- dfbetas(fit4)
tail(sort(highest[,3]),3)
##          Fiat 128    Toyota Corolla Chrysler Imperial 
##         0.3266016         0.4685961         0.5836215
influential<- hatvalues(fit4)
tail(sort(influential),3)
##   Chrysler Imperial Lincoln Continental       Maserati Bora 
##           0.2809856           0.3044512           0.3709866

Conclusion 3

The hat diagonal for the three most influential point are: Chrysler Imperial, Lincoln Continental, Maserati Bora whereas the slope dfbeta for the three points with the highest hat value are: Fiat 128, Toyota Corolla and Chrysler Imperial.