Executive summary

This Project is part of course assignment of Regression Models - a magazine about the automobile industry.
Looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome) which are the following two questions:

1.Is an automatic or manual transmission better for MPG?
2.Quantify the MPG difference between automatic and manual transmissions?

Using by simple linear model, we determine that there is a signficant difference between the mean 7.3 MPG for automatic and manual transmission cars. With multiple linear model, this increase is approximately 2.9 MPG when switching from an automatic transmission to a manual one, with the weight and qsec(1/4 mile time) held constant.

The way to find the best model is located in the Appendix to this document.

Part 1. Simple Regression and Explorary Analysis

1.0 Data and Library Load

library(ggplot2)
library(car)
data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#let's check the data distribution by boxplot
boxplot(mtcars$mpg ~ mtcars$am, names = c("Automatic", "Manual"), xlab = "transmission", ylab = "mpg", col = "orange")  

1.1 Simple Linear Regression modeling

#H0 : the mean of mpg of Automatic and Manual is same.
#HA : the mean of mpg of Automatic and Manual is different.
#this model is based on ols(ordinary least squares)

fit <- lm(mpg ~ am, data = mtcars)

1.2 Model check by Exploratory Analysis

#residual plot
g = ggplot(mtcars, aes(x = am, y= resid(fit)))
g = g + geom_hline(yintercept = 0, size = 2)
g = g + geom_point(size = 6, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g = g + xlab("transmission 0: Auto, 1: Manual") + ylab("Residual")
g

summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
#The intercept(b0) coefficient is mean of MPG for cars with automatic (am == 0)
#the am coefficient(b1) is the mean increase in MPG for cars with manual (am == 1). 
#the bo+b1 is the mean of MPG for cars with manual transmissions.
#we can calculate a 95% confidence interval for b0 and b1
confint(fit)
##                2.5 %   97.5 %
## (Intercept) 14.85062 19.44411
## am           3.64151 10.84837
diff(tapply(mtcars$mpg,mtcars$am, mean))
##        1 
## 7.244939
#the manual cars tend to have more average mpg of 7.244939 than automatic cars.
diff(tapply(mtcars$mpg,mtcars$am, sd))
##        1 
## 2.332537
#the manual cars tend to have more standard deviation mpg of 2.332537 than automatic cars.

Conclusion :

The p-value .0002850207410 for b1 is significantly small enough to reject the H0 and accept the HA that there is a significant difference in MPG between the two groups at alpha = 0.05. However when we check the Adjusted R-squard 0.3385 doesn’t explain the data enoughly, so we need to work on another model that accout for the data.

Part 2. Multiple Linear Regression and Explorary Analysis

2.0 Best Model and Library Load

#best model: to have a parsimounious model - accept wt + qsec + am as predictors 
#check the appendix to see how to find the best model.

2.1 Multiple linear modeling

bestmdl <- lm(mpg ~ wt + qsec + am, data = mtcars)
sqrt(vif(bestmdl))
##       wt     qsec       am 
## 1.575738 1.168049 1.594189
#It should be okay with Variance Influence Factor
influencePlot(bestmdl)

##                         StudRes       Hat        CookD
## Merc 230            -1.25110559 0.2970422 0.1620826668
## Lincoln Continental  0.08383783 0.2642151 0.0006541961
## Chrysler Imperial    2.32311937 0.2296338 0.3475974030
## Fiat 128             2.12214584 0.1276313 0.1464019096
#some with names of data need to be double checked

2.2 Model check by Exploratory Analysis

#residual plot
g = ggplot(mtcars, aes(x= bestmdl$fitted.values, y= resid(bestmdl)))
g = g + geom_hline(yintercept = 0, size = 2)
g = g + geom_point(size = 4, colour = "black", alpha = 0.4)
g = g + geom_point(size = 3, colour = "red", alpha = 0.4)
g = g + xlab("Fitted Values lm(mpg ~ wt + qsec + am)") + ylab("Residual")
g = g + geom_smooth()

g
## `geom_smooth()` using method = 'loess'

summary(bestmdl)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11
confint(bestmdl)
##                   2.5 %    97.5 %
## (Intercept) -4.63829946 23.873860
## wt          -5.37333423 -2.459673
## qsec         0.63457320  1.817199
## am           0.04573031  5.825944

Conclusion :

The p-value .0467 for am is not significantly small enough to reject the H0 and the p-values for wt, qsec are significantly small to accept the HA that there is a significant difference in MPG between the two groups at alpha = 0.05. It has the Adjusted R-squared value of 0.8336 which is higher than the previous model.

It is hard to say that this model is best, however based on the supporting analysis in Appendix section, this model can be regarded as a model that account for the data.

Appendix

#Let's find out the best model by utilizing the stepwise regression to find the best regression model automatically.
#I am using Backward stepwise regression.

full.model = lm(mpg ~ ., data = mtcars)
reduced.model= step(full.model, direction = "backward")
## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790
summary(reduced.model)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11
#Reduced best model include wt, qsec, am
#Let's double check with leaps to check the all subset regression to complemnt the stepwise model.

library(leaps)
leaps.model <- regsubsets(mpg ~ ., 
                          data = mtcars,
                          nbest = 4)
plot(leaps.model, scale= "adjr2") #using adjusted R-square

#Find a model which maximize the ajusted R-sqaure and parsimounious model if possible.
#Check the model adjr2 0.84 with wt, qsec, am, this model regarded as parsimonious model.

spreadLevelPlot(bestmdl)

## 
## Suggested power transformation:  -0.5249755
#let's check the corelation.
library(GGally)
ggpairs(mtcars, columns = c(1, 6, 7, 9))