****************************************************************************************************************************

Problem Statement:

We are doing some analysis work for Motor Trend, a magazine about the automobile industry. Looking at a data set mtcars which is a data set with a collection of cars. Our bosses at Motor Trend are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome).

They are particularly interested in the following two questions:

  1. “Is an automatic or manual transmission better for MPG”
  2. “Quantify the MPG difference between automatic and manual transmissions”

Our Approach (high-level):

We will first understand the data through EDA, then build regression models with MPG as an outcome, then we will analyze models through residual plots and diagnostics, and select best fitted model. Finally we will make decisions and draw conclusion on best MPG performance based on transmission type.

Following steps outline our approach:

  1. Do some Exploratory Data Analysis (EDA) - look at data, its strcuture, and descriptive statistics on variables of our interest. Also we will have to make decision about how we will compare MPG performance based on whether the car is automatic transmission or manual transmission car.
  2. Univariate Regression Models - we will try and fit a simple linear regression to the data, and analyze outcomes
  3. Multivariate Regression Models and Model Selection - we will fit several multivariable regressions and analyze outcomes
  4. Perform Predictions - we will predict MPG using a test data set and the selected model
  5. Conclusion - we will summarize final findings and suggestions
  6. Appendix - additional information

**************************************************************************************************************************

Load the required packages from the library

suppressMessages(library(UsingR))
suppressMessages(library(ggplot2))
suppressMessages(library(stats))
suppressMessages(library(dplyr))
data("mtcars")

1. Exploratory Data Analysis

We are particularly interested in variables “mpg” and “am” where “mpg” is the measure of miles per gallon and “am” descirbes whether the car is automatic or manual transmission (0=automatic, 1=manual)

# Let's see the first six rows of the data
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# Now let's see summary descriptive statistics on variable "mpg" 
summary(mtcars$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.42   19.20   20.09   22.80   33.90
# Now let's count how many of all 32 cars are with automatic transmission and how many are with manual
table(mtcars$am)
## 
##  0  1 
## 19 13
# Let's do boxplot of MPG against transmission type
boxplot(mpg ~ am, data=mtcars, xlab="Transmission (0 = Automatic, 1 = Manual)", ylab="Miles per Gallon", main="Boxplot of MPG vs. Transmission")

# Now let's see average MPG on automatic transmission cars and manual transmission cars separately
auto<-subset(mtcars, mtcars$am==0)
summary(auto$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   14.95   17.30   17.15   19.20   24.40
manual<-subset(mtcars, mtcars$am==1)
summary(manual$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   21.00   22.80   24.39   30.40   33.90
# Now let's see how many cylinder types we have in this data type
table(mtcars$cyl)
## 
##  4  6  8 
## 11  7 14

According to our Exploratory Data Analysis (EDA), data details 11 features for 32 different types of cars. Out of these 32 cars, 19 cars are with automatic transmission and the rest are with manual transmission. We also found out that average MPG is 20.09 with minimum MPG value of 10.40 and maximum MPG value of 33.90. According to summary descriptve statistics on average MPG on automatic and manual transmission cars separately, data shows that we get more MPG on manual transmission cars with average MPG on manual transmission cars comes out to be 24.39 compared to that of automatic transmission cars which is 17.15.

2. UNIVARIATE REGRESSION MODELS

2.1 Let’s try and fit univariate linear regression model to our data

# "mpg" as an outcome and "am"" as a predictor variable
fit1<-lm(mpg~factor(am), data = mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## factor(am)1    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

As seen from the summary output of the linear regression, the intercept is the mean MPG of automatic transmission and the slope coefficient is the difference in average MPG in manual transmission cars compared to automatic transmission cars. Thus, 17.147 is the average MPG that automatic transmission cars get while 17.147+7.245=24.392 is the average MPG that manual transmission cars get. This supports the average MPG values we got through simple EDA we did prior to this. However, this model only explains 34% (R-squared 0.3385) of the total variation. The leftover from this variation is the residual variation. So let’s look at the residual plots and then we will decide if we must further include more variables in our model.

2.2 Residual plots

library(knitr)
e1<-residuals(fit1)
x<-factor(mtcars$am)
g=ggplot(data.frame(x=x,y=e1),aes(x=x,y=y))
g=g+geom_hline(yintercept = 0,size=2)
g=g+geom_point(size=7,colour="black",alpha=0.4)
g=g+geom_point(size=5,colour="red",alpha=0.4)
g=g+ggtitle("fit1 - Residual plot")+xlab("Transmission Type")+ylab("Residuals")
g

According to the residual plot of fit1, we see that there are lots of residual variation in both transmission types, especially with manual transmission. That means, we must further include more variables into the model which then likely explain these residual variations.

3. MULTIVARIATE REGRESSION MODELS AND MODEL SELECTION

3.1 Let’s fit several models and see how which model presents best results

Looking at the correlation matrix and graph in the APPENDIX part of this report, we will now include variables cyl, drat and qsec in the nested multivariable models.

# MPG as an outcome, am and cyl as predictors
fit2<-lm(mpg~factor(am)+factor(cyl), data = mtcars)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ factor(am) + factor(cyl), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9618 -1.4971 -0.2057  1.8907  6.5382 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    24.802      1.323  18.752  < 2e-16 ***
## factor(am)1     2.560      1.298   1.973 0.058457 .  
## factor(cyl)6   -6.156      1.536  -4.009 0.000411 ***
## factor(cyl)8  -10.068      1.452  -6.933 1.55e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.073 on 28 degrees of freedom
## Multiple R-squared:  0.7651, Adjusted R-squared:  0.7399 
## F-statistic:  30.4 on 3 and 28 DF,  p-value: 5.959e-09
# MPG as an outcome, am, cyl, and drat as predictors
fit3<-lm(mpg~factor(am)+factor(cyl)+drat, data = mtcars)
summary(fit3)
## 
## Call:
## lm(formula = mpg ~ factor(am) + factor(cyl) + drat, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.938 -1.585 -0.211  1.911  6.536 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   23.9197     6.7812   3.527  0.00152 ** 
## factor(am)1    2.4381     1.6087   1.516  0.14125    
## factor(cyl)6  -6.0768     1.6737  -3.631  0.00117 ** 
## factor(cyl)8  -9.9381     1.7711  -5.611 5.93e-06 ***
## drat           0.2385     1.7966   0.133  0.89539    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.129 on 27 degrees of freedom
## Multiple R-squared:  0.7653, Adjusted R-squared:  0.7305 
## F-statistic: 22.01 on 4 and 27 DF,  p-value: 3.606e-08
# MPG as an outcome, am, cyl, drat, and qsec as predictors
fit4<-lm(mpg~factor(am)+factor(cyl)+drat+qsec, data = mtcars)
summary(fit4)
## 
## Call:
## lm(formula = mpg ~ factor(am) + factor(cyl) + drat + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9914 -1.2239 -0.1982  1.8050  5.8845 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   13.6963    16.7136   0.819   0.4200  
## factor(am)1    3.4540     2.2219   1.555   0.1321  
## factor(cyl)6  -5.1320     2.2010  -2.332   0.0277 *
## factor(cyl)8  -8.0778     3.3010  -2.447   0.0215 *
## drat           0.5171     1.8622   0.278   0.7834  
## qsec           0.4363     0.6506   0.671   0.5084  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.161 on 26 degrees of freedom
## Multiple R-squared:  0.7693, Adjusted R-squared:  0.7249 
## F-statistic: 17.34 on 5 and 26 DF,  p-value: 1.481e-07

According to the summary outputs of fit2, fit3, and fit4, including cyl as a predictor is significant and it increases the value of R-squared to 0.7399 which means this model now explains the 74% of the total variation in MPG. However, including drat and qsec as predictors does not help much and not significant so we will take them out of the model. However, as per our correltation matrix in appendix section, let’s include wt and qsec variables as regressors and see what happens.

# MPG as an outcome, am and wt
fit3<-lm(mpg~factor(am)+wt, data = mtcars)
summary(fit3)
## 
## Call:
## lm(formula = mpg ~ factor(am) + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5295 -2.3619 -0.1317  1.4025  6.8782 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.32155    3.05464  12.218 5.84e-13 ***
## factor(am)1 -0.02362    1.54565  -0.015    0.988    
## wt          -5.35281    0.78824  -6.791 1.87e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.098 on 29 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7358 
## F-statistic: 44.17 on 2 and 29 DF,  p-value: 1.579e-09
# MPG as an outcome, am and wt as predictors
fit4<-lm(mpg~factor(am)+wt+qsec, data = mtcars)
summary(fit4)
## 
## Call:
## lm(formula = mpg ~ factor(am) + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## factor(am)1   2.9358     1.4109   2.081 0.046716 *  
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The outputs of the summary show that including wt and qsec variable as regressors is significant and R-squared value of 0.8336 tells us that model explains 83% of total variation in predicted MPG. So now, let’s compare fit1 (original simple linear regression), fit2 (multivariate regression including cyl variable) and fit4 (multivariate regression including both wt and qsec variables) using anova test.

anova(fit1, fit2, fit4)
## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + factor(cyl)
## Model 3: mpg ~ factor(am) + wt + qsec
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     28 264.50  2    456.40 24.158 8.01e-07 ***
## 3     28 169.29  0     95.21                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
c(deviance(fit1),deviance(fit2),deviance(fit4))
## [1] 720.8966 264.4957 169.2859
shapiro.test(fit1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit1$residuals
## W = 0.98208, p-value = 0.8573
shapiro.test(fit2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit2$residuals
## W = 0.98002, p-value = 0.8001
shapiro.test(fit4$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit4$residuals
## W = 0.9411, p-value = 0.08043

According to ANOVA, fit4 seems to be the best model with lowest residual sums of squares. Also notice the residual normality test and it suggests that we can trust the results of ANOVA. The reduction in residual sums of squares with the model fit4 is also a supporting fact to our decision to select fit4 as the best model.

3.2 Residual plot on selected model

e2<-residuals(fit4)
x<-factor(mtcars$am)
g=ggplot(data.frame(x=x,y=e2),aes(x=x,y=y))
g=g+geom_hline(yintercept = 0,size=2)
g=g+geom_point(size=7,colour="black",alpha=0.4)
g=g+geom_point(size=5,colour="red",alpha=0.4)
g=g+ggtitle("fit4 - Residual plot")+xlab("Transmission Type")+ylab("Residuals")
g

As seen from the residual plot above, the fit4 has much smaller residuals than that of fit1. Thus, based on ANOVA, residual normality test and the residual plot, we select fit4 as the best model.

4. PERFORM PREDICTIONS

# Let's first create a test data set
tam<-mtcars$am
tqsec<-mtcars$qsec
twt<-mtcars$wt
ndata<-data.frame(am=factor(tam), qsec=tqsec, wt=twt)
# Predict with fit1 as a baseline to compare with fit4
p1<-predict(fit1, newdata = ndata, interval = "prediction")
# Predict with model 4
p4<-predict(fit4, newdata = ndata, interval = "prediction")
# Compare predictions
prds<-data.frame(p1=p1, p4=p4)
fits<-data.frame(p1=mean(prds$p1.fit),p4=mean(prds$p4.fit))
lwrs<-data.frame(p1=mean(prds$p1.lwr),p4=mean(prds$p4.lwr))
uprs<-data.frame(p1=mean(prds$p1.upr),p4=mean(prds$p4.upr))
ovrl<-rbind(fits, lwrs, uprs)
row.names(ovrl)<-c("Avg_fits", "Avg_lwr", "Avg_upr")
ovrl
##                 p1       p4
## Avg_fits 20.090625 20.09062
## Avg_lwr   9.771398 14.75029
## Avg_upr  30.409852 25.43096

Accroding to the predictions, fit4 predicts average MPG exactly same as fit1 (which is also the average of MPG in mtcars data) but predicts lower and upper prediction boundaries more accurate and realistically than fit1.

5. CONCLUSION

p4<-data.frame(p4)
f<-p4$fit
x<-factor(tam)
g=ggplot(data.frame(x=x,y=f),aes(x=x,y=y))
g=g+geom_hline(yintercept = 0,size=2)
g=g+geom_point(size=7,colour="black",alpha=0.4)
g=g+geom_point(size=5,colour="red",alpha=0.4)
g=g+ggtitle("Predicted MPG values vs Transmission Type")+xlab("Transmission Type")+ylab("Predicted MPG")
g

fit1<-lm(mpg~factor(am)+wt+qsec, data = mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ factor(am) + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## factor(am)1   2.9358     1.4109   2.081 0.046716 *  
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

According to the plot, often times manual transmission cars get higher MPG performance than the automatic cars. Thus, we conclude that manual transmission is better for MPG.

Also, the MPG difference between automatic and manual transmissions is 2.93.

6. APPENDIX

let’s try to understand the details about the data

# Let's first quickly read about mtcars data on R help
?mtcars
# Now let's look at the variables names of the data
names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
# Now see more details about data as to its number of observations, number of variables, and the class of each variable
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Let’s see correlation between variables that might aid us in model selection

# Let's first glance at the correlation table to see which variables have high correlation with variable mpg
cor(mtcars,use = "complete.obs", method = "kendall")
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.7953134 -0.7681311 -0.7428125  0.46454879 -0.7278321
## cyl  -0.7953134  1.0000000  0.8144263  0.7851865 -0.55131785  0.7282611
## disp -0.7681311  0.8144263  1.0000000  0.6659987 -0.49898277  0.7433824
## hp   -0.7428125  0.7851865  0.6659987  1.0000000 -0.38262689  0.6113081
## drat  0.4645488 -0.5513178 -0.4989828 -0.3826269  1.00000000 -0.5471495
## wt   -0.7278321  0.7282611  0.7433824  0.6113081 -0.54714953  1.0000000
## qsec  0.3153652 -0.4489698 -0.3008155 -0.4729061  0.03272155 -0.1419881
## vs    0.5896790 -0.7710007 -0.6033059 -0.6305926  0.37510111 -0.4884787
## am    0.4690128 -0.4946212 -0.5202739 -0.3039956  0.57554849 -0.6138790
## gear  0.4331509 -0.5125435 -0.4759795 -0.2794458  0.58392476 -0.5435956
## carb -0.5043945  0.4654299  0.4137360  0.5959842 -0.09535193  0.3713741
##             qsec         vs          am        gear        carb
## mpg   0.31536522  0.5896790  0.46901280  0.43315089 -0.50439455
## cyl  -0.44896982 -0.7710007 -0.49462115 -0.51254349  0.46542994
## disp -0.30081549 -0.6033059 -0.52027392 -0.47597955  0.41373600
## hp   -0.47290613 -0.6305926 -0.30399557 -0.27944584  0.59598416
## drat  0.03272155  0.3751011  0.57554849  0.58392476 -0.09535193
## wt   -0.14198812 -0.4884787 -0.61387896 -0.54359562  0.37137413
## qsec  1.00000000  0.6575431 -0.16890405 -0.09126069 -0.50643945
## vs    0.65754312  1.0000000  0.16834512  0.26974788 -0.57692729
## am   -0.16890405  0.1683451  1.00000000  0.77078758 -0.05859929
## gear -0.09126069  0.2697479  0.77078758  1.00000000  0.09801487
## carb -0.50643945 -0.5769273 -0.05859929  0.09801487  1.00000000
# Let's do a correlation table to aid ourselves in model selection.
pairs(mpg~cyl+drat+qsec,data = mtcars)

pairs(mpg~cyl+wt+qsec,data = mtcars)