We are doing some analysis work for Motor Trend, a magazine about the automobile industry. Looking at a data set mtcars which is a data set with a collection of cars. Our bosses at Motor Trend are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome).
They are particularly interested in the following two questions:
We will first understand the data through EDA, then build regression models with MPG as an outcome, then we will analyze models through residual plots and diagnostics, and select best fitted model. Finally we will make decisions and draw conclusion on best MPG performance based on transmission type.
Following steps outline our approach:
suppressMessages(library(UsingR))
suppressMessages(library(ggplot2))
suppressMessages(library(stats))
suppressMessages(library(dplyr))
data("mtcars")
We are particularly interested in variables “mpg” and “am” where “mpg” is the measure of miles per gallon and “am” descirbes whether the car is automatic or manual transmission (0=automatic, 1=manual)
# Let's see the first six rows of the data
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Now let's see summary descriptive statistics on variable "mpg"
summary(mtcars$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.42 19.20 20.09 22.80 33.90
# Now let's count how many of all 32 cars are with automatic transmission and how many are with manual
table(mtcars$am)
##
## 0 1
## 19 13
# Let's do boxplot of MPG against transmission type
boxplot(mpg ~ am, data=mtcars, xlab="Transmission (0 = Automatic, 1 = Manual)", ylab="Miles per Gallon", main="Boxplot of MPG vs. Transmission")
# Now let's see average MPG on automatic transmission cars and manual transmission cars separately
auto<-subset(mtcars, mtcars$am==0)
summary(auto$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 14.95 17.30 17.15 19.20 24.40
manual<-subset(mtcars, mtcars$am==1)
summary(manual$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 21.00 22.80 24.39 30.40 33.90
# Now let's see how many cylinder types we have in this data type
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
According to our Exploratory Data Analysis (EDA), data details 11 features for 32 different types of cars. Out of these 32 cars, 19 cars are with automatic transmission and the rest are with manual transmission. We also found out that average MPG is 20.09 with minimum MPG value of 10.40 and maximum MPG value of 33.90. According to summary descriptve statistics on average MPG on automatic and manual transmission cars separately, data shows that we get more MPG on manual transmission cars with average MPG on manual transmission cars comes out to be 24.39 compared to that of automatic transmission cars which is 17.15.
2.1 Let’s try and fit univariate linear regression model to our data
# "mpg" as an outcome and "am"" as a predictor variable
fit1<-lm(mpg~factor(am), data = mtcars)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## factor(am)1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
As seen from the summary output of the linear regression, the intercept is the mean MPG of automatic transmission and the slope coefficient is the difference in average MPG in manual transmission cars compared to automatic transmission cars. Thus, 17.147 is the average MPG that automatic transmission cars get while 17.147+7.245=24.392 is the average MPG that manual transmission cars get. This supports the average MPG values we got through simple EDA we did prior to this. However, this model only explains 34% (R-squared 0.3385) of the total variation. The leftover from this variation is the residual variation. So let’s look at the residual plots and then we will decide if we must further include more variables in our model.
2.2 Residual plots
library(knitr)
e1<-residuals(fit1)
x<-factor(mtcars$am)
g=ggplot(data.frame(x=x,y=e1),aes(x=x,y=y))
g=g+geom_hline(yintercept = 0,size=2)
g=g+geom_point(size=7,colour="black",alpha=0.4)
g=g+geom_point(size=5,colour="red",alpha=0.4)
g=g+ggtitle("fit1 - Residual plot")+xlab("Transmission Type")+ylab("Residuals")
g
According to the residual plot of fit1, we see that there are lots of residual variation in both transmission types, especially with manual transmission. That means, we must further include more variables into the model which then likely explain these residual variations.
3.1 Let’s fit several models and see how which model presents best results
Looking at the correlation matrix and graph in the APPENDIX part of this report, we will now include variables cyl, drat and qsec in the nested multivariable models.
# MPG as an outcome, am and cyl as predictors
fit2<-lm(mpg~factor(am)+factor(cyl), data = mtcars)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ factor(am) + factor(cyl), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9618 -1.4971 -0.2057 1.8907 6.5382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.802 1.323 18.752 < 2e-16 ***
## factor(am)1 2.560 1.298 1.973 0.058457 .
## factor(cyl)6 -6.156 1.536 -4.009 0.000411 ***
## factor(cyl)8 -10.068 1.452 -6.933 1.55e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.073 on 28 degrees of freedom
## Multiple R-squared: 0.7651, Adjusted R-squared: 0.7399
## F-statistic: 30.4 on 3 and 28 DF, p-value: 5.959e-09
# MPG as an outcome, am, cyl, and drat as predictors
fit3<-lm(mpg~factor(am)+factor(cyl)+drat, data = mtcars)
summary(fit3)
##
## Call:
## lm(formula = mpg ~ factor(am) + factor(cyl) + drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.938 -1.585 -0.211 1.911 6.536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9197 6.7812 3.527 0.00152 **
## factor(am)1 2.4381 1.6087 1.516 0.14125
## factor(cyl)6 -6.0768 1.6737 -3.631 0.00117 **
## factor(cyl)8 -9.9381 1.7711 -5.611 5.93e-06 ***
## drat 0.2385 1.7966 0.133 0.89539
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.129 on 27 degrees of freedom
## Multiple R-squared: 0.7653, Adjusted R-squared: 0.7305
## F-statistic: 22.01 on 4 and 27 DF, p-value: 3.606e-08
# MPG as an outcome, am, cyl, drat, and qsec as predictors
fit4<-lm(mpg~factor(am)+factor(cyl)+drat+qsec, data = mtcars)
summary(fit4)
##
## Call:
## lm(formula = mpg ~ factor(am) + factor(cyl) + drat + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9914 -1.2239 -0.1982 1.8050 5.8845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.6963 16.7136 0.819 0.4200
## factor(am)1 3.4540 2.2219 1.555 0.1321
## factor(cyl)6 -5.1320 2.2010 -2.332 0.0277 *
## factor(cyl)8 -8.0778 3.3010 -2.447 0.0215 *
## drat 0.5171 1.8622 0.278 0.7834
## qsec 0.4363 0.6506 0.671 0.5084
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.161 on 26 degrees of freedom
## Multiple R-squared: 0.7693, Adjusted R-squared: 0.7249
## F-statistic: 17.34 on 5 and 26 DF, p-value: 1.481e-07
According to the summary outputs of fit2, fit3, and fit4, including cyl as a predictor is significant and it increases the value of R-squared to 0.7399 which means this model now explains the 74% of the total variation in MPG. However, including drat and qsec as predictors does not help much and not significant so we will take them out of the model. However, as per our correltation matrix in appendix section, let’s include wt and qsec variables as regressors and see what happens.
# MPG as an outcome, am and wt
fit3<-lm(mpg~factor(am)+wt, data = mtcars)
summary(fit3)
##
## Call:
## lm(formula = mpg ~ factor(am) + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5295 -2.3619 -0.1317 1.4025 6.8782
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.32155 3.05464 12.218 5.84e-13 ***
## factor(am)1 -0.02362 1.54565 -0.015 0.988
## wt -5.35281 0.78824 -6.791 1.87e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.098 on 29 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7358
## F-statistic: 44.17 on 2 and 29 DF, p-value: 1.579e-09
# MPG as an outcome, am and wt as predictors
fit4<-lm(mpg~factor(am)+wt+qsec, data = mtcars)
summary(fit4)
##
## Call:
## lm(formula = mpg ~ factor(am) + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## factor(am)1 2.9358 1.4109 2.081 0.046716 *
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The outputs of the summary show that including wt and qsec variable as regressors is significant and R-squared value of 0.8336 tells us that model explains 83% of total variation in predicted MPG. So now, let’s compare fit1 (original simple linear regression), fit2 (multivariate regression including cyl variable) and fit4 (multivariate regression including both wt and qsec variables) using anova test.
anova(fit1, fit2, fit4)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + factor(cyl)
## Model 3: mpg ~ factor(am) + wt + qsec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 264.50 2 456.40 24.158 8.01e-07 ***
## 3 28 169.29 0 95.21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
c(deviance(fit1),deviance(fit2),deviance(fit4))
## [1] 720.8966 264.4957 169.2859
shapiro.test(fit1$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit1$residuals
## W = 0.98208, p-value = 0.8573
shapiro.test(fit2$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit2$residuals
## W = 0.98002, p-value = 0.8001
shapiro.test(fit4$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit4$residuals
## W = 0.9411, p-value = 0.08043
According to ANOVA, fit4 seems to be the best model with lowest residual sums of squares. Also notice the residual normality test and it suggests that we can trust the results of ANOVA. The reduction in residual sums of squares with the model fit4 is also a supporting fact to our decision to select fit4 as the best model.
3.2 Residual plot on selected model
e2<-residuals(fit4)
x<-factor(mtcars$am)
g=ggplot(data.frame(x=x,y=e2),aes(x=x,y=y))
g=g+geom_hline(yintercept = 0,size=2)
g=g+geom_point(size=7,colour="black",alpha=0.4)
g=g+geom_point(size=5,colour="red",alpha=0.4)
g=g+ggtitle("fit4 - Residual plot")+xlab("Transmission Type")+ylab("Residuals")
g
As seen from the residual plot above, the fit4 has much smaller residuals than that of fit1. Thus, based on ANOVA, residual normality test and the residual plot, we select fit4 as the best model.
# Let's first create a test data set
tam<-mtcars$am
tqsec<-mtcars$qsec
twt<-mtcars$wt
ndata<-data.frame(am=factor(tam), qsec=tqsec, wt=twt)
# Predict with fit1 as a baseline to compare with fit4
p1<-predict(fit1, newdata = ndata, interval = "prediction")
# Predict with model 4
p4<-predict(fit4, newdata = ndata, interval = "prediction")
# Compare predictions
prds<-data.frame(p1=p1, p4=p4)
fits<-data.frame(p1=mean(prds$p1.fit),p4=mean(prds$p4.fit))
lwrs<-data.frame(p1=mean(prds$p1.lwr),p4=mean(prds$p4.lwr))
uprs<-data.frame(p1=mean(prds$p1.upr),p4=mean(prds$p4.upr))
ovrl<-rbind(fits, lwrs, uprs)
row.names(ovrl)<-c("Avg_fits", "Avg_lwr", "Avg_upr")
ovrl
## p1 p4
## Avg_fits 20.090625 20.09062
## Avg_lwr 9.771398 14.75029
## Avg_upr 30.409852 25.43096
Accroding to the predictions, fit4 predicts average MPG exactly same as fit1 (which is also the average of MPG in mtcars data) but predicts lower and upper prediction boundaries more accurate and realistically than fit1.
p4<-data.frame(p4)
f<-p4$fit
x<-factor(tam)
g=ggplot(data.frame(x=x,y=f),aes(x=x,y=y))
g=g+geom_hline(yintercept = 0,size=2)
g=g+geom_point(size=7,colour="black",alpha=0.4)
g=g+geom_point(size=5,colour="red",alpha=0.4)
g=g+ggtitle("Predicted MPG values vs Transmission Type")+xlab("Transmission Type")+ylab("Predicted MPG")
g
fit1<-lm(mpg~factor(am)+wt+qsec, data = mtcars)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ factor(am) + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## factor(am)1 2.9358 1.4109 2.081 0.046716 *
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
According to the plot, often times manual transmission cars get higher MPG performance than the automatic cars. Thus, we conclude that manual transmission is better for MPG.
Also, the MPG difference between automatic and manual transmissions is 2.93.
let’s try to understand the details about the data
# Let's first quickly read about mtcars data on R help
?mtcars
# Now let's look at the variables names of the data
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
# Now see more details about data as to its number of observations, number of variables, and the class of each variable
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Let’s see correlation between variables that might aid us in model selection
# Let's first glance at the correlation table to see which variables have high correlation with variable mpg
cor(mtcars,use = "complete.obs", method = "kendall")
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.7953134 -0.7681311 -0.7428125 0.46454879 -0.7278321
## cyl -0.7953134 1.0000000 0.8144263 0.7851865 -0.55131785 0.7282611
## disp -0.7681311 0.8144263 1.0000000 0.6659987 -0.49898277 0.7433824
## hp -0.7428125 0.7851865 0.6659987 1.0000000 -0.38262689 0.6113081
## drat 0.4645488 -0.5513178 -0.4989828 -0.3826269 1.00000000 -0.5471495
## wt -0.7278321 0.7282611 0.7433824 0.6113081 -0.54714953 1.0000000
## qsec 0.3153652 -0.4489698 -0.3008155 -0.4729061 0.03272155 -0.1419881
## vs 0.5896790 -0.7710007 -0.6033059 -0.6305926 0.37510111 -0.4884787
## am 0.4690128 -0.4946212 -0.5202739 -0.3039956 0.57554849 -0.6138790
## gear 0.4331509 -0.5125435 -0.4759795 -0.2794458 0.58392476 -0.5435956
## carb -0.5043945 0.4654299 0.4137360 0.5959842 -0.09535193 0.3713741
## qsec vs am gear carb
## mpg 0.31536522 0.5896790 0.46901280 0.43315089 -0.50439455
## cyl -0.44896982 -0.7710007 -0.49462115 -0.51254349 0.46542994
## disp -0.30081549 -0.6033059 -0.52027392 -0.47597955 0.41373600
## hp -0.47290613 -0.6305926 -0.30399557 -0.27944584 0.59598416
## drat 0.03272155 0.3751011 0.57554849 0.58392476 -0.09535193
## wt -0.14198812 -0.4884787 -0.61387896 -0.54359562 0.37137413
## qsec 1.00000000 0.6575431 -0.16890405 -0.09126069 -0.50643945
## vs 0.65754312 1.0000000 0.16834512 0.26974788 -0.57692729
## am -0.16890405 0.1683451 1.00000000 0.77078758 -0.05859929
## gear -0.09126069 0.2697479 0.77078758 1.00000000 0.09801487
## carb -0.50643945 -0.5769273 -0.05859929 0.09801487 1.00000000
# Let's do a correlation table to aid ourselves in model selection.
pairs(mpg~cyl+drat+qsec,data = mtcars)
pairs(mpg~cyl+wt+qsec,data = mtcars)