You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
Is an automatic or manual transmission better for MPG
Quantify the MPG difference between automatic and manual transmissions
A few models were tested including simple linear regression of mpg ~ am, multivariate regression, mpg~. and the final model where only significant variable were chosen. The final model showed that manual transmission cars on average have 1.55 miles per gallon more than automatic cars.
The mtcars dataset has 32 observations of 11 variables. All the data is represented as numeric however some variables are better represented as factor variables.
library(dplyr)
library(ggplot2)
carData <- mtcars
str(carData)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
This sections converts the relevant variables into factor variables
carData$vs <- as.factor(carData$vs)
carData$am <- as.factor(carData$am)
Summary of the data
summary(carData)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs am
## Min. :2.760 Min. :1.513 Min. :14.50 0:18 0:19
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1:14 1:13
## Median :3.695 Median :3.325 Median :17.71
## Mean :3.597 Mean :3.217 Mean :17.85
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
## Max. :4.930 Max. :5.424 Max. :22.90
## gear carb
## Min. :3.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000
## Median :4.000 Median :2.000
## Mean :3.688 Mean :2.812
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :8.000
ggplot(carData,aes(x=am,y=mpg,fill=am))+geom_boxplot()
From the box plot we can see that there is a significant difference in the means between the two. ## Simple Linear Regression
First we do simple linear regression of am against mpg
fit <- lm(mpg~am,data=carData)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = carData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The intercept represents the mean of mpg of automatic cars while the slope is the difference between manual cars and automatic cars. The adjusted r-squared value is only 0.338 meaning that only 33.8% of the regression variance is explained by the model therefore there may be other significant variables which contribute to the model.
First, a model including all variables is fitted so as to determine the significant variables.
fit2 <- lm(mpg~.,data=carData)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ ., data = carData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs1 0.31776 2.10451 0.151 0.8814
## am1 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
cor(mtcars)[1,]
## mpg cyl disp hp drat wt
## 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594
## qsec vs am gear carb
## 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
From the above analysis, we can see that cyl,disp,hp,wt,am show strong correlations for the model therefore we will use them and remove the rest of the variables for the new model.
fit3 <- lm(mpg~cyl+disp+hp+wt+am,data=carData)
summary(fit3)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + wt + am, data = carData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5952 -1.5864 -0.7157 1.2821 5.5725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.20280 3.66910 10.412 9.08e-11 ***
## cyl -1.10638 0.67636 -1.636 0.11393
## disp 0.01226 0.01171 1.047 0.30472
## hp -0.02796 0.01392 -2.008 0.05510 .
## wt -3.30262 1.13364 -2.913 0.00726 **
## am1 1.55649 1.44054 1.080 0.28984
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.505 on 26 degrees of freedom
## Multiple R-squared: 0.8551, Adjusted R-squared: 0.8273
## F-statistic: 30.7 on 5 and 26 DF, p-value: 4.029e-10
This model has a higher adjusted r-squared value as compared to the second model thus is preferred. Residuals are also plotted to examine heteroskedacity.
par(mfrow=c(2,2))
plot(fit3)
The residuals have approximately the same variance and are normally distributed (as seen in QQ plot).
The final model explains the most variance out of the 3 model we have tested. Manual transmission cars on average have 1.55 miles per gallon more than automatic cars.
library(GGally)
ggpairs(mtcars)