We are looking at a data set of a collection of cars, and are interested in exploring the relationship between a set of variables and miles per gallon. In particularly the following two questions:
We’ll take the following steps:
First change ‘am’ to factor (0 = automatic, 1 = manual) And make cylinders a factor as well (since it is not continious)
library(ggplot2)
library(GGally)
library(dplyr)
library(ggfortify)
data(mtcars)
mtcarsFactors <- mtcars
mtcarsFactors$am <- as.factor(mtcarsFactors$am)
levels(mtcarsFactors$am) <- c("automatic", "manual")
mtcarsFactors$cyl <- as.factor(mtcarsFactors$cyl)
mtcarsFactors$gear <- as.factor(mtcarsFactors$gear)
mtcarsFactors$vs <- as.factor(mtcarsFactors$vs)
levels(mtcarsFactors$vs) <- c("V", "S")
Look at the dimensions & head of the dataset to get an idea
# Res 1
dim(mtcarsFactors)
## [1] 32 11
# Res 2
head(mtcarsFactors)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 V manual 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 V manual 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 S manual 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 S automatic 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 V automatic 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 S automatic 3 1
Now just take a look at the realtionship between the two parameters we are intereseted in.
# Figure 1
library(ggplot2)
p <- ggplot(mtcarsFactors, aes(am, mpg))
p + geom_boxplot(aes(fill = am))
Even this shows clearly that the manual transmissions have higher mpg’s there could be a bias in the dataset that we are overlooking. Before creating a model we should look at which paramters to include besides ‘am’. So we look at all correlations of parameters and take only those higher then the ‘am’ correlation.
# Res 3
cors <- cor(mtcars$mpg, mtcars)
orderedCors <- cors[,order(-abs(cors[1,]))]
orderedCors
## mpg wt cyl disp hp drat vs am carb gear
## 1.0000000 -0.8676594 -0.8521620 -0.8475514 -0.7761684 0.6811719 0.6640389 0.5998324 -0.5509251 0.4802848
## qsec
## 0.4186840
# Res 4
amPos <- which(names(orderedCors)=="am")
subsetColumns <- names(orderedCors)[1:amPos]
subsetColumns
## [1] "mpg" "wt" "cyl" "disp" "hp" "drat" "vs" "am"
# Figure 2
mtcarsFactors[,subsetColumns] %>%
ggpairs(
mapping = ggplot2::aes(color = am),
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("smooth", alpha=0.4, size=1), combo = wrap("dot"))
)
Now that we have seen that mpg has many other (stronger) correlations than just ‘am’ we can guess that a model predicting the mpg solely on this parameter will not be the most accurate model. Let’s check this out.
First we start with the basic model
# Res 5
basicFit <- lm(mpg ~ am, mtcarsFactors)
summary(basicFit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcarsFactors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## ammanual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The p-values are actually quite low, the R-squared is problematic however. Now go to the other side of the spectrum by fitting all parameters of mtcars
# Res 6
totalFit <- lm(mpg ~ ., mtcarsFactors)
summary(totalFit)
##
## Call:
## lm(formula = mpg ~ ., data = mtcarsFactors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2015 -1.2319 0.1033 1.1953 4.3085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.09262 17.13627 0.881 0.3895
## cyl6 -1.19940 2.38736 -0.502 0.6212
## cyl8 3.05492 4.82987 0.633 0.5346
## disp 0.01257 0.01774 0.708 0.4873
## hp -0.05712 0.03175 -1.799 0.0879 .
## drat 0.73577 1.98461 0.371 0.7149
## wt -3.54512 1.90895 -1.857 0.0789 .
## qsec 0.76801 0.75222 1.021 0.3201
## vsS 2.48849 2.54015 0.980 0.3396
## ammanual 3.34736 2.28948 1.462 0.1601
## gear4 -0.99922 2.94658 -0.339 0.7382
## gear5 1.06455 3.02730 0.352 0.7290
## carb 0.78703 1.03599 0.760 0.4568
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.616 on 19 degrees of freedom
## Multiple R-squared: 0.8845, Adjusted R-squared: 0.8116
## F-statistic: 12.13 on 12 and 19 DF, p-value: 1.764e-06
The R-squared has improved, but the p-values hardly show any significance anymore. Perhaps this is due to overfitting. We now have to meet somewhere in the middle. Let’s iterare using the step method
# Res 7
bestFit <- step(totalFit,direction="both",trace=FALSE)
summary(bestFit)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcarsFactors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ammanual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The resulting best model mpg ~ wt + qsec + am
is actually dependant on the transmission (am), but also weight (wt) and 1/4 mile time (qsec). All have significant p-values The R-squared is pretty good to (0.85)
Now let’s look (amongst others) at the Residuals vs Fitted
# Figure 3
autoplot(bestFit)
The ‘Normal Q-Q’ plot looks ok, but the ‘Residuals vs Fitted’ and ‘Scale-Location’ both show worrysome trends
The question “Is an automatic or manual transmission better for MPG” can be answered because all models (#Res5, #Res6 and #Res7) show that, holding all other paramters constant, manual transmission will increase your MPG.
The question “Quantify the MPG difference between automatic and manual transmissions” is harder to answer.
Based on the ‘bestFit’ (#Res7) model mpg ~ wt + qsec + am
we could conclude that (with a p < 0.05 confidence) cars with manual transmission have 2.9358 (say 3) more miles per gallon than automatic transmissions. The model seems clean with a p < 0.05 and R squared of 0.85
The residuals vs fitted chart however warns us that there is something missing in our model. The real problem I think is that we only have 32 observations to train on (#Res1) and that observations hardly have overlap on the parameters ‘wt’ and ‘qsec’ (amongst others) if we look at the diagonal in the matrix chart (#Figure2)
Although the conclusion of ca. 3 mpg better performance on manual transmissions seems feasible, I cannot with confidence conclude that this model will fit all future observations.