Looking at a data set of a collection of cars, this report will explore the relationship between a set of variables and miles per gallon (MPG) (outcome). In particular, it seeks to answer the following two questions:
Steps taken: 1. Load and process dataset. 2. Exploratory analysis of the dataset. 3. Model selection of various regression models 4. An analysis of the residuals for the best fit model.
# Loading libraries
library(ggplot2)
library(GGally)
library(dplyr)
# Loading mtcars dataset
data(mtcars)
mtcarsData <- mtcars
# Convert the non continuous variables into factors
mtcarsData$am <- as.factor(mtcarsData$am)
levels(mtcarsData$am) <- c("Auto", "Manual")
mtcarsData$vs <- as.factor(mtcarsData$vs)
levels(mtcarsData$vs) <- c("V", "S")
mtcarsData$cyl <- as.factor(mtcarsData$cyl)
mtcarsData$gear <- as.factor(mtcarsData$gear)
mtcarsData$carb <- as.factor(mtcarsData$carb)
# Dimensions of the dataset
str(mtcarsData)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "V","S": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "Auto","Manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
# Look at the first 6 observations of the dataset
head(mtcarsData)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 V Manual 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 V Manual 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 S Manual 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 S Auto 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 V Auto 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 S Auto 3 1
We’ll begin by looking at the pair-wise relationship between the variables in the dataset. The figure below shows scatterplots produced by plotting each variable against all others.
pairs(mtcarsData, panel=panel.smooth, pch=5, cex=0.5, gap=0.3, lwd=3, las=1, cex.axis=0.8)
We’ll also a look at the relationship between transmission type and mpg.
ggplot(mtcarsData, aes(am, mpg, colour = am)) +
geom_boxplot() + theme(legend.position = "right") + ggtitle("Relationship between MPG and Transmission type (am)") +
theme(plot.title = element_text(lineheight = 1, face = "bold", size = 14))
We can observe that the manual transmissions cars have higher MPGs compared to the Auto transmission cars.
fitAM <- lm(mpg ~ am, mtcarsData)
summary(fitAM)
##
## Call:
## lm(formula = mpg ~ am, data = mtcarsData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
While the p-value is low, the low adjusted R squared value of 33.85% means that we will need to look at how the other variables affect mpg, as this model that only includes the transmission type am, is only able to account for 33.85% of the variance in mpg.
fitAll <- lm(mpg ~ ., mtcarsData)
summary(fitAll)
##
## Call:
## lm(formula = mpg ~ ., data = mtcarsData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vsS 1.93085 2.87126 0.672 0.5115
## amManual 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
As expected, the Adjusted R-squared is higher than our previous model, however, it is more likely that we can get a better model fit with fewer variables.
fitBest <- step(fitAll,direction="both",trace=FALSE)
summary(fitBest)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcarsData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
We observe that the best model, mpg ~ cyl + hp + wt + am, includes 4 variables and the transmission type, am, is indeed one of the variables that affects the mpg. We now have an improved Adjusted R-squared of 84.01% compared to that of the previous model (77.9%). As expected the p-value for the model is small, and all variables show significant p-values.
Exploring this model further:
par(mfrow = c(2,2))
plot(fitBest)
According to the above plots:
Given the very small sample size of just 32 observations, we would expect bias to exist in the analysis. Despite this, the model fitBest, appears to be quite a good fit with a high Adjusted R-squared and a study of the residuals also confirms the good fit of the model.
Back to the questions:
Based on our analysis above, we can conclude that Manual transmission cars give better MPG. The MPG difference as given by our best fit model is 1.8 miles per gallon, i.e. Manual transmission cars give 1.8 miles more per gallon compared to Auto transmission cars.