Looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG).
We are particularly interested in the following two questions:
- Is an automatic or manual transmission better for MPG?
- What is the MPG difference between automatic and manual transmissions?
After performing a regression analysis with MPG as an outcome and the type of transmission as a predictor, we have determined that a manual transmission is better for fuel consumption and that vehicles fitted with manual transmission tend to achieve higher MPG than vehicles fitted with automatic transmissions.
Our calculations and the linear model predict that manual transmissions achieve on average an estimated 7.24 mpg more than automatic transmissions, with the average consumption of 17.15 mpg for automatic transmissions and 24.39 mpg for manual transmissions.
library(dplyr)
library(GGally)
cardata <- mtcars
dim(cardata)
## [1] 32 11
head(cardata)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
summary(cardata)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
class(cardata$mpg)
## [1] "numeric"
class(cardata$am)
## [1] "numeric"
#As the transmission variable "am" was is numeric, we transform it to a factor
cardata$am <- as.factor(cardata$am)
class(cardata$am)
## [1] "factor"
Because the outcome variable, “mpg”“, is a continuous variable, and the transmission-type variable,”am“”, is a factor variable, a boxplot best represents the data
with(cardata, plot(am, mpg, main = "MPG by Transmission", ylab = "Miles per Gallon", xlab = "Automatic : 0 Manual : 1"))
From this boxplot we see that the average mpg for automatic transmissions is around 17 mpg, and that of manual transmission is around 23 mpg.
We use a standard linear regression model, as the outcome/dependent variable, “mpg”, is a continuous variable, and the predictor/independent variable, “am”, is a factor variable.
Because our predictor variable, “am”, is a factor, we can remove the intercept.
lm_mpg <- lm(mpg ~ am - 1, data = cardata)
round(quantile(lm_mpg$residuals), 2)
## 0% 25% 50% 75% 100%
## -9.39 -3.09 -0.30 3.24 9.51
The residuals are approximately normally distributed.
summary(lm_mpg)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## am0 17.14737 1.124603 15.24749 1.133983e-15
## am1 24.39231 1.359578 17.94109 1.376283e-17
The “am0” indicates transmission type 0 (Automatic), and “am1” indicates transmission type 1 (Manual).
The coefficients have a low standard error, and the extremely low probability of the t-values indicate that the coefficients are statistically significant.
round(summary(lm_mpg)$r.squared, 3)
## [1] 0.949
The R-squared value is quite high, and indicates that using only type of transmission in the fitted model accounts for roughly 95% of the variation in miles per gallon.
Considering the coefficients of this model:
lm_mpg$coefficients
## am0 am1
## 17.14737 24.39231
“am0” indicates Automatic and “am1” indicates Manual. The coefficient values provided is the average MPG for each of the two types of transmission.
From the values presented in the coefficient table, we see that automatic transmissions have an average consumption of 17.15 mpg, and manual transmissions have an average consumption of 24.39 mpg.
confint(lm_mpg)
## 2.5 % 97.5 %
## am0 14.85062 19.44411
## am1 21.61568 27.16894
Considering the confidence intervals we can, with 95% confidence, say that the average MPG for automatic transmissions is in the range of 14.85 mpg to 19.44 mpg, and the average MPG for manual transmissions is in the range of 21.62 mpg to 27.17 mpg.
Anova computes the analysis of variance (or deviance) tables for one or more fitted model objects.
anova(lm_mpg)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## am 2 13321.4 6660.7 277.18 < 2.2e-16 ***
## Residuals 30 720.9 24.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Anova table confirms that our predictor variable, “am”, is statistically significant, and provides us with the breakdown of variance in the form of the SSR (sum of squared residuals).
We will identify points that have high leverage, and that have the potential to affect the outcome of the model.
round(summary(hatvalues(lm_mpg)),3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.053 0.053 0.053 0.062 0.077 0.077
Judging by the summary output of the hat values we can see that there are no problematic data points, and that the range from minimum to maximum is quite narrow. It indicates that all data points have approximately the same leverage on the outcome of the model.
We will identify points that have high influence the outcome of the model.
These data points would most likely show up in the regression model’s diagnostic plots below.
influential <- as.data.frame(round(dfbetas(lm_mpg), 2))
influential$Vehicle <- row.names(dfbetas(lm_mpg))
influential <- select(influential, c(Vehicle, am1))
influential <- arrange(influential, desc(am1))
The three datapoints with the most positive leverage are:
head(influential, 2)
## Vehicle am1
## 1 Toyota Corolla 0.62
## 2 Fiat 128 0.51
The three datapoints with the most negative leverage are:
tail(influential, 2)
## Vehicle am1
## 31 Ford Pantera L -0.55
## 32 Maserati Bora -0.61
By plotting the various regression statistics against each other, we can diagnose the various factors at play in the selected model.
The near-horizontal level of the red line shows that the model fit is good.
plot(lm_mpg, 1)
(Although not as much as one would ideally require.)
plot(lm_mpg, 2)
One can clearly see that the influential points identified earlier have an impact on the residuals of the model.
plot(lm_mpg, 3)
For more accurate predication or model outcomes, the influential outliers are to be removed.
plot(lm_mpg, 4)
As the red line is nearly horizontal, we can see that the leverage for all points are approximately the same.
plot(lm_mpg, 5)
Other linear regressom model parameters had to be considered, as the model in scope of the questions posed only had an R-squared value of 0.949, i.e. the type of transmission only explained 94.9% of the variance in MPG. Considering all the various factors that could affect MPG, it is logical to combine these factors with the type of transmission used to see if the R-squared value can be increased, therefore accounting for more of the variance in the MPG outcome.
Looking at the data, we can see that there are other variables that could have a potenetial impact on the MPG outcome:
head(cardata, 2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
Potential variables include:
- Displacement (disp)
- Weight (wt)
- Number of gears (gear)
We will be constructing a couple of additional models, adding each of these variables as predictors along with the type of transmission, and then see which of the models can predict MPG the best with the given data.
lm_td <- lm(mpg ~ am + disp - 1, data = cardata)
summary(lm_td)
##
## Call:
## lm(formula = mpg ~ am + disp - 1, data = cardata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6382 -2.4751 -0.5631 2.2333 6.8386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## am0 27.848081 1.834071 15.184 2.45e-15 ***
## am1 29.681539 1.218689 24.355 < 2e-16 ***
## disp -0.036851 0.005782 -6.373 5.75e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.218 on 29 degrees of freedom
## Multiple R-squared: 0.9786, Adjusted R-squared: 0.9764
## F-statistic: 442.4 on 3 and 29 DF, p-value: < 2.2e-16
The model shows that, with the addition of displacement to type of transmission, the R-squared value increases somewhat, and we can see that the displacement variable also has a significant impact on the outcome.
lm_tdw <- lm(mpg ~ am + disp + wt - 1, data = cardata)
summary(lm_tdw)
##
## Call:
## lm(formula = mpg ~ am + disp + wt - 1, data = cardata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4890 -2.4106 -0.7232 1.7503 6.3293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## am0 34.675911 3.240609 10.700 2.12e-11 ***
## am1 34.853635 2.376440 14.666 1.14e-14 ***
## disp -0.017805 0.009375 -1.899 0.0679 .
## wt -3.279044 1.327509 -2.470 0.0199 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.967 on 28 degrees of freedom
## Multiple R-squared: 0.9824, Adjusted R-squared: 0.9799
## F-statistic: 391.7 on 4 and 28 DF, p-value: < 2.2e-16
By adding weight along with the other two variables, the R-squared value increases slightly, but displacement’s effect on the outcome isn’t as significant compared to the other variables.
That being said, this is the model with the highest R-squared value.
Note that the R-squared value only shows how much of the variance is explained by the predictor variables in the model, and should not be the sole consideration or deciding factor for model fit.
lm_dwg <- lm(mpg ~ disp + wt + gear - 1, data = cardata)
summary(lm_dwg)
##
## Call:
## lm(formula = mpg ~ disp + wt + gear - 1, data = cardata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5520 -2.1574 0.4345 2.9138 10.3726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## disp -0.02933 0.01587 -1.848 0.0748 .
## wt 1.85260 1.69342 1.094 0.2830
## gear 5.55331 0.67944 8.173 5.18e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.073 on 29 degrees of freedom
## Multiple R-squared: 0.9468, Adjusted R-squared: 0.9413
## F-statistic: 172.2 on 3 and 29 DF, p-value: < 2.2e-16
The R-squared value is high, explaining 94.7% of the variance in the MPG outcome, but the other models have more significant variables and higher R-squared values.
We will run an Anova diagnostic (as an example) on the model with the highest R-squared value:
anova(lm_tdw)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## am 2 13321.4 6660.7 756.4188 < 2.2e-16 ***
## disp 1 420.6 420.6 47.7669 1.637e-07 ***
## wt 1 53.7 53.7 6.1013 0.01987 *
## Residuals 28 246.6 8.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this Anova analysis we can see that all the variables are statistically significant.
Anova can be used when you next multiple models to compare them to see the effect each new additional variable (as added during the nesting process using the update function) has on each other.
As we are using many variables to determine the MPG outcome, we need to ensure that there isn’t multicollinearity.
There is a chance that there is a linear relationship between displacement and weight.
Let’s plot the pairs to see the relationship between the variables:
g <- ggpairs(cardata[, c(1,3,6,10)], lower = list(continuous = "smooth", method = c("loess")))
g
From this we can see that some variables have a linear relationship with each other, and thus we have multicollinearity.
When there is an intercept in the data we can use the VIF function to determine what variance inflation the multicollinear variables have caused. (The vif function can be found in the “car” or “HH” packages)
For the VIF example we will NOT remove the intercept, else the VIF output is not sensible.
testvif <- lm(mpg ~ am + disp + wt, data = cardata)
car::vif(testvif)
## am disp wt
## 1.931264 4.752561 5.939675
These show that adding displacement will increase variance with a factor of around 4.7, and by adding weight it will increase by a factor of around 5.9.