You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”
We are going to fit a logistic GLM to our data. The transmission variable will be our outcome and the miles per gallon variable our predictor. This will allow us to treat the former as a binary outcome and therefore apply a logistic model to it.
We are going to use the mtcars dataframe in the datasets library to do our analysis. From the documentation (see ?? mtcars), we learn that the code used for our categorical variable am is: Transmission (0 = automatic, 1 = manual. So let’s modify our variable to reflect that
There is a clear positive trend here. Let us verify this with a regression model
The logistic model looks like this: \[\ log(odds) = b_0 + b_1x \]
\[\ log (odds) = \frac {p} {1-p} \]
So we can see the clear linear trend between the log odds and the regressor variable. The model appears to explain almost all the variation but when we check the residual plot we find a pattern. I do not understand why this is.
There is a strange pattern in the residual plot which could indicate that the fit is not perfect even though it looks that way in the linear plot above.
We check our diagnostics to make sure we have a good fit
Owing to the kind of model we have (we have reversed the roles of the predictor and the outcome), it would not really make sense to explore the effect of more variables on the fit. As far as we can tell, what is important is how the transmission type affects miles per gallon; not how miles per gallon and other variables relate to the transmission. This would require, perhaps, an entirely different kind of model - one that takes a binary variable as a predictor and spits out a continuous variable as an outcome. It is not entirely clear how to do that at this stage.
This may require a different model from the one we have above. However, it might be useful to state what we have uncovered using our model.
Our fitted logistic GLM looks like this:
\[\ log(odds) = -6.6035 + 0.307mpg \] \[\ log(odds) = log(\frac {manual} {automatic}) \]
We can therefore say that for every unit increase in miles per gallon (mpg), there is a 0.3070282 increase in the log odds. The intercept -6.6035267 is the value of the log odds when mpg = 0. It is not really as important as the slope.
The first thing to note is that our results are statistically significant. Looking at the p values above, they are smaller than 0.5 so we can do inference with our model.
Second, from our plot A1 we can see visually that the probability that a car has an manual (1) transmission increases with increase in miles per gallon (mpg). That means, conversely, that an automatic transmission is better for mpg.
library(tidyverse)
library(datasets)
#Modifying am column
am_fact = sapply(mtcars$am, function(x){if (x == 1) {x = "Manual"} else {x = "Automatic"}})
mtcars = mutate(datasets::mtcars, am_fact = am_fact)
##Transmision variable levels
ggplot(data.frame(x = 1: nrow(mtcars), y = mtcars$am_fact), aes(x, y)) +
labs(title = "Transmission Variable Levels", tag = "A0") +
ylab("Transmission type") +
geom_point()
##Boxplot
ggplot(mtcars, aes(am_fact, mpg)) +
geom_boxplot()
#GLM
glm1 = glm(data = mtcars, am ~ mpg, family = 'binomial')
# Calculating the log odds
log_vals = log(glm1$fitted.values / (1 - glm1$fitted.values))
## Coefficients
coeff = glm1$coefficients
#Fitted Points Plot
g = ggplot(data = mtcars, aes(mpg, am)) +
geom_point() +
ylab("Probability (p)") +
labs(title = "Fitted Points Plot", tag = "A1") +
geom_smooth(method = 'lm', formula = y ~ x)
g + geom_point(aes(y = glm1$fitted.values, x = mpg), colour = 'Purple')
#GLM Plot
ggplot(data = data.frame(y = log_vals, mpg = mtcars$mpg), aes(mpg, log_vals)) +
geom_point(colour = 'Orange') +
labs(title = "GLM Plot", tag = "A2") +
ylab("Log Odds") +
geom_abline(intercept = coeff[1], slope = coeff[2])
#Residual Plots
ggplot(data = data.frame(e = resid(glm1), mpg = mtcars$mpg), aes(mpg, e)) +
labs(title = "Residual Plot", tag = "A3") +
ylab("Residuals") +
geom_point()
#Diagnostics
plot(glm1)
#Significance
t = summary(glm1)$coefficients
barplot(t[7:8], main = "Significance", horiz = TRUE, col = "steelblue",
xlab = "p-value", ylab = "Coefficients"
)