Using regression and the mtcars dataset from within R Studio we will give a possible answer to this question.
The linear regression coefficient is 7.245, suggesting the a manual transmission increases mpg by 7.245 compared to 17.147mpg with an automatic. The nested fit model shows that transmission type does not have a significant effect when combined with the other 10 variables. The general linear model suggests that a manual transmission increases mpg by 3.5%. The ANOVA analysis of the confounding and interaction of transmission type has a p-value of 0.001 and thus the transmission variable is necessary to the model.
The process will start with a nested linear fit model starting with the comparison of mpg versus transmission type and adding each variable one at a time. However, since some of the variables are continuous and some are factors we will use VIF on the continuous type and general linear models (glm) on the factors.
The first model (fit1) is the linear model between mpg versus the factor am. The fit will be updated by adding each additional variable in the following order: factor(cyl), disp, hp, drat, wt, qsec, factor(vs), factor(gear), and then finally factor(carb).
| Res.Df | RSS | Df | Sum.of.Sq | F | Pr..F. |
|---|---|---|---|---|---|
| 30 | 720.8966 | NA | NA | NA | NA |
| 28 | 264.4957 | 2 | 456.4009213 | 28.4296590 | 0.0000079 |
| 27 | 230.4599 | 1 | 34.0358025 | 4.2402467 | 0.0572763 |
| 26 | 183.0392 | 1 | 47.4206687 | 5.9077595 | 0.0280914 |
| 25 | 182.3812 | 1 | 0.6580252 | 0.0819781 | 0.7785511 |
| 24 | 150.1005 | 1 | 32.2806726 | 4.0215892 | 0.0633096 |
| 23 | 141.2059 | 1 | 8.8946071 | 1.1081075 | 0.3091566 |
| 22 | 139.0230 | 1 | 2.1828581 | 0.2719447 | 0.6096443 |
| 20 | 134.0015 | 2 | 5.0215145 | 0.3127950 | 0.7360567 |
| 15 | 120.4027 | 5 | 13.5988573 | 0.3388344 | 0.8814442 |
The ANOVA test suggests that the variable cyl is significant at the 0.001 level; hp is significant at 0.05; disp and wt at the 0.1 level of significance. From this fitting process it suggests that Rear Axle Ratio (drat), 1/4 mile time (qsec), engine shape (vs), number of forward gears (gear), and the number of carburetors do not appear to be necessary or don’t add much to the model.
We drop the factor variables of transmission type, the number of cylinders, engine shape, number of forward gears, and number of carburetors to determine which of the continuous variables are significant to the model. We do this by taking the square root of the the VIF values. Our criteria will be that any value greater than 2 will be considered significant.
| x | |
|---|---|
| disp | 3.018 |
| hp | 2.281 |
| drat | 1.524 |
| wt | 2.648 |
| qsec | 1.787 |
From the table we see that disp, hp, and wt each have square root values greater than two.
We shall use the general linear model to test if any of the factor variables are significant against mpg. The variable am uses 0 and 1 to represent automatic and manual transmissions, respectively, and we shall use the “binomial” family in the glm. The variable vs is also treated as a binomial as 0 is v-shaped engine and 1 is the straight-line type. Since the number of cylinders (cyl), the number of gears (gears), and the number of carburetors (carb) each take multiple integers to represent different qualities we shall use the family of “Poissson” in the ANOVA analysis.
| factor | p_value | mpg_coef |
|---|---|---|
| transmission type | 0.00023 | 1.03502 |
| number of cylinders | 0.00045 | 0.95706 |
| v-shaped or straight-line | 0.00002 | 1.03924 |
| number of gears | 0.30889 | 1.01668 |
| number of carburetors | 0.00215 | 0.94191 |
From the output in the table we see that transmission is significant and a manual increases mpg by 3.5%. The number of cylinders is significant and increasing the number of cylinders decreases mpg by 4.3%. The vs variable is significant and a straight-line type engine adds about 3.9% to mpg. The number of gears does not appear to be significant. The number of carburetors is significant and for each increase in the number we expect a decrease of around 5.2% in mpg.
Using the regression techniques described above we find that the nested model fit suggested - and VIF and GLM confirmed - that the number of cylinders, horsepower, displacement, and weight were significant to the model. When we find the subset of mtcars and rerun the linear regression we find:
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| cyl | 5.356 | 1.418 | 3.778 | 0.001 |
| hp | -0.031 | 0.036 | -0.871 | 0.391 |
| disp | -0.121 | 0.023 | -5.350 | 0.000 |
| wt | 5.691 | 2.327 | 2.446 | 0.021 |
When we treated the mpg coefficient as the center of the regression we see that the number of cylinders, the displacement, and the weight are the significant variables.
Since the question was does transmission type have an effect on mpg and the previous results suggested we drop it from the model. We shall test to determine if am is a confounder or an interaction variable to our final model. We will tie this to weight because automatic transmissions tend to weigh more than a manual transmission.
The p-value is 0.00102 which suggests that the interaction between weight and transmission type is necessary in the model.
library(datasets)
library(dplyr)
library(ggplot2)
library(car)
library(broom)
library(tibble)
library(tidyr)
library(purrr)
library(knitr)
fit1 <- lm(mpg ~ factor(am), data = mtcars)
fit2 <- update(fit1, mpg ~ factor(am) + factor(cyl))
fit3 <- update(fit1, mpg ~ factor(am) + factor(cyl) + disp)
fit4 <- update(fit1, mpg ~ factor(am) + factor(cyl) + disp + hp)
fit5 <- update(fit1, mpg ~ factor(am) + factor(cyl) + disp + hp + drat)
fit6 <- update(fit1, mpg ~ factor(am) + factor(cyl) + disp + hp + drat + wt)
fit7 <- update(fit1, mpg ~ factor(am) + factor(cyl) + disp + hp + drat + wt + qsec)
fit8 <- update(fit1, mpg ~ factor(am) + factor(cyl) + disp + hp + drat + wt + qsec + factor(vs))
fit9 <- update(fit1, mpg ~ factor(am) + factor(cyl) + disp + hp + drat + wt + qsec + factor(vs) + factor(gear))
fit10 <- update(fit1, mpg ~ factor(am) + factor(cyl) + disp + hp + drat + wt + qsec + factor(vs) + factor(gear) + factor(carb))
anova_fit <- data.frame(anova(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9,fit10))
knitr::kable(anova_fit)
df_1 <- mtcars %>% select(mpg | disp:qsec)
fit_continuous <- lm(mpg ~ ., data = df_1)
continuous_vif <- sqrt(vif(fit_continuous))
continuous_vif <- round(continuous_vif,3)
knitr::kable(round(continuous_vif,3))
log_am <- glm(mtcars$am ~ mtcars$mpg, family = "binomial")
am_coef <- summary(log_am)$coefficients
am_exp <- round(exp(coef(lm(log(mtcars$am+1) ~ mtcars$mpg))),5)
am_anova <- anova(log_am, test = "Chisq")
am_p_val <- round(anova(log_am, test = "Chisq")[2,5],5)
log_cyl <- glm(mtcars$cyl ~ mtcars$mpg, family = poisson())
cyl_coef <- summary(log_cyl)$coefficients
cyl_exp <- round(exp(coef(lm(log(mtcars$cyl) ~ mtcars$mpg))), 5)
cyl_anova <- anova(log_cyl, test = "Chisq")
cyl_p_val <- round(anova(log_cyl, test = "Chisq")[2,5],5)
log_vs <- glm(mtcars$vs ~ mtcars$mpg, family = "binomial")
vs_p_val <- round(anova(log_vs, test = "Chisq")[2,5],5)
vs_exp <- round(exp(coef(lm(log(mtcars$vs+1) ~ mtcars$mpg))),5)
log_gear <- glm(mtcars$gear ~ mtcars$mpg, family = poisson())
gear_exp <- round(exp(coef(lm(log(mtcars$gear) ~ mtcars$mpg))),5)
gear_p_val <- round(anova(log_gear, test = "Chisq")[2,5],5)#not sig
log_carb <- glm(mtcars$carb ~ mtcars$mpg, family = poisson())
carb_coef <- summary(log_carb)$coefficients
carb_exp <- round(exp(coef(lm(log(mtcars$carb) ~ mtcars$mpg))), 5) # predicts a 5% decrease in mpg as the number of carburetors increase
carb_p_val <- round(anova(log_carb, test = "Chisq")[2,5],5)
factors <- c("transmission type", "number of cylinders", "v-shaped or straight-line", "number of gears", "number of carburetors")
p_values <- c(am_p_val, cyl_p_val, vs_p_val, gear_p_val, carb_p_val)
exp_coefficients <- as.matrix(c(am_exp, cyl_exp, vs_exp, gear_exp, carb_exp))[c(2,4,6,8,10),]
factor_results <- data.frame(factor = factors, p_value = p_values, mpg_coef = exp_coefficients)
knitr::kable(factor_results)
sig_mtcars <- mtcars %>% select(mpg| cyl | hp | disp | wt)
final_lm <- round(summary(lm(mpg ~ . -1, data = sig_mtcars))$coefficient,3)
knitr::kable(final_lm)
am_confounder <- lm(mpg ~ factor(am)+wt, data = mtcars)
am_interaction <- lm(mpg ~ factor(am)*wt, data = mtcars)
summary(am_confounder)$coefficient
summary(am_interaction)$coefficient
p_val <- round(anova(am_confounder,am_interaction)[2,6],5)
man_mtcars <- mtcars %>% filter(am == 1)
auto_mtcars <- mtcars %>% filter(am ==0)
man_line <- lm(mpg ~ wt, data = man_mtcars)
auto_line <- lm(mpg ~ wt, data = auto_mtcars)
ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(am))) +
geom_point(size = 4) +
labs(x = "Weight (in 1000 lbs)",
y = "MPG",
title = "Regression of MPG v. Weight by Transmission Type") +
scale_color_manual(name="Transmission",
labels=c("Automatic","Manual"),
values = c("pink","lightblue")) +
geom_abline(intercept = coef(man_line)[1],
slope = coef(man_line)[2],
size = 1,
colour = "blue") +
geom_abline(intercept = coef(auto_line)[1],
slope = coef(auto_line)[2],
size = 1,
colour = "red")