The purpose of this analysis is to find if fuel efficiency (miles / gallorn) is affected by transmission type (automatic vs. manual) and, if yes, by how much. The R dataset, ‘mtcars’ contains data from 32 cars (each row has 10 columns).
I found that a manual transmission significantly increased fuel efficiency by approximatibely 7.24 miles / gallon compared to automatic. However adjusting for the ohter covariates did not result in significant fuel usage differences (with the exception of a smaller selection of covariates). Adjusted R-squared values increased from 34% for the unadjusted model to 80-84% for adjusted and the inclusion of covariates significantly improved all models compared to the unadjusted one but restricting the number of covariates or the coding strategy (numeric or factor) did not.
These results suggest that fuel efficiency is better in cars with manual transmission than in cars with automatic transmission, but other factors are able to compensate for this difference.
Starting from the ‘mtcars’ dataset, I prepared two custom datasets for analysis. mtcars2
kept the numerical type of discrete variables with numerical levels: cyl
(Number of cylinders: 4, 6 or 8), gear
(Number of forward gears: 3, 4 or 5) and carb
(Number of carburetors: 1-4, 6 or 8). Therefore, regression coefficients for these variables show the expected difference of fuel efficiency for evey unit increase in their numeric value and the intercept uses values of 0, even when these values do not appear in the dataset. mtcars3
coded these variables as factors resulting in dummy variables compared by the regression algorithm to their first level.
In both datasets, variables am
(Transmission: manual compared to automatic) and vs
(Engine: straight compared to V-shaped) were coded as factors.
Variables disp
(Displacement) was rescaled from cubic inches to litters and hp
(Gross horsepower) was downscaled by a factor of 100 in order to make their coefficients easier to visualize graphically since their unscaled coefficients and confidence intervals were very close to 0, which made them appear as dots rather than ranges. Rescaling only affects the numerical values of their coefficients and their visual aspect but does not affect any related statisitcal inference.
All variables were labelled for aestethic reasons using the function set_variable_labels
from the package labelled
. This automatically applies the attribute label
to each column of a dataset, which is used by other functions to replace variable names to variable labels in their output.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
data(mtcars)
# mtcars %>% summary()
# Prepare nice variable labels
labels <- c(
"Miles/(US) gallon",
"Number of cylinders",
"Displacement (Liters)", # from cubic inches
"Gross horsepower (100x)", # each 100 hp
"Rear axle ratio",
"Weight (1000 lbs)",
"1/4 mile time",
"Engine",
"Transmission",
"Number of forward gears",
"Number of carburetors"
)
# change the units for continuous variables to make coefficinets more visble on the chart
mtcars1 <- mtcars %>% mutate(
`disp` = `disp`/61.024, # cubic inches to liters
`hp`= `hp`/100 # each 100 hp
)
# make factors
mtcars2 <- mtcars1 %>% mutate(
# true factors
`am` = factor(`am`, labels = c(`0`="automatic", `1`="manual")),
`vs` = factor(`vs`, labels = c(`0`="V-shaped", `1` = "straight")),
) %>% labelled::set_variable_labels(.labels = labels)
mtcars3 <- mtcars2 %>% mutate(
# numeric factors: these could be coded as numeric
`cyl` = factor(`cyl`), # Levels: 4 6 8
`gear` = factor(`gear`), # Levels: 3 4 5
`carb` = factor(`carb`) # Levels: 1 2 3 4 6 8
) %>% labelled::set_variable_labels(.labels = labels)
# mtcars3 %>% summary()
A short summary of the data (factors coded as factors) is printed below.
mtcars3 %>% summary()
## mpg cyl disp hp drat
## Min. :10.40 4:11 Min. :1.165 Min. :0.520 Min. :2.760
## 1st Qu.:15.43 6: 7 1st Qu.:1.980 1st Qu.:0.965 1st Qu.:3.080
## Median :19.20 8:14 Median :3.217 Median :1.230 Median :3.695
## Mean :20.09 Mean :3.781 Mean :1.467 Mean :3.597
## 3rd Qu.:22.80 3rd Qu.:5.342 3rd Qu.:1.800 3rd Qu.:3.920
## Max. :33.90 Max. :7.735 Max. :3.350 Max. :4.930
## wt qsec vs am gear carb
## Min. :1.513 Min. :14.50 V-shaped:18 automatic:19 3:15 1: 7
## 1st Qu.:2.581 1st Qu.:16.89 straight:14 manual :13 4:12 2:10
## Median :3.325 Median :17.71 5: 5 3: 3
## Mean :3.217 Mean :17.85 4:10
## 3rd Qu.:3.610 3rd Qu.:18.90 6: 1
## Max. :5.424 Max. :22.90 8: 1
A more detailed summary of each variable in the dataset is provied by the function DescTools::Desc
which outputs decriptive statistics and charts based on data type.
DescTools::Desc(mtcars3)
## ------------------------------------------------------------------------------
## Describe mtcars3 (data.frame):
##
## data frame: 32 obs. of 11 variables
## 32 complete cases (100.0%)
##
## Nr ColName Class NAs Levels Label
## 1 mpg numeric . Miles/(US) gallon
## 2 cyl factor . (3): 1-4, 2-6, 3-8 Number of cylinders
## 3 disp numeric . Displacement (Liters)
## 4 hp numeric . Gross horsepower (100x)
## 5 drat numeric . Rear axle ratio
## 6 wt numeric . Weight (1000 lbs)
## 7 qsec numeric . 1/4 mile time
## 8 vs factor . (2): 1-V-shaped, Engine
## 2-straight
## 9 am factor . (2): 1-automatic, Transmission
## 2-manual
## 10 gear factor . (3): 1-3, 2-4, 3-5 Number of forward gears
## 11 carb factor . (6): 1-1, 2-2, 3-3, 4-4, Number of carburetors
## 5-6, ...
##
##
## ------------------------------------------------------------------------------
## 1 - mpg (numeric) :
## Miles/(US) gallon
##
##
## length n NAs unique 0s mean meanCI
## 32 32 0 25 0 20.091 17.918
## 100.0% 0.0% 0.0% 22.264
##
## .05 .10 .25 median .75 .90 .95
## 11.995 14.340 15.425 19.200 22.800 30.090 31.300
##
## range sd vcoef mad IQR skew kurt
## 23.500 6.027 0.300 5.411 7.375 0.611 -0.373
##
## lowest : 10.4 (2), 13.3, 14.3, 14.7, 15.0
## highest: 26.0, 27.3, 30.4 (2), 32.4, 33.9
## ------------------------------------------------------------------------------
## 2 - cyl (factor) :
## Number of cylinders
##
##
## length n NAs unique levels dupes
## 32 32 0 3 3 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 8 14 43.8% 14 43.8%
## 2 4 11 34.4% 25 78.1%
## 3 6 7 21.9% 32 100.0%
## ------------------------------------------------------------------------------
## 3 - disp (numeric) :
## Displacement (Liters)
##
##
## length n NAs unique 0s mean meanCI
## 32 32 0 27 0 3.780838 3.048591
## 100.0% 0.0% 0.0% 4.513086
##
## .05 .10 .25 median .75 .90 .95
## 1.267534 1.320956 1.979959 3.216767 5.342160 6.489250 7.357761
##
## range sd vcoef mad IQR skew kurt
## 6.569546 2.030983 0.537178 2.301985 3.362202 0.381657 -1.207212
##
## lowest : 1.165115, 1.240496, 1.289657, 1.294573, 1.558403
## highest: 5.899318 (2), 6.554798, 7.210278, 7.538018, 7.734662
## ------------------------------------------------------------------------------
## 4 - hp (numeric) :
## Gross horsepower (100x)
##
##
## length n NAs unique 0s mean meanCI
## 32 32 0 22 0 1.4669 1.2197
## 100.0% 0.0% 0.0% 1.7141
##
## .05 .10 .25 median .75 .90 .95
## 0.6365 0.6600 0.9650 1.2300 1.8000 2.4350 2.5355
##
## range sd vcoef mad IQR skew kurt
## 2.8300 0.6856 0.4674 0.7710 0.8350 0.7260 -0.1356
##
## lowest : 0.52, 0.62, 0.65, 0.66 (2), 0.91
## highest: 2.15, 2.3, 2.45 (2), 2.64, 3.35
## ------------------------------------------------------------------------------
## 5 - drat (numeric) :
## Rear axle ratio
##
##
## length n NAs unique 0s mean meanCI
## 32 32 0 22 0 3.5966 3.4038
## 100.0% 0.0% 0.0% 3.7893
##
## .05 .10 .25 median .75 .90 .95
## 2.8535 3.0070 3.0800 3.6950 3.9200 4.2090 4.3145
##
## range sd vcoef mad IQR skew kurt
## 2.1700 0.5347 0.1487 0.7042 0.8400 0.2659 -0.7147
##
## lowest : 2.76 (2), 2.93, 3.0, 3.07 (3), 3.08 (2)
## highest: 4.08 (2), 4.11, 4.22 (2), 4.43, 4.93
## ------------------------------------------------------------------------------
## 6 - wt (numeric) :
## Weight (1000 lbs)
##
##
## length n NAs unique 0s mean meanCI
## 32 32 0 29 0 3.21725 2.86448
## 100.0% 0.0% 0.0% 3.57002
##
## .05 .10 .25 median .75 .90 .95
## 1.73600 1.95550 2.58125 3.32500 3.61000 4.04750 5.29275
##
## range sd vcoef mad IQR skew kurt
## 3.91100 0.97846 0.30413 0.76725 1.02875 0.42315 -0.02271
##
## lowest : 1.513, 1.615, 1.835, 1.935, 2.14
## highest: 3.845, 4.07, 5.25, 5.345, 5.424
## ------------------------------------------------------------------------------
## 7 - qsec (numeric) :
## 1/4 mile time
##
##
## length n NAs unique 0s mean meanCI
## 32 32 0 30 0 17.8488 17.2045
## 100.0% 0.0% 0.0% 18.4930
##
## .05 .10 .25 median .75 .90 .95
## 15.0455 15.5340 16.8925 17.7100 18.9000 19.9900 20.1045
##
## range sd vcoef mad IQR skew kurt
## 8.4000 1.7869 0.1001 1.4159 2.0075 0.3690 0.3351
##
## lowest : 14.5, 14.6, 15.41, 15.5, 15.84
## highest: 19.9, 20.0, 20.01, 20.22, 22.9
## ------------------------------------------------------------------------------
## 8 - vs (factor - dichotomous) :
## Engine
##
##
## length n NAs unique
## 32 32 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## V-shaped 18 56.2% 39.3% 71.8%
## straight 14 43.8% 28.2% 60.7%
##
## ' 95%-CI Wilson
## ------------------------------------------------------------------------------
## 9 - am (factor - dichotomous) :
## Transmission
##
##
## length n NAs unique
## 32 32 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## automatic 19 59.4% 42.3% 74.5%
## manual 13 40.6% 25.5% 57.7%
##
## ' 95%-CI Wilson
## ------------------------------------------------------------------------------
## 10 - gear (factor) :
## Number of forward gears
##
##
## length n NAs unique levels dupes
## 32 32 0 3 3 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 3 15 46.9% 15 46.9%
## 2 4 12 37.5% 27 84.4%
## 3 5 5 15.6% 32 100.0%
## ------------------------------------------------------------------------------
## 11 - carb (factor) :
## Number of carburetors
##
##
## length n NAs unique levels dupes
## 32 32 0 6 6 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 2 10 31.2% 10 31.2%
## 2 4 10 31.2% 20 62.5%
## 3 1 7 21.9% 27 84.4%
## 4 3 3 9.4% 30 93.8%
## 5 6 1 3.1% 31 96.9%
## 6 8 1 3.1% 32 100.0%
A boxplot of fuel efficiency by type of transmission shows that manual cars use less fuel than automatic cars. This difference is statisticaly signifficant accoding to a T-test.
ggplot(mtcars3, aes(y=`mpg`, x=`am`, color=`am`)) +
geom_boxplot(varwidth = T) +
stat_summary(fun.y = mean, na.rm=T, geom="point", pch=23, size=3) +
ggpubr::stat_compare_means(method = "t.test") +
labs(x="Transmission type", y="Miles / gallon") +
theme(legend.position = "none")
I defined 5 regression model to study the relation between the type of transmission (binary variable am
) and fuel consumption (in miles / gallon, variable mpg
).
The unadjusted model is just another way to perform the same T-test as above. It is used as refference point for the other models.
I created 2 other models using all other variables as covariates. One of them uses numerical coding for numerical dicrete variables and the other codes them as factors.
In order to reduce the number of coeficients, I used a stepwise selection algorithm to select the most important covariates from both adjusted models.
# Unadjusted model: Miles / gallon by Transmission type
unadj <- lm(data=mtcars2, formula=`mpg`~`am`) # %>% summary
# Adjusted model: Miles / gallon by Transmission type, keepig all other covariates constant, retaing the numeric type of numeric factors
adj.num <- lm(data=mtcars2, formula=`mpg`~.) # %>% summary
# Adjusted model: Miles / gallon by Transmission type, keepig all other covariates constant, converting numeric factors to dummy variables
adj.cat <- lm(data=mtcars3, formula=`mpg`~.) # %>% summary
# Adjusted model with stepwise selection of covariates: Miles / gallon by Transmission type, keepig all other covariates constant, retaing the numeric type of numeric factors
adj.num.step <- step(lm(data=mtcars2, formula=`mpg`~.), trace = F) # %>% summary
# Adjusted model with stepwise selection of covariates: Miles / gallon by Transmission type, keepig all other covariates constant, converting numeric factors to dummy variables
adj.cat.step <- step(lm(data=mtcars3, formula=`mpg`~.), trace = F) # %>% summary
I used the sjPlot::plot_models
function to create a chart of the coefficients for the two adjusted models and the unadjusted model. This chart shows each coeficient and its 95% confidence interval. There are 3 models, each with a different color (green: unadjusted; blue: adjusted, numeric coding; red: adjusted, qualitative coding).
The chart shows that unadjusted, a manual transmission leads to a significant average increase in fuel efficiency of about 7 miles/ gallon. However, adjusted for all other covariates, under both coding methods, this average differnce approaches 0 and it is no longer statistically singificant. In fact, under both coding methods, confidence intervals for all other covariates cross 0 and, therefore, are not statistically signifficantly different from 0.
sjPlot::plot_models(unadj, adj.num, adj.cat,
std.est = NULL, # use "std" if you want to show standardized coefficients, which makes it easier to comparte variables measured on different scales (with very different numeric values).
show.intercept = F, # use TRUE to include the intercept
prefix.labels = "label", wrap.labels = 100,
title = "Miles/gallon by:",
axis.title = "Coefficients (95% conf. int.)",
legend.title = "Model",
m.labels = c("Unadjusted", "Adjusted, numeric", "Adjusted, qualitative")) +
# legend: top-right, transparent background
theme(legend.position = c(1, 1), legend.justification = c(1, 1),
legend.background = element_blank(), legend.key = element_blank())
The table below, created using the function sjPlot:tab_model
, shows the same coefficients with their 95% confidence interval and p-values, as well as R-squared values for each of the models. It shows that the unadjusted model only explains 33.8% of the variance of fuel usage while the adjsted models, using all variables, capture around 80% of the variance. The remaining 20% of the variance is not explained by any of the variabls in the mtcars
dataset.
sjPlot::tab_model(unadj, adj.num, adj.cat,
collapse.ci = T,
dv.labels = c("Unadjusted",
"Adjusted, numeric", "Adjusted, qualitative"))
Unadjusted | Adjusted, numeric | Adjusted, qualitative | ||||
---|---|---|---|---|---|---|
Predictors | Estimates | p | Estimates | p | Estimates | p |
(Intercept) |
17.15 (14.85 – 19.44) |
<0.001 |
12.30 (-26.62 – 51.23) |
0.518 |
23.88 (-18.89 – 66.65) |
0.253 |
Transmission: manual |
7.24 (3.64 – 10.85) |
<0.001 |
2.52 (-1.76 – 6.80) |
0.234 |
1.21 (-5.64 – 8.06) |
0.711 |
Number of cylinders |
-0.11 (-2.28 – 2.06) |
0.916 | ||||
Displacement(Liters) |
0.81 (-1.45 – 3.08) |
0.463 |
2.17 (-1.98 – 6.32) |
0.283 | ||
Gross horsepower(100 x) |
-2.15 (-6.68 – 2.38) |
0.335 |
-7.05 (-15.45 – 1.35) |
0.094 | ||
Rear axle ratio |
0.79 (-2.61 – 4.19) |
0.635 |
1.18 (-4.11 – 6.48) |
0.641 | ||
Weight(1000 lbs) |
-3.72 (-7.65 – 0.22) |
0.063 |
-4.53 (-9.94 – 0.88) |
0.095 | ||
1/4 mile time |
0.82 (-0.70 – 2.34) |
0.274 |
0.37 (-1.63 – 2.36) |
0.700 | ||
Engine: straight |
0.32 (-4.06 – 4.69) |
0.881 |
1.93 (-4.19 – 8.05) |
0.512 | ||
Number of forward gears |
0.66 (-2.45 – 3.76) |
0.665 | ||||
Number of carburetors |
-0.20 (-1.92 – 1.52) |
0.812 | ||||
Number of cylinders: cyl 6 |
-2.65 (-9.13 – 3.83) |
0.397 | ||||
Number of cylinders: cyl 8 |
-0.34 (-15.60 – 14.92) |
0.963 | ||||
Number of forward gears: gear 4 |
1.11 (-6.98 – 9.21) |
0.773 | ||||
Number of forward gears: gear 5 |
2.53 (-5.44 – 10.49) |
0.509 | ||||
Number of carburetors: carb 2 |
-0.98 (-5.92 – 3.96) |
0.679 | ||||
Number of carburetors: carb 3 |
3.00 (-6.15 – 12.15) |
0.495 | ||||
Number of carburetors: carb 4 |
1.09 (-8.39 – 10.58) |
0.810 | ||||
Number of carburetors: carb 6 |
4.48 (-9.13 – 18.08) |
0.494 | ||||
Number of carburetors: carb 8 |
7.25 (-10.57 – 25.07) |
0.399 | ||||
Observations | 32 | 32 | 32 | |||
R2 / R2 adjusted | 0.360 / 0.338 | 0.869 / 0.807 | 0.893 / 0.779 |
In order to reduce the number of coefficients, I used sepwise covariate selection to find the models that most efficiently explain fuel usage, starting from the two adjusted models.
Using numeric coding, I found that a manual transmission significantly increases fuel economy with only aproximativey 3 miles / gallon compared to automatic, when adjusted for car weight and 1/4 mile time. Also, each 1000lbs car weight reduces reduces fuel economy with approximatively 4 miles / gallon and a slower acceleration (each second added to the 1/4 mile time) increases fuel economy by about 1.2 miles / gallon.
Using factor coding, I found that a manual transmission did significantly increase fuel economy, adjusted for weight, horsepower and number of cylinders. However, larger weight reduced fuel economy by about 2.5 miles / gallon for every 1000 lbs, a more powerful motor reduced fuel economy by about 3.2 miles / gallon for every 100 hp and a 6 cylinder motor reduced fuel economy by about 3 miles / gallon compared to 4 cylinders.
sjPlot::plot_models(unadj, adj.num.step, adj.cat.step,
std.est = NULL, # use "std" if you want to show standardized coefficients, which makes it easier to comparte variables measured on different scales (with very different numeric values).
show.intercept = F, # use TRUE to include the intercept
prefix.labels = "label", wrap.labels = 100,
title = "Miles/gallon by:",
axis.title = "Coefficients (95% conf. int.)",
legend.title = "Model",
m.labels = c("Unadjusted",
"Adjusted, numeric, stepwise", "Adjusted, qualitative, stepwise")) +
# legend: top-right, transparent background
theme(legend.position = c(1, 1), legend.justification = c(1, 1),
legend.background = element_blank(), legend.key = element_blank())
The table below shows that the adjsted stepwise models capture 83.4% and 84.0% of the variance in fuel usage, respectively. Theit unadjusted R-squared values are smaller than those of their full models because they include fewer variables while their adjusted R-squared values are larger because they only include the most informative variables.
sjPlot::tab_model(unadj, adj.num.step, adj.cat.step,
collapse.ci = T,
dv.labels = c("Unadjusted",
"Adjusted, numeric, stepwise", "Adjusted, qualitative, stepwise"))
Unadjusted | Adjusted, numeric, stepwise | Adjusted, qualitative, stepwise | ||||
---|---|---|---|---|---|---|
Predictors | Estimates | p | Estimates | p | Estimates | p |
(Intercept) |
17.15 (14.85 – 19.44) |
<0.001 |
9.62 (-4.64 – 23.87) |
0.178 |
33.71 (28.35 – 39.06) |
<0.001 |
Transmission: manual |
7.24 (3.64 – 10.85) |
<0.001 |
2.94 (0.05 – 5.83) |
0.047 |
1.81 (-1.06 – 4.68) |
0.206 |
Weight(1000 lbs) |
-3.92 (-5.37 – -2.46) |
<0.001 |
-2.50 (-4.32 – -0.68) |
0.009 | ||
1/4 mile time |
1.23 (0.63 – 1.82) |
<0.001 | ||||
Number of cylinders: cyl 6 |
-3.03 (-5.92 – -0.14) |
0.041 | ||||
Number of cylinders: cyl 8 |
-2.16 (-6.86 – 2.53) |
0.352 | ||||
Gross horsepower(100 x) |
-3.21 (-6.03 – -0.40) |
0.027 | ||||
Observations | 32 | 32 | 32 | |||
R2 / R2 adjusted | 0.360 / 0.338 | 0.850 / 0.834 | 0.866 / 0.840 |
# Compare the unadjusted model to the adjusted models
anova(unadj, adj.num)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 21 147.49 9 573.4 9.0711 1.779e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(unadj, adj.cat)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.9
## 2 15 120.4 15 600.49 4.9874 0.001759 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare the adjusted qualitative model to the numeric model
anova(adj.num, adj.cat)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 21 147.49
## 2 15 120.40 6 27.092 0.5625 0.7536
The inclusion of other covariates significantly impoved the regression model, under both coding procedures (numeric: p<0.001, qualitative: p=0.0018). The two adjusted model did not significantly differ in their fit (p=0.7536).
# Compare the unadjusted model to the adjusted stepwise models
anova(unadj, adj.num.step)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 45.618 1.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(unadj, adj.cat.step)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 151.03 4 569.87 24.527 1.688e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare the adjusted stepwise qualitative and numeric models
anova(adj.num.step, adj.cat.step)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + qsec + am
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 169.29
## 2 26 151.03 2 18.26 1.5718 0.2268
With stepwise selection, the inclusion of other covariates significantly impoved the regression model, under both coding procedures (numeric: p<0.001, qualitative: p<0.001). The two adjusted stepwise models did not significantly differ in their fit (p=0.227).
# Compare the stepwise models to the full models
anova(adj.num, adj.num.step)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 21 147.49
## 2 28 169.29 -7 -21.791 0.4432 0.8636
anova(adj.cat, adj.cat.step)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 15 120.40
## 2 26 151.03 -11 -30.623 0.3468 0.9588
The stepwise models did not show signifficantly better fits compared to their models of origin (numeric: p=0.864, qualitative: p=0.959).
# A function that replicates the `plot(model)` functionality of base plotting system to ggplot
diagPlot <- function(model) {
# Residuals vs. Fitted
p1 <- ggplot(model, aes(x = .fitted, y = .resid)) +
geom_point() + stat_smooth() +
# geom_hline(yintercept=mean(model$residuals), color="black", linetype="solid") +
# geom_smooth(method="lm", se=F, color="black", linetype="solid", size=0.5)+
geom_hline(yintercept=0, color="red", linetype="dashed") +
labs(x="Fitted Values", y="Residuals", title="Residuals vs. Fitted")
# Normal Q-Q
p2 <- ggplot(model, aes(sample=.stdresid)) +
geom_qq() + geom_qq_line(color="red", linetype="dashed") +
labs(x="Theoretical Quantiles", y="Std. Residuals", title="Normal Q-Q")
# Scale-Location
p3 <- ggplot(model, aes(x = .fitted, y = sqrt(abs(.stdresid))))+
geom_point(na.rm=T) + stat_smooth(na.rm = T) +
labs(x="Fitted Value", y=expression(sqrt("|Std. Residuals|")), title="Scale-Location")
# Residuals vs. Leverage vs. Cook's distance
p4 <- ggplot(model, aes(x = .hat, y = .stdresid)) +
geom_point(aes(size=.cooksd, color=.cooksd), na.rm=T) +
stat_smooth(na.rm=T) +
geom_hline(yintercept=0, color="red", linetype="dashed") +
scale_size_area() +
guides(size=guide_legend("Cook's Distance"), color=guide_legend("Cook's Distance"))+
labs(x="Leverage", y="Std. Residuals", title="Residuals vs. Leverage") +
theme(legend.position="bottom")
return(list(`Residuals vs. Fitted` = p1,
`Normal Q-Q` = p2,
`Scale-Location` = p3,
`Residuals vs. Leverage vs. Cook's distance` = p4))
}
The full diagnostics chart-grid is less usefull for the unadjusted model. The chart below shows the density distribution of the residuals, which are approximatively normal and centerd at 0.
ggplot(unadj, aes(x=.resid)) +
geom_density(fill="grey80", color="grey50") +
ggpubr::stat_overlay_normal_density(color="black")+
geom_vline(xintercept = 0, color="red", linetype="dashed") +
labs(x="Residuals", y = "Density",
title="Residuals distribution", subtitle = "(Normal distribution overlay)")
The adjusted model, using numeric coding, does not show many allarming issues. The residuals are normally distributed and scattered around 0 but show a “U”-shaped trend as more cars are added. They show no pattern with regards to fitted values. As expected, larger residuals show the highest leverage.
cowplot::plot_grid(plotlist = diagPlot(adj.num), nrow=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The adjusted model, using factor coding, does not show many allarming issues. The residuals are less normally distributed, scattered around 0 and do not show any particular pattern as more cars are added. There may be some heteroskedacity. As expected, larger residuals show the highest leverage.
cowplot::plot_grid(plotlist = diagPlot(adj.cat), nrow=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_qq).
## Warning: Removed 2 rows containing non-finite values (stat_qq_line).
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The diagnostics for adjusted stepwise model with numerical coding look very similar to its original model.
cowplot::plot_grid(plotlist = diagPlot(adj.num.step), nrow=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The diagnostics for adjusted stepwise model with factor coding look very similar to its original model.
cowplot::plot_grid(plotlist = diagPlot(adj.cat.step), nrow=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'