Summary

The purpose of this analysis is to find if fuel efficiency (miles / gallorn) is affected by transmission type (automatic vs. manual) and, if yes, by how much. The R dataset, ‘mtcars’ contains data from 32 cars (each row has 10 columns).

I found that a manual transmission significantly increased fuel efficiency by approximatibely 7.24 miles / gallon compared to automatic. However adjusting for the ohter covariates did not result in significant fuel usage differences (with the exception of a smaller selection of covariates). Adjusted R-squared values increased from 34% for the unadjusted model to 80-84% for adjusted and the inclusion of covariates significantly improved all models compared to the unadjusted one but restricting the number of covariates or the coding strategy (numeric or factor) did not.

These results suggest that fuel efficiency is better in cars with manual transmission than in cars with automatic transmission, but other factors are able to compensate for this difference.

Data preparation

Starting from the ‘mtcars’ dataset, I prepared two custom datasets for analysis. mtcars2 kept the numerical type of discrete variables with numerical levels: cyl (Number of cylinders: 4, 6 or 8), gear (Number of forward gears: 3, 4 or 5) and carb (Number of carburetors: 1-4, 6 or 8). Therefore, regression coefficients for these variables show the expected difference of fuel efficiency for evey unit increase in their numeric value and the intercept uses values of 0, even when these values do not appear in the dataset. mtcars3 coded these variables as factors resulting in dummy variables compared by the regression algorithm to their first level.

In both datasets, variables am (Transmission: manual compared to automatic) and vs (Engine: straight compared to V-shaped) were coded as factors.

Variables disp (Displacement) was rescaled from cubic inches to litters and hp (Gross horsepower) was downscaled by a factor of 100 in order to make their coefficients easier to visualize graphically since their unscaled coefficients and confidence intervals were very close to 0, which made them appear as dots rather than ranges. Rescaling only affects the numerical values of their coefficients and their visual aspect but does not affect any related statisitcal inference.

All variables were labelled for aestethic reasons using the function set_variable_labels from the package labelled. This automatically applies the attribute label to each column of a dataset, which is used by other functions to replace variable names to variable labels in their output.

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

data(mtcars)
# mtcars %>% summary()

# Prepare nice variable labels
labels <- c(
  "Miles/(US) gallon",
  "Number of cylinders",
  "Displacement (Liters)", # from cubic inches
  "Gross horsepower (100x)", # each 100 hp
  "Rear axle ratio",
  "Weight (1000 lbs)",
  "1/4 mile time",
  "Engine",
  "Transmission",
  "Number of forward gears",
  "Number of carburetors"
)

# change the units for continuous variables to make coefficinets more visble on the chart
mtcars1 <- mtcars %>% mutate(
  `disp` = `disp`/61.024, # cubic inches to liters
  `hp`= `hp`/100 #  each 100 hp
)

# make factors
mtcars2 <- mtcars1 %>% mutate(
  # true factors
  `am` = factor(`am`, labels = c(`0`="automatic", `1`="manual")),
  `vs` = factor(`vs`, labels = c(`0`="V-shaped", `1` = "straight")), 
) %>% labelled::set_variable_labels(.labels = labels)

mtcars3 <- mtcars2 %>% mutate(
  # numeric factors: these could be coded as numeric
  `cyl` = factor(`cyl`), # Levels: 4 6 8
  `gear` = factor(`gear`), # Levels: 3 4 5
  `carb` = factor(`carb`) # Levels: 1 2 3 4 6 8
) %>% labelled::set_variable_labels(.labels = labels)

# mtcars3 %>% summary()

Exploratory analysis

Concise Summary

A short summary of the data (factors coded as factors) is printed below.

mtcars3 %>% summary()

##       mpg        cyl         disp             hp             drat      
##  Min.   :10.40   4:11   Min.   :1.165   Min.   :0.520   Min.   :2.760  
##  1st Qu.:15.43   6: 7   1st Qu.:1.980   1st Qu.:0.965   1st Qu.:3.080  
##  Median :19.20   8:14   Median :3.217   Median :1.230   Median :3.695  
##  Mean   :20.09          Mean   :3.781   Mean   :1.467   Mean   :3.597  
##  3rd Qu.:22.80          3rd Qu.:5.342   3rd Qu.:1.800   3rd Qu.:3.920  
##  Max.   :33.90          Max.   :7.735   Max.   :3.350   Max.   :4.930  
##        wt             qsec              vs             am     gear   carb  
##  Min.   :1.513   Min.   :14.50   V-shaped:18   automatic:19   3:15   1: 7  
##  1st Qu.:2.581   1st Qu.:16.89   straight:14   manual   :13   4:12   2:10  
##  Median :3.325   Median :17.71                                5: 5   3: 3  
##  Mean   :3.217   Mean   :17.85                                       4:10  
##  3rd Qu.:3.610   3rd Qu.:18.90                                       6: 1  
##  Max.   :5.424   Max.   :22.90                                       8: 1

Detailed Summary

A more detailed summary of each variable in the dataset is provied by the function DescTools::Desc which outputs decriptive statistics and charts based on data type.

DescTools::Desc(mtcars3)

## ------------------------------------------------------------------------------ 
## Describe mtcars3 (data.frame):
## 
## data frame:  32 obs. of  11 variables
##      32 complete cases (100.0%)
## 
##   Nr  ColName  Class    NAs  Levels                    Label                  
##   1   mpg      numeric  .                              Miles/(US) gallon      
##   2   cyl      factor   .    (3): 1-4, 2-6, 3-8        Number of cylinders    
##   3   disp     numeric  .                              Displacement (Liters)  
##   4   hp       numeric  .                              Gross horsepower (100x)
##   5   drat     numeric  .                              Rear axle ratio        
##   6   wt       numeric  .                              Weight (1000 lbs)      
##   7   qsec     numeric  .                              1/4 mile time          
##   8   vs       factor   .    (2): 1-V-shaped,          Engine                 
##                              2-straight                                       
##   9   am       factor   .    (2): 1-automatic,         Transmission           
##                              2-manual                                         
##   10  gear     factor   .    (3): 1-3, 2-4, 3-5        Number of forward gears
##   11  carb     factor   .    (6): 1-1, 2-2, 3-3, 4-4,  Number of carburetors  
##                              5-6, ...                                         
## 
## 
## ------------------------------------------------------------------------------ 
## 1 - mpg (numeric) :
##   Miles/(US) gallon
## 
## 
##   length       n     NAs  unique      0s    mean  meanCI
##       32      32       0      25       0  20.091  17.918
##           100.0%    0.0%            0.0%          22.264
##                                                         
##      .05     .10     .25  median     .75     .90     .95
##   11.995  14.340  15.425  19.200  22.800  30.090  31.300
##                                                         
##    range      sd   vcoef     mad     IQR    skew    kurt
##   23.500   6.027   0.300   5.411   7.375   0.611  -0.373
##                                                         
## lowest : 10.4 (2), 13.3, 14.3, 14.7, 15.0
## highest: 26.0, 27.3, 30.4 (2), 32.4, 33.9

## ------------------------------------------------------------------------------ 
## 2 - cyl (factor) :
##   Number of cylinders
## 
## 
##   length      n    NAs unique levels  dupes
##       32     32      0      3      3      y
##          100.0%   0.0%                     
## 
##    level  freq   perc  cumfreq  cumperc
## 1      8    14  43.8%       14    43.8%
## 2      4    11  34.4%       25    78.1%
## 3      6     7  21.9%       32   100.0%

## ------------------------------------------------------------------------------ 
## 3 - disp (numeric) :
##   Displacement (Liters)
## 
## 
##     length         n       NAs    unique        0s      mean     meanCI
##         32        32         0        27         0  3.780838   3.048591
##               100.0%      0.0%                0.0%             4.513086
##                                                                        
##        .05       .10       .25    median       .75       .90        .95
##   1.267534  1.320956  1.979959  3.216767  5.342160  6.489250   7.357761
##                                                                        
##      range        sd     vcoef       mad       IQR      skew       kurt
##   6.569546  2.030983  0.537178  2.301985  3.362202  0.381657  -1.207212
##                                                                        
## lowest : 1.165115, 1.240496, 1.289657, 1.294573, 1.558403
## highest: 5.899318 (2), 6.554798, 7.210278, 7.538018, 7.734662

## ------------------------------------------------------------------------------ 
## 4 - hp (numeric) :
##   Gross horsepower (100x)
## 
## 
##   length       n     NAs  unique      0s    mean   meanCI
##       32      32       0      22       0  1.4669   1.2197
##           100.0%    0.0%            0.0%           1.7141
##                                                          
##      .05     .10     .25  median     .75     .90      .95
##   0.6365  0.6600  0.9650  1.2300  1.8000  2.4350   2.5355
##                                                          
##    range      sd   vcoef     mad     IQR    skew     kurt
##   2.8300  0.6856  0.4674  0.7710  0.8350  0.7260  -0.1356
##                                                          
## lowest : 0.52, 0.62, 0.65, 0.66 (2), 0.91
## highest: 2.15, 2.3, 2.45 (2), 2.64, 3.35

## ------------------------------------------------------------------------------ 
## 5 - drat (numeric) :
##   Rear axle ratio
## 
## 
##   length       n     NAs  unique      0s    mean   meanCI
##       32      32       0      22       0  3.5966   3.4038
##           100.0%    0.0%            0.0%           3.7893
##                                                          
##      .05     .10     .25  median     .75     .90      .95
##   2.8535  3.0070  3.0800  3.6950  3.9200  4.2090   4.3145
##                                                          
##    range      sd   vcoef     mad     IQR    skew     kurt
##   2.1700  0.5347  0.1487  0.7042  0.8400  0.2659  -0.7147
##                                                          
## lowest : 2.76 (2), 2.93, 3.0, 3.07 (3), 3.08 (2)
## highest: 4.08 (2), 4.11, 4.22 (2), 4.43, 4.93

## ------------------------------------------------------------------------------ 
## 6 - wt (numeric) :
##   Weight (1000 lbs)
## 
## 
##    length        n      NAs   unique       0s     mean    meanCI
##        32       32        0       29        0  3.21725   2.86448
##             100.0%     0.0%              0.0%            3.57002
##                                                                 
##       .05      .10      .25   median      .75      .90       .95
##   1.73600  1.95550  2.58125  3.32500  3.61000  4.04750   5.29275
##                                                                 
##     range       sd    vcoef      mad      IQR     skew      kurt
##   3.91100  0.97846  0.30413  0.76725  1.02875  0.42315  -0.02271
##                                                                 
## lowest : 1.513, 1.615, 1.835, 1.935, 2.14
## highest: 3.845, 4.07, 5.25, 5.345, 5.424

## ------------------------------------------------------------------------------ 
## 7 - qsec (numeric) :
##   1/4 mile time
## 
## 
##    length        n      NAs   unique       0s     mean   meanCI
##        32       32        0       30        0  17.8488  17.2045
##             100.0%     0.0%              0.0%           18.4930
##                                                                
##       .05      .10      .25   median      .75      .90      .95
##   15.0455  15.5340  16.8925  17.7100  18.9000  19.9900  20.1045
##                                                                
##     range       sd    vcoef      mad      IQR     skew     kurt
##    8.4000   1.7869   0.1001   1.4159   2.0075   0.3690   0.3351
##                                                                
## lowest : 14.5, 14.6, 15.41, 15.5, 15.84
## highest: 19.9, 20.0, 20.01, 20.22, 22.9

## ------------------------------------------------------------------------------ 
## 8 - vs (factor - dichotomous) :
##   Engine
## 
## 
##   length      n    NAs unique
##       32     32      0      2
##          100.0%   0.0%       
## 
##           freq   perc  lci.95  uci.95'
## V-shaped    18  56.2%   39.3%   71.8%
## straight    14  43.8%   28.2%   60.7%
## 
## ' 95%-CI Wilson

## ------------------------------------------------------------------------------ 
## 9 - am (factor - dichotomous) :
##   Transmission
## 
## 
##   length      n    NAs unique
##       32     32      0      2
##          100.0%   0.0%       
## 
##            freq   perc  lci.95  uci.95'
## automatic    19  59.4%   42.3%   74.5%
## manual       13  40.6%   25.5%   57.7%
## 
## ' 95%-CI Wilson

## ------------------------------------------------------------------------------ 
## 10 - gear (factor) :
##   Number of forward gears
## 
## 
##   length      n    NAs unique levels  dupes
##       32     32      0      3      3      y
##          100.0%   0.0%                     
## 
##    level  freq   perc  cumfreq  cumperc
## 1      3    15  46.9%       15    46.9%
## 2      4    12  37.5%       27    84.4%
## 3      5     5  15.6%       32   100.0%

## ------------------------------------------------------------------------------ 
## 11 - carb (factor) :
##   Number of carburetors
## 
## 
##   length      n    NAs unique levels  dupes
##       32     32      0      6      6      y
##          100.0%   0.0%                     
## 
##    level  freq   perc  cumfreq  cumperc
## 1      2    10  31.2%       10    31.2%
## 2      4    10  31.2%       20    62.5%
## 3      1     7  21.9%       27    84.4%
## 4      3     3   9.4%       30    93.8%
## 5      6     1   3.1%       31    96.9%
## 6      8     1   3.1%       32   100.0%

MPG by Transmission

A boxplot of fuel efficiency by type of transmission shows that manual cars use less fuel than automatic cars. This difference is statisticaly signifficant accoding to a T-test.

ggplot(mtcars3, aes(y=`mpg`, x=`am`, color=`am`)) +
  geom_boxplot(varwidth = T) +
  stat_summary(fun.y = mean, na.rm=T, geom="point", pch=23, size=3) +
  ggpubr::stat_compare_means(method = "t.test") +
  labs(x="Transmission type", y="Miles / gallon") +
  theme(legend.position = "none")

Regression analysis

Models

I defined 5 regression model to study the relation between the type of transmission (binary variable am) and fuel consumption (in miles / gallon, variable mpg).

The unadjusted model is just another way to perform the same T-test as above. It is used as refference point for the other models.

I created 2 other models using all other variables as covariates. One of them uses numerical coding for numerical dicrete variables and the other codes them as factors.

In order to reduce the number of coeficients, I used a stepwise selection algorithm to select the most important covariates from both adjusted models.

# Unadjusted model: Miles / gallon by Transmission type
unadj <- lm(data=mtcars2, formula=`mpg`~`am`) # %>% summary

# Adjusted model: Miles / gallon by Transmission type, keepig all other covariates constant, retaing the numeric type of numeric factors
adj.num <- lm(data=mtcars2, formula=`mpg`~.) # %>% summary

# Adjusted model: Miles / gallon by Transmission type, keepig all other covariates constant, converting numeric factors to dummy variables
adj.cat <- lm(data=mtcars3, formula=`mpg`~.) # %>% summary

# Adjusted model with stepwise selection of covariates: Miles / gallon by Transmission type, keepig all other covariates constant, retaing the numeric type of numeric factors
adj.num.step <- step(lm(data=mtcars2, formula=`mpg`~.), trace = F) # %>% summary

# Adjusted model with stepwise selection of covariates: Miles / gallon by Transmission type, keepig all other covariates constant, converting numeric factors to dummy variables
adj.cat.step <- step(lm(data=mtcars3, formula=`mpg`~.), trace = F) # %>% summary

Full models coefficients

I used the sjPlot::plot_models function to create a chart of the coefficients for the two adjusted models and the unadjusted model. This chart shows each coeficient and its 95% confidence interval. There are 3 models, each with a different color (green: unadjusted; blue: adjusted, numeric coding; red: adjusted, qualitative coding).

The chart shows that unadjusted, a manual transmission leads to a significant average increase in fuel efficiency of about 7 miles/ gallon. However, adjusted for all other covariates, under both coding methods, this average differnce approaches 0 and it is no longer statistically singificant. In fact, under both coding methods, confidence intervals for all other covariates cross 0 and, therefore, are not statistically signifficantly different from 0.

sjPlot::plot_models(unadj, adj.num, adj.cat, 
                    std.est = NULL, # use "std" if you want to show standardized coefficients, which makes it easier to comparte variables measured on different scales (with very different numeric values).
                    show.intercept = F, # use TRUE to include the intercept
                    prefix.labels = "label", wrap.labels = 100,
                    title = "Miles/gallon by:",
                    axis.title = "Coefficients (95% conf. int.)",
                    legend.title = "Model", 
                    m.labels = c("Unadjusted", "Adjusted, numeric", "Adjusted, qualitative")) +
  
  # legend: top-right, transparent background
  theme(legend.position = c(1, 1), legend.justification = c(1, 1), 
        legend.background = element_blank(), legend.key = element_blank())

The table below, created using the function sjPlot:tab_model, shows the same coefficients with their 95% confidence interval and p-values, as well as R-squared values for each of the models. It shows that the unadjusted model only explains 33.8% of the variance of fuel usage while the adjsted models, using all variables, capture around 80% of the variance. The remaining 20% of the variance is not explained by any of the variabls in the mtcars dataset.

sjPlot::tab_model(unadj, adj.num, adj.cat,
                  collapse.ci = T,
                  dv.labels = c("Unadjusted", 
                                 "Adjusted, numeric", "Adjusted, qualitative"))

	Unadjusted		Adjusted, numeric		Adjusted, qualitative
Predictors	Estimates	p	Estimates	p	Estimates	p
(Intercept)	17.15 (14.85 – 19.44)	<0.001	12.30 (-26.62 – 51.23)	0.518	23.88 (-18.89 – 66.65)	0.253
Transmission: manual	7.24 (3.64 – 10.85)	<0.001	2.52 (-1.76 – 6.80)	0.234	1.21 (-5.64 – 8.06)	0.711
Number of cylinders			-0.11 (-2.28 – 2.06)	0.916
Displacement(Liters)			0.81 (-1.45 – 3.08)	0.463	2.17 (-1.98 – 6.32)	0.283
Gross horsepower(100 x)			-2.15 (-6.68 – 2.38)	0.335	-7.05 (-15.45 – 1.35)	0.094
Rear axle ratio			0.79 (-2.61 – 4.19)	0.635	1.18 (-4.11 – 6.48)	0.641
Weight(1000 lbs)			-3.72 (-7.65 – 0.22)	0.063	-4.53 (-9.94 – 0.88)	0.095
1/4 mile time			0.82 (-0.70 – 2.34)	0.274	0.37 (-1.63 – 2.36)	0.700
Engine: straight			0.32 (-4.06 – 4.69)	0.881	1.93 (-4.19 – 8.05)	0.512
Number of forward gears			0.66 (-2.45 – 3.76)	0.665
Number of carburetors			-0.20 (-1.92 – 1.52)	0.812
Number of cylinders: cyl 6					-2.65 (-9.13 – 3.83)	0.397
Number of cylinders: cyl 8					-0.34 (-15.60 – 14.92)	0.963
Number of forward gears: gear 4					1.11 (-6.98 – 9.21)	0.773
Number of forward gears: gear 5					2.53 (-5.44 – 10.49)	0.509
Number of carburetors: carb 2					-0.98 (-5.92 – 3.96)	0.679
Number of carburetors: carb 3					3.00 (-6.15 – 12.15)	0.495
Number of carburetors: carb 4					1.09 (-8.39 – 10.58)	0.810
Number of carburetors: carb 6					4.48 (-9.13 – 18.08)	0.494
Number of carburetors: carb 8					7.25 (-10.57 – 25.07)	0.399
Observations	32		32		32
R² / R² adjusted	0.360 / 0.338		0.869 / 0.807		0.893 / 0.779

Stepwise models coefficients

In order to reduce the number of coefficients, I used sepwise covariate selection to find the models that most efficiently explain fuel usage, starting from the two adjusted models.

Using numeric coding, I found that a manual transmission significantly increases fuel economy with only aproximativey 3 miles / gallon compared to automatic, when adjusted for car weight and 1/4 mile time. Also, each 1000lbs car weight reduces reduces fuel economy with approximatively 4 miles / gallon and a slower acceleration (each second added to the 1/4 mile time) increases fuel economy by about 1.2 miles / gallon.

Using factor coding, I found that a manual transmission did significantly increase fuel economy, adjusted for weight, horsepower and number of cylinders. However, larger weight reduced fuel economy by about 2.5 miles / gallon for every 1000 lbs, a more powerful motor reduced fuel economy by about 3.2 miles / gallon for every 100 hp and a 6 cylinder motor reduced fuel economy by about 3 miles / gallon compared to 4 cylinders.

sjPlot::plot_models(unadj, adj.num.step, adj.cat.step, 
                    std.est =  NULL, # use "std" if you want to show standardized coefficients, which makes it easier to comparte variables measured on different scales (with very different numeric values).
                    show.intercept = F, # use TRUE to include the intercept
                    prefix.labels = "label", wrap.labels = 100,
                    title = "Miles/gallon by:",
                    axis.title = "Coefficients (95% conf. int.)",
                    legend.title = "Model", 
                    m.labels = c("Unadjusted", 
                                 "Adjusted, numeric, stepwise", "Adjusted, qualitative, stepwise")) +
  
  # legend: top-right, transparent background
  theme(legend.position = c(1, 1), legend.justification = c(1, 1), 
        legend.background = element_blank(), legend.key = element_blank())

The table below shows that the adjsted stepwise models capture 83.4% and 84.0% of the variance in fuel usage, respectively. Theit unadjusted R-squared values are smaller than those of their full models because they include fewer variables while their adjusted R-squared values are larger because they only include the most informative variables.

sjPlot::tab_model(unadj, adj.num.step, adj.cat.step, 
                  collapse.ci = T,
                  dv.labels = c("Unadjusted", 
                                 "Adjusted, numeric, stepwise", "Adjusted, qualitative, stepwise"))

	Unadjusted		Adjusted, numeric, stepwise		Adjusted, qualitative, stepwise
Predictors	Estimates	p	Estimates	p	Estimates	p
(Intercept)	17.15 (14.85 – 19.44)	<0.001	9.62 (-4.64 – 23.87)	0.178	33.71 (28.35 – 39.06)	<0.001
Transmission: manual	7.24 (3.64 – 10.85)	<0.001	2.94 (0.05 – 5.83)	0.047	1.81 (-1.06 – 4.68)	0.206
Weight(1000 lbs)			-3.92 (-5.37 – -2.46)	<0.001	-2.50 (-4.32 – -0.68)	0.009
1/4 mile time			1.23 (0.63 – 1.82)	<0.001
Number of cylinders: cyl 6					-3.03 (-5.92 – -0.14)	0.041
Number of cylinders: cyl 8					-2.16 (-6.86 – 2.53)	0.352
Gross horsepower(100 x)					-3.21 (-6.03 – -0.40)	0.027
Observations	32		32		32
R² / R² adjusted	0.360 / 0.338		0.850 / 0.834		0.866 / 0.840

Model comparrisons

# Compare the unadjusted model to the adjusted models
anova(unadj, adj.num)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     21 147.49  9     573.4 9.0711 1.779e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(unadj, adj.cat)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)   
## 1     30 720.9                                
## 2     15 120.4 15    600.49 4.9874 0.001759 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Compare the adjusted qualitative model to the numeric model
anova(adj.num, adj.cat)

## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     21 147.49                           
## 2     15 120.40  6    27.092 0.5625 0.7536

The inclusion of other covariates significantly impoved the regression model, under both coding procedures (numeric: p<0.001, qualitative: p=0.0018). The two adjusted model did not significantly differ in their fit (p=0.7536).

# Compare the unadjusted model to the adjusted stepwise models
anova(unadj, adj.num.step)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     28 169.29  2    551.61 45.618 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(unadj, adj.cat.step)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     26 151.03  4    569.87 24.527 1.688e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Compare the adjusted stepwise qualitative and numeric models
anova(adj.num.step, adj.cat.step)

## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + qsec + am
## Model 2: mpg ~ cyl + hp + wt + am
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     28 169.29                           
## 2     26 151.03  2     18.26 1.5718 0.2268

With stepwise selection, the inclusion of other covariates significantly impoved the regression model, under both coding procedures (numeric: p<0.001, qualitative: p<0.001). The two adjusted stepwise models did not significantly differ in their fit (p=0.227).

# Compare the stepwise models to the full models
anova(adj.num, adj.num.step)

## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     21 147.49                           
## 2     28 169.29 -7   -21.791 0.4432 0.8636

anova(adj.cat, adj.cat.step)

## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ cyl + hp + wt + am
##   Res.Df    RSS  Df Sum of Sq      F Pr(>F)
## 1     15 120.40                            
## 2     26 151.03 -11   -30.623 0.3468 0.9588

The stepwise models did not show signifficantly better fits compared to their models of origin (numeric: p=0.864, qualitative: p=0.959).

Model diagnostics

# A function that replicates the `plot(model)` functionality of base plotting system to ggplot

diagPlot <- function(model) {
  # Residuals vs. Fitted
  p1 <- ggplot(model, aes(x = .fitted, y = .resid)) + 
    geom_point() + stat_smooth() +
    # geom_hline(yintercept=mean(model$residuals), color="black", linetype="solid") +
    # geom_smooth(method="lm", se=F, color="black", linetype="solid", size=0.5)+
    geom_hline(yintercept=0, color="red", linetype="dashed") +
    labs(x="Fitted Values", y="Residuals", title="Residuals vs. Fitted")
  
  # Normal Q-Q
  p2 <- ggplot(model, aes(sample=.stdresid)) +
    geom_qq() + geom_qq_line(color="red", linetype="dashed") +
    labs(x="Theoretical Quantiles", y="Std. Residuals", title="Normal Q-Q")
  
  # Scale-Location
  p3 <- ggplot(model, aes(x = .fitted, y = sqrt(abs(.stdresid))))+
    geom_point(na.rm=T) + stat_smooth(na.rm = T) +
    labs(x="Fitted Value", y=expression(sqrt("|Std. Residuals|")), title="Scale-Location")
  
  # Residuals vs. Leverage vs. Cook's distance
  p4 <- ggplot(model, aes(x = .hat, y = .stdresid)) + 
    geom_point(aes(size=.cooksd, color=.cooksd), na.rm=T) + 
    stat_smooth(na.rm=T) + 
    geom_hline(yintercept=0, color="red", linetype="dashed") +
    scale_size_area() + 
    guides(size=guide_legend("Cook's Distance"), color=guide_legend("Cook's Distance"))+
    labs(x="Leverage", y="Std. Residuals", title="Residuals vs. Leverage") + 
    theme(legend.position="bottom")
  
  return(list(`Residuals vs. Fitted` = p1, 
              `Normal Q-Q` = p2, 
              `Scale-Location` = p3, 
              `Residuals vs. Leverage vs. Cook's distance` = p4))
}

Unadjusted model

The full diagnostics chart-grid is less usefull for the unadjusted model. The chart below shows the density distribution of the residuals, which are approximatively normal and centerd at 0.

ggplot(unadj, aes(x=.resid)) + 
  geom_density(fill="grey80", color="grey50") +
  ggpubr::stat_overlay_normal_density(color="black")+
  geom_vline(xintercept = 0, color="red", linetype="dashed") +
  labs(x="Residuals", y = "Density", 
       title="Residuals distribution", subtitle = "(Normal distribution overlay)")

Adjusted, numeric

The adjusted model, using numeric coding, does not show many allarming issues. The residuals are normally distributed and scattered around 0 but show a “U”-shaped trend as more cars are added. They show no pattern with regards to fitted values. As expected, larger residuals show the highest leverage.

cowplot::plot_grid(plotlist = diagPlot(adj.num), nrow=2)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Adjusted, qualitative

The adjusted model, using factor coding, does not show many allarming issues. The residuals are less normally distributed, scattered around 0 and do not show any particular pattern as more cars are added. There may be some heteroskedacity. As expected, larger residuals show the highest leverage.

cowplot::plot_grid(plotlist = diagPlot(adj.cat), nrow=2)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 2 rows containing non-finite values (stat_qq).

## Warning: Removed 2 rows containing non-finite values (stat_qq_line).

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Adjusted, numeric, stepwise

The diagnostics for adjusted stepwise model with numerical coding look very similar to its original model.

cowplot::plot_grid(plotlist = diagPlot(adj.num.step), nrow=2)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Adjusted, qualitative, stepwise

The diagnostics for adjusted stepwise model with factor coding look very similar to its original model.

cowplot::plot_grid(plotlist = diagPlot(adj.cat.step), nrow=2)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Regression analysis of fuel efficiency using R dataset mtcars

Alex Istrate