Tutorial 6: Multivariate Linear Regression

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

# List of packages
packages <- c("tidyverse", "broom", "modelsummary", "sjPlot", "car", "knitr", "gt", "fst", "Lock5Data", "patchwork", "kableExtra", "broom", "sjmisc", "sjlabelled", "gt", "tinytable", "flextable", "webshot2", "officer")

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "broom"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "broom"        "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "sjPlot"       "modelsummary" "broom"        "lubridate"    "forcats"     
##  [6] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [11] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [16] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[5]]
##  [1] "car"          "carData"      "sjPlot"       "modelsummary" "broom"       
##  [6] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [11] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [16] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [21] "methods"      "base"        
## 
## [[6]]
##  [1] "knitr"        "car"          "carData"      "sjPlot"       "modelsummary"
##  [6] "broom"        "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"        
## 
## [[7]]
##  [1] "gt"           "knitr"        "car"          "carData"      "sjPlot"      
##  [6] "modelsummary" "broom"        "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [21] "utils"        "datasets"     "methods"      "base"        
## 
## [[8]]
##  [1] "fst"          "gt"           "knitr"        "car"          "carData"     
##  [6] "sjPlot"       "modelsummary" "broom"        "lubridate"    "forcats"     
## [11] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [16] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [21] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[9]]
##  [1] "Lock5Data"    "fst"          "gt"           "knitr"        "car"         
##  [6] "carData"      "sjPlot"       "modelsummary" "broom"        "lubridate"   
## [11] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [16] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [21] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [26] "base"        
## 
## [[10]]
##  [1] "patchwork"    "Lock5Data"    "fst"          "gt"           "knitr"       
##  [6] "car"          "carData"      "sjPlot"       "modelsummary" "broom"       
## [11] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [16] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [21] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [26] "methods"      "base"        
## 
## [[11]]
##  [1] "kableExtra"   "patchwork"    "Lock5Data"    "fst"          "gt"          
##  [6] "knitr"        "car"          "carData"      "sjPlot"       "modelsummary"
## [11] "broom"        "lubridate"    "forcats"      "stringr"      "dplyr"       
## [16] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [21] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"        
## 
## [[12]]
##  [1] "kableExtra"   "patchwork"    "Lock5Data"    "fst"          "gt"          
##  [6] "knitr"        "car"          "carData"      "sjPlot"       "modelsummary"
## [11] "broom"        "lubridate"    "forcats"      "stringr"      "dplyr"       
## [16] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [21] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"        
## 
## [[13]]
##  [1] "sjmisc"       "kableExtra"   "patchwork"    "Lock5Data"    "fst"         
##  [6] "gt"           "knitr"        "car"          "carData"      "sjPlot"      
## [11] "modelsummary" "broom"        "lubridate"    "forcats"      "stringr"     
## [16] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [21] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [26] "utils"        "datasets"     "methods"      "base"        
## 
## [[14]]
##  [1] "sjlabelled"   "sjmisc"       "kableExtra"   "patchwork"    "Lock5Data"   
##  [6] "fst"          "gt"           "knitr"        "car"          "carData"     
## [11] "sjPlot"       "modelsummary" "broom"        "lubridate"    "forcats"     
## [16] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [21] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [26] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[15]]
##  [1] "sjlabelled"   "sjmisc"       "kableExtra"   "patchwork"    "Lock5Data"   
##  [6] "fst"          "gt"           "knitr"        "car"          "carData"     
## [11] "sjPlot"       "modelsummary" "broom"        "lubridate"    "forcats"     
## [16] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [21] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [26] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[16]]
##  [1] "tinytable"    "sjlabelled"   "sjmisc"       "kableExtra"   "patchwork"   
##  [6] "Lock5Data"    "fst"          "gt"           "knitr"        "car"         
## [11] "carData"      "sjPlot"       "modelsummary" "broom"        "lubridate"   
## [16] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [21] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [26] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [31] "base"        
## 
## [[17]]
##  [1] "flextable"    "tinytable"    "sjlabelled"   "sjmisc"       "kableExtra"  
##  [6] "patchwork"    "Lock5Data"    "fst"          "gt"           "knitr"       
## [11] "car"          "carData"      "sjPlot"       "modelsummary" "broom"       
## [16] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [21] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [26] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [31] "methods"      "base"        
## 
## [[18]]
##  [1] "webshot2"     "flextable"    "tinytable"    "sjlabelled"   "sjmisc"      
##  [6] "kableExtra"   "patchwork"    "Lock5Data"    "fst"          "gt"          
## [11] "knitr"        "car"          "carData"      "sjPlot"       "modelsummary"
## [16] "broom"        "lubridate"    "forcats"      "stringr"      "dplyr"       
## [21] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [26] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [31] "datasets"     "methods"      "base"        
## 
## [[19]]
##  [1] "officer"      "webshot2"     "flextable"    "tinytable"    "sjlabelled"  
##  [6] "sjmisc"       "kableExtra"   "patchwork"    "Lock5Data"    "fst"         
## [11] "gt"           "knitr"        "car"          "carData"      "sjPlot"      
## [16] "modelsummary" "broom"        "lubridate"    "forcats"      "stringr"     
## [21] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [26] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [31] "utils"        "datasets"     "methods"      "base"

Understanding Multivariate Linear Regression

In our previous tutorial, we explored simple linear regression with a single predictor variable. Now, we’ll expand our toolkit by incorporating multiple predictors in our regression models.

In simple linear regression, we modeled an outcome as a function of one predictor:

$Y = \beta_0 + \beta_1X_1 + \varepsilon$

In multivariate regression, we extend this to include multiple predictors:

$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k + \varepsilon$

Interpreting Coefficients in Multivariate Models

The transition from simple to multivariate regression changes how we interpret coefficients in important ways:

The Intercept (β₀): Represents the expected value of the outcome when all predictors equal zero (or their reference categories for categorical variables). This value may not always be meaningful in practical terms, especially if zero values for predictors aren’t realistic in your data.
Predictor Coefficients (β₁, β₂, etc.): Each coefficient represents the expected change in the outcome for a one-unit increase in that predictor while holding all other predictors constant. This “holding constant” aspect is crucial - it means we’re measuring the unique contribution of each variable.
Residual Term (ε): Captures all variation in the outcome not explained by the included predictors.

The Key Insight: “All Else Equal”

What makes multivariate regression powerful is its ability to isolate the relationship between each predictor and the outcome. When we say “β₁ equals 2.5,” we’re saying:

“For each one-unit increase in X₁, we expect Y to increase by 2.5 units, assuming all other predictors in the model remain unchanged.”

This helps us address confounding - when a relationship between two variables is influenced by a third variable related to both.

Visualizing Multivariate Regression

While simple regression can be easily visualized with a single line on a scatterplot, multivariate regression becomes more challenging to visualize. With two predictors, we can imagine a plane in three-dimensional space:

Y = β₀ + β₁X₁ + β₂X₂ + ε

Another way to visualize this is by showing how the relationship between Y and X₁ changes at different levels of X₂.

Here is a simulated example to help illustrate:

set.seed(123)
n <- 500
experience <- runif(n, 0, 30)  # Years of experience from 0-30

education <- sample(c("HS or Less", "College", "Graduate"), 
                    size = n, replace = TRUE, 
                    prob = c(0.4, 0.4, 0.2))  # Realistic distribution

income <- case_when(
  education == "HS or Less" ~ 25000 + 1200 * experience,
  education == "College" ~ 40000 + 1200 * experience,
  education == "Graduate" ~ 60000 + 1200 * experience
) + rnorm(n, 0, 10000)  # Add some random noise

sim_data <- data.frame(
  Income = income,
  Experience = experience,
  Education = factor(education, levels = c("HS or Less", "College", "Graduate"))
)

ggplot(sim_data, aes(x = Experience, y = Income, color = Education)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
  scale_y_continuous(labels = scales::dollar_format()) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal(base_size = 12) +
  labs(
    title = "Effect of Work Experience on Income by Education Level",
    subtitle = "The slopes are parallel (same return to experience) but intercepts differ by education",
    x = "Years of Work Experience",
    y = "Annual Income",
    color = "Education Level"
  ) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  ) +
  annotate("text", x = 25, y = 30000, label = "Reference Category", 
          color = "#E41A1C", fontface = "bold") +
  annotate("text", x = 25, y = 55000, 
          label = "β2, second category = +$15,000\n(College effect)", color = "#377EB8") +
  annotate("text", x = 25, y = 80000, 
          label = "β2, third category = +$35,000\n(Graduate effect)", color = "#4DAF4A") +
  annotate("text", x = 5, y = 90000, 
          label = "β1 = +$1,200 per year\n(Experience effect)", 
          color = "black", fontface = "italic")

In this visualization:

Each colored line represents the relationship between a continuous predictor and the outcome at different levels of a categorical predictor (in our example, each line shows how income increases with experience for each education level)
The reference category (in our example, “HS or Less”) serves as the baseline against which other educational category levels are compared
The slopes are parallel across all category levels, meaning the effect of our continuous predictor (β₁) remains constant regardless of the categorical predictor’s value (in our example, each additional year of experience increases income by $1,200 regardless of education level)
The intercepts differ for each category level beyond the reference, represented by distinct coefficients (in our example, β₂, category two represents the $15,000 premium for “College” compared to “HS or Less”, while β₂, category two represents the $35,000 premium for “Graduate” compared to “HS or Less”)
Categorical variables shift the entire relationship up or down while preserving the slope of the continuous predictor (in our example, higher education levels shift the entire income trajectory upward)

In our regression equation form: $Y = \beta_0 + \beta_1X_1 + \beta_2D_2 + \beta_3D_3 + \varepsilon$ Where:

$\beta_0$ is the intercept for the reference category (expected outcome when X₁ is zero and categorical predictor is at reference level)
$\beta_1$ is the effect of the continuous predictor (slope of the line)
$\beta_2$ is the effect of the second category compared to reference
$\beta_3$ is the effect of the third category compared to reference
$D_2$ and $D_3$ are dummy variables (1 if observation belongs to that category, 0 otherwise)

This illustrates a key property of the standard multivariate regression model: the effect of each predictor (its slope) is constant across all values of other predictors. If this assumption doesn’t hold, we might need to consider interaction terms (which we’ll explore in later tutorials).

Let’s turn to a real-world example

Throughout this tutorial, we’ll focus on:

Understanding and interpreting coefficients in multivariate models
Comparing different models to evaluate improvement in explanatory power
Using model fit statistics to choose between competing models
Creating professional tables for multiple regression results

Let’s begin with the Happy Planet Index data from a teaching R package: https://search.r-project.org/CRAN/refmans/Lock5Data/html/HappyPlanetIndex.html

Example 1: Predicting the Happy Planet Index with Multiple Factors

Understanding the Happy Planet Index

The Happy Planet Index (HPI) provides a metric that measures sustainable well-being across nations. Unlike traditional measures like GDP, the HPI assesses a country’s performance in three key areas: life expectancy, self-reported well-being, and ecological footprint.

How it Works:
- It combines life expectancy, well-being, and ecological footprint data to create a score between 0 and 100.
- A score of 100 represents a country achieving a high life expectancy, high well-being, and a low ecological footprint.
- A score of 0 indicates the opposite: low life expectancy, low well-being, and a high ecological footprint.

For our analysis, we’ll examine what factors predict a country’s HPI score using variables that aren’t already part of the index calculation itself.

Exploring the Data

First, let’s load and examine the data:

data("HappyPlanetIndex")
World <- HappyPlanetIndex

str(World)

## 'data.frame':    143 obs. of  11 variables:
##  $ Country       : Factor w/ 143 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Region        : int  7 3 4 1 7 2 2 7 5 7 ...
##  $ Happiness     : num  5.5 5.6 4.3 7.1 5 7.9 7.8 5.3 5.3 5.8 ...
##  $ LifeExpectancy: num  76.2 71.7 41.7 74.8 71.7 80.9 79.4 67.1 63.1 68.7 ...
##  $ Footprint     : num  2.2 1.7 0.9 2.5 1.4 7.8 5 2.2 0.6 3.9 ...
##  $ HLY           : num  41.7 40.1 17.8 53.4 36.1 63.7 61.9 35.4 33.1 40.1 ...
##  $ HPI           : num  47.9 51.2 26.8 59 48.3 ...
##  $ HPIRank       : int  54 40 130 15 48 102 57 85 31 104 ...
##  $ GDPperCapita  : int  5316 7062 2335 14280 4945 31794 33700 5016 2053 7918 ...
##  $ HDI           : num  0.801 0.733 0.446 0.869 0.775 0.962 0.948 0.746 0.547 0.804 ...
##  $ Population    : num  3.15 32.85 16.1 38.75 3.02 ...

summary(World[, c("HPI", "GDPperCapita", "HDI", "Population", "Region")])

##       HPI         GDPperCapita        HDI           Population      
##  Min.   :16.59   Min.   :  667   Min.   :0.3360   Min.   :   0.290  
##  1st Qu.:34.47   1st Qu.: 2107   1st Qu.:0.5790   1st Qu.:   4.455  
##  Median :43.60   Median : 6632   Median :0.7720   Median :  10.480  
##  Mean   :43.38   Mean   :11275   Mean   :0.7291   Mean   :  44.145  
##  3rd Qu.:52.20   3rd Qu.:15711   3rd Qu.:0.8680   3rd Qu.:  31.225  
##  Max.   :76.12   Max.   :60228   Max.   :0.9680   Max.   :1304.500  
##                  NA's   :2       NA's   :2                          
##      Region     
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :4.000  
##  Mean   :3.832  
##  3rd Qu.:6.000  
##  Max.   :7.000  
##

The key variables we’ll be working with include:

HPI: Happy Planet Index (0-100 scale) - our outcome variable
GDPperCapita: Gross domestic product per capita - measuring economic development
HDI: Human Development Index - measuring overall development
Population: Population in millions - capturing country size
Region: Geographic region of the country (1=Latin America, 2=Western nations, etc.)

Visualizing the HPI Distribution and Relationships

Let’s first look at the distribution of HPI scores across regions:

World$RegionLabel <- factor(World$Region, 
                          levels = 1:7,
                          labels = c("Latin America", "Western Nations", "Middle East", 
                                    "Sub-Saharan Africa", "South Asia", "East Asia", 
                                    "Former Communist"))

ggplot(World, aes(x = RegionLabel, y = HPI, fill = RegionLabel)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(),
        legend.position = "none") +
  labs(x = "Region", y = "Happy Planet Index", 
       title = "Distribution of Happy Planet Index Scores by Region")

Now, let’s visualize the relationships between our outcome variable (HPI) and potential predictors:

theme_clean <- function() {
  theme_minimal(base_size = 12, base_family = "sans") +
    theme(
      plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
      plot.subtitle = element_text(size = 12, color = "gray30", margin = margin(b = 20)),
      axis.title = element_text(size = 11, color = "gray30"),
      axis.text = element_text(size = 10, color = "gray50"),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(color = "gray90"),
      panel.background = element_rect(fill = "white", color = NA),
      strip.background = element_rect(fill = "gray95", color = NA),
      strip.text = element_text(face = "bold", size = 11, margin = margin(t = 5, b = 5)),
      legend.position = "bottom",
      legend.title = element_text(face = "bold", size = 10),
      legend.text = element_text(size = 9),
      plot.margin = margin(20, 20, 20, 20)
    )
}

p1 <- ggplot(World, aes(x = log10(GDPperCapita), y = HPI)) +
  geom_point(color = "#3182bd", alpha = 0.7, size = 2.5) +
  geom_smooth(method = "lm", color = "#e41a1c", fill = "#e41a1c", alpha = 0.1, se = TRUE) +
  labs(
    title = "HPI vs. GDP per Capita (Log Scale)",
    x = "Log10 GDP per Capita",
    y = "Happy Planet Index (0-100)"
  ) +
  theme_clean()

p2 <- ggplot(World, aes(x = log(Population), y = HPI)) +
  geom_point(color = "#3182bd", alpha = 0.7, size = 2.5) +
  geom_smooth(method = "lm", color = "#e41a1c", fill = "#e41a1c", alpha = 0.1, se = TRUE) +
  labs(
    title = "HPI vs. Population (Log Scale)",
    x = "Log Population",
    y = "Happy Planet Index (0-100)"
  ) +
  theme_clean()

(p1 + p2) +
  plot_annotation(
    title = "Relationships with Happy Planet Index",
    subtitle = "Exploring relations between HPI and various development indicators",
    theme = theme(
      plot.title = element_text(face = "bold", size = 16),
      plot.subtitle = element_text(size = 12, color = "gray30")
    )
  )

What is a logarithmic transformation?

A logarithmic transformation converts each value in a dataset to its logarithm, effectively compressing the scale of data that spans several orders of magnitude. For example, when we apply a log10 transformation to GDP per capita data, a value of 1,000 becomes 3, 10,000 becomes 4, and 100,000 becomes 5.

Why use logarithmic transformations?

We apply logarithmic transformations to our data for several important reasons:

Managing skewed distributions: Economic variables like GDP per capita and population are often right-skewed, with many countries having low to moderate values and a few having extremely high values. Logarithmic transformation helps normalize these distributions.
Revealing relationships: In many economic and social science contexts, relationships between variables are often multiplicative rather than additive. Log transformations can linearize these relationships, making patterns easier to detect.
Interpretive value: Changes in log-transformed data can be interpreted in terms of percentage changes or proportional differences, which is often more intuitive for economic variables. A change of 1 in log10 scale represents a 10-fold increase in the original variable.
Visual clarity: Without log transformation, countries with very high values (like the United States or Luxembourg for GDP per capita) would dominate the visualization, while differences among lower-income countries would be compressed and difficult to discern.

In our visualizations, we’ve applied log transformations to both GDP per capita and population to better visualize their relationships with the Happy Planet Index. This allows us to see patterns that might otherwise be obscured by the extreme ranges and skewed distributions of the original data.

Here are the visuals without log:

p1 <- ggplot(World, aes(x = GDPperCapita, y = HPI)) +
  geom_point(color = "#3182bd", alpha = 0.7, size = 2.5) +
  geom_smooth(method = "lm", color = "#e41a1c", fill = "#e41a1c", alpha = 0.1, se = TRUE) +
  labs(
    title = "HPI vs. GDP per Capita (No Log)",
    x = "GDP per Capita",
    y = "Happy Planet Index (0-100)"
  ) +
  theme_clean()

p2 <- ggplot(World, aes(x = Population, y = HPI)) +
  geom_point(color = "#3182bd", alpha = 0.7, size = 2.5) +
  geom_smooth(method = "lm", color = "#e41a1c", fill = "#e41a1c", alpha = 0.1, se = TRUE) +
  labs(
    title = "HPI vs. Population (No Log)",
    x = "Population",
    y = "Happy Planet Index (0-100)"
  ) +
  theme_clean()

(p1 + p2) +
  plot_annotation(
    title = "Relationships with Happy Planet Index",
    subtitle = "Exploring relations between HPI and various development indicators",
    theme = theme(
      plot.title = element_text(face = "bold", size = 16),
      plot.subtitle = element_text(size = 12, color = "gray30")
    )
  )

Building Progressive Regression Models

Now, let’s systematically build multiple regression models with an increasing number of predictors to see how adding variables improves our understanding of what contributes to a country’s Happy Planet Index score.

Model 1: Single Predictor - GDP per Capita

First, let’s start with a simple model using just GDP per Capita as a predictor:

model1 <- lm(HPI ~ log10(GDPperCapita), data = World)

summary(model1)

## 
## Call:
## lm(formula = HPI ~ log10(GDPperCapita), data = World)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.492  -9.489  -1.765  10.185  31.657 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           25.065      7.635   3.283   0.0013 **
## log10(GDPperCapita)    4.840      2.002   2.418   0.0169 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.32 on 139 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.04037,    Adjusted R-squared:  0.03347 
## F-statistic: 5.848 on 1 and 139 DF,  p-value: 0.01689

What do you note?

Model 2: Two Predictors - GDP per Capita and Population

Next, let’s add the Human Development Index as a second predictor:

model2 <- lm(HPI ~ log10(GDPperCapita) + log(Population), data = World)

summary(model2)

## 
## Call:
## lm(formula = HPI ~ log10(GDPperCapita) + log(Population), data = World)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.407  -8.980  -1.303   8.464  32.786 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          20.5791     8.0118   2.569   0.0113 * 
## log10(GDPperCapita)   5.2407     2.0007   2.619   0.0098 **
## log(Population)       1.1949     0.6902   1.731   0.0857 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.23 on 138 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.06077,    Adjusted R-squared:  0.04716 
## F-statistic: 4.464 on 2 and 138 DF,  p-value: 0.01322

What about now?

Model 3: Three Predictors - Adding Region

Finally, let’s add Region as a third predictor to capture geographic differences:

World <- World %>%
  mutate(Region = case_when(
    Region == 1 ~ "Latin America",
    Region == 2 ~ "Western Nations",
    Region == 3 ~ "Middle East",
    Region == 4 ~ "Sub-Saharan Africa",
    Region == 5 ~ "South Asia",
    Region == 6 ~ "East Asia",
    Region == 7 ~ "Former Communist Countries",
    TRUE ~ NA_character_
  ))

World$Region <- factor(World$Region, 
                              levels = c("Latin America", "Western Nations", "Middle East", 
                                         "Sub-Saharan Africa", "South Asia", "East Asia", 
                                         "Former Communist Countries"))

model3 <- lm(HPI ~ log10(GDPperCapita) + log(Population) + Region, data = World)

summary(model3)

## 
## Call:
## lm(formula = HPI ~ log10(GDPperCapita) + log(Population) + Region, 
##     data = World)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.7415  -4.9437   0.2663   4.3046  20.0580 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       79.7681     7.8446  10.169  < 2e-16 ***
## log10(GDPperCapita)               -6.1027     1.9975  -3.055  0.00272 ** 
## log(Population)                    0.5132     0.4300   1.193  0.23483    
## RegionWestern Nations            -11.4259     2.4849  -4.598 9.86e-06 ***
## RegionMiddle East                -10.9803     2.3438  -4.685 6.88e-06 ***
## RegionSub-Saharan Africa         -33.5979     2.2415 -14.989  < 2e-16 ***
## RegionSouth Asia                  -6.7016     3.2173  -2.083  0.03918 *  
## RegionEast Asia                   -5.7315     2.6208  -2.187  0.03051 *  
## RegionFormer Communist Countries -15.3240     2.0106  -7.622 4.35e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.084 on 132 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.6988, Adjusted R-squared:  0.6805 
## F-statistic: 38.28 on 8 and 132 DF,  p-value: < 2.2e-16

And now?

Model Comparison

Let’s systematically compare our three models to evaluate how adding predictors improves our ability to explain variations in national happiness.

tab_model(model1, model2, model3)

	HPI			HPI			HPI
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	25.07	9.97 – 40.16	0.001	20.58	4.74 – 36.42	0.011	79.77	64.25 – 95.29	<0.001
GDPperCapita [log10]	4.84	0.88 – 8.80	0.017	5.24	1.28 – 9.20	0.010	-6.10	-10.05 – -2.15	0.003
Population [log]				1.19	-0.17 – 2.56	0.086	0.51	-0.34 – 1.36	0.235
Region [Western Nations]							-11.43	-16.34 – -6.51	<0.001
Region [Middle East]							-10.98	-15.62 – -6.34	<0.001
Region [Sub-Saharan Africa]							-33.60	-38.03 – -29.16	<0.001
Region [South Asia]							-6.70	-13.07 – -0.34	0.039
Region [East Asia]							-5.73	-10.92 – -0.55	0.031
Region [Former Communist Countries]							-15.32	-19.30 – -11.35	<0.001
Observations	141			141			141
R² / R² adjusted	0.040 / 0.033			0.061 / 0.047			0.699 / 0.681

Interpreting the Multivariate Regression Models for Happy Planet Index

Model 1: GDP per Capita Only

In this basic model with GDP per Capita (log transformed) as our only predictor:

Intercept (25.07): When GDP per Capita (log10) equals zero (which is not a meaningful value in our dataset), the expected HPI score is 25.07.
GDP per Capita coefficient (4.84): For each 1-unit increase in log10 of GDP per Capita, the expected HPI score increases by 4.84 points. Since we’re using a log transformation, this means that a 10-fold increase in GDP per Capita is associated with a 4.84 point increase in HPI.
R² = 0.040: This model explains only 4% of the variation in HPI scores across countries, indicating GDP per Capita alone is a weak predictor of the Happy Planet Index.

Model 2: GDP per Capita and Population

When we add Population (log transformed) as a second predictor:

Intercept (20.58): When both GDP per Capita (log10) and Population (log) equal zero, the expected HPI score is 20.58 (again, not meaningful in practical terms).
GDP per Capita coefficient (5.24): After adjusting for Population, the relationship between GDP and HPI remains positive and slightly stronger. For each 1-unit increase in log10 of GDP per Capita (holding Population constant), the expected HPI score increases by 5.24 points.
Population coefficient (1.19): For each 1-unit increase in the natural log of Population (holding GDP per Capita constant), the expected HPI score increases by 1.19 points, although this effect is not statistically significant (p = 0.086). This suggests that larger countries might have slightly higher HPI scores, but the evidence is not conclusive.
R² = 0.061: This model explains 6.1% of the variation in HPI scores, a modest improvement over Model 1. The addition of Population as a predictor only marginally increases our ability to explain variations in HPI.

Model 3: GDP per Capita, Population, and Region

In our most comprehensive model, adding regional differences:

Intercept (79.77): This represents the expected HPI score for Latin American countries (the reference region) when GDP per Capita (log10) and Population (log) equal zero.
GDP per Capita coefficient (-6.10): The relationship between GDP and HPI becomes negative after controlling for Population and Region. For each 1-unit increase in log10 of GDP per Capita (holding Population and Region constant), the expected HPI score decreases by 6.10 points. This suggests that, within the same region and for countries of similar population size, higher GDP is associated with lower HPI scores.
Population coefficient (0.51): The effect of Population is reduced and remains non-significant. For each 1-unit increase in log Population (holding other variables constant), the expected HPI score increases by 0.51 points.
Region coefficients: These represent the average difference in HPI between each region and Latin America (the reference region), adjusting for GDP and Population:
- Western Nations: 11.43 points lower
- Middle East: 10.98 points lower
- Sub-Saharan Africa: 33.60 points lower
- South Asia: 6.70 points lower
- East Asia: 5.73 points lower
- Former Communist Countries: 15.32 points lower
All regional differences are statistically significant, indicating substantial variation in HPI scores across different parts of the world even after accounting for economic development and population size.
R² = 0.699: Our final model explains 69.9% of the variation in HPI scores, a substantial improvement over Model 2. This indicates that regional differences play a major role in explaining variations in HPI beyond economic and demographic factors.

Summary of Model Comparisons

The progression of our models reveals several important insights:

GDP alone is a poor predictor of HPI (R² = 0.04), suggesting that economic wealth by itself does not translate directly into sustainable well-being.
Adding Population provides minimal improvement (increasing R² only to 0.061), indicating that country size has little explanatory power for sustainable well-being outcomes.
Adding Region delivers the most comprehensive model (R² = 0.699), indicating that geographical, cultural, or policy differences between regions play a crucial role in determining HPI scores that cannot be explained by economic or demographic metrics alone.
The relationship between GDP and HPI changes direction when controlling for region, suggesting that the positive association in simpler models may be confounded by regional differences. Within regions, higher GDP may actually be associated with lower sustainable well-being.

This analysis suggests that creating conditions for sustainable well-being involves more than just economic growth - regional factors like cultural values, policy choices, and environmental practices appear to be the dominant influences on a country’s HPI score. The negative relationship between GDP and HPI within regions may reflect the environmental costs of economic development that are captured in the HPI’s ecological footprint component.

plot_model(model3, show.values = TRUE)

Understanding Model Fit Statistics

To decide which model provides the best balance between explanatory power and parsimony, we can examine several key statistics:

AIC and BIC

aic_bic_data <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3"),
  AIC = c(AIC(model1), AIC(model2), AIC(model3)),
  BIC = c(BIC(model1), BIC(model2), BIC(model3))
)

kable(aic_bic_data, digits = 1, caption = "Comparison of AIC and BIC Across Models")

Comparison of AIC and BIC Across Models
Model	AIC	BIC
Model 1	1112.3	1121.2
Model 2	1111.3	1123.1
Model 3	962.9	992.4

Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are measures used for model selection. Unlike R-squared, lower values indicate better model fit. Both balance goodness of fit against model complexity, with BIC penalizing additional parameters more strongly than AIC.

So both AIC and BIC are measures used for model selection:

Lower values of AIC and BIC indicate better model fit
Both metrics balance goodness of fit against model complexity
Both penalize models for additional parameters to prevent overfitting
BIC typically applies a stronger penalty for model complexity than AIC

Comparing the Models

Model 1: GDP per Capita Only

AIC: 1111.3
BIC: 1123.1

Barely any change!

Model 2: GDP per Capita and Life Expectancy

AIC: 1028.8
BIC: 1040.6
We see a substantial decrease! This represents strong evidence that Model 3 is much preferred to Model 1 and Model 2.

Model 3: GDP per Capita, Life Expectancy, and Region

AIC: 962.9
BIC: 992.4
Compared to Model 2, there’s a further decrease of 76.4 points in AIC and 58.7 points in BIC.
This again represents evidence favoring Model 3 over Model 2.

Conclusion from AIC and BIC

Even though Model 3 includes more parameters (which incurs penalties in both AIC and BIC calculations), the improvement in fit more than compensates for this added complexity. This aligns with our previous observation using R² that regional differences play a crucial role in explaining variations in HPI.

Example 2: Community Satisfaction in Atlantic Canada

For our second example, we will analyze factors that associated with a sense of community belonging using the Canadian Housing Survey (CHS) 2021 data. We will focus specifically on Atlantic Canada to understand regional variations in community satisfaction and how demographic and social factors relate to feeling part of one’s community.

Understanding the Outcome Variable

Our outcome variable measures community satisfaction based on the question:

“Using a scale of 0 to 10, where 0 means ‘Very dissatisfied’ and 10 means ‘Very satisfied’, how satisfied are you with feeling as part of your community?”

This variable (PCOS_05) captures an important dimension of social wellbeing that goes beyond mere housing characteristics, reflecting how connected individuals feel to their surrounding community.

Loading and Exploring the Data

First, let’s load the Canadian Housing Survey data and examine its structure:

chs_data <- read.csv("CHS2021ECL_PUMF.csv")

dim(chs_data)

## [1] 40988   109

Here, we really need to refer to the survey documentation to know how to handle the variables, which is found here: https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvVariableList&Id=1405275

variables_of_interest <- c("PCOS_05", "PGEOGR", "PRSPGNDR", "PAGEP1", "CER_05", "EHA_10B", "PHHTTINC")
head(chs_data[, variables_of_interest])

##   PCOS_05 PGEOGR PRSPGNDR PAGEP1 CER_05 EHA_10B PHHTTINC
## 1       9      3        2      3      2       6 7.50e+04
## 2       5     26        2      3      1       2 9.25e+04
## 3       1     22        2      4      2       2 6.00e+04
## 4       6     22        1      2      2       6 1.90e+05
## 5       4     16        1      2      2       6 9.75e+04
## 6      99     10        1      1      2       6 1.00e+11

One variable check example:

table(chs_data$EHA_10B)

## 
##     1     2     6     9 
##  3822  3105 33852   209

Data Preparation for Atlantic Canada Analysis

Subset to Atlantic Canada regions
Select and recode relevant variables
Handle missing values

atlantic_chs <- chs_data %>%
  filter(PGEOGR %in% 1:6) %>%
  select(
    outcome = PCOS_05,              # Community satisfaction (0-10)
    region = PGEOGR,                # Geographic region
    gender = PRSPGNDR,              # Gender of reference person
    community_group = CER_05,       # Participation in community group
    economic_hardship = EHA_10B,    # Economic hardship due to COVID-19
    household_income = PHHTTINC     # Household income (for reference)
  )

Now let’s clean:

atlantic_chs <- atlantic_chs %>%
  mutate(
    # Clean outcome variable (community satisfaction 0-10)
    outcome = case_when(
      outcome < 0 ~ NA_real_,
      outcome > 10 ~ NA_real_,
      TRUE ~ as.numeric(outcome)
    ),
    
    # Create descriptive region names
    region_name = factor(region,
                       levels = 1:6,
                       labels = c("Newfoundland and Labrador", 
                                 "Prince Edward Island", 
                                 "Halifax", 
                                 "Outside Halifax - NS", 
                                 "Saint John and Moncton", 
                                 "Outside Saint John and Moncton - NB")),
    
    # Recode gender as factor
    gender = factor(gender,
                  levels = c(1, 2),
                  labels = c("Male", "Female")),
    
    # Recode community group participation as factor
    community_group = factor(community_group,
                          levels = c(1, 2),
                          labels = c("Yes", "No"))
  )

str(atlantic_chs)

## 'data.frame':    11263 obs. of  7 variables:
##  $ outcome          : num  9 9 4 9 9 7 9 8 7 4 ...
##  $ region           : int  3 4 1 4 4 1 1 2 5 3 ...
##  $ gender           : Factor w/ 2 levels "Male","Female": 2 2 1 2 1 1 2 1 1 2 ...
##  $ community_group  : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ economic_hardship: int  6 6 6 6 6 6 1 6 6 1 ...
##  $ household_income : num  7.50e+04 4.80e+04 1.55e+04 5.00e+04 1.00e+11 ...
##  $ region_name      : Factor w/ 6 levels "Newfoundland and Labrador",..: 3 4 1 4 4 1 1 2 5 3 ...

summary(atlantic_chs)

##     outcome          region         gender     community_group
##  Min.   :1.000   Min.   :1.000   Male  :4753   Yes :1720      
##  1st Qu.:5.000   1st Qu.:2.000   Female:6510   No  :9541      
##  Median :7.000   Median :4.000                 NA's:   2      
##  Mean   :6.439   Mean   :3.484                                
##  3rd Qu.:8.000   3rd Qu.:5.000                                
##  Max.   :9.000   Max.   :6.000                                
##  NA's   :124                                                  
##  economic_hardship household_income   
##  Min.   :1.000     Min.   :7.750e+02  
##  1st Qu.:6.000     1st Qu.:4.000e+04  
##  Median :6.000     Median :8.250e+04  
##  Mean   :5.219     Mean   :2.251e+10  
##  3rd Qu.:6.000     3rd Qu.:2.300e+05  
##  Max.   :9.000     Max.   :1.000e+11  
##                                       
##                               region_name  
##  Newfoundland and Labrador          :2446  
##  Prince Edward Island               :1714  
##  Halifax                            :1088  
##  Outside Halifax - NS               :1977  
##  Saint John and Moncton             :2037  
##  Outside Saint John and Moncton - NB:2001  
##

Exploratory Data Analysis

Let’s explore community satisfaction across regions and demographic groups:

atlantic_chs %>%
  group_by(region_name) %>%
  summarize(
    mean_satisfaction = mean(outcome, na.rm = TRUE),
    median_satisfaction = median(outcome, na.rm = TRUE),
    sd_satisfaction = sd(outcome, na.rm = TRUE),
    n = n()
  ) %>%
  arrange(desc(mean_satisfaction)) %>%
  kable(digits = 2, caption = "Community Satisfaction by Region in Atlantic Canada")

Community Satisfaction by Region in Atlantic Canada
region_name	mean_satisfaction	median_satisfaction	sd_satisfaction	n
Outside Halifax - NS	6.60	7	2.14	1977
Outside Saint John and Moncton - NB	6.59	7	2.16	2001
Newfoundland and Labrador	6.58	7	2.16	2446
Prince Edward Island	6.56	7	2.12	1714
Saint John and Moncton	6.20	7	2.24	2037
Halifax	5.83	6	2.25	1088

ggplot(atlantic_chs, aes(x = region_name, y = outcome, fill = region_name)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(),
        legend.position = "none") +
  labs(x = "Region", y = "Community Satisfaction (0-10)", 
       title = "Community Satisfaction by Region in Atlantic Canada")

Building Progressive Regression Models

Now we’ll systematically build models to predict community satisfaction, starting with regional differences and then adding demographic and social factors:

Model 1: Regional Differences in Community Satisfaction

chs_model1 <- lm(outcome ~ region_name, data = atlantic_chs)

summary(chs_model1)

## 
## Call:
## lm(formula = outcome ~ region_name, data = atlantic_chs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6019 -1.5769  0.4144  1.7989  3.1730 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                     6.57686    0.04422 148.736
## region_namePrince Edward Island                -0.01750    0.06892  -0.254
## region_nameHalifax                             -0.74985    0.07958  -9.423
## region_nameOutside Halifax - NS                 0.02509    0.06618   0.379
## region_nameSaint John and Moncton              -0.37577    0.06561  -5.727
## region_nameOutside Saint John and Moncton - NB  0.00879    0.06593   0.133
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## region_namePrince Edward Island                   0.800    
## region_nameHalifax                              < 2e-16 ***
## region_nameOutside Halifax - NS                   0.705    
## region_nameSaint John and Moncton              1.05e-08 ***
## region_nameOutside Saint John and Moncton - NB    0.894    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.175 on 11133 degrees of freedom
##   (124 observations deleted due to missingness)
## Multiple R-squared:  0.01281,    Adjusted R-squared:  0.01237 
## F-statistic:  28.9 on 5 and 11133 DF,  p-value: < 2.2e-16

What do you note?

Model 2: Adding Gender

chs_model2 <- lm(outcome ~ region_name + gender, data = atlantic_chs)

summary(chs_model2)

## 
## Call:
## lm(formula = outcome ~ region_name + gender, data = atlantic_chs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6819 -1.5464  0.3613  1.8572  3.2328 
## 
## Coefficients:
##                                                 Estimate Std. Error t value
## (Intercept)                                     6.655339   0.050378 132.109
## region_namePrince Edward Island                -0.016683   0.068892  -0.242
## region_nameHalifax                             -0.752638   0.079547  -9.462
## region_nameOutside Halifax - NS                 0.026554   0.066149   0.401
## region_nameSaint John and Moncton              -0.377098   0.065583  -5.750
## region_nameOutside Saint John and Moncton - NB  0.008755   0.065898   0.133
## genderFemale                                   -0.135464   0.041725  -3.247
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## region_namePrince Edward Island                 0.80866    
## region_nameHalifax                              < 2e-16 ***
## region_nameOutside Halifax - NS                 0.68812    
## region_nameSaint John and Moncton              9.16e-09 ***
## region_nameOutside Saint John and Moncton - NB  0.89431    
## genderFemale                                    0.00117 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.174 on 11132 degrees of freedom
##   (124 observations deleted due to missingness)
## Multiple R-squared:  0.01375,    Adjusted R-squared:  0.01322 
## F-statistic: 25.86 on 6 and 11132 DF,  p-value: < 2.2e-16

Model 3: Adding Community Group Participation

chs_model3 <- lm(outcome ~ region_name + gender + community_group, data = atlantic_chs)

summary(chs_model3)

## 
## Call:
## lm(formula = outcome ~ region_name + gender + community_group, 
##     data = atlantic_chs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1261 -1.5538  0.4242  1.8739  3.3165 
## 
## Coefficients:
##                                                 Estimate Std. Error t value
## (Intercept)                                     7.118677   0.069993 101.706
## region_namePrince Edward Island                -0.022015   0.068617  -0.321
## region_nameHalifax                             -0.750103   0.079253  -9.465
## region_nameOutside Halifax - NS                 0.006897   0.065915   0.105
## region_nameSaint John and Moncton              -0.353285   0.065367  -5.405
## region_nameOutside Saint John and Moncton - NB  0.007439   0.065632   0.113
## genderFemale                                   -0.142206   0.041567  -3.421
## community_groupNo                              -0.542883   0.057132  -9.502
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## region_namePrince Edward Island                0.748340    
## region_nameHalifax                              < 2e-16 ***
## region_nameOutside Halifax - NS                0.916669    
## region_nameSaint John and Moncton              6.63e-08 ***
## region_nameOutside Saint John and Moncton - NB 0.909759    
## genderFemale                                   0.000626 ***
## community_groupNo                               < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.166 on 11130 degrees of freedom
##   (125 observations deleted due to missingness)
## Multiple R-squared:  0.02174,    Adjusted R-squared:  0.02113 
## F-statistic: 35.34 on 7 and 11130 DF,  p-value: < 2.2e-16

plot_model(chs_model3, show.values = TRUE)

Creating professional regression tables

pred_labels <- c(
  "(Intercept)" = "Intercept",
  "region_namePrince Edward Island" = "Prince Edward Island",
  "region_nameHalifax" = "Halifax",
  "region_nameOutside Halifax - NS" = "Outside Halifax (NS)",
  "region_nameSaint John and Moncton" = "Saint John & Moncton",
  "region_nameOutside Saint John and Moncton - NB" = "Outside SJ & Moncton (NB)",
  "genderFemale" = "Female",
  "community_groupNo" = "No Community Group"
)

tab_model(
  chs_model1, chs_model2, chs_model3,
  pred.labels = pred_labels,
  dv.labels = c("Model 1:\nRegional", "Model 2:\n+ Gender", "Model 3:\n+ Community"),
  string.pred = "Predictors",
  string.est = "β",
  string.ci = "95% CI",
  string.p = "p-value",
  p.style = "stars",
  p.threshold = c(0.05, 0.01, 0.001),
  show.reflvl = TRUE,
  show.aic = TRUE,
  show.r2 = TRUE,
  show.fstat = TRUE,
  show.loglik = FALSE,
  digits = 3,
  title = "Table 2. Community Satisfaction in Atlantic Canada",
  CSS = list(
    css.table = "font-size: 12px; width: auto; margin-bottom: 20px;",
    css.thead = "font-weight: bold; border-bottom: 1px solid #ddd;",
    css.tdata = "padding: 5px; border-bottom: 1px solid #f0f0f0;",
    css.caption = "font-weight: bold; color: #333333; font-size: 14px; margin-bottom: 10px;",
    css.footnote = "font-size: 11px; color: #555555;",
    css.depvarhead = "color: #333333; font-weight: bold; text-align: center; border-bottom: 1px solid #ddd;",
    css.centeralign = "text-align: center;",
    css.firsttablecol = "font-weight: bold; font-size: 12px;"
  )
)

Table 2. Community Satisfaction in Atlantic Canada
	Model 1: Regional		Model 2: + Gender		Model 3: + Community
Predictors	β	95% CI	β	95% CI	β	95% CI
Intercept	6.577 ^***	6.490 – 6.664	6.655 ^***	6.557 – 6.754	7.119 ^***	6.981 – 7.256
Prince Edward Island	-0.017	-0.153 – 0.118	-0.017	-0.152 – 0.118	-0.022	-0.157 – 0.112
Halifax	-0.750 ^***	-0.906 – -0.594	-0.753 ^***	-0.909 – -0.597	-0.750 ^***	-0.905 – -0.595
Outside Halifax (NS)	0.025	-0.105 – 0.155	0.027	-0.103 – 0.156	0.007	-0.122 – 0.136
Saint John & Moncton	-0.376 ^***	-0.504 – -0.247	-0.377 ^***	-0.506 – -0.249	-0.353 ^***	-0.481 – -0.225
Outside SJ & Moncton (NB)	0.009	-0.120 – 0.138	0.009	-0.120 – 0.138	0.007	-0.121 – 0.136
Female			-0.135 ^**	-0.217 – -0.054	-0.142 ^***	-0.224 – -0.061
No Community Group					-0.543 ^***	-0.655 – -0.431
Observations	11139		11139		11138
R² / R² adjusted	0.013 / 0.012		0.014 / 0.013		0.022 / 0.021
AIC	48932.313		48923.771		48830.355
p<0.05 p<0.01 * p<0.001

models <- list(
  "Regional" = lm(outcome ~ region_name, data = atlantic_chs),
  "Gender" = lm(outcome ~ region_name + gender, data = atlantic_chs),
  "Full" = lm(outcome ~ region_name + gender + community_group, data = atlantic_chs)
)

# Create custom coefficient labels
cm <- c(
  "(Intercept)" = "Intercept",
  "region_namePrince Edward Island" = "Prince Edward Island",
  "region_nameHalifax" = "Halifax",
  "region_nameOutside Halifax - NS" = "Outside Halifax (NS)",
  "region_nameSaint John and Moncton" = "Saint John and Moncton",
  "region_nameOutside Saint John and Moncton - NB" = "Outside SJ & Moncton (NB)",
  "genderFemale" = "Female",
  "community_groupNo" = "No Community Group"
)

# Create custom goodness-of-fit metrics
gm <- tribble(
  ~raw,        ~clean,          ~fmt,
  "nobs",      "N",             0,
  "r.squared", "R²",            2,
  "adj.r.squared", "Adjusted R²", 2,
  "AIC",       "AIC",           1,
  "BIC",       "BIC",           1,
  "rmse",      "RMSE",          2
)

Let’s try different styles:

table1 <- modelsummary(
  models,
  coef_map = cm,
  gof_map = gm,
  stars = TRUE,
  title = "Community Satisfaction in Atlantic Canada",
  notes = "Data source: Canadian Housing Survey (2021)"
)
table1

Community Satisfaction in Atlantic Canada
	Regional	Gender	Full
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Data source: Canadian Housing Survey (2021)
Intercept	6.577***	6.655***	7.119***
	(0.044)	(0.050)	(0.070)
Prince Edward Island	-0.017	-0.017	-0.022
	(0.069)	(0.069)	(0.069)
Halifax	-0.750***	-0.753***	-0.750***
	(0.080)	(0.080)	(0.079)
Outside Halifax (NS)	0.025	0.027	0.007
	(0.066)	(0.066)	(0.066)
Saint John and Moncton	-0.376***	-0.377***	-0.353***
	(0.066)	(0.066)	(0.065)
Outside SJ & Moncton (NB)	0.009	0.009	0.007
	(0.066)	(0.066)	(0.066)
Female		-0.135**	-0.142***
		(0.042)	(0.042)
No Community Group			-0.543***
			(0.057)
N	11139	11139	11138
R²	0.01	0.01	0.02
Adjusted R²	0.01	0.01	0.02
RMSE	2.17	2.17	2.16

table2 <- modelsummary(
  models, 
  output = "gt",
  coef_map = cm,
  gof_map = gm,
  stars = TRUE,
  title = "Community Satisfaction in Atlantic Canada"
)

table2 <- table2 %>%
  gt::tab_options(
    heading.title.font.size = 14,
    heading.subtitle.font.size = 12,
    column_labels.font.weight = "bold",
    table.border.top.width = px(3),
    table.border.top.color = "black",
    table.border.bottom.color = "black",
    table.border.bottom.width = px(3),
    column_labels.border.bottom.width = px(2),
    column_labels.border.bottom.color = "black",
    stub.border.color = "white",
    data_row.padding = px(4)
  ) %>%
  gt::tab_style(
    style = list(
      gt::cell_fill(color = "#f5f5f5")
    ),
    locations = gt::cells_body(rows = seq(2, 20, 2))
  ) %>%
  gt::tab_source_note(source_note = "Data source: Canadian Housing Survey (2021)")

table2

Community Satisfaction in Atlantic Canada
	Regional	Gender	Full
Intercept	6.577***	6.655***	7.119***
	(0.044)	(0.050)	(0.070)
Prince Edward Island	-0.017	-0.017	-0.022
	(0.069)	(0.069)	(0.069)
Halifax	-0.750***	-0.753***	-0.750***
	(0.080)	(0.080)	(0.079)
Outside Halifax (NS)	0.025	0.027	0.007
	(0.066)	(0.066)	(0.066)
Saint John and Moncton	-0.376***	-0.377***	-0.353***
	(0.066)	(0.066)	(0.065)
Outside SJ & Moncton (NB)	0.009	0.009	0.007
	(0.066)	(0.066)	(0.066)
Female		-0.135**	-0.142***
		(0.042)	(0.042)
No Community Group			-0.543***
			(0.057)
N	11139	11139	11138
R²	0.01	0.01	0.02
Adjusted R²	0.01	0.01	0.02
RMSE	2.17	2.17	2.16
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Data source: Canadian Housing Survey (2021)

table3 <- modelsummary(
  models,
  output = "kableExtra",
  coef_map = cm,
  gof_map = gm,
  stars = TRUE,
) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  ) %>%
  kableExtra::add_header_above(c(" " = 1, "Atlantic Canada Community Satisfaction Models" = 3)) %>%
  kableExtra::footnote(
    general = "Data source: Canadian Housing Survey (2021)",
    number = c(
      "Reference category for region is 'Newfoundland and Labrador'",
      "Reference category for gender is 'Male'",
      "Reference category for community group is 'Yes'"
    ),
    symbol = c("* p < 0.05, ** p < 0.01, *** p < 0.001")
  )
table3

	Atlantic Canada Community Satisfaction Models
	Regional	Gender	Full
Intercept	6.577***	6.655***	7.119***
	(0.044)	(0.050)	(0.070)
Prince Edward Island	−0.017	−0.017	−0.022
	(0.069)	(0.069)	(0.069)
Halifax	−0.750***	−0.753***	−0.750***
	(0.080)	(0.080)	(0.079)
Outside Halifax (NS)	0.025	0.027	0.007
	(0.066)	(0.066)	(0.066)
Saint John and Moncton	−0.376***	−0.377***	−0.353***
	(0.066)	(0.066)	(0.065)
Outside SJ & Moncton (NB)	0.009	0.009	0.007
	(0.066)	(0.066)	(0.066)
Female		−0.135**	−0.142***
		(0.042)	(0.042)
No Community Group			−0.543***
			(0.057)
N	11139	11139	11138
R²	0.01	0.01	0.02
Adjusted R²	0.01	0.01	0.02
RMSE	2.17	2.17	2.16
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Note:
Data source: Canadian Housing Survey (2021)
¹ Reference category for region is ‘Newfoundland and Labrador’
² Reference category for gender is ‘Male’
³ Reference category for community group is ‘Yes’
p < 0.05, p < 0.01, * p < 0.001

table4 <- modelsummary(
  models,
  output = "flextable",
  coef_map = cm,
  gof_map = gm,
  stars = TRUE,
) %>%
  flextable::theme_booktabs() %>%
  flextable::add_header_row(
    values = c("", "Table 2. Atlantic Canada Community Satisfaction Models"),
    colwidths = c(1, 3)
  ) %>%
  flextable::bold(
    i = 1,
    part = "header"
  ) %>%
  flextable::fontsize(size = 10, part = "all") 
table4

	Table 2. Atlantic Canada Community Satisfaction Models
	Regional	Gender	Full
Intercept	6.577***	6.655***	7.119***
	(0.044)	(0.050)	(0.070)
Prince Edward Island	-0.017	-0.017	-0.022
	(0.069)	(0.069)	(0.069)
Halifax	-0.750***	-0.753***	-0.750***
	(0.080)	(0.080)	(0.079)
Outside Halifax (NS)	0.025	0.027	0.007
	(0.066)	(0.066)	(0.066)
Saint John and Moncton	-0.376***	-0.377***	-0.353***
	(0.066)	(0.066)	(0.065)
Outside SJ & Moncton (NB)	0.009	0.009	0.007
	(0.066)	(0.066)	(0.066)
Female		-0.135**	-0.142***
		(0.042)	(0.042)
No Community Group			-0.543***
			(0.057)
N	11139	11139	11138
R²	0.01	0.01	0.02
Adjusted R²	0.01	0.01	0.02
RMSE	2.17	2.17	2.16
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Suppose you wanted to save that table to a word doc:

doc <- officer::read_docx() %>%
  flextable::body_add_flextable(table4)
print(doc, target = "atlantic_chs_models_flextable.docx")

Practice Exercise for Diary Reflection

For this week’s practice exercise, you will build an alternative set of regression models to explore different factors that might influence community satisfaction in Atlantic Canada. Instead of using the predictors we examined in class, you will work with three different variables:

Income: PHHTTINC (Household total income)
Education: PHGEDUC (Highest level of education completed)
Visible Minority Status: PVISMIN (Visible minority in household)

I will provide the full survey documentation as a PDF on Moodle.

Assignment Instructions:

Build Three Progressive Models:
- Model 1: Community satisfaction predicted by household income (PHHTTINC)
- Model 2: Add education level (PHGEDUC)
- Model 3: Add visible minority household member (PVISMIN)
Data Preparation:
- Make sure to properly clean and transform variables as needed
- For income, consider whether a log transformation is appropriate
- For categorical variables, create meaningful labels using the survey documentation
Create a Professional Regression Table:
- Use modelsummary or the sjPlot package to create a nice looking regression table
- Include all three models with appropriate labels and formatting
Reflection Questions:

Provide an interpretation of the three models. What do you note. As well consider:

How do these factors compare to the regional and social factors we analyzed in class?
What surprised you about your findings?
What challenges did you encounter working with these variables and data?

Good luck! This exercise will help you develop valuable skills in model specification, comparison, and interpretation.

Tutorial 6: Multivariate Linear Regression

Methods and Research II

Understanding Multivariate Linear Regression

Interpreting Coefficients in Multivariate Models

The Key Insight: “All Else Equal”

Visualizing Multivariate Regression

Let’s turn to a real-world example

Example 1: Predicting the Happy Planet Index with Multiple Factors

Understanding the Happy Planet Index

Exploring the Data

Visualizing the HPI Distribution and Relationships

What is a logarithmic transformation?

Why use logarithmic transformations?

Building Progressive Regression Models

Model 1: Single Predictor - GDP per Capita

Model 2: Two Predictors - GDP per Capita and Population

Model 3: Three Predictors - Adding Region

Model Comparison

Interpreting the Multivariate Regression Models for Happy Planet Index

Model 1: GDP per Capita Only

Model 2: GDP per Capita and Population

Model 3: GDP per Capita, Population, and Region

Summary of Model Comparisons

Understanding Model Fit Statistics

AIC and BIC

Comparing the Models

Conclusion from AIC and BIC

Example 2: Community Satisfaction in Atlantic Canada

Understanding the Outcome Variable

Loading and Exploring the Data

Data Preparation for Atlantic Canada Analysis

Exploratory Data Analysis

Building Progressive Regression Models

Model 1: Regional Differences in Community Satisfaction

Model 2: Adding Gender

Model 3: Adding Community Group Participation

Creating professional regression tables

Let’s try different styles:

Practice Exercise for Diary Reflection

Assignment Instructions:

END