Running and Plotting Regressions

Author

Jesse McDevitt-Irwin

Introduction

In this worksheet, you’ll investigate how adding a control variable can affect the relationship between two variables. You’ll:

Estimate a bivariate regression (e.g., child height-for-age on HWISE score).
Add a potential confounder (e.g., rural vs. urban) in a multivariable regression.
Compare the results visually using jtools::plot_summs().

Before regressions: Save your cleaned data

So far, we have been working from the raw DHS data. In previous assignments, you recoded (using mutate) your main variables. Now, save your dataset from previous assignments, by adding the command write.csv(dhs_clean, filename = "Temp/dhs_clean.csv") at the end of your script. Then run the whole script.

Click below to see an example of a script making dhs_clean.

Click to show code

# Load required packages
library(tidyverse)  # For data wrangling and visualization
library(haven)      # For reading Stata .dta files

rm(list = ls())


# Select relevant individual variables
vars_hr <- c(
  "hhid", "hv001", "hv002", "hv005", "hv008", "hv009", "hv024", "hv025", "hv204",
  "shwa17","hv025",
  "sh108g", "sh108o", "sh108h", "sh108i", "sh108j", "sh108k",
  "sh108l", "sh108m", "sh108n", "sh108p", "sh108q", "sh108r"
)

hr_raw <- read_dta("Raw/MZHR81FL.DTA",
                   col_select = all_of(vars_hr))

wise_vars <- c("sh108g", "sh108o", "sh108h", "sh108i", "sh108j", "sh108k",
               "sh108l", "sh108m", "sh108n", "sh108p", "sh108q", "sh108r")

# Use case_when to recode
hr_raw <- hr_raw %>%
  mutate(across(all_of(wise_vars), ~ case_when(
    .x == 1 ~ 0,
    .x == 2 ~ 1,
    .x == 3 ~ 2,
    .x %in% c(4,5) ~ 3,
    TRUE ~ NA_real_
  ), .names = "recoded_{.col}")) %>% 
  mutate(WISE_score = rowSums(across(starts_with("recoded_")))) %>% 
  mutate(WISE_category = case_when(
    WISE_score <= 2 ~ "No-to-marginal",
    WISE_score <= 11 ~ "Low",
    WISE_score <= 23 ~ "Moderate",
    WISE_score >= 24 ~ "High",
    TRUE ~ NA) %>% as.factor
  )

vars_kr <- c("v001", "v002", "v025", "hw70")
kr_raw <- read_dta("Raw/MZKR81FL.DTA",
                   col_select = all_of(vars_kr)) %>% 
  mutate(urban = case_when(
    v025 == 1 ~ 1,
    v025 == 2 ~ 0,
    TRUE ~ NA)) %>% 
  mutate(height_for_age = case_when(
      hw70 > 600 ~ NA,
      TRUE ~ hw70
    ))

dhs_clean <- hr_raw %>% 
  inner_join(kr_raw, by = c("hv001"="v001", 
                            "hv002" = "v002"))

write.csv(dhs_clean, file = "Temp/dhs_clean.csv")

Load Libraries and Data

Next create a new R script, named “regressions.R”. Save it in your “Code” folder.

library(tidyverse)

# Install jtools if not already installed
if (!requireNamespace("jtools", quietly = TRUE)) {
  install.packages("jtools")
}
library(jtools)

dhs_clean <- read.csv("Temp/dhs_clean.csv")

Choose Your Variables

# Example variables – replace with your own choices
dependent_var <- "height_for_age"
main_predictor <- "WISE_score"
confounder <- "urban"  # Re-coded: 1 = urban, 0 = rural

Rescale Your Variables

When running regressions, it is best to standardize your variables. This allows you to easily compare the magnitude of different coefficients.

Leave categorical variables as they are.
Divide your continuous variables by 2 times their standard deviation.

This makes continuous variables roughly comparable to binary variables.¹

sigma_height = sd(dhs_clean$height_for_age, na.rm = T)
sigma_wise = sd(dhs_clean$WISE_score, na.rm = T)

dhs_clean <- dhs_clean %>% 
  mutate(height_for_age = height_for_age/(2*sigma_height),
         WISE_score = WISE_score/(2*sigma_wise))

Run Bivariate and Multivariable Regressions

# Bivariate model
model_biv <- lm(height_for_age ~ WISE_score, data = dhs_clean)

# Multivariable model with confounder
model_mult <- lm(height_for_age ~ WISE_score + urban, data = dhs_clean)

Compare the Models Visually

# Plot both models side by side to compare coefficients for hwise_score
plot_summs(model_biv, model_mult,
           model.names = c("Bivariate", "Multivariable"),
           ci_level = .68)

Interpretation Questions

How is the independent related to the dependent?

Higher HWISE scores (more water insecurity) is associated with lower height-for-age.

How did adding the confounder change the estimated association between dependent and independent? Why do you think this is?

Adding the urban variable attenuated the association between the HWISE score and height-for-age. Urban households must have lower HWISE scores and also taller children.

Which variable seems to have a stronger relationship with the dependent: your independent or the confounder?

Urban-rural has a stronger relationship with height-for-age than HWISE does.

Footnotes

If you are interested in more explanation, see this paper.↩︎