Before we dive into the data, let’s make sure you’re comfortable with R Markdown and how to work through this document.
R Markdown (.Rmd files) combines three things:
There are several ways to run code in R Markdown:
Ctrl+Enter (Windows) or Cmd+Enter (Mac)Tip: When working through a practice document, run chunks one at a time so you can check the output and make sure you understand it before moving on.
“Knitting” means converting your .Rmd file into a final document (HTML, PDF, or Word). When you knit:
Common knitting errors and fixes:
library(tidyverse))When you run a code chunk, the output appears directly below it. Learning to read this output is a key skill we’re building. You’ll see things like:
This practice document walks you through exploring the 2021 Canadian Housing Survey (CHS) data. Your goal is not just to run code, but to interpret what you see and reflect on what the numbers mean.
Work through each section, running the code chunks and answering the journal prompts in your course journal. There are no “right” answers to the reflection questions — the goal is to practice the cycle of run, check, interpret.
Before analyzing data, we need to understand what we’re working with.
# Load the tidyverse package
library(tidyverse)
# Load the Canadian Housing Survey data
# Make sure the file is in your data/ folder
chs <- read_csv("data/CHS2021ECL_PUMF.csv")
What’s happening here?
library(tidyverse) loads a collection of R packages for
data analysisread_csv() reads the data file and stores it as an
object called chs# How many observations (rows)?
nrow(chs)
## [1] 40988
# How many variables (columns)?
ncol(chs)
## [1] 109
Reading this output:
nrow(chs) returns 40,988 — this is the
number of rows (observations)ncol(chs) returns 109 — this is the
number of columns (variables)# Get an overview of all variables
glimpse(chs)
## Rows: 40,988
## Columns: 109
## $ PUMFID <dbl> 63501, 63502, 63503, 63504, 63505, 63506, 63507, 63508, 63509…
## $ EHA_10 <dbl> 3, 2, 2, 5, 3, 3, 1, 3, 4, 4, 4, 2, 3, 3, 5, 1, 5, 3, 3, 5, 4…
## $ EHA_10A <dbl> 6, 6, 6, 6, 6, 6, 1, 6, 6, 6, 6, 6, 6, 6, 6, 2, 6, 6, 6, 6, 6…
## $ EHA_10B <dbl> 6, 2, 2, 6, 6, 6, 6, 6, 6, 6, 6, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ EHA_25 <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2…
## $ DWS_05A <dbl> 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 1, 3…
## $ DWI_05A <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ DWI_05B <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2…
## $ DWI_05C <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ DWI_05D <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ NES_05A <dbl> 3, 2, 2, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ NSC_30A <dbl> 1, 1, 2, 3, 1, 2, 1, 2, 2, 1, 2, 2, 3, 1, 2, 1, 2, 1, 2, 1, 2…
## $ NSC_30B <dbl> 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 2, 1, 9…
## $ NSC_30C <dbl> 1, 3, 3, 3, 3, 3, 1, 3, 2, 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 1, 9…
## $ NEI_05A <dbl> 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 2, 4, 4, 2, 3…
## $ NEI_05B <dbl> 4, 3, 3, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4…
## $ NEI_05C <dbl> 4, 3, 3, 4, 4, 3, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 3, 3, 3, 4, 4…
## $ NEI_05D <dbl> 4, 3, 3, 4, 4, 3, 4, 4, 4, 4, 4, 2, 4, 1, 4, 4, 4, 2, 4, 4, 4…
## $ NEI_05E <dbl> 4, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4…
## $ NEI_05F <dbl> 4, 1, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4…
## $ NEI_05G <dbl> 4, 1, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4…
## $ NEI_05H <dbl> 4, 2, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4, 2, 4…
## $ NEI_05I <dbl> 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4…
## $ WSA_05 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2…
## $ SDH_05 <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 6, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ CER_05 <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1…
## $ CER_20 <dbl> 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 1, 2, 3, 3, 3, 2, 3…
## $ LIS_10 <dbl> 2, 1, 2, 1, 2, 2, 2, 3, 1, 3, 3, 2, 3, 2, 1, 2, 1, 2, 1, 1, 3…
## $ COS_10 <dbl> 3, 3, 2, 1, 2, 3, 3, 9, 3, 3, 2, 3, 3, 1, 1, 2, 1, 3, 3, 2, 3…
## $ COS_15 <dbl> 3, 2, 3, 3, 4, 3, 2, 2, 1, 1, 4, 4, 3, 2, 3, 3, 2, 4, 3, 1, 1…
## $ GH_05 <dbl> 4, 4, 4, 1, 1, 4, 5, 3, 2, 2, 1, 3, 3, 1, 2, 4, 3, 3, 2, 1, 2…
## $ GH_10 <dbl> 3, 4, 4, 3, 1, 2, 3, 3, 2, 2, 1, 4, 3, 1, 3, 4, 3, 4, 2, 2, 1…
## $ REGION <chr> "01", "05", "04", "04", "03", "02", "02", "03", "04", "01", "…
## $ PAGEGR1 <dbl> 2, 9, 2, 1, 2, 2, 9, 9, 1, 2, 9, 1, 2, 9, 1, 2, 1, 9, 1, 2, 2…
## $ PAGEGR2 <dbl> 1, 9, 2, 2, 2, 1, 9, 9, 2, 2, 9, 2, 2, 9, 2, 2, 2, 9, 1, 2, 2…
## $ PAGEGR3 <dbl> 1, 9, 2, 1, 1, 2, 9, 9, 1, 2, 9, 1, 1, 9, 1, 1, 1, 9, 1, 1, 2…
## $ PAGEGR4 <dbl> 2, 9, 1, 2, 2, 2, 9, 9, 2, 1, 9, 2, 2, 9, 2, 2, 2, 9, 2, 2, 1…
## $ PAGEP1 <dbl> 3, 3, 4, 2, 2, 1, 2, 1, 2, 4, 1, 2, 3, 3, 2, 3, 2, 2, 3, 3, 4…
## $ PCER_10 <dbl> 96, 2, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 2, 96,…
## $ PCER_15 <dbl> 6, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 2, 6, 6, 6, 6, 6, 3…
## $ PCHN <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 9, 2, 2, 2, 2, 2, 2, 2…
## $ PCOS_05 <dbl> 9, 5, 1, 6, 4, 99, 5, 99, 9, 9, 8, 1, 4, 8, 4, 4, 5, 4, 7, 7,…
## $ PDCLASS <dbl> 1, 1, 0, 1, 0, 1, 2, 0, 0, 0, 9, 1, 2, 0, 1, 1, 1, 9, 0, 0, 0…
## $ PDCT_05 <dbl> 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 9, 1, 1, 1, 2, 1, 1, 1…
## $ PDCT_20 <dbl> 4, 2, 3, 3, 2, 1, 3, 2, 4, 3, 2, 3, 2, 99, 4, 4, 2, 4, 3, 2, …
## $ PDCT_25 <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 9, 1, 1, 1, 1, 1, 2, 1…
## $ PDTYPER <dbl> 3, 9, 0, 1, 0, 1, 9, 0, 0, 0, 9, 1, 3, 0, 2, 2, 2, 9, 0, 0, 0…
## $ PDV_SAH <dbl> 6, 2, 6, 6, 6, 2, 2, 2, 6, 6, 2, 1, 9, 9, 6, 6, 6, 2, 6, 6, 6…
## $ PDV_SHCO <dbl> 4.7e+03, 1.4e+03, 3.4e+03, 3.5e+03, 2.1e+03, 1.0e+07, 1.0e+07…
## $ PDV_SUIT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1…
## $ PDWLTYPE <chr> "01", "06", "01", "02", "01", "06", "04", "99", "01", "01", "…
## $ PDWS_05 <dbl> 3, 2, 1, 2, 3, 2, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 2, 3, 3, 3…
## $ PDWS_10A <dbl> 1, 4, 2, 2, 4, 3, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 4, 3, 2, 1…
## $ PDWS_10B <dbl> 1, 3, 2, 2, 4, 4, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 3, 3, 2, 1…
## $ PDWS_10C <dbl> 1, 2, 4, 3, 4, 2, 4, 2, 1, 1, 1, 1, 3, 3, 2, 1, 2, 4, 3, 1, 1…
## $ PDWS_10D <dbl> 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 3, 2, 2, 1, 1, 2, 3, 3, 2, 1…
## $ PDWS_10E <dbl> 1, 2, 2, 2, 1, 3, 2, 2, 1, 1, 1, 2, 3, 2, 2, 1, 4, 3, 3, 1, 2…
## $ PDWS_10F <dbl> 1, 4, 3, 3, 3, 3, 4, 2, 2, 2, 4, 3, 3, 4, 3, 4, 3, 3, 3, 1, 1…
## $ PDWS_10G <dbl> 1, 2, 2, 2, 1, 3, 2, 2, 1, 1, 1, 2, 2, 4, 2, 2, 2, 3, 3, 1, 1…
## $ PDWS_10H <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 1…
## $ PDWS_10I <dbl> 1, 2, 2, 2, 2, 3, 1, 2, 1, 1, 2, 4, 2, 3, 2, 2, 2, 2, 3, 3, 1…
## $ PDWS_10J <dbl> 1, 2, 2, 2, 4, 3, 1, 2, 1, 1, 2, 2, 2, 3, 2, 2, 2, 3, 3, 1, 2…
## $ PEHA_05A <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2…
## $ PEHA_05B <dbl> 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PEHA_05C <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PEMPL <dbl> 1, 1, 2, 1, 1, 1, 9, 9, 1, 2, 9, 9, 2, 1, 1, 1, 1, 1, 1, 1, 2…
## $ PFTHB5YR <dbl> 2, 6, 2, 2, 2, 6, 6, 6, 2, 2, 6, 6, 6, 9, 2, 2, 2, 6, 2, 2, 2…
## $ PFWEIGHT <dbl> 338.5873, 44.6467, 1706.9443, 150.9932, 1683.8582, 205.0446, …
## $ PGEOGR <chr> "03", "26", "22", "22", "16", "10", "10", "16", "18", "04", "…
## $ PHGEDUC <dbl> 3, 4, 6, 7, 5, 2, 99, 99, 1, 6, 6, 2, 99, 2, 6, 4, 6, 99, 4, …
## $ PHHSIZE <dbl> 3, 2, 1, 3, 1, 2, 99, 99, 5, 1, 2, 4, 1, 99, 3, 1, 3, 99, 4, …
## $ PHHTTINC <dbl> 7.50e+04, 9.25e+04, 6.00e+04, 1.90e+05, 9.75e+04, 1.00e+11, 1…
## $ PHTYPE <chr> "01", "03", "05", "01", "05", "02", "99", "99", "01", "05", "…
## $ PLIS_05 <dbl> 6, 6, 1, 7, 7, 7, 2, 6, 9, 9, 8, 3, 8, 4, 7, 4, 8, 4, 7, 7, 1…
## $ PNES_05 <dbl> 3, 1, 1, 3, 2, 1, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 1…
## $ PNSC_15 <dbl> 1, 4, 2, 2, 4, 2, 2, 1, 1, 1, 2, 2, 1, 4, 2, 2, 3, 3, 2, 1, 2…
## $ POWN_20 <dbl> 1, 6, 1, 1, 1, 6, 6, 6, 1, 1, 6, 6, 6, 9, 1, 1, 2, 6, 1, 1, 2…
## $ POWN_80 <dbl> 5.0e+04, 1.0e+08, 5.2e+05, 3.5e+05, 1.0e+05, 1.0e+08, 1.0e+08…
## $ PPAC_05 <dbl> 4, 4, 4, 3, 3, 1, 1, 1, 1, 4, 1, 2, 2, 3, 3, 3, 3, 99, 4, 1, …
## $ PPAC_10 <dbl> 1, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ PPAC_23 <dbl> 1, 1, 6, 1, 6, 2, 9, 9, 1, 6, 2, 1, 6, 9, 1, 6, 1, 9, 2, 6, 6…
## $ PPAC_30 <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 3, 1, 1…
## $ PPAC_35 <dbl> 6, 2, 2, 2, 2, 2, 2, 2, 6, 6, 2, 1, 2, 2, 6, 2, 2, 2, 6, 6, 6…
## $ PPAC_45A <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2…
## $ PPAC_45C <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45D <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45E <dbl> 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2…
## $ PPAC_45F <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45G <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2…
## $ PPAC_45H <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45I <dbl> 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2…
## $ PPAC_45J <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1…
## $ PPAC_45K <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2…
## $ PPAC_45L <dbl> 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45M <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2…
## $ PPAC_45N <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45O <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPROV <dbl> 12, 59, 48, 48, 35, 24, 24, 35, 46, 12, 48, 47, 10, 35, 35, 4…
## $ PRSPGNDR <dbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 2…
## $ PRSPIMST <dbl> 9, 1, 1, 1, 1, 1, 1, 9, 9, 9, 1, 9, 9, 2, 1, 9, 1, 9, 1, 1, 1…
## $ PSCR_05 <dbl> 6, 2, 6, 6, 6, 2, 1, 2, 6, 6, 2, 1, 9, 9, 6, 6, 6, 2, 6, 6, 6…
## $ PSCR_10 <dbl> 6, 2, 6, 6, 6, 2, 2, 2, 6, 6, 2, 1, 9, 9, 6, 6, 6, 2, 6, 6, 6…
## $ PSCR_25 <dbl> 96, 1, 96, 96, 96, 3, 3, 3, 96, 96, 3, 2, 99, 99, 96, 96, 96,…
## $ PSCR_35 <dbl> 6, 2, 6, 6, 6, 6, 2, 6, 6, 6, 6, 1, 6, 9, 6, 6, 6, 2, 6, 6, 6…
## $ PSCR_D40 <dbl> 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6…
## $ PSTIR_GR <dbl> 3, 1, 3, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 9, 1, 1, 1, 1, 1, 1, 1…
## $ PVISMIN <dbl> 9, 2, 2, 1, 2, 2, 9, 9, 9, 9, 9, 9, 9, 9, 2, 9, 1, 9, 2, 2, 2…
## $ PWSA_D15 <dbl> 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6…
## $ VERDATE <chr> "30/11/2022", "30/11/2022", "30/11/2022", "30/11/2022", "30/1…
Reading this output:
The glimpse() function shows you:
Rows: 40,988 and Columns: 109 — confirming
what we saw abovePUMFID,
EHA_10, REGION, etc.)<dbl> means
numeric, <chr> means text/character)In your journal, write:
- How many households are in this dataset?
- How many variables were measured for each household?
- What does each row represent? What does each column represent?
- Looking at the variable names, can you guess what any of them might measure?
40,988 households are in this dataset (this is the number of rows)
109 variables were measured for each household (this is the number of columns)
Each row represents one household — specifically, a private household in one of the 10 provinces or 3 territorial capitals. Each column represents one variable — a piece of information collected about that household (like their province, income, dwelling type, etc.)
Variable meanings from the codebook:
REGION — Region of residence (Atlantic, Quebec,
Ontario, Prairies, British Columbia, Territories)PPROV — Province of residencePHHSIZE — Household size (number of people)PHHTTINC — Total income of householdGH_05 — General health - self-assessed healthPDWLTYPE — Dwelling typeEHA_10 — Economic hardship - level of difficulty
experienced in past 12 monthsPCHN — Core housing needNote: The codebook is your friend! These abbreviated variable names are common in survey data and the codebook tells you exactly what each one means.
Categorical variables tell us how many fall into each group. Let’s explore several.
The variable PDCT_05 records whether the dwelling is
owned or rented.
From the codebook:
chs %>%
count(PDCT_05)
## # A tibble: 3 × 2
## PDCT_05 n
## <dbl> <int>
## 1 1 20821
## 2 2 18779
## 3 9 1388
Reading this output:
This table shows three rows:
PDCT_05 = 1 (owned)PDCT_05 = 2 (rented)PDCT_05 = 9 (not stated)Now let’s add percentages:
chs %>%
count(PDCT_05) %>%
mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 3 × 3
## PDCT_05 n percent
## <dbl> <int> <dbl>
## 1 1 20821 50.8
## 2 2 18779 45.8
## 3 9 1388 3.4
Reading this output:
The mutate() function added a new column called
percent. Now we can see:
- What percentage of households in this sample own their home?
- What percentage rent?
- How many households didn’t answer this question (code 9)?
- Does this distribution surprise you? Why or why not?
50.8% of households own their home (or, if we exclude the “not stated” responses: 20,821 / 39,600 = 52.6%)
45.8% rent (or excluding “not stated”: 18,779 / 39,600 = 47.4%)
1,388 households didn’t answer this question (code 9 = “Not stated”)
Interpretation (answers will vary):
Important methodological note: When we report
percentages, we should be clear about our denominator and whether we’re
using weighted or unweighted data. The percentages we calculate here are
unweighted sample percentages. For population-level estimates,
survey weights (the PFWEIGHT variable) should be applied.
Again, we will get back to this!
The variable GH_05 asks about self-assessed health.
From the codebook:
The question wording is: “In general, how is your health? Would you say:”
chs %>%
count(GH_05) %>%
mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 6 × 3
## GH_05 n percent
## <dbl> <int> <dbl>
## 1 1 5428 13.2
## 2 2 13124 32
## 3 3 13955 34
## 4 4 6208 15.1
## 5 5 2068 5
## 6 9 205 0.5
Reading this output:
This shows a distribution across six categories. Notice that the largest groups are in the middle (codes 2 and 3), with fewer at the extremes.
- What is the most common response?
- What percentage report “Fair” or “Poor” health (codes 4 and 5 combined)?
- In the codebook/original survey, what kind of variable is this?
“Good” (code 3) is the most common response, with 34.0% of respondents (13,955 households)
20.1% report “Fair” or “Poor” health (15.1% + 5.0% = 20.1%)
This is an ordinal variable — the categories have a meaningful order (Excellent > Very good > Good > Fair > Poor), but the “distances” between categories aren’t necessarily equal. We can say “Excellent” is better than “Good,” but we can’t say the difference between Excellent and Very good is the same as between Fair and Poor.
Additional notes:
The variable PDCT_25 asks about the condition of the
dwelling.
From the codebook:
The question asks: “Is this dwelling in need of any repairs?”
chs %>%
count(PDCT_25) %>%
mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 4 × 3
## PDCT_25 n percent
## <dbl> <int> <dbl>
## 1 1 28078 68.5
## 2 2 9095 22.2
## 3 3 3019 7.4
## 4 9 796 1.9
- What share of dwellings need major repairs, can you tell accurately based on the above? Why or why not?
- If you were a policy analyst, what might these numbers tell you about housing quality in Canada?
- What might this variable miss if we want to assess housing quality? (Think about operationalization)
About 7.4% need major repairs — but we should be cautious about this figure. Including the “not stated” responses (1.9%) in the denominator slightly deflates the percentage. If we exclude non-responses: 3,019 / 40,192 = 7.5%. This is close enough that it doesn’t change our interpretation much, but precision matters in research.
Policy implications:
Operationalization limitations:
DWI_05A), pests
(DWI_05B), undrinkable water (DWI_05C), and
poor indoor air quality (DWI_05D)For numerical variables, we calculate statistics like mean and median.
The variable PHHSIZE records the number of people in the
household.
From the codebook:
Let’s calculate the mean:
chs %>%
summarize(
mean_size = mean(PHHSIZE),
median_size = median(PHHSIZE),
max_size = max(PHHSIZE)
)
## # A tibble: 1 × 3
## mean_size median_size max_size
## <dbl> <dbl> <dbl>
## 1 15.5 2 99
🚨 Something looks wrong! A mean household size of 15.5 people? A maximum of 99? That doesn’t make sense.
Something looks wrong here.
- What is the calculated mean?
- What is the maximum value?
- Why might these numbers be problematic?
The calculated mean is 15.5 (which is clearly wrong)
The maximum value is 99
Why this is problematic:
The lesson: Always check your codebook and handle special codes appropriately before calculating statistics.
The issue is that code 99 means “not stated” — it’s not a household of 99 people! We need to filter it out:
chs %>%
filter(PHHSIZE != 99) %>%
summarize(
mean_size = mean(PHHSIZE),
median_size = median(PHHSIZE),
min_size = min(PHHSIZE),
max_size = max(PHHSIZE),
n = n()
)
## # A tibble: 1 × 5
## mean_size median_size min_size max_size n
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 2.02 2 1 5 35300
Reading this output:
Now we get sensible numbers:
- What is the corrected mean household size?
- What is the median?
- Why are the total number of observations different than when we started?
- Which measure better represents the “typical” household? Why?
- Key lesson: What went wrong in the first calculation, and why does filtering matter?
The corrected mean household size is 2.02 people
The median is 2 people
We went from 40,988 to 35,300 observations because we filtered out households that didn’t state their household size (code 99). The difference (5,688 households, or about 14%) is quite substantial and worth noting when reporting results.
Both measures are reasonable here:
Key lesson:
Now let’s look at variables that measure related concepts in different ways.
The variable PSTIR_GR measures what share of income goes
to shelter costs.
From the codebook:
chs %>%
filter(PSTIR_GR %in% c(1, 2, 3, 4)) %>%
count(PSTIR_GR) %>%
mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 4 × 3
## PSTIR_GR n percent
## <dbl> <int> <dbl>
## 1 1 31869 80.4
## 2 2 5743 14.5
## 3 3 1669 4.2
## 4 4 379 1
A common benchmark: households spending 30% or more of income on shelter are considered “cost-burdened.”
Reading this output:
We filtered to only include codes 1-4 (excluding “not applicable” and “not stated”). Of these households:
- What percentage of households spend less than 30% on shelter (code 1)?
- What percentage are “cost-burdened” (codes 2, 3, and 4 combined)?
- What percentage spend 50% or more (codes 3 and 4)?
80.4% of households (among those with valid data) spend less than 30% of income on shelter
19.6% are “cost-burdened” (14.5% + 4.2% + 1.0% = 19.7%, or equivalently 100% - 80.4% = 19.6%)
5.2% spend 50% or more on shelter (4.2% + 1.0% = 5.2%)
Additional context:
The variable PCHN is a composite indicator of housing
adequacy.
From the codebook:
A household is in “core housing need” if its housing fails to meet at least one of three standards established for housing adequacy (dwelling condition), suitability (enough bedrooms for household composition), and affordability (spending less than 30% on shelter), AND if its income before taxes is at or below the appropriate community-and-bedroom-specific income threshold.
chs %>%
filter(PCHN %in% c(1, 2)) %>%
count(PCHN) %>%
mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 2 × 3
## PCHN n percent
## <dbl> <int> <dbl>
## 1 1 4477 11.4
## 2 2 34804 88.6
- What percentage of households are in core housing need?
- Both
PSTIR_GRandPCHNrelate to housing affordability. How are they different?- Operationalization question: Which measure better captures “housing affordability”? What does each capture that the other might miss?
11.4% of households are in core housing need (among those examined — 4,477 out of 39,281 households)
How they differ:
| Feature | PSTIR_GR (Shelter Cost Ratio) | PCHN (Core Housing Need) |
|---|---|---|
| Focus | Single dimension: cost burden | Multiple dimensions: adequacy + suitability + affordability |
| Threshold | 30% of income to shelter | Fails standards AND can’t afford acceptable local housing |
| Complexity | Simple ratio | Composite indicator derived from multiple variables |
| What it measures | Current spending burden | Whether household NEEDS better housing AND can’t afford it |
Operationalization assessment:
Shelter Cost Ratio (PSTIR_GR) captures:
Core Housing Need (PCHN) captures:
Neither captures:
PDWS_05 and
PDWS_10A-J)This section is optional but good practice.
Let’s look at just Atlantic Canadian households.
Province codes (from codebook - PPROV variable):
# Filter to Atlantic provinces
atlantic <- chs %>%
filter(PPROV %in% c(10, 11, 12, 13))
# How many Atlantic households?
nrow(atlantic)
## [1] 11263
# What share of the national sample?
nrow(atlantic) / nrow(chs) * 100
## [1] 27.47877
Reading this output:
Let’s look at economic hardship across Atlantic provinces. The
variable EHA_10 asks: “In the past 12 months, how difficult
or easy was it for your household to meet its financial needs in terms
of transportation, housing, food, clothing and other necessary
expenses?”
From the codebook:
atlantic %>%
filter(EHA_10 != 9) %>%
group_by(PPROV) %>%
summarize(
n = n(),
pct_difficult = round(sum(EHA_10 %in% c(1, 2)) / n() * 100, 1)
)
## # A tibble: 4 × 3
## PPROV n pct_difficult
## <dbl> <int> <dbl>
## 1 10 2440 26
## 2 11 1708 19.7
## 3 12 3061 23
## 4 13 4031 22.9
Reading this output:
This table shows, for each Atlantic province:
- How many Atlantic Canadian households are in the sample?
- Which province has the highest share reporting economic hardship (codes 1 or 2)?
- What differences do you notice across provinces?
11,263 Atlantic Canadian households are in the sample (sum of: NL 2,446 + PEI 1,714 + NS 3,065 + NB 4,038 from codebook)
Newfoundland and Labrador (province code 10) has the highest share at 26.0% reporting economic hardship (codes 1 or 2)
Provincial differences:
Observations:
Note: Atlantic Canada makes up 27.5% of this unweighted sample, but only 6.9% of the weighted population estimate (from the REGION variable in the codebook). This is because the CHS oversampled Atlantic Canada to ensure adequate sample sizes for regional analysis — a common survey design strategy.