Getting Started: A Quick Guide to R Markdown

Before we dive into the data, let’s make sure you’re comfortable with R Markdown and how to work through this document.

What is R Markdown?

R Markdown (.Rmd files) combines three things:

  1. Regular text (like what you’re reading now) — for explanations and writing
  2. Code chunks (the gray boxes with R code) — for running analyses
  3. Output — the results that appear after you run code

How to Run Code

There are several ways to run code in R Markdown:

  1. Run a single chunk: Click the green “play” button (▶) in the top-right corner of any code chunk
  2. Run all chunks above: Click the downward-pointing arrow next to the play button
  3. Run selected lines: Highlight code and press Ctrl+Enter (Windows) or Cmd+Enter (Mac)
  4. Knit the whole document: Click the “Knit” button at the top to run everything and produce the final HTML output

Tip: When working through a practice document, run chunks one at a time so you can check the output and make sure you understand it before moving on.

What is “Knitting”?

“Knitting” means converting your .Rmd file into a final document (HTML, PDF, or Word). When you knit:

  • R runs ALL the code chunks from top to bottom
  • It combines the text, code, and output into one polished document
  • The output file appears in the same folder as your .Rmd file

Common knitting errors and fixes:

  • “Object not found” → You probably didn’t run an earlier chunk that creates that object
  • “Could not find function” → You need to load a package first (usually library(tidyverse))
  • “File not found” → Check that your data file is in the right folder

Reading Output

When you run a code chunk, the output appears directly below it. Learning to read this output is a key skill we’re building. You’ll see things like:

  • Tables — rows and columns of data
  • Messages — information from R (often about packages loading)
  • Warnings — something might not be quite right (but the code still ran)
  • Errors — the code didn’t work; you need to fix something

Overview

This practice document walks you through exploring the 2021 Canadian Housing Survey (CHS) data. Your goal is not just to run code, but to interpret what you see and reflect on what the numbers mean.

Work through each section, running the code chunks and answering the journal prompts in your course journal. There are no “right” answers to the reflection questions — the goal is to practice the cycle of run, check, interpret.


Part 1: Getting Oriented

Before analyzing data, we need to understand what we’re working with.

Load packages and data

# Load the tidyverse package
library(tidyverse)

# Load the Canadian Housing Survey data
# Make sure the file is in your data/ folder
chs <- read_csv("data/CHS2021ECL_PUMF.csv")

What’s happening here?

  • library(tidyverse) loads a collection of R packages for data analysis
  • read_csv() reads the data file and stores it as an object called chs
  • If you get an error, check the troubleshooting tips above

Check the dimensions

# How many observations (rows)?
nrow(chs)
## [1] 40988
# How many variables (columns)?
ncol(chs)
## [1] 109

Reading this output:

  • nrow(chs) returns 40,988 — this is the number of rows (observations)
  • ncol(chs) returns 109 — this is the number of columns (variables)

Look at the structure

# Get an overview of all variables
glimpse(chs)
## Rows: 40,988
## Columns: 109
## $ PUMFID   <dbl> 63501, 63502, 63503, 63504, 63505, 63506, 63507, 63508, 63509…
## $ EHA_10   <dbl> 3, 2, 2, 5, 3, 3, 1, 3, 4, 4, 4, 2, 3, 3, 5, 1, 5, 3, 3, 5, 4…
## $ EHA_10A  <dbl> 6, 6, 6, 6, 6, 6, 1, 6, 6, 6, 6, 6, 6, 6, 6, 2, 6, 6, 6, 6, 6…
## $ EHA_10B  <dbl> 6, 2, 2, 6, 6, 6, 6, 6, 6, 6, 6, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ EHA_25   <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2…
## $ DWS_05A  <dbl> 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 1, 3…
## $ DWI_05A  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ DWI_05B  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2…
## $ DWI_05C  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ DWI_05D  <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ NES_05A  <dbl> 3, 2, 2, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ NSC_30A  <dbl> 1, 1, 2, 3, 1, 2, 1, 2, 2, 1, 2, 2, 3, 1, 2, 1, 2, 1, 2, 1, 2…
## $ NSC_30B  <dbl> 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 2, 1, 9…
## $ NSC_30C  <dbl> 1, 3, 3, 3, 3, 3, 1, 3, 2, 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 1, 9…
## $ NEI_05A  <dbl> 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 2, 4, 4, 2, 3…
## $ NEI_05B  <dbl> 4, 3, 3, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4…
## $ NEI_05C  <dbl> 4, 3, 3, 4, 4, 3, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 3, 3, 3, 4, 4…
## $ NEI_05D  <dbl> 4, 3, 3, 4, 4, 3, 4, 4, 4, 4, 4, 2, 4, 1, 4, 4, 4, 2, 4, 4, 4…
## $ NEI_05E  <dbl> 4, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4…
## $ NEI_05F  <dbl> 4, 1, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4…
## $ NEI_05G  <dbl> 4, 1, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4…
## $ NEI_05H  <dbl> 4, 2, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4, 2, 4…
## $ NEI_05I  <dbl> 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4…
## $ WSA_05   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2…
## $ SDH_05   <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 6, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ CER_05   <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1…
## $ CER_20   <dbl> 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 1, 2, 3, 3, 3, 2, 3…
## $ LIS_10   <dbl> 2, 1, 2, 1, 2, 2, 2, 3, 1, 3, 3, 2, 3, 2, 1, 2, 1, 2, 1, 1, 3…
## $ COS_10   <dbl> 3, 3, 2, 1, 2, 3, 3, 9, 3, 3, 2, 3, 3, 1, 1, 2, 1, 3, 3, 2, 3…
## $ COS_15   <dbl> 3, 2, 3, 3, 4, 3, 2, 2, 1, 1, 4, 4, 3, 2, 3, 3, 2, 4, 3, 1, 1…
## $ GH_05    <dbl> 4, 4, 4, 1, 1, 4, 5, 3, 2, 2, 1, 3, 3, 1, 2, 4, 3, 3, 2, 1, 2…
## $ GH_10    <dbl> 3, 4, 4, 3, 1, 2, 3, 3, 2, 2, 1, 4, 3, 1, 3, 4, 3, 4, 2, 2, 1…
## $ REGION   <chr> "01", "05", "04", "04", "03", "02", "02", "03", "04", "01", "…
## $ PAGEGR1  <dbl> 2, 9, 2, 1, 2, 2, 9, 9, 1, 2, 9, 1, 2, 9, 1, 2, 1, 9, 1, 2, 2…
## $ PAGEGR2  <dbl> 1, 9, 2, 2, 2, 1, 9, 9, 2, 2, 9, 2, 2, 9, 2, 2, 2, 9, 1, 2, 2…
## $ PAGEGR3  <dbl> 1, 9, 2, 1, 1, 2, 9, 9, 1, 2, 9, 1, 1, 9, 1, 1, 1, 9, 1, 1, 2…
## $ PAGEGR4  <dbl> 2, 9, 1, 2, 2, 2, 9, 9, 2, 1, 9, 2, 2, 9, 2, 2, 2, 9, 2, 2, 1…
## $ PAGEP1   <dbl> 3, 3, 4, 2, 2, 1, 2, 1, 2, 4, 1, 2, 3, 3, 2, 3, 2, 2, 3, 3, 4…
## $ PCER_10  <dbl> 96, 2, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 2, 96,…
## $ PCER_15  <dbl> 6, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 2, 6, 6, 6, 6, 6, 3…
## $ PCHN     <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 9, 2, 2, 2, 2, 2, 2, 2…
## $ PCOS_05  <dbl> 9, 5, 1, 6, 4, 99, 5, 99, 9, 9, 8, 1, 4, 8, 4, 4, 5, 4, 7, 7,…
## $ PDCLASS  <dbl> 1, 1, 0, 1, 0, 1, 2, 0, 0, 0, 9, 1, 2, 0, 1, 1, 1, 9, 0, 0, 0…
## $ PDCT_05  <dbl> 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 9, 1, 1, 1, 2, 1, 1, 1…
## $ PDCT_20  <dbl> 4, 2, 3, 3, 2, 1, 3, 2, 4, 3, 2, 3, 2, 99, 4, 4, 2, 4, 3, 2, …
## $ PDCT_25  <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 9, 1, 1, 1, 1, 1, 2, 1…
## $ PDTYPER  <dbl> 3, 9, 0, 1, 0, 1, 9, 0, 0, 0, 9, 1, 3, 0, 2, 2, 2, 9, 0, 0, 0…
## $ PDV_SAH  <dbl> 6, 2, 6, 6, 6, 2, 2, 2, 6, 6, 2, 1, 9, 9, 6, 6, 6, 2, 6, 6, 6…
## $ PDV_SHCO <dbl> 4.7e+03, 1.4e+03, 3.4e+03, 3.5e+03, 2.1e+03, 1.0e+07, 1.0e+07…
## $ PDV_SUIT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1…
## $ PDWLTYPE <chr> "01", "06", "01", "02", "01", "06", "04", "99", "01", "01", "…
## $ PDWS_05  <dbl> 3, 2, 1, 2, 3, 2, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 2, 3, 3, 3…
## $ PDWS_10A <dbl> 1, 4, 2, 2, 4, 3, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 4, 3, 2, 1…
## $ PDWS_10B <dbl> 1, 3, 2, 2, 4, 4, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 3, 3, 2, 1…
## $ PDWS_10C <dbl> 1, 2, 4, 3, 4, 2, 4, 2, 1, 1, 1, 1, 3, 3, 2, 1, 2, 4, 3, 1, 1…
## $ PDWS_10D <dbl> 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 3, 2, 2, 1, 1, 2, 3, 3, 2, 1…
## $ PDWS_10E <dbl> 1, 2, 2, 2, 1, 3, 2, 2, 1, 1, 1, 2, 3, 2, 2, 1, 4, 3, 3, 1, 2…
## $ PDWS_10F <dbl> 1, 4, 3, 3, 3, 3, 4, 2, 2, 2, 4, 3, 3, 4, 3, 4, 3, 3, 3, 1, 1…
## $ PDWS_10G <dbl> 1, 2, 2, 2, 1, 3, 2, 2, 1, 1, 1, 2, 2, 4, 2, 2, 2, 3, 3, 1, 1…
## $ PDWS_10H <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 1…
## $ PDWS_10I <dbl> 1, 2, 2, 2, 2, 3, 1, 2, 1, 1, 2, 4, 2, 3, 2, 2, 2, 2, 3, 3, 1…
## $ PDWS_10J <dbl> 1, 2, 2, 2, 4, 3, 1, 2, 1, 1, 2, 2, 2, 3, 2, 2, 2, 3, 3, 1, 2…
## $ PEHA_05A <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2…
## $ PEHA_05B <dbl> 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PEHA_05C <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PEMPL    <dbl> 1, 1, 2, 1, 1, 1, 9, 9, 1, 2, 9, 9, 2, 1, 1, 1, 1, 1, 1, 1, 2…
## $ PFTHB5YR <dbl> 2, 6, 2, 2, 2, 6, 6, 6, 2, 2, 6, 6, 6, 9, 2, 2, 2, 6, 2, 2, 2…
## $ PFWEIGHT <dbl> 338.5873, 44.6467, 1706.9443, 150.9932, 1683.8582, 205.0446, …
## $ PGEOGR   <chr> "03", "26", "22", "22", "16", "10", "10", "16", "18", "04", "…
## $ PHGEDUC  <dbl> 3, 4, 6, 7, 5, 2, 99, 99, 1, 6, 6, 2, 99, 2, 6, 4, 6, 99, 4, …
## $ PHHSIZE  <dbl> 3, 2, 1, 3, 1, 2, 99, 99, 5, 1, 2, 4, 1, 99, 3, 1, 3, 99, 4, …
## $ PHHTTINC <dbl> 7.50e+04, 9.25e+04, 6.00e+04, 1.90e+05, 9.75e+04, 1.00e+11, 1…
## $ PHTYPE   <chr> "01", "03", "05", "01", "05", "02", "99", "99", "01", "05", "…
## $ PLIS_05  <dbl> 6, 6, 1, 7, 7, 7, 2, 6, 9, 9, 8, 3, 8, 4, 7, 4, 8, 4, 7, 7, 1…
## $ PNES_05  <dbl> 3, 1, 1, 3, 2, 1, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 1…
## $ PNSC_15  <dbl> 1, 4, 2, 2, 4, 2, 2, 1, 1, 1, 2, 2, 1, 4, 2, 2, 3, 3, 2, 1, 2…
## $ POWN_20  <dbl> 1, 6, 1, 1, 1, 6, 6, 6, 1, 1, 6, 6, 6, 9, 1, 1, 2, 6, 1, 1, 2…
## $ POWN_80  <dbl> 5.0e+04, 1.0e+08, 5.2e+05, 3.5e+05, 1.0e+05, 1.0e+08, 1.0e+08…
## $ PPAC_05  <dbl> 4, 4, 4, 3, 3, 1, 1, 1, 1, 4, 1, 2, 2, 3, 3, 3, 3, 99, 4, 1, …
## $ PPAC_10  <dbl> 1, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ PPAC_23  <dbl> 1, 1, 6, 1, 6, 2, 9, 9, 1, 6, 2, 1, 6, 9, 1, 6, 1, 9, 2, 6, 6…
## $ PPAC_30  <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 3, 1, 1…
## $ PPAC_35  <dbl> 6, 2, 2, 2, 2, 2, 2, 2, 6, 6, 2, 1, 2, 2, 6, 2, 2, 2, 6, 6, 6…
## $ PPAC_45A <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2…
## $ PPAC_45C <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45D <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45E <dbl> 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2…
## $ PPAC_45F <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45G <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2…
## $ PPAC_45H <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45I <dbl> 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2…
## $ PPAC_45J <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1…
## $ PPAC_45K <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2…
## $ PPAC_45L <dbl> 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45M <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2…
## $ PPAC_45N <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45O <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPROV    <dbl> 12, 59, 48, 48, 35, 24, 24, 35, 46, 12, 48, 47, 10, 35, 35, 4…
## $ PRSPGNDR <dbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 2…
## $ PRSPIMST <dbl> 9, 1, 1, 1, 1, 1, 1, 9, 9, 9, 1, 9, 9, 2, 1, 9, 1, 9, 1, 1, 1…
## $ PSCR_05  <dbl> 6, 2, 6, 6, 6, 2, 1, 2, 6, 6, 2, 1, 9, 9, 6, 6, 6, 2, 6, 6, 6…
## $ PSCR_10  <dbl> 6, 2, 6, 6, 6, 2, 2, 2, 6, 6, 2, 1, 9, 9, 6, 6, 6, 2, 6, 6, 6…
## $ PSCR_25  <dbl> 96, 1, 96, 96, 96, 3, 3, 3, 96, 96, 3, 2, 99, 99, 96, 96, 96,…
## $ PSCR_35  <dbl> 6, 2, 6, 6, 6, 6, 2, 6, 6, 6, 6, 1, 6, 9, 6, 6, 6, 2, 6, 6, 6…
## $ PSCR_D40 <dbl> 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6…
## $ PSTIR_GR <dbl> 3, 1, 3, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 9, 1, 1, 1, 1, 1, 1, 1…
## $ PVISMIN  <dbl> 9, 2, 2, 1, 2, 2, 9, 9, 9, 9, 9, 9, 9, 9, 2, 9, 1, 9, 2, 2, 2…
## $ PWSA_D15 <dbl> 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6…
## $ VERDATE  <chr> "30/11/2022", "30/11/2022", "30/11/2022", "30/11/2022", "30/1…

Reading this output:

The glimpse() function shows you:

  • Rows: 40,988 and Columns: 109 — confirming what we saw above
  • A list of all variable names (like PUMFID, EHA_10, REGION, etc.)
  • The data type of each variable (<dbl> means numeric, <chr> means text/character)
  • The first few values of each variable

📓 Journal Prompt 1 — ANSWER KEY

In your journal, write:

  1. How many households are in this dataset?
  2. How many variables were measured for each household?
  3. What does each row represent? What does each column represent?
  4. Looking at the variable names, can you guess what any of them might measure?

Suggested Answers:

  1. 40,988 households are in this dataset (this is the number of rows)

  2. 109 variables were measured for each household (this is the number of columns)

  3. Each row represents one household — specifically, a private household in one of the 10 provinces or 3 territorial capitals. Each column represents one variable — a piece of information collected about that household (like their province, income, dwelling type, etc.)

  4. Variable meanings from the codebook:

    • REGION — Region of residence (Atlantic, Quebec, Ontario, Prairies, British Columbia, Territories)
    • PPROV — Province of residence
    • PHHSIZE — Household size (number of people)
    • PHHTTINC — Total income of household
    • GH_05 — General health - self-assessed health
    • PDWLTYPE — Dwelling type
    • EHA_10 — Economic hardship - level of difficulty experienced in past 12 months
    • PCHN — Core housing need

    Note: The codebook is your friend! These abbreviated variable names are common in survey data and the codebook tells you exactly what each one means.


Part 2: Counting Categories

Categorical variables tell us how many fall into each group. Let’s explore several.

Tenure: Owned or Rented?

The variable PDCT_05 records whether the dwelling is owned or rented.

From the codebook:

  • 1 = Yes (Owned by a member of the household)
  • 2 = No (Rented)
  • 9 = Not stated
chs %>%
  count(PDCT_05)
## # A tibble: 3 × 2
##   PDCT_05     n
##     <dbl> <int>
## 1       1 20821
## 2       2 18779
## 3       9  1388

Reading this output:

This table shows three rows:

  • 20,821 households have PDCT_05 = 1 (owned)
  • 18,779 households have PDCT_05 = 2 (rented)
  • 1,388 households have PDCT_05 = 9 (not stated)

Now let’s add percentages:

chs %>%
  count(PDCT_05) %>%
  mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 3 × 3
##   PDCT_05     n percent
##     <dbl> <int>   <dbl>
## 1       1 20821    50.8
## 2       2 18779    45.8
## 3       9  1388     3.4

Reading this output:

The mutate() function added a new column called percent. Now we can see:

  • 50.8% of households in the sample own their home
  • 45.8% rent
  • 3.4% didn’t state their tenure

📓 Journal Prompt 2 — ANSWER KEY

  1. What percentage of households in this sample own their home?
  2. What percentage rent?
  3. How many households didn’t answer this question (code 9)?
  4. Does this distribution surprise you? Why or why not?

Suggested Answers:

  1. 50.8% of households own their home (or, if we exclude the “not stated” responses: 20,821 / 39,600 = 52.6%)

  2. 45.8% rent (or excluding “not stated”: 18,779 / 39,600 = 47.4%)

  3. 1,388 households didn’t answer this question (code 9 = “Not stated”)

  4. Interpretation (answers will vary):

    • This roughly 50/50 split in the unweighted sample might surprise you. However, the codebook shows that when properly weighted to represent the Canadian population, 66.8% of households own their home and 31.6% rent.
    • The difference between unweighted sample counts and weighted population estimates occurs because the CHS oversampled certain groups (like renters) to ensure adequate sample sizes for analysis. This is a common survey design strategy. We will talk more about this in future sessions!

Important methodological note: When we report percentages, we should be clear about our denominator and whether we’re using weighted or unweighted data. The percentages we calculate here are unweighted sample percentages. For population-level estimates, survey weights (the PFWEIGHT variable) should be applied. Again, we will get back to this!


Self-Assessed Health

The variable GH_05 asks about self-assessed health.

From the codebook:

  • 1 = Excellent
  • 2 = Very good
  • 3 = Good
  • 4 = Fair
  • 5 = Poor
  • 9 = Not stated

The question wording is: “In general, how is your health? Would you say:”

chs %>%
  count(GH_05) %>%
  mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 6 × 3
##   GH_05     n percent
##   <dbl> <int>   <dbl>
## 1     1  5428    13.2
## 2     2 13124    32  
## 3     3 13955    34  
## 4     4  6208    15.1
## 5     5  2068     5  
## 6     9   205     0.5

Reading this output:

This shows a distribution across six categories. Notice that the largest groups are in the middle (codes 2 and 3), with fewer at the extremes.


📓 Journal Prompt 3 — ANSWER KEY

  1. What is the most common response?
  2. What percentage report “Fair” or “Poor” health (codes 4 and 5 combined)?
  3. In the codebook/original survey, what kind of variable is this?

Suggested Answers:

  1. “Good” (code 3) is the most common response, with 34.0% of respondents (13,955 households)

  2. 20.1% report “Fair” or “Poor” health (15.1% + 5.0% = 20.1%)

  3. This is an ordinal variable — the categories have a meaningful order (Excellent > Very good > Good > Fair > Poor), but the “distances” between categories aren’t necessarily equal. We can say “Excellent” is better than “Good,” but we can’t say the difference between Excellent and Very good is the same as between Fair and Poor.

Additional notes:

  • Notice the slight positive skew — most people rate their health favorably
  • Self-assessed health is a widely used measure in social science because it’s been shown to predict actual health outcomes (like mortality)
  • The “9 - Not stated” is very small here (0.5%), which is good for data quality

Dwelling Condition

The variable PDCT_25 asks about the condition of the dwelling.

From the codebook:

  • 1 = No, only regular maintenance is needed
  • 2 = Yes, minor repairs are needed
  • 3 = Yes, major repairs are needed
  • 9 = Not stated

The question asks: “Is this dwelling in need of any repairs?”

chs %>%
  count(PDCT_25) %>%
  mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 4 × 3
##   PDCT_25     n percent
##     <dbl> <int>   <dbl>
## 1       1 28078    68.5
## 2       2  9095    22.2
## 3       3  3019     7.4
## 4       9   796     1.9

📓 Journal Prompt 4 — ANSWER KEY

  1. What share of dwellings need major repairs, can you tell accurately based on the above? Why or why not?
  2. If you were a policy analyst, what might these numbers tell you about housing quality in Canada?
  3. What might this variable miss if we want to assess housing quality? (Think about operationalization)

Suggested Answers:

  1. About 7.4% need major repairs — but we should be cautious about this figure. Including the “not stated” responses (1.9%) in the denominator slightly deflates the percentage. If we exclude non-responses: 3,019 / 40,192 = 7.5%. This is close enough that it doesn’t change our interpretation much, but precision matters in research.

  2. Policy implications:

    • Nearly 1 in 10 Canadian dwellings needs major repairs — that’s a significant housing quality issue
    • Combined with minor repairs (22.2%), about 30% of housing stock needs some repair work
    • This suggests potential targets for housing improvement programs
    • Regional breakdowns would help identify where repairs are most needed
  3. Operationalization limitations:

    • This is self-reported — what one person calls “major repairs” another might call “minor”
    • It doesn’t capture what kind of repairs (structural? cosmetic? health-related like mold?)
    • It doesn’t measure things people might not notice (like hidden mold, lead paint, radon)
    • It doesn’t account for ability/willingness to identify problems
    • A professional inspection would be more objective but far more expensive to collect
    • Note: The CHS does have separate variables for specific dwelling issues like mold/mildew (DWI_05A), pests (DWI_05B), undrinkable water (DWI_05C), and poor indoor air quality (DWI_05D)

Part 3: Numerical Summaries

For numerical variables, we calculate statistics like mean and median.

Household Size: First Attempt

The variable PHHSIZE records the number of people in the household.

From the codebook:

  • 1 = 1 person
  • 2 = 2 people
  • 3 = 3 people
  • 4 = 4 people
  • 5 = 5+ people
  • 99 = Not stated

Let’s calculate the mean:

chs %>%
  summarize(
    mean_size = mean(PHHSIZE),
    median_size = median(PHHSIZE),
    max_size = max(PHHSIZE)
  )
## # A tibble: 1 × 3
##   mean_size median_size max_size
##       <dbl>       <dbl>    <dbl>
## 1      15.5           2       99

🚨 Something looks wrong! A mean household size of 15.5 people? A maximum of 99? That doesn’t make sense.


📓 Journal Prompt 5 — ANSWER KEY

Something looks wrong here.

  1. What is the calculated mean?
  2. What is the maximum value?
  3. Why might these numbers be problematic?

Suggested Answers:

  1. The calculated mean is 15.5 (which is clearly wrong)

  2. The maximum value is 99

  3. Why this is problematic:

    • The code 99 = “Not stated” is being treated as an actual number
    • R doesn’t know that 99 is a code for missing data — it just sees it as the number ninety-nine
    • When you include 99 in your calculations, it inflates the mean dramatically
    • This is a very common error in survey data analysis!
    • From the codebook, we can see that 5,688 households (8.0% of the sample) have code 99

    The lesson: Always check your codebook and handle special codes appropriately before calculating statistics.


Household Size: Corrected

The issue is that code 99 means “not stated” — it’s not a household of 99 people! We need to filter it out:

chs %>%
  filter(PHHSIZE != 99) %>%
  summarize(
    mean_size = mean(PHHSIZE),
    median_size = median(PHHSIZE),
    min_size = min(PHHSIZE),
    max_size = max(PHHSIZE),
    n = n()
  )
## # A tibble: 1 × 5
##   mean_size median_size min_size max_size     n
##       <dbl>       <dbl>    <dbl>    <dbl> <int>
## 1      2.02           2        1        5 35300

Reading this output:

Now we get sensible numbers:

  • Mean: 2.02 people per household
  • Median: 2 people
  • Range: 1 to 5 (remember, 5 means “five or more”)
  • n = 35,300 (we lost 5,688 observations by filtering out the 99s)

📓 Journal Prompt 6 — ANSWER KEY

  1. What is the corrected mean household size?
  2. What is the median?
  3. Why are the total number of observations different than when we started?
  4. Which measure better represents the “typical” household? Why?
  5. Key lesson: What went wrong in the first calculation, and why does filtering matter?

Suggested Answers:

  1. The corrected mean household size is 2.02 people

  2. The median is 2 people

  3. We went from 40,988 to 35,300 observations because we filtered out households that didn’t state their household size (code 99). The difference (5,688 households, or about 14%) is quite substantial and worth noting when reporting results.

  4. Both measures are reasonable here:

    • The median (2) tells us that half of households have 2 or fewer people
    • The mean (2.02) is almost identical, suggesting a relatively symmetric distribution
    • With a maximum code of 5 meaning “5 or more,” the mean is actually slightly underestimated (we don’t know the exact size of larger households)
    • For most purposes, reporting “the typical household has 2 people” captures the key insight
    • Looking at the codebook frequencies: 1-person (13,912), 2-person (12,998), 3-person (3,725), 4-person (3,173), 5+ person (1,492) — clearly 1 and 2 person households dominate
  5. Key lesson:

    • What went wrong: R treated the code 99 (“not stated”) as an actual number, dragging up the mean
    • Why filtering matters: Survey data often uses special codes for missing values. If you don’t handle them properly, your statistics will be wrong — sometimes dramatically so
    • Best practice: Always check the codebook BEFORE calculating statistics, and filter or recode special values appropriately

Part 4: Putting It Together

Now let’s look at variables that measure related concepts in different ways.

Shelter-Cost-to-Income Ratio

The variable PSTIR_GR measures what share of income goes to shelter costs.

From the codebook:

  • 1 = Spending less than 30% of income on shelter costs
  • 2 = Spending 30% to less than 50% of income on shelter costs
  • 3 = Spending 50% to less than 100% of income on shelter costs
  • 4 = Spending over 100% of income on shelter costs
  • 5 = Not applicable (households on reserves, farm dwelling, household who reported a zero or negative total household income)
  • 9 = Not stated
chs %>%
  filter(PSTIR_GR %in% c(1, 2, 3, 4)) %>%
  count(PSTIR_GR) %>%
  mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 4 × 3
##   PSTIR_GR     n percent
##      <dbl> <int>   <dbl>
## 1        1 31869    80.4
## 2        2  5743    14.5
## 3        3  1669     4.2
## 4        4   379     1

A common benchmark: households spending 30% or more of income on shelter are considered “cost-burdened.”

Reading this output:

We filtered to only include codes 1-4 (excluding “not applicable” and “not stated”). Of these households:

  • 80.4% spend less than 30% (not cost-burdened)
  • 14.5% spend 30-50% (moderately cost-burdened)
  • 4.2% spend 50-100% (severely cost-burdened)
  • 1.0% spend 100% or more (extremely cost-burdened)

📓 Journal Prompt 7 — ANSWER KEY

  1. What percentage of households spend less than 30% on shelter (code 1)?
  2. What percentage are “cost-burdened” (codes 2, 3, and 4 combined)?
  3. What percentage spend 50% or more (codes 3 and 4)?

Suggested Answers:

  1. 80.4% of households (among those with valid data) spend less than 30% of income on shelter

  2. 19.6% are “cost-burdened” (14.5% + 4.2% + 1.0% = 19.7%, or equivalently 100% - 80.4% = 19.6%)

  3. 5.2% spend 50% or more on shelter (4.2% + 1.0% = 5.2%)

Additional context:

  • The 30% threshold is a policy standard used by CMHC and Statistics Canada
  • Nearly 1 in 5 households being cost-burdened represents a significant affordability challenge
  • The 5.2% in severe cost burden (50%+) may face difficult choices between housing and other necessities
  • Shelter-cost-to-income ratio is calculated by dividing the average monthly shelter costs by the average monthly total household income and multiplying by 100

Core Housing Need

The variable PCHN is a composite indicator of housing adequacy.

From the codebook:

  • 1 = In core housing need
  • 2 = Not in core housing need
  • 3 = Not examined for core housing need
  • 9 = Not stated

A household is in “core housing need” if its housing fails to meet at least one of three standards established for housing adequacy (dwelling condition), suitability (enough bedrooms for household composition), and affordability (spending less than 30% on shelter), AND if its income before taxes is at or below the appropriate community-and-bedroom-specific income threshold.

chs %>%
  filter(PCHN %in% c(1, 2)) %>%
  count(PCHN) %>%
  mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 2 × 3
##    PCHN     n percent
##   <dbl> <int>   <dbl>
## 1     1  4477    11.4
## 2     2 34804    88.6

📓 Journal Prompt 8 — ANSWER KEY

  1. What percentage of households are in core housing need?
  2. Both PSTIR_GR and PCHN relate to housing affordability. How are they different?
  3. Operationalization question: Which measure better captures “housing affordability”? What does each capture that the other might miss?

Suggested Answers:

  1. 11.4% of households are in core housing need (among those examined — 4,477 out of 39,281 households)

  2. How they differ:

    Feature PSTIR_GR (Shelter Cost Ratio) PCHN (Core Housing Need)
    Focus Single dimension: cost burden Multiple dimensions: adequacy + suitability + affordability
    Threshold 30% of income to shelter Fails standards AND can’t afford acceptable local housing
    Complexity Simple ratio Composite indicator derived from multiple variables
    What it measures Current spending burden Whether household NEEDS better housing AND can’t afford it
  3. Operationalization assessment:

    Shelter Cost Ratio (PSTIR_GR) captures:

    • Simple, clear measure of current financial burden
    • Easy to understand and communicate
    • BUT: A household might be cost-burdened by choice (living in a nicer place than necessary) or might have adequate housing despite high costs

    Core Housing Need (PCHN) captures:

    • Multiple dimensions of housing problems (derived from dwelling condition, housing suitability, and shelter-cost-to-income ratio variables)
    • Only flags households who NEED improvement AND can’t afford it locally
    • BUT: More complex, harder to explain, may miss households who are struggling but haven’t hit all the thresholds

    Neither captures:

    • Housing insecurity or precarity (risk of losing housing)
    • Neighborhood quality
    • Distance from work/services
    • Subjective satisfaction with housing (though the CHS does measure this with variables like PDWS_05 and PDWS_10A-J)

Part 5: Challenge (Optional)

This section is optional but good practice.

Filtering to Atlantic Canada

Let’s look at just Atlantic Canadian households.

Province codes (from codebook - PPROV variable):

  • 10 = Newfoundland and Labrador
  • 11 = Prince Edward Island
  • 12 = Nova Scotia
  • 13 = New Brunswick
# Filter to Atlantic provinces
atlantic <- chs %>%
  filter(PPROV %in% c(10, 11, 12, 13))

# How many Atlantic households?
nrow(atlantic)
## [1] 11263
# What share of the national sample?
nrow(atlantic) / nrow(chs) * 100
## [1] 27.47877

Reading this output:

  • 11,263 Atlantic Canadian households in the sample
  • This represents 27.5% of the national sample

Compare a variable by province

Let’s look at economic hardship across Atlantic provinces. The variable EHA_10 asks: “In the past 12 months, how difficult or easy was it for your household to meet its financial needs in terms of transportation, housing, food, clothing and other necessary expenses?”

From the codebook:

  • 1 = Very difficult
  • 2 = Difficult
  • 3 = Neither difficult nor easy
  • 4 = Easy
  • 5 = Very easy
  • 9 = Not stated
atlantic %>%
  filter(EHA_10 != 9) %>%
  group_by(PPROV) %>%
  summarize(
    n = n(),
    pct_difficult = round(sum(EHA_10 %in% c(1, 2)) / n() * 100, 1)
  )
## # A tibble: 4 × 3
##   PPROV     n pct_difficult
##   <dbl> <int>         <dbl>
## 1    10  2440          26  
## 2    11  1708          19.7
## 3    12  3061          23  
## 4    13  4031          22.9

Reading this output:

This table shows, for each Atlantic province:

  • The number of households (after filtering out “not stated”)
  • The percentage reporting economic difficulty (codes 1 “Very difficult” or 2 “Difficult” on the EHA_10 variable)

📓 Journal Prompt 9 (Challenge) — ANSWER KEY

  1. How many Atlantic Canadian households are in the sample?
  2. Which province has the highest share reporting economic hardship (codes 1 or 2)?
  3. What differences do you notice across provinces?

Suggested Answers:

  1. 11,263 Atlantic Canadian households are in the sample (sum of: NL 2,446 + PEI 1,714 + NS 3,065 + NB 4,038 from codebook)

  2. Newfoundland and Labrador (province code 10) has the highest share at 26.0% reporting economic hardship (codes 1 or 2)

  3. Provincial differences:

    • NL (code 10): 26.0% (highest) — n = 2,440 after filtering
    • NS (code 12): 23.0% — n = 3,061 after filtering
    • NB (code 13): 22.9% — n = 4,031 after filtering
    • PEI (code 11): 19.7% (lowest) — n = 1,708 after filtering

    Observations:

    • There’s about a 6 percentage point gap between the highest (NL) and lowest (PEI) provinces
    • All four provinces have substantial portions (roughly 1 in 4 to 1 in 5) reporting economic hardship
    • These patterns may reflect differences in employment opportunities, cost of living, wages, or population characteristics
    • NL’s higher rate may reflect economic challenges related to the oil industry downturn and outmigration

Note: Atlantic Canada makes up 27.5% of this unweighted sample, but only 6.9% of the weighted population estimate (from the REGION variable in the codebook). This is because the CHS oversampled Atlantic Canada to ensure adequate sample sizes for regional analysis — a common survey design strategy.