Getting Started: A Quick Guide to R Markdown

Before we dive into the data, let’s make sure you’re comfortable with R Markdown and how to work through this document.

What is R Markdown?

R Markdown (.Rmd files) combines three things:

Regular text (like what you’re reading now) — for explanations and writing
Code chunks (the gray boxes with R code) — for running analyses
Output — the results that appear after you run code

How to Run Code

There are several ways to run code in R Markdown:

Run a single chunk: Click the green “play” button (▶) in the top-right corner of any code chunk
Run all chunks above: Click the downward-pointing arrow next to the play button
Run selected lines: Highlight code and press Ctrl+Enter (Windows) or Cmd+Enter (Mac)
Knit the whole document: Click the “Knit” button at the top to run everything and produce the final HTML output

Tip: When working through a practice document, run chunks one at a time so you can check the output and make sure you understand it before moving on.

What is “Knitting”?

“Knitting” means converting your .Rmd file into a final document (HTML, PDF, or Word). When you knit:

R runs ALL the code chunks from top to bottom
It combines the text, code, and output into one polished document
The output file appears in the same folder as your .Rmd file

Common knitting errors and fixes:

“Object not found” → You probably didn’t run an earlier chunk that creates that object
“Could not find function” → You need to load a package first (usually library(tidyverse))
“File not found” → Check that your data file is in the right folder

Reading Output

When you run a code chunk, the output appears directly below it. Learning to read this output is a key skill we’re building. You’ll see things like:

Tables — rows and columns of data
Messages — information from R (often about packages loading)
Warnings — something might not be quite right (but the code still ran)
Errors — the code didn’t work; you need to fix something

Overview

This practice document walks you through exploring the 2021 Canadian Housing Survey (CHS) data. Your goal is not just to run code, but to interpret what you see and reflect on what the numbers mean.

Work through each section, running the code chunks and answering the journal prompts in your course journal. There are no “right” answers to the reflection questions — the goal is to practice the cycle of run, check, interpret.

Part 1: Getting Oriented

Before analyzing data, we need to understand what we’re working with.

Load packages and data

# Load the tidyverse package
library(tidyverse)

# Load the Canadian Housing Survey data
# Make sure the file is in your data/ folder
chs <- read_csv("data/CHS2021ECL_PUMF.csv")

What’s happening here?

library(tidyverse) loads a collection of R packages for data analysis
read_csv() reads the data file and stores it as an object called chs
If you get an error, check the troubleshooting tips above

Check the dimensions

# How many observations (rows)?
nrow(chs)

## [1] 40988

# How many variables (columns)?
ncol(chs)

## [1] 109

Reading this output:

nrow(chs) returns 40,988 — this is the number of rows (observations)
ncol(chs) returns 109 — this is the number of columns (variables)

Look at the structure

# Get an overview of all variables
glimpse(chs)

## Rows: 40,988
## Columns: 109
## $ PUMFID   <dbl> 63501, 63502, 63503, 63504, 63505, 63506, 63507, 63508, 63509…
## $ EHA_10   <dbl> 3, 2, 2, 5, 3, 3, 1, 3, 4, 4, 4, 2, 3, 3, 5, 1, 5, 3, 3, 5, 4…
## $ EHA_10A  <dbl> 6, 6, 6, 6, 6, 6, 1, 6, 6, 6, 6, 6, 6, 6, 6, 2, 6, 6, 6, 6, 6…
## $ EHA_10B  <dbl> 6, 2, 2, 6, 6, 6, 6, 6, 6, 6, 6, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ EHA_25   <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2…
## $ DWS_05A  <dbl> 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 1, 3…
## $ DWI_05A  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ DWI_05B  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2…
## $ DWI_05C  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ DWI_05D  <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ NES_05A  <dbl> 3, 2, 2, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ NSC_30A  <dbl> 1, 1, 2, 3, 1, 2, 1, 2, 2, 1, 2, 2, 3, 1, 2, 1, 2, 1, 2, 1, 2…
## $ NSC_30B  <dbl> 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 2, 1, 9…
## $ NSC_30C  <dbl> 1, 3, 3, 3, 3, 3, 1, 3, 2, 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 1, 9…
## $ NEI_05A  <dbl> 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 2, 4, 4, 2, 3…
## $ NEI_05B  <dbl> 4, 3, 3, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4…
## $ NEI_05C  <dbl> 4, 3, 3, 4, 4, 3, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 3, 3, 3, 4, 4…
## $ NEI_05D  <dbl> 4, 3, 3, 4, 4, 3, 4, 4, 4, 4, 4, 2, 4, 1, 4, 4, 4, 2, 4, 4, 4…
## $ NEI_05E  <dbl> 4, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4…
## $ NEI_05F  <dbl> 4, 1, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4…
## $ NEI_05G  <dbl> 4, 1, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4…
## $ NEI_05H  <dbl> 4, 2, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4, 2, 4…
## $ NEI_05I  <dbl> 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4…
## $ WSA_05   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2…
## $ SDH_05   <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 6, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ CER_05   <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1…
## $ CER_20   <dbl> 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 1, 2, 3, 3, 3, 2, 3…
## $ LIS_10   <dbl> 2, 1, 2, 1, 2, 2, 2, 3, 1, 3, 3, 2, 3, 2, 1, 2, 1, 2, 1, 1, 3…
## $ COS_10   <dbl> 3, 3, 2, 1, 2, 3, 3, 9, 3, 3, 2, 3, 3, 1, 1, 2, 1, 3, 3, 2, 3…
## $ COS_15   <dbl> 3, 2, 3, 3, 4, 3, 2, 2, 1, 1, 4, 4, 3, 2, 3, 3, 2, 4, 3, 1, 1…
## $ GH_05    <dbl> 4, 4, 4, 1, 1, 4, 5, 3, 2, 2, 1, 3, 3, 1, 2, 4, 3, 3, 2, 1, 2…
## $ GH_10    <dbl> 3, 4, 4, 3, 1, 2, 3, 3, 2, 2, 1, 4, 3, 1, 3, 4, 3, 4, 2, 2, 1…
## $ REGION   <chr> "01", "05", "04", "04", "03", "02", "02", "03", "04", "01", "…
## $ PAGEGR1  <dbl> 2, 9, 2, 1, 2, 2, 9, 9, 1, 2, 9, 1, 2, 9, 1, 2, 1, 9, 1, 2, 2…
## $ PAGEGR2  <dbl> 1, 9, 2, 2, 2, 1, 9, 9, 2, 2, 9, 2, 2, 9, 2, 2, 2, 9, 1, 2, 2…
## $ PAGEGR3  <dbl> 1, 9, 2, 1, 1, 2, 9, 9, 1, 2, 9, 1, 1, 9, 1, 1, 1, 9, 1, 1, 2…
## $ PAGEGR4  <dbl> 2, 9, 1, 2, 2, 2, 9, 9, 2, 1, 9, 2, 2, 9, 2, 2, 2, 9, 2, 2, 1…
## $ PAGEP1   <dbl> 3, 3, 4, 2, 2, 1, 2, 1, 2, 4, 1, 2, 3, 3, 2, 3, 2, 2, 3, 3, 4…
## $ PCER_10  <dbl> 96, 2, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 2, 96,…
## $ PCER_15  <dbl> 6, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 2, 6, 6, 6, 6, 6, 3…
## $ PCHN     <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 9, 2, 2, 2, 2, 2, 2, 2…
## $ PCOS_05  <dbl> 9, 5, 1, 6, 4, 99, 5, 99, 9, 9, 8, 1, 4, 8, 4, 4, 5, 4, 7, 7,…
## $ PDCLASS  <dbl> 1, 1, 0, 1, 0, 1, 2, 0, 0, 0, 9, 1, 2, 0, 1, 1, 1, 9, 0, 0, 0…
## $ PDCT_05  <dbl> 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 9, 1, 1, 1, 2, 1, 1, 1…
## $ PDCT_20  <dbl> 4, 2, 3, 3, 2, 1, 3, 2, 4, 3, 2, 3, 2, 99, 4, 4, 2, 4, 3, 2, …
## $ PDCT_25  <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 9, 1, 1, 1, 1, 1, 2, 1…
## $ PDTYPER  <dbl> 3, 9, 0, 1, 0, 1, 9, 0, 0, 0, 9, 1, 3, 0, 2, 2, 2, 9, 0, 0, 0…
## $ PDV_SAH  <dbl> 6, 2, 6, 6, 6, 2, 2, 2, 6, 6, 2, 1, 9, 9, 6, 6, 6, 2, 6, 6, 6…
## $ PDV_SHCO <dbl> 4.7e+03, 1.4e+03, 3.4e+03, 3.5e+03, 2.1e+03, 1.0e+07, 1.0e+07…
## $ PDV_SUIT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1…
## $ PDWLTYPE <chr> "01", "06", "01", "02", "01", "06", "04", "99", "01", "01", "…
## $ PDWS_05  <dbl> 3, 2, 1, 2, 3, 2, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 2, 3, 3, 3…
## $ PDWS_10A <dbl> 1, 4, 2, 2, 4, 3, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 4, 3, 2, 1…
## $ PDWS_10B <dbl> 1, 3, 2, 2, 4, 4, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 3, 3, 2, 1…
## $ PDWS_10C <dbl> 1, 2, 4, 3, 4, 2, 4, 2, 1, 1, 1, 1, 3, 3, 2, 1, 2, 4, 3, 1, 1…
## $ PDWS_10D <dbl> 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 3, 2, 2, 1, 1, 2, 3, 3, 2, 1…
## $ PDWS_10E <dbl> 1, 2, 2, 2, 1, 3, 2, 2, 1, 1, 1, 2, 3, 2, 2, 1, 4, 3, 3, 1, 2…
## $ PDWS_10F <dbl> 1, 4, 3, 3, 3, 3, 4, 2, 2, 2, 4, 3, 3, 4, 3, 4, 3, 3, 3, 1, 1…
## $ PDWS_10G <dbl> 1, 2, 2, 2, 1, 3, 2, 2, 1, 1, 1, 2, 2, 4, 2, 2, 2, 3, 3, 1, 1…
## $ PDWS_10H <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 1…
## $ PDWS_10I <dbl> 1, 2, 2, 2, 2, 3, 1, 2, 1, 1, 2, 4, 2, 3, 2, 2, 2, 2, 3, 3, 1…
## $ PDWS_10J <dbl> 1, 2, 2, 2, 4, 3, 1, 2, 1, 1, 2, 2, 2, 3, 2, 2, 2, 3, 3, 1, 2…
## $ PEHA_05A <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2…
## $ PEHA_05B <dbl> 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PEHA_05C <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PEMPL    <dbl> 1, 1, 2, 1, 1, 1, 9, 9, 1, 2, 9, 9, 2, 1, 1, 1, 1, 1, 1, 1, 2…
## $ PFTHB5YR <dbl> 2, 6, 2, 2, 2, 6, 6, 6, 2, 2, 6, 6, 6, 9, 2, 2, 2, 6, 2, 2, 2…
## $ PFWEIGHT <dbl> 338.5873, 44.6467, 1706.9443, 150.9932, 1683.8582, 205.0446, …
## $ PGEOGR   <chr> "03", "26", "22", "22", "16", "10", "10", "16", "18", "04", "…
## $ PHGEDUC  <dbl> 3, 4, 6, 7, 5, 2, 99, 99, 1, 6, 6, 2, 99, 2, 6, 4, 6, 99, 4, …
## $ PHHSIZE  <dbl> 3, 2, 1, 3, 1, 2, 99, 99, 5, 1, 2, 4, 1, 99, 3, 1, 3, 99, 4, …
## $ PHHTTINC <dbl> 7.50e+04, 9.25e+04, 6.00e+04, 1.90e+05, 9.75e+04, 1.00e+11, 1…
## $ PHTYPE   <chr> "01", "03", "05", "01", "05", "02", "99", "99", "01", "05", "…
## $ PLIS_05  <dbl> 6, 6, 1, 7, 7, 7, 2, 6, 9, 9, 8, 3, 8, 4, 7, 4, 8, 4, 7, 7, 1…
## $ PNES_05  <dbl> 3, 1, 1, 3, 2, 1, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 1…
## $ PNSC_15  <dbl> 1, 4, 2, 2, 4, 2, 2, 1, 1, 1, 2, 2, 1, 4, 2, 2, 3, 3, 2, 1, 2…
## $ POWN_20  <dbl> 1, 6, 1, 1, 1, 6, 6, 6, 1, 1, 6, 6, 6, 9, 1, 1, 2, 6, 1, 1, 2…
## $ POWN_80  <dbl> 5.0e+04, 1.0e+08, 5.2e+05, 3.5e+05, 1.0e+05, 1.0e+08, 1.0e+08…
## $ PPAC_05  <dbl> 4, 4, 4, 3, 3, 1, 1, 1, 1, 4, 1, 2, 2, 3, 3, 3, 3, 99, 4, 1, …
## $ PPAC_10  <dbl> 1, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ PPAC_23  <dbl> 1, 1, 6, 1, 6, 2, 9, 9, 1, 6, 2, 1, 6, 9, 1, 6, 1, 9, 2, 6, 6…
## $ PPAC_30  <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 3, 1, 1…
## $ PPAC_35  <dbl> 6, 2, 2, 2, 2, 2, 2, 2, 6, 6, 2, 1, 2, 2, 6, 2, 2, 2, 6, 6, 6…
## $ PPAC_45A <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2…
## $ PPAC_45C <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45D <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45E <dbl> 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2…
## $ PPAC_45F <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45G <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2…
## $ PPAC_45H <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45I <dbl> 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2…
## $ PPAC_45J <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1…
## $ PPAC_45K <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2…
## $ PPAC_45L <dbl> 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45M <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2…
## $ PPAC_45N <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2…
## $ PPAC_45O <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PPROV    <dbl> 12, 59, 48, 48, 35, 24, 24, 35, 46, 12, 48, 47, 10, 35, 35, 4…
## $ PRSPGNDR <dbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 2…
## $ PRSPIMST <dbl> 9, 1, 1, 1, 1, 1, 1, 9, 9, 9, 1, 9, 9, 2, 1, 9, 1, 9, 1, 1, 1…
## $ PSCR_05  <dbl> 6, 2, 6, 6, 6, 2, 1, 2, 6, 6, 2, 1, 9, 9, 6, 6, 6, 2, 6, 6, 6…
## $ PSCR_10  <dbl> 6, 2, 6, 6, 6, 2, 2, 2, 6, 6, 2, 1, 9, 9, 6, 6, 6, 2, 6, 6, 6…
## $ PSCR_25  <dbl> 96, 1, 96, 96, 96, 3, 3, 3, 96, 96, 3, 2, 99, 99, 96, 96, 96,…
## $ PSCR_35  <dbl> 6, 2, 6, 6, 6, 6, 2, 6, 6, 6, 6, 1, 6, 9, 6, 6, 6, 2, 6, 6, 6…
## $ PSCR_D40 <dbl> 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6…
## $ PSTIR_GR <dbl> 3, 1, 3, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 9, 1, 1, 1, 1, 1, 1, 1…
## $ PVISMIN  <dbl> 9, 2, 2, 1, 2, 2, 9, 9, 9, 9, 9, 9, 9, 9, 2, 9, 1, 9, 2, 2, 2…
## $ PWSA_D15 <dbl> 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6, 999.6…
## $ VERDATE  <chr> "30/11/2022", "30/11/2022", "30/11/2022", "30/11/2022", "30/1…

Reading this output:

The glimpse() function shows you:

Rows: 40,988 and Columns: 109 — confirming what we saw above
A list of all variable names (like PUMFID, EHA_10, REGION, etc.)
The data type of each variable (<dbl> means numeric, <chr> means text/character)
The first few values of each variable

📓 Journal Prompt 1 — ANSWER KEY

In your journal, write:

How many households are in this dataset?

How many variables were measured for each household?

What does each row represent? What does each column represent?

Looking at the variable names, can you guess what any of them might measure?

Part 2: Counting Categories

Categorical variables tell us how many fall into each group. Let’s explore several.

Tenure: Owned or Rented?

The variable PDCT_05 records whether the dwelling is owned or rented.

From the codebook:

1 = Yes (Owned by a member of the household)
2 = No (Rented)
9 = Not stated

chs %>%
  count(PDCT_05)

## # A tibble: 3 × 2
##   PDCT_05     n
##     <dbl> <int>
## 1       1 20821
## 2       2 18779
## 3       9  1388

Reading this output:

This table shows three rows:

20,821 households have PDCT_05 = 1 (owned)
18,779 households have PDCT_05 = 2 (rented)
1,388 households have PDCT_05 = 9 (not stated)

Now let’s add percentages:

chs %>%
  count(PDCT_05) %>%
  mutate(percent = round(n / sum(n) * 100, 1))

## # A tibble: 3 × 3
##   PDCT_05     n percent
##     <dbl> <int>   <dbl>
## 1       1 20821    50.8
## 2       2 18779    45.8
## 3       9  1388     3.4

Reading this output:

The mutate() function added a new column called percent. Now we can see:

50.8% of households in the sample own their home
45.8% rent
3.4% didn’t state their tenure

📓 Journal Prompt 2 — ANSWER KEY

What percentage of households in this sample own their home?

What percentage rent?

How many households didn’t answer this question (code 9)?

Does this distribution surprise you? Why or why not?

Self-Assessed Health

The variable GH_05 asks about self-assessed health.

From the codebook:

1 = Excellent
2 = Very good
3 = Good
4 = Fair
5 = Poor
9 = Not stated

The question wording is: “In general, how is your health? Would you say:”

chs %>%
  count(GH_05) %>%
  mutate(percent = round(n / sum(n) * 100, 1))

## # A tibble: 6 × 3
##   GH_05     n percent
##   <dbl> <int>   <dbl>
## 1     1  5428    13.2
## 2     2 13124    32  
## 3     3 13955    34  
## 4     4  6208    15.1
## 5     5  2068     5  
## 6     9   205     0.5

Reading this output:

This shows a distribution across six categories. Notice that the largest groups are in the middle (codes 2 and 3), with fewer at the extremes.

📓 Journal Prompt 3 — ANSWER KEY

What is the most common response?

What percentage report “Fair” or “Poor” health (codes 4 and 5 combined)?

In the codebook/original survey, what kind of variable is this?

Dwelling Condition

The variable PDCT_25 asks about the condition of the dwelling.

From the codebook:

1 = No, only regular maintenance is needed
2 = Yes, minor repairs are needed
3 = Yes, major repairs are needed
9 = Not stated

The question asks: “Is this dwelling in need of any repairs?”

chs %>%
  count(PDCT_25) %>%
  mutate(percent = round(n / sum(n) * 100, 1))

## # A tibble: 4 × 3
##   PDCT_25     n percent
##     <dbl> <int>   <dbl>
## 1       1 28078    68.5
## 2       2  9095    22.2
## 3       3  3019     7.4
## 4       9   796     1.9

📓 Journal Prompt 4 — ANSWER KEY

What share of dwellings need major repairs, can you tell accurately based on the above? Why or why not?

If you were a policy analyst, what might these numbers tell you about housing quality in Canada?

What might this variable miss if we want to assess housing quality? (Think about operationalization)

Part 3: Numerical Summaries

For numerical variables, we calculate statistics like mean and median.

Household Size: First Attempt

The variable PHHSIZE records the number of people in the household.

From the codebook:

1 = 1 person
2 = 2 people
3 = 3 people
4 = 4 people
5 = 5+ people
99 = Not stated

Let’s calculate the mean:

chs %>%
  summarize(
    mean_size = mean(PHHSIZE),
    median_size = median(PHHSIZE),
    max_size = max(PHHSIZE)
  )

## # A tibble: 1 × 3
##   mean_size median_size max_size
##       <dbl>       <dbl>    <dbl>
## 1      15.5           2       99

🚨 Something looks wrong! A mean household size of 15.5 people? A maximum of 99? That doesn’t make sense.

📓 Journal Prompt 5 — ANSWER KEY

Something looks wrong here.

What is the calculated mean?

What is the maximum value?

Why might these numbers be problematic?

Household Size: Corrected

The issue is that code 99 means “not stated” — it’s not a household of 99 people! We need to filter it out:

chs %>%
  filter(PHHSIZE != 99) %>%
  summarize(
    mean_size = mean(PHHSIZE),
    median_size = median(PHHSIZE),
    min_size = min(PHHSIZE),
    max_size = max(PHHSIZE),
    n = n()
  )

## # A tibble: 1 × 5
##   mean_size median_size min_size max_size     n
##       <dbl>       <dbl>    <dbl>    <dbl> <int>
## 1      2.02           2        1        5 35300

Reading this output:

Now we get sensible numbers:

Mean: 2.02 people per household
Median: 2 people
Range: 1 to 5 (remember, 5 means “five or more”)
n = 35,300 (we lost 5,688 observations by filtering out the 99s)

📓 Journal Prompt 6 — ANSWER KEY

What is the corrected mean household size?

What is the median?

Why are the total number of observations different than when we started?

Which measure better represents the “typical” household? Why?

Key lesson: What went wrong in the first calculation, and why does filtering matter?

Part 4: Putting It Together

Now let’s look at variables that measure related concepts in different ways.

Shelter-Cost-to-Income Ratio

The variable PSTIR_GR measures what share of income goes to shelter costs.

From the codebook:

1 = Spending less than 30% of income on shelter costs
2 = Spending 30% to less than 50% of income on shelter costs
3 = Spending 50% to less than 100% of income on shelter costs
4 = Spending over 100% of income on shelter costs
5 = Not applicable (households on reserves, farm dwelling, household who reported a zero or negative total household income)
9 = Not stated

chs %>%
  filter(PSTIR_GR %in% c(1, 2, 3, 4)) %>%
  count(PSTIR_GR) %>%
  mutate(percent = round(n / sum(n) * 100, 1))

## # A tibble: 4 × 3
##   PSTIR_GR     n percent
##      <dbl> <int>   <dbl>
## 1        1 31869    80.4
## 2        2  5743    14.5
## 3        3  1669     4.2
## 4        4   379     1

A common benchmark: households spending 30% or more of income on shelter are considered “cost-burdened.”

Reading this output:

We filtered to only include codes 1-4 (excluding “not applicable” and “not stated”). Of these households:

80.4% spend less than 30% (not cost-burdened)
14.5% spend 30-50% (moderately cost-burdened)
4.2% spend 50-100% (severely cost-burdened)
1.0% spend 100% or more (extremely cost-burdened)

📓 Journal Prompt 7 — ANSWER KEY

What percentage of households spend less than 30% on shelter (code 1)?

What percentage are “cost-burdened” (codes 2, 3, and 4 combined)?

What percentage spend 50% or more (codes 3 and 4)?

Core Housing Need

The variable PCHN is a composite indicator of housing adequacy.

From the codebook:

1 = In core housing need
2 = Not in core housing need
3 = Not examined for core housing need
9 = Not stated

A household is in “core housing need” if its housing fails to meet at least one of three standards established for housing adequacy (dwelling condition), suitability (enough bedrooms for household composition), and affordability (spending less than 30% on shelter), AND if its income before taxes is at or below the appropriate community-and-bedroom-specific income threshold.

chs %>%
  filter(PCHN %in% c(1, 2)) %>%
  count(PCHN) %>%
  mutate(percent = round(n / sum(n) * 100, 1))

## # A tibble: 2 × 3
##    PCHN     n percent
##   <dbl> <int>   <dbl>
## 1     1  4477    11.4
## 2     2 34804    88.6

📓 Journal Prompt 8 — ANSWER KEY

What percentage of households are in core housing need?

Both PSTIR_GR and PCHN relate to housing affordability. How are they different?

Operationalization question: Which measure better captures “housing affordability”? What does each capture that the other might miss?

Feature	PSTIR_GR (Shelter Cost Ratio)	PCHN (Core Housing Need)
Focus	Single dimension: cost burden	Multiple dimensions: adequacy + suitability + affordability
Threshold	30% of income to shelter	Fails standards AND can’t afford acceptable local housing
Complexity	Simple ratio	Composite indicator derived from multiple variables
What it measures	Current spending burden	Whether household NEEDS better housing AND can’t afford it

Part 5: Challenge (Optional)

This section is optional but good practice.

Filtering to Atlantic Canada

Let’s look at just Atlantic Canadian households.

Province codes (from codebook - PPROV variable):

10 = Newfoundland and Labrador
11 = Prince Edward Island
12 = Nova Scotia
13 = New Brunswick

# Filter to Atlantic provinces
atlantic <- chs %>%
  filter(PPROV %in% c(10, 11, 12, 13))

# How many Atlantic households?
nrow(atlantic)

## [1] 11263

# What share of the national sample?
nrow(atlantic) / nrow(chs) * 100

## [1] 27.47877

Reading this output:

11,263 Atlantic Canadian households in the sample
This represents 27.5% of the national sample

Compare a variable by province

Let’s look at economic hardship across Atlantic provinces. The variable EHA_10 asks: “In the past 12 months, how difficult or easy was it for your household to meet its financial needs in terms of transportation, housing, food, clothing and other necessary expenses?”

From the codebook:

1 = Very difficult
2 = Difficult
3 = Neither difficult nor easy
4 = Easy
5 = Very easy
9 = Not stated

atlantic %>%
  filter(EHA_10 != 9) %>%
  group_by(PPROV) %>%
  summarize(
    n = n(),
    pct_difficult = round(sum(EHA_10 %in% c(1, 2)) / n() * 100, 1)
  )

## # A tibble: 4 × 3
##   PPROV     n pct_difficult
##   <dbl> <int>         <dbl>
## 1    10  2440          26  
## 2    11  1708          19.7
## 3    12  3061          23  
## 4    13  4031          22.9

Reading this output:

This table shows, for each Atlantic province:

The number of households (after filtering out “not stated”)
The percentage reporting economic difficulty (codes 1 “Very difficult” or 2 “Difficult” on the EHA_10 variable)

📓 Journal Prompt 9 (Challenge) — ANSWER KEY

How many Atlantic Canadian households are in the sample?

Which province has the highest share reporting economic hardship (codes 1 or 2)?

What differences do you notice across provinces?

Week 2 Practice: Exploring the Canadian Housing Survey

ANSWER KEY with Explanations

SOC3320 — Methods & Research II

Winter 2026

Getting Started: A Quick Guide to R Markdown

What is R Markdown?

How to Run Code

What is “Knitting”?

Reading Output

Overview

Part 1: Getting Oriented

Load packages and data

Check the dimensions

Look at the structure

📓 Journal Prompt 1 — ANSWER KEY

Suggested Answers:

Part 2: Counting Categories

Tenure: Owned or Rented?

📓 Journal Prompt 2 — ANSWER KEY

Suggested Answers:

Self-Assessed Health

📓 Journal Prompt 3 — ANSWER KEY

Suggested Answers:

Dwelling Condition

📓 Journal Prompt 4 — ANSWER KEY

Suggested Answers:

Part 3: Numerical Summaries

Household Size: First Attempt

📓 Journal Prompt 5 — ANSWER KEY

Suggested Answers:

Household Size: Corrected

📓 Journal Prompt 6 — ANSWER KEY

Suggested Answers:

Part 4: Putting It Together

Shelter-Cost-to-Income Ratio

📓 Journal Prompt 7 — ANSWER KEY

Suggested Answers:

Core Housing Need

📓 Journal Prompt 8 — ANSWER KEY

Suggested Answers:

Part 5: Challenge (Optional)

Filtering to Atlantic Canada

Compare a variable by province

📓 Journal Prompt 9 (Challenge) — ANSWER KEY

Suggested Answers: