Before we dive into the data, let’s make sure you’re set up to run this document and publish your work.
If you haven’t already installed the tidyverse package,
you’ll need to do so first. For instance, by placing the following in a
code box. To generate a new code box just click the sign near Run (to
the left of it) that has a +C symbol and then click R.
install.packages("tidyverse")
You only need to install packages once. After that, you just load
them with library().
Note: You do NOT need to install knitr
or rmarkdown separately — RStudio handles these
automatically.
Work through this document chunk by chunk, not all at once. For each gray code box:
Ctrl+Enter
(Windows) or Cmd+Enter (Mac)After running each chunk:
Important: Run chunks in order from top to bottom. Later chunks depend on earlier ones!
“Knitting” means converting your .Rmd file into a final HTML document. When you knit:
Ctrl+Shift+K (Windows) or
Cmd+Shift+K (Mac)The first time you knit, RStudio may prompt you to install additional packages. Click “Yes” to install them. This only happens once.
| Error Message | What It Means | How to Fix |
|---|---|---|
| “Object not found” | You’re using something that wasn’t created | Make sure earlier chunks ran successfully; check for typos |
| “Could not find function” | A package isn’t loaded | Add library(tidyverse) at the top |
| “File not found” | R can’t find your data file | Check the file path; make sure the file is in the right folder |
| “Unexpected symbol” | Syntax error in your code | Look for missing commas, parentheses, or quotes |
Remember: When you knit, R starts fresh. Even if a chunk worked when you ran it interactively, knitting might fail if the chunk depends on something you ran in the Console but didn’t include in the document.
RPubs is a free service for sharing R Markdown documents online. To publish:
Your document will now be live at
rpubs.com/yourusername/yourslug. You can share this link
with anyone!
Note: RPubs documents are public. Don’t include any personal information you wouldn’t want others to see.
This practice document builds on your skills from Week 2 by introducing visualization and measures of spread.
The core lesson from this week’s lecture: a single summary statistic hides as much as it reveals. To understand data, you need to see its shape.
We’ll also deepen our understanding of survey weights — why they matter and when to use them.
Work through each section, running the code chunks and answering the journal prompts in your course journal. Remember the cycle: run, check, interpret.
In Week 2, you learned an important lesson: survey data often contains special codes for missing or invalid responses. Remember the “household of 99 people” error?
The standard workflow for CHS variables:
Throughout this document, you’ll see this pattern:
chs %>%
filter(VARIABLE != 9) %>% # Remove "Not stated" (code 9)
...
The comment explains what’s being removed and why. Always check the codebook first to know which codes to filter.
In Week 2, we used count() to make frequency tables.
Tables are useful, but they don’t show us the shape of
a distribution. This week, we learn to visualize.
# Load the tidyverse package (includes ggplot2 for visualization)
library(tidyverse)
# Load the Canadian Housing Survey data
chs <- read_csv("data/CHS2021ECL_PUMF.csv")
Remember the Datasaurus Dozen from lecture? Four datasets had nearly identical means, but completely different patterns. The only way to see those patterns was to plot the data.
Let’s start with a variable we explored last week:
self-assessed health (GH_05).
# From codebook: GH_05 (Self-assessed health)
# 1 = Excellent, 2 = Very good, 3 = Good, 4 = Fair, 5 = Poor
# 9 = Not stated
chs %>%
filter(GH_05 != 9) %>% # Remove "Not stated"
count(GH_05) %>%
mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 5 × 3
## GH_05 n percent
## <dbl> <int> <dbl>
## 1 1 5428 13.3
## 2 2 13124 32.2
## 3 3 13955 34.2
## 4 4 6208 15.2
## 5 5 2068 5.1
This tells us the counts and percentages. But what does the distribution look like?
For categorical variables, we use
geom_bar().
chs %>%
filter(GH_05 != 9) %>% # Remove "Not stated"
ggplot(aes(x = factor(GH_05))) +
geom_bar(fill = "steelblue") +
scale_x_discrete(labels = c("1" = "Excellent", "2" = "Very good",
"3" = "Good", "4" = "Fair", "5" = "Poor")) +
labs(
title = "Self-Assessed Health",
subtitle = "Canadian Housing Survey, 2021",
x = "Health Rating",
y = "Number of Households"
) +
theme_minimal()
What’s happening in this code?
filter(GH_05 != 9) removes “not stated” responsesggplot(aes(x = factor(GH_05))) sets up the plot with
GH_05 on the x-axisfactor() tells R to treat the numbers as
categoriesgeom_bar() creates a bar chartscale_x_discrete(labels = ...) replaces the numeric
codes with readable labelslabs() adds titles and labelstheme_minimal() gives us a clean lookFor numerical variables, we use
geom_histogram(). Let’s look at life satisfaction
(PLIS_05), which asks respondents to rate their
satisfaction on a 0-10 scale.
From the codebook (PLIS_05 - Life satisfaction):
The original question asked: “Using a scale of 0 to 10, where 0 means ‘Very dissatisfied’ and 10 means ‘Very satisfied’, how do you feel about your life as a whole right now?”
For the public use file, responses are grouped into categories:
Note: This grouping is common in public use microdata files (PUMFs) to protect respondent confidentiality. We’re working with binned data, not the original 0-10 responses.
chs %>%
filter(PLIS_05 != 99) %>% # Remove "Not stated"
ggplot(aes(x = PLIS_05)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
scale_x_continuous(breaks = 1:9,
labels = c("0-2", "3", "4", "5", "6", "7", "8", "9", "10")) +
labs(
title = "Life Satisfaction",
subtitle = "Canadian Housing Survey, 2021",
x = "Satisfaction Level (0 = Very Dissatisfied, 10 = Very Satisfied)",
y = "Number of Households"
) +
theme_minimal()
What’s different from a bar chart?
geom_histogram() is for continuous/numerical databinwidth = 1 means each bar spans 1 unit on the
scaleLook at both visualizations above.
One of the key lessons from lecture: when the mean and median differ, that tells you something about the shape of the distribution.
The variable PDV_SHCO records monthly shelter costs in
dollars. Unlike the grouped variables we just saw, this one is truly
continuous.
From the codebook (PDV_SHCO - Shelter costs):
Let’s first calculate the mean and median:
chs %>%
filter(PDV_SHCO < 9000) %>% # Filter out "not stated" (coded as 9999999.99)
summarize(
mean_cost = mean(PDV_SHCO),
median_cost = median(PDV_SHCO),
min_cost = min(PDV_SHCO),
max_cost = max(PDV_SHCO),
n = n()
)
## # A tibble: 1 × 5
## mean_cost median_cost min_cost max_cost n
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 1117. 900 0 8900 31876
Notice anything? The mean is higher than the median. What does that suggest about the shape?
chs %>%
filter(PDV_SHCO < 9000) %>% # Remove "not stated"
ggplot(aes(x = PDV_SHCO)) +
geom_histogram(binwidth = 200, fill = "steelblue", color = "white") +
geom_vline(aes(xintercept = mean(PDV_SHCO)),
color = "red", linetype = "dashed", linewidth = 1) +
geom_vline(aes(xintercept = median(PDV_SHCO)),
color = "darkgreen", linewidth = 1) +
labs(
title = "Monthly Shelter Costs",
subtitle = "Red dashed = Mean | Green solid = Median",
x = "Monthly Shelter Cost ($)",
y = "Number of Households"
) +
theme_minimal()
The red dashed line shows the mean; the green solid line shows the median.
Two distributions can have the same mean but tell completely different stories. What distinguishes them is spread — how much values vary from the center.
The standard deviation (SD) measures the typical distance of values from the mean.
Let’s calculate the SD for shelter costs:
chs %>%
filter(PDV_SHCO < 9000) %>% # Remove "not stated"
summarize(
mean_cost = mean(PDV_SHCO),
median_cost = median(PDV_SHCO),
sd_cost = sd(PDV_SHCO),
n = n()
)
## # A tibble: 1 × 4
## mean_cost median_cost sd_cost n
## <dbl> <dbl> <dbl> <int>
## 1 1117. 900 813. 31876
Reading this output:
sd() calculates the standard deviationDoes shelter cost vary more for owners or renters? Let’s find out
using group_by():
From the codebook (PDCT_05 - Tenure):
chs %>%
filter(PDV_SHCO < 9000, PDCT_05 %in% c(1, 2)) %>% # Remove "not stated" for both variables
mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
group_by(tenure) %>%
summarize(
mean_cost = mean(PDV_SHCO),
median_cost = median(PDV_SHCO),
sd_cost = sd(PDV_SHCO),
n = n()
)
## # A tibble: 2 × 5
## tenure mean_cost median_cost sd_cost n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Owner 1273. 1000 961. 17972
## 2 Renter 916. 800 502. 13904
What’s new here?
mutate() creates a new variable tenure
with readable labelsgroup_by(tenure) calculates statistics separately for
owners and renterschs %>%
filter(PDV_SHCO < 9000, PDCT_05 %in% c(1, 2)) %>% # Remove "not stated"
mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
ggplot(aes(x = PDV_SHCO, fill = tenure)) +
geom_histogram(binwidth = 200, color = "white", alpha = 0.9) +
facet_wrap(~tenure, ncol = 1) +
scale_fill_manual(values = c("Owner" = "#E27D60", "Renter" = "#41B3A3")) +
labs(
title = "Monthly Shelter Costs by Tenure",
x = "Monthly Shelter Cost ($)",
y = "Number of Households"
) +
theme_minimal() +
theme(legend.position = "none")
What’s new here?
facet_wrap(~tenure, ncol = 1) creates separate panels
for each groupscale_fill_manual() lets us choose specific colors for
each groupLook at both the table and the visualization above.
In lecture, we learned that survey samples don’t always perfectly represent the population. Some groups get oversampled (more in sample than in population) to ensure enough cases for analysis.
Even with careful sampling, some groups end up:
This happens because of:
Survey weights adjust for these imbalances. Each
household gets a weight (PFWEIGHT) based on how many
households it represents in the population.
Let’s replicate what we saw in lecture. First, the unweighted percentages (raw sample counts):
chs %>%
filter(PDCT_05 %in% c(1, 2)) %>% # Keep only owners and renters
count(PDCT_05) %>%
mutate(
tenure = ifelse(PDCT_05 == 1, "Owner", "Renter"),
pct_unweighted = round(n / sum(n) * 100, 1)
) %>%
dplyr::select(tenure, n, pct_unweighted)
## # A tibble: 2 × 3
## tenure n pct_unweighted
## <chr> <int> <dbl>
## 1 Owner 20821 52.6
## 2 Renter 18779 47.4
This is what we calculated in Week 2. But is this what Canada actually looks like?
The variable PFWEIGHT contains the survey weights. We
sum the weights to get population estimates:
chs %>%
filter(PDCT_05 %in% c(1, 2)) %>% # Keep only owners and renters
mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
group_by(tenure) %>%
summarize(
n_unweighted = n(),
n_weighted = sum(PFWEIGHT)
) %>%
mutate(
pct_unweighted = round(n_unweighted / sum(n_unweighted) * 100, 1),
pct_weighted = round(n_weighted / sum(n_weighted) * 100, 1)
)
## # A tibble: 2 × 5
## tenure n_unweighted n_weighted pct_unweighted pct_weighted
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Owner 20821 10086258. 52.6 67.9
## 2 Renter 18779 4768680. 47.4 32.1
What’s happening here?
n() counts the raw number of households in the
samplesum(PFWEIGHT) sums the weights, estimating the
population countHere’s a useful feature of the CHS codebook: it reports both Frequency (unweighted count) and Weighted Frequency for every variable. For PDCT_05:
| Category | Code | Frequency (n) | Weighted Frequency | Weighted % |
|---|---|---|---|---|
| Owned | 1 | 20,821 | 10,086,258 | 67.9% |
| Rented | 2 | 18,779 | 4,768,680 | 32.1% |
This tells you immediately that the weighted percentage of owners (67.9%) is much higher than the unweighted percentage (52.6%).
The CHS also oversampled Atlantic Canada. Let’s see this:
chs %>%
mutate(region = case_when(
REGION == "01" | REGION == 1 ~ "Atlantic",
REGION == "02" | REGION == 2 ~ "Quebec",
REGION == "03" | REGION == 3 ~ "Ontario",
REGION == "04" | REGION == 4 ~ "Prairies",
REGION == "05" | REGION == 5 ~ "British Columbia",
REGION == "06" | REGION == 6 ~ "Territories",
TRUE ~ NA_character_
)) %>%
filter(!is.na(region)) %>%
group_by(region) %>%
summarize(
n_unweighted = n(),
n_weighted = sum(PFWEIGHT)
) %>%
mutate(
pct_unweighted = round(n_unweighted / sum(n_unweighted) * 100, 1),
pct_weighted = round(n_weighted / sum(n_weighted) * 100, 1),
difference = pct_unweighted - pct_weighted
)
## # A tibble: 6 × 6
## region n_unweighted n_weighted pct_unweighted pct_weighted difference
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Atlantic 11263 1043174. 27.5 6.9 20.6
## 2 British Columb… 4197 2077140. 10.2 13.8 -3.6
## 3 Ontario 7333 5694396. 17.9 37.7 -19.8
## 4 Prairies 11434 2551913. 27.9 16.9 11
## 5 Quebec 5973 3701158. 14.6 24.5 -9.9
## 6 Territories 788 25934. 1.9 0.2 1.7
Look at Atlantic Canada: It makes up about 27% of the unweighted sample but only about 7% of the weighted population estimate. That’s deliberate oversampling to ensure enough cases for regional analysis.
| Purpose | Use |
|---|---|
| Describing what Canada looks like | Weighted percentages |
| Describing your sample | Unweighted counts |
| Calculating population estimates | Weighted |
| Reporting how many respondents you analyzed | Unweighted |
A key skill in quantitative reasoning is comparing how a variable is distributed across different groups. This connects to our discussion of polarization: groups can have the same overall mean but very different distributions.
Let’s compare economic hardship (EHA_10) across
regions.
From the codebook (EHA_10 - Economic hardship):
Question: “In the past 12 months, how difficult or easy was it for your household to meet its financial needs in terms of transportation, housing, food, clothing and other necessary expenses?”
First, let’s calculate summary statistics by region:
chs %>%
filter(EHA_10 != 9) %>% # Remove "Not stated"
mutate(region = case_when(
REGION == "01" | REGION == 1 ~ "Atlantic",
REGION == "02" | REGION == 2 ~ "Quebec",
REGION == "03" | REGION == 3 ~ "Ontario",
REGION == "04" | REGION == 4 ~ "Prairies",
REGION == "05" | REGION == 5 ~ "British Columbia",
REGION == "06" | REGION == 6 ~ "Territories",
TRUE ~ NA_character_
)) %>%
filter(!is.na(region)) %>%
group_by(region) %>%
summarize(
mean_hardship = round(mean(EHA_10), 2),
sd_hardship = round(sd(EHA_10), 2),
pct_difficult = round(sum(EHA_10 %in% c(1, 2)) / n() * 100, 1),
n = n()
)
## # A tibble: 6 × 5
## region mean_hardship sd_hardship pct_difficult n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Atlantic 3.19 1.02 23.1 11240
## 2 British Columbia 3.23 1.02 21.8 4190
## 3 Ontario 3.18 1.07 24.6 7316
## 4 Prairies 3.17 1.03 23.7 11405
## 5 Quebec 3.47 1.03 15.9 5958
## 6 Territories 3.42 1.04 17.8 788
What’s new here?
case_when() lets us recode multiple values at once with
readable labelspct_difficult calculates the percentage reporting “Very
difficult” or “Difficult”chs %>%
filter(EHA_10 != 9) %>% # Remove "Not stated"
mutate(region = case_when(
REGION == "01" | REGION == 1 ~ "Atlantic",
REGION == "02" | REGION == 2 ~ "Quebec",
REGION == "03" | REGION == 3 ~ "Ontario",
REGION == "04" | REGION == 4 ~ "Prairies",
REGION == "05" | REGION == 5 ~ "British Columbia",
REGION == "06" | REGION == 6 ~ "Territories",
TRUE ~ NA_character_
)) %>%
filter(!is.na(region)) %>%
ggplot(aes(x = factor(EHA_10), fill = region)) +
geom_bar(show.legend = FALSE) +
facet_wrap(~region) +
scale_x_discrete(labels = c("1" = "Very\ndifficult", "2" = "Difficult",
"3" = "Neither", "4" = "Easy", "5" = "Very\neasy")) +
scale_fill_brewer(palette = "Set1") +
labs(
title = "Economic Hardship by Region",
subtitle = "Unweighted sample counts — note differences in sample sizes across regions",
x = "Level of Difficulty Meeting Financial Needs",
y = "Number of Households"
) +
theme_minimal() +
theme(axis.text.x = element_text(size = 7))
What’s new here?
facet_wrap(~region) creates a separate panel for each
regionscale_fill_brewer(palette = "Set1") uses a nice color
paletteLook at both the table and the visualization.
This section asks you to apply what you’ve learned to a new variable. The tasks below are similar to what you’ve done in earlier parts — look back at those sections for guidance.
The variable PCOS_05 measures sense of belonging to the
local community.
From the codebook (PCOS_05 - Community satisfaction):
Question: “Using a scale of 0 to 10, where 0 means ‘Very dissatisfied’ and 10 means ‘Very satisfied’, how satisfied are you with feeling as part of your community?”
Like life satisfaction, responses are grouped in the PUMF:
Task 1: Calculate summary statistics (mean, median, SD) for community belonging.
Hint: Look at the beginning of Part 2 where we calculated these
for shelter costs. You’ll need to filter out code 99, then use
summarize() with mean(),
median(), and sd().
# Your code here
# Replace PDV_SHCO with PCOS_05
# Remember to filter out the "not stated" code (99 instead of 9999999.99)
Task 2: Create a histogram of community belonging.
Hint: Look at the life satisfaction histogram in Part 1. The code structure is very similar — you just need to change the variable name and the filter condition.
# Your code here
# Use PCOS_05 instead of PLIS_05
# Filter out code 99
# Adjust the scale_x_continuous() labels to match the codebook categories
Task 3: Compare community belonging between owners and renters.
Hint: Look at Part 3 where we compared shelter costs by tenure.
You’ll use the same approach: filter both variables, create a tenure
label with mutate(), use group_by(), then
summarize().
# Your code here
# Filter: PCOS_05 != 99 and PDCT_05 %in% c(1, 2)
# Create tenure variable with mutate()
# Group by tenure
# Calculate mean, median, sd, and n
After completing the tasks above:
Take a few minutes to reflect on this practice session:
| Function | What it does | Example |
|---|---|---|
ggplot() |
Initialize a plot | ggplot(data, aes(x = var)) |
geom_bar() |
Bar chart for categorical data | geom_bar(fill = "steelblue") |
geom_histogram() |
Histogram for numerical data | geom_histogram(binwidth = 1) |
geom_vline() |
Add vertical line | geom_vline(xintercept = 50) |
facet_wrap() |
Separate panels by group | facet_wrap(~group) |
scale_x_continuous() |
Customize x-axis for continuous data | scale_x_continuous(breaks = seq(0, 10, 2)) |
scale_x_discrete() |
Customize x-axis for categorical data | scale_x_discrete(labels = c(...)) |
scale_fill_manual() |
Set custom colors | scale_fill_manual(values = c("A" = "red")) |
sd() |
Calculate standard deviation | sd(variable) |
case_when() |
Recode multiple values | case_when(x == 1 ~ "A", x == 2 ~ "B") |
sum(weights) |
Weighted count | sum(PFWEIGHT) |
Always visualize — summary statistics can hide important patterns (remember the Datasaurus!)
Shape matters — symmetric, left-skewed, and right-skewed distributions tell different stories
Mean vs. median — when they differ, the distribution is skewed; the median often better represents the “typical” case
Spread is telling — standard deviation tells you how much variation exists; two groups can have the same mean but very different SDs
Filter first — always check the codebook and remove special codes (like 9, 99, 9999999.99) before calculating statistics
Weighted vs. unweighted — for population estimates, use weights; for describing your sample, unweighted counts are appropriate
Sample sizes matter — small samples (like the Territories) mean less certainty; oversampled regions (like Atlantic Canada) have more observations than their population share
Remember: The goal is not perfect code, but building intuition for what distributions reveal. If something confuses you, write it down — that’s valuable information for class discussion.