Getting Started

Before we dive into the data, let’s make sure you’re set up to run this document and publish your work.

Installing Required Packages

If you haven’t already installed the tidyverse package, you’ll need to do so first. For instance, by placing the following in a code box. To generate a new code box just click the sign near Run (to the left of it) that has a +C symbol and then click R.

install.packages("tidyverse")

You only need to install packages once. After that, you just load them with library().

Note: You do NOT need to install knitr or rmarkdown separately — RStudio handles these automatically.

How to Run Code Chunks

Work through this document chunk by chunk, not all at once. For each gray code box:

  • Click the green “play” button (▶) in the top-right corner of the chunk, OR
  • Place your cursor in the chunk and press Ctrl+Enter (Windows) or Cmd+Enter (Mac)

After running each chunk:

  • Check the output that appears below
  • Make sure it looks reasonable before moving on
  • If you get an error, read it carefully — often it tells you what went wrong

Important: Run chunks in order from top to bottom. Later chunks depend on earlier ones!

Knitting to HTML

“Knitting” means converting your .Rmd file into a final HTML document. When you knit:

  • R runs ALL the code chunks from top to bottom in a fresh environment
  • It combines the text, code, and output into one polished document
  • The output file appears in the same folder as your .Rmd file

How to Knit

  • Click the Knit button at the top of the editor pane (it has a yarn ball icon), OR
  • Press Ctrl+Shift+K (Windows) or Cmd+Shift+K (Mac)

First Time Knitting?

The first time you knit, RStudio may prompt you to install additional packages. Click “Yes” to install them. This only happens once.

Common Knitting Errors

Error Message What It Means How to Fix
“Object not found” You’re using something that wasn’t created Make sure earlier chunks ran successfully; check for typos
“Could not find function” A package isn’t loaded Add library(tidyverse) at the top
“File not found” R can’t find your data file Check the file path; make sure the file is in the right folder
“Unexpected symbol” Syntax error in your code Look for missing commas, parentheses, or quotes

Remember: When you knit, R starts fresh. Even if a chunk worked when you ran it interactively, knitting might fail if the chunk depends on something you ran in the Console but didn’t include in the document.

Publishing to RPubs

RPubs is a free service for sharing R Markdown documents online. To publish:

  1. First, Knit your document to HTML (see above)
  2. In the HTML preview window, click the Publish button (blue icon, top-right)
  3. Choose RPubs
  4. If this is your first time:
    • Click “Create an account” and sign up at rpubs.com
    • Return to RStudio and try publishing again
  5. Give your document a title and a unique URL slug
  6. Click Publish

Your document will now be live at rpubs.com/yourusername/yourslug. You can share this link with anyone!

Note: RPubs documents are public. Don’t include any personal information you wouldn’t want others to see.


Overview

This practice document builds on your skills from Week 2 by introducing visualization and measures of spread.

The core lesson from this week’s lecture: a single summary statistic hides as much as it reveals. To understand data, you need to see its shape.

We’ll also deepen our understanding of survey weights — why they matter and when to use them.

Work through each section, running the code chunks and answering the journal prompts in your course journal. Remember the cycle: run, check, interpret.


Data Cleaning Reminder

In Week 2, you learned an important lesson: survey data often contains special codes for missing or invalid responses. Remember the “household of 99 people” error?

The standard workflow for CHS variables:

  1. Check the codebook for special codes (9, 99, 999, 9999999.99, etc.)
  2. Filter them out before analyzing
  3. Then calculate statistics or create visualizations

Throughout this document, you’ll see this pattern:

chs %>%
  filter(VARIABLE != 9) %>%  # Remove "Not stated" (code 9)
  ...

The comment explains what’s being removed and why. Always check the codebook first to know which codes to filter.


Part 1: From Tables to Pictures

In Week 2, we used count() to make frequency tables. Tables are useful, but they don’t show us the shape of a distribution. This week, we learn to visualize.

Load packages and data

# Load the tidyverse package (includes ggplot2 for visualization)
library(tidyverse)

# Load the Canadian Housing Survey data
chs <- read_csv("data/CHS2021ECL_PUMF.csv")

Why visualize?

Remember the Datasaurus Dozen from lecture? Four datasets had nearly identical means, but completely different patterns. The only way to see those patterns was to plot the data.

Let’s start with a variable we explored last week: self-assessed health (GH_05).

A frequency table (review from Week 2)

# From codebook: GH_05 (Self-assessed health)
# 1 = Excellent, 2 = Very good, 3 = Good, 4 = Fair, 5 = Poor
# 9 = Not stated

chs %>%
  filter(GH_05 != 9) %>%  # Remove "Not stated"
  count(GH_05) %>%
  mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 5 × 3
##   GH_05     n percent
##   <dbl> <int>   <dbl>
## 1     1  5428    13.3
## 2     2 13124    32.2
## 3     3 13955    34.2
## 4     4  6208    15.2
## 5     5  2068     5.1

This tells us the counts and percentages. But what does the distribution look like?

Our first bar chart

For categorical variables, we use geom_bar().

chs %>%
  filter(GH_05 != 9) %>%  # Remove "Not stated"
  ggplot(aes(x = factor(GH_05))) +
  geom_bar(fill = "steelblue") +
  scale_x_discrete(labels = c("1" = "Excellent", "2" = "Very good", 
                               "3" = "Good", "4" = "Fair", "5" = "Poor")) +
  labs(
    title = "Self-Assessed Health",
    subtitle = "Canadian Housing Survey, 2021",
    x = "Health Rating",
    y = "Number of Households"
  ) +
  theme_minimal()

What’s happening in this code?

  • filter(GH_05 != 9) removes “not stated” responses
  • ggplot(aes(x = factor(GH_05))) sets up the plot with GH_05 on the x-axis
  • factor() tells R to treat the numbers as categories
  • geom_bar() creates a bar chart
  • scale_x_discrete(labels = ...) replaces the numeric codes with readable labels
  • labs() adds titles and labels
  • theme_minimal() gives us a clean look

Our first histogram

For numerical variables, we use geom_histogram(). Let’s look at life satisfaction (PLIS_05), which asks respondents to rate their satisfaction on a 0-10 scale.

From the codebook (PLIS_05 - Life satisfaction):

The original question asked: “Using a scale of 0 to 10, where 0 means ‘Very dissatisfied’ and 10 means ‘Very satisfied’, how do you feel about your life as a whole right now?”

For the public use file, responses are grouped into categories:

  • 01 = 0, 1, or 2 (Very dissatisfied)
  • 02 = 3
  • 03 = 4
  • 04 = 5
  • 05 = 6
  • 06 = 7
  • 07 = 8
  • 08 = 9
  • 09 = 10 (Very satisfied)
  • 99 = Not stated

Note: This grouping is common in public use microdata files (PUMFs) to protect respondent confidentiality. We’re working with binned data, not the original 0-10 responses.

chs %>%
  filter(PLIS_05 != 99) %>%  # Remove "Not stated"
  ggplot(aes(x = PLIS_05)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
  scale_x_continuous(breaks = 1:9,
                     labels = c("0-2", "3", "4", "5", "6", "7", "8", "9", "10")) +
  labs(
    title = "Life Satisfaction",
    subtitle = "Canadian Housing Survey, 2021",
    x = "Satisfaction Level (0 = Very Dissatisfied, 10 = Very Satisfied)",
    y = "Number of Households"
  ) +
  theme_minimal()

What’s different from a bar chart?

  • geom_histogram() is for continuous/numerical data
  • binwidth = 1 means each bar spans 1 unit on the scale
  • The bars touch because the variable is continuous (or in this case, ordered categories representing a continuous concept)

📓 Journal Prompt 1 — ANSWER KEY

Look at both visualizations above.

  1. Describe the shape of the life satisfaction distribution. Is it symmetric? Skewed? Where is the peak?
  2. What does the shape tell you about how Canadians rate their life satisfaction?

Suggested Answers:

Some key points to emphasize:

  1. Shape description:
    • The distribution is left-skewed (or negatively skewed) — the tail extends toward the lower satisfaction scores
    • The peak (mode) is at 8 on the original 0-10 scale
    • Most responses cluster in the upper range (representing scores of 7 to 10)
    • There are relatively fewer responses in the lower satisfaction categories (0 to 4 on original scale)
  2. Interpretation:
    • Most Canadians report high life satisfaction — the distribution is concentrated in the upper range
    • This is a common finding in well-being research: most people in developed countries report above-average life satisfaction (a phenomenon sometimes called the “positive skew” in subjective well-being)
    • The left-skew indicates that while most people are satisfied, there’s a “tail” of people reporting lower satisfaction — these outliers pull the mean down relative to the median
    • This pattern suggests that interventions targeting subjective well-being might focus on the groupws with low satisfaction rather than trying to shift the entire distribution

Additional notes:

  • The mean for this distribution would be lower than the median because of the left skew
  • Comparing this distribution across subgroups (age, income, region) could reveal important patterns in who experiences lower life satisfaction (more on that in Arc 2!)

Part 2: Understanding Shape

One of the key lessons from lecture: when the mean and median differ, that tells you something about the shape of the distribution.

Shelter costs: A case of skewness

The variable PDV_SHCO records monthly shelter costs in dollars. Unlike the grouped variables we just saw, this one is truly continuous.

From the codebook (PDV_SHCO - Shelter costs):

  • Values range from $0 to $9,000
  • 9999999.99 = Not stated

Let’s first calculate the mean and median:

chs %>%
  filter(PDV_SHCO < 9000) %>%  # Filter out "not stated" (coded as 9999999.99)
  summarize(
    mean_cost = mean(PDV_SHCO),
    median_cost = median(PDV_SHCO),
    min_cost = min(PDV_SHCO),
    max_cost = max(PDV_SHCO),
    n = n()
  )
## # A tibble: 1 × 5
##   mean_cost median_cost min_cost max_cost     n
##       <dbl>       <dbl>    <dbl>    <dbl> <int>
## 1     1117.         900        0     8900 31876

Notice anything? The mean is higher than the median. What does that suggest about the shape?

Visualizing shelter costs

chs %>%
  filter(PDV_SHCO < 9000) %>%  # Remove "not stated"
  ggplot(aes(x = PDV_SHCO)) +
  geom_histogram(binwidth = 200, fill = "steelblue", color = "white") +
  geom_vline(aes(xintercept = mean(PDV_SHCO)), 
             color = "red", linetype = "dashed", linewidth = 1) +
  geom_vline(aes(xintercept = median(PDV_SHCO)), 
             color = "darkgreen", linewidth = 1) +
  labs(
    title = "Monthly Shelter Costs",
    subtitle = "Red dashed = Mean | Green solid = Median",
    x = "Monthly Shelter Cost ($)",
    y = "Number of Households"
  ) +
  theme_minimal()

The red dashed line shows the mean; the green solid line shows the median.


📓 Journal Prompt 2 — ANSWER KEY

  1. Is this distribution symmetric, left-skewed, or right-skewed?
  2. Why is the mean higher than the median? What’s “pulling” the mean in that direction?
  3. If a news headline said “Average Canadian household spends $X on shelter,” would that accurately represent the typical household? Why or why not?

Suggested Answers:

Some key points:

  1. The distribution is right-skewed (or positively skewed):
    • The “tail” extends to the right toward higher shelter costs
    • Most households cluster at lower costs, with progressively fewer at higher costs
    • The peak is in the $400-$800 range, with a long tail extending toward $9,000
  2. Why the mean is higher than the median:
    • The mean ($1,117) is pulled upward by households with very high shelter costs
    • A small number of households paying $4,000-$9,000/month thus increase the average (recall the mean is more sensitive to outliers)
    • The median ($900) is a measure of the middle value (the 50th percentile value)
    • The greater the skew, the more it pulls the mean value (or, another way to put it, the greater the skew, the greater the gap between the median and the mean)
  3. Would “average shelter cost” be misleading?
    • Yes, it would be misleading for describing the “typical” household
    • The mean ($1,117) overstates what most households actually pay
    • The median ($900) better represents the typical household — half pay more, half pay less
    • A more accurate headline might say: “Half of Canadian households spend $900 or less on shelter, though the average is $1,117 given higher-cost households”
    • Policy implication: Programs designed around the “average” cost might miss the mark for what is more typical for households while still being inadequate for those in expensive markets

Key insight: This is why we always report both measures for skewed distributions — and why “median household income” is preferred over “mean household income” in most socio-economic reporting.


Part 3: Measuring Spread

Two distributions can have the same mean but tell completely different stories. What distinguishes them is spread — how much values vary from the center.

Introducing standard deviation

The standard deviation (SD) measures the typical distance of values from the mean.

  • Small SD: Values cluster tightly around the mean
  • Large SD: Values spread far from the mean

Let’s calculate the SD for shelter costs:

chs %>%
  filter(PDV_SHCO < 9000) %>%  # Remove "not stated"
  summarize(
    mean_cost = mean(PDV_SHCO),
    median_cost = median(PDV_SHCO),
    sd_cost = sd(PDV_SHCO),
    n = n()
  )
## # A tibble: 1 × 4
##   mean_cost median_cost sd_cost     n
##       <dbl>       <dbl>   <dbl> <int>
## 1     1117.         900    813. 31876

Reading this output:

  • sd() calculates the standard deviation
  • The SD tells you: on average, how far are values from the mean?

Comparing spread across groups

Does shelter cost vary more for owners or renters? Let’s find out using group_by():

From the codebook (PDCT_05 - Tenure):

  • 1 = Yes (owned by household member)
  • 2 = No (rented)
  • 9 = Not stated
chs %>%
  filter(PDV_SHCO < 9000, PDCT_05 %in% c(1, 2)) %>%  # Remove "not stated" for both variables
  mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
  group_by(tenure) %>%
  summarize(
    mean_cost = mean(PDV_SHCO),
    median_cost = median(PDV_SHCO),
    sd_cost = sd(PDV_SHCO),
    n = n()
  )
## # A tibble: 2 × 5
##   tenure mean_cost median_cost sd_cost     n
##   <chr>      <dbl>       <dbl>   <dbl> <int>
## 1 Owner      1273.        1000    961. 17972
## 2 Renter      916.         800    502. 13904

What’s new here?

  • mutate() creates a new variable tenure with readable labels
  • group_by(tenure) calculates statistics separately for owners and renters

Visualizing the comparison

chs %>%
  filter(PDV_SHCO < 9000, PDCT_05 %in% c(1, 2)) %>%  # Remove "not stated"
  mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
  ggplot(aes(x = PDV_SHCO, fill = tenure)) +
  geom_histogram(binwidth = 200, color = "white", alpha = 0.9) +
  facet_wrap(~tenure, ncol = 1) +
  scale_fill_manual(values = c("Owner" = "#E27D60", "Renter" = "#41B3A3")) +
  labs(
    title = "Monthly Shelter Costs by Tenure",
    x = "Monthly Shelter Cost ($)",
    y = "Number of Households"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

What’s new here?

  • facet_wrap(~tenure, ncol = 1) creates separate panels for each group
  • scale_fill_manual() lets us choose specific colors for each group
  • This lets us compare distributions side by side

📓 Journal Prompt 3 — ANSWER KEY

Look at both the table and the visualization above.

  1. Which group (owners or renters) has higher average shelter costs?
  2. Which group has more variation in shelter costs (higher SD)?
  3. What might explain the difference in spread between owners and renters? Think about what makes up shelter costs for each group.
  4. If you only reported the mean shelter cost for each group, what would you miss about how these groups differ?

Suggested Answers:

  1. Owners have higher average shelter costs:

    • Mean: Owners = $1,273 vs. Renters = $916 (difference of $357/month)
    • Median: Owners = $1,000 vs. Renters = $800 (difference of $200/month)
    • Both measures show owners paying more
  2. Owners have much more variation (higher SD):

    • SD: Owners = $961 vs. Renters = $502
    • The owner SD is nearly twice the renter SD
    • This means shelter costs for owners are much more spread out
  3. What explains the difference in spread?

    For owners, shelter costs include:

    • Mortgage payments (highly variable based on purchase price, down payment, interest rate, remaining balance)
    • Property taxes (vary by location and home value)
    • Home insurance
    • Utilities (often larger homes)
    • Maintenance and repairs
    • Condo fees (if applicable)

    For renters, shelter costs include:

    • Rent (more standardized within markets)
    • Possibly utilities (sometimes included)
    • Renter’s insurance (relatively small)

    Why owners have more spread:

    • Home values vary enormously (from $200K condos to multi-million dollar homes)
    • Owners at different stages of mortgage repayment have very different costs
    • Property taxes scale with home value
    • Some owners have paid off their mortgages (low costs) while others just bought (high costs)
    • Renters face more regulated/standardized pricing within rental markets
  4. What you’d miss by only reporting the mean:

    • The mean alone hides the fact that owner costs are much more variable
    • Among owners, costs range from nearly $0 (paid-off homes) to $8,000+
    • The owner distribution is more right-skewed (mean > median by $273 for owners vs. $116 for renters)
    • You’d miss that the “typical” renter experience is more homogeneous, while owner experiences are more diverse
    • Policy implication: Housing assistance programs would need different approaches for owners vs. renters because of this variability

Part 4: Weighted vs. Unweighted Estimates

During the lecture, we learned that survey samples don’t always perfectly represent the population. Some groups get oversampled (more in sample than in population) to ensure enough cases for analysis.

Why do we need weights?

Even with careful sampling, some groups end up:

  • Over-represented: More in the sample than in the population
  • Under-represented: Fewer in the sample than in the population

This happens because of:

  1. Deliberate oversampling: The CHS oversampled renters and Atlantic Canadians to ensure enough cases for detailed analysis
  2. Differential response rates: Some groups are less likely to respond to surveys
  3. Sampling frame limitations: Not everyone has an equal chance of being selected

Survey weights adjust for these imbalances. Each household gets a weight (PFWEIGHT) based on how many households it represents in the population.

The CHS tenure example

Let’s replicate what we saw in lecture. First, the unweighted percentages (raw sample counts):

chs %>%
  filter(PDCT_05 %in% c(1, 2)) %>%  # Keep only owners and renters
  count(PDCT_05) %>%
  mutate(
    tenure = ifelse(PDCT_05 == 1, "Owner", "Renter"),
    pct_unweighted = round(n / sum(n) * 100, 1)
  ) %>%
  dplyr::select(tenure, n, pct_unweighted)
## # A tibble: 2 × 3
##   tenure     n pct_unweighted
##   <chr>  <int>          <dbl>
## 1 Owner  20821           52.6
## 2 Renter 18779           47.4

This is what we calculated in Week 2. But is this what Canada actually looks like?

Calculating weighted percentages

The variable PFWEIGHT contains the survey weights. We sum the weights to get population estimates:

chs %>%
  filter(PDCT_05 %in% c(1, 2)) %>%  # Keep only owners and renters
  mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
  group_by(tenure) %>%
  summarize(
    n_unweighted = n(),
    n_weighted = sum(PFWEIGHT)
  ) %>%
  mutate(
    pct_unweighted = round(n_unweighted / sum(n_unweighted) * 100, 1),
    pct_weighted = round(n_weighted / sum(n_weighted) * 100, 1)
  )
## # A tibble: 2 × 5
##   tenure n_unweighted n_weighted pct_unweighted pct_weighted
##   <chr>         <int>      <dbl>          <dbl>        <dbl>
## 1 Owner         20821  10086258.           52.6         67.9
## 2 Renter        18779   4768680.           47.4         32.1

What’s happening here?

  • n() counts the raw number of households in the sample
  • sum(PFWEIGHT) sums the weights, estimating the population count
  • We calculate percentages both ways to compare

The codebook shows both

Here’s a useful feature of the CHS codebook: it reports both Frequency (unweighted count) and Weighted Frequency for every variable. For PDCT_05:

Category Code Frequency (n) Weighted Frequency Weighted %
Owned 1 20,821 10,086,258 67.9%
Rented 2 18,779 4,768,680 32.1%

This tells you immediately that the weighted percentage of owners (67.9%) is much higher than the unweighted percentage (52.6%).

Regional oversampling

The CHS also oversampled Atlantic Canada. Let’s see this:

chs %>%
  mutate(region = case_when(
    REGION == "01" | REGION == 1 ~ "Atlantic",
    REGION == "02" | REGION == 2 ~ "Quebec",
    REGION == "03" | REGION == 3 ~ "Ontario",
    REGION == "04" | REGION == 4 ~ "Prairies",
    REGION == "05" | REGION == 5 ~ "British Columbia",
    REGION == "06" | REGION == 6 ~ "Territories",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(region)) %>%
  group_by(region) %>%
  summarize(
    n_unweighted = n(),
    n_weighted = sum(PFWEIGHT)
  ) %>%
  mutate(
    pct_unweighted = round(n_unweighted / sum(n_unweighted) * 100, 1),
    pct_weighted = round(n_weighted / sum(n_weighted) * 100, 1),
    difference = pct_unweighted - pct_weighted
  )
## # A tibble: 6 × 6
##   region          n_unweighted n_weighted pct_unweighted pct_weighted difference
##   <chr>                  <int>      <dbl>          <dbl>        <dbl>      <dbl>
## 1 Atlantic               11263   1043174.           27.5          6.9       20.6
## 2 British Columb…         4197   2077140.           10.2         13.8       -3.6
## 3 Ontario                 7333   5694396.           17.9         37.7      -19.8
## 4 Prairies               11434   2551913.           27.9         16.9       11  
## 5 Quebec                  5973   3701158.           14.6         24.5       -9.9
## 6 Territories              788     25934.            1.9          0.2        1.7

Look at Atlantic Canada: It makes up about 27% of the unweighted sample but only about 7% of the weighted population estimate. That’s deliberate oversampling to ensure enough cases for regional analysis.

When to use weighted vs. unweighted

Purpose Use
Describing what Canada looks like Weighted percentages
Describing your sample Unweighted counts
Calculating population estimates Weighted
Reporting how many respondents you analyzed Unweighted

📓 Journal Prompt 4 — ANSWER KEY

  1. What is the unweighted percentage of homeowners in the sample?
  2. What is the weighted percentage (the population estimate)?
  3. Why is there such a large difference (~15 percentage points)?
  4. If you were writing a report about Canadian homeownership rates, which percentage should you use? Why?
  5. Why might the CHS have deliberately oversampled Atlantic Canada?

Suggested Answers:

  1. Unweighted percentage of homeowners: 52.6% (20,821 out of 39,600 valid responses)

  2. Weighted percentage of homeowners: 67.9% (representing approximately 10.1 million owner households out of 14.9 million total households)

  3. Why the ~15 percentage point difference:

    • Deliberate oversampling of renters: The CHS purposely included more renters than their population share to ensure enough renter households for detailed analysis
    • Renters are a policy-relevant group (housing affordability, tenant rights, etc.), so having enough of them in the sample allows for reliable subgroup estimates
    • Differential response rates: Renters may also have different response rates than owners
    • Without weights, we’d conclude Canada is roughly 50/50 owners/renters — but the weighted data shows it’s closer to 68/32
    • The weights correct for this by giving each renter household less weight (they’re over-represented) and each owner household more weight (they’re under-represented relative to their true population share)
  4. Which percentage for a report on Canadian homeownership:

    • Use the weighted percentage (67.9%) for any statement about “Canadian households”
    • The unweighted percentage describes your sample, not Canada
    • A correct statement: “An estimated 67.9% of Canadian households own their home”
    • You might also report: “Our analysis is based on 20,821 owner households and 18,779 renter households in the CHS sample” — this uses unweighted counts to describe your data
    • Key principle: Population claims require population weights; sample descriptions use raw counts
  5. Why oversample Atlantic Canada:

    • Sufficient sample size for regional analysis: Atlantic Canada is only ~7% of Canadian households. In a proportional sample of 40,000, that’s only ~2,800 Atlantic households
    • Provincial-level estimates: To analyze each Atlantic province separately (NL, PEI, NS, NB), you need enough cases in each. PEI alone is only ~0.4% of Canada
    • Policy and social relevance: Regional housing markets, affordability, and housing conditions differ — policymakers need reliable regional data
    • Statistical precision: As we saw in the sampling lab app, smaller samples produce larger margins of error (and more variability in estimates). Oversampling ensures more precise estimates for smaller regions
    • The tradeoff: Oversampling means the raw (unweighted) sample doesn’t look like Canada, but the weights fix this when calculating population estimates

Methodological note: This is a standard survey design strategy called “stratified sampling with disproportionate allocation.” The key is that as long as you have proper weights, you can both (1) make reliable population estimates AND (2) analyze subgroups with adequate sample sizes.


Part 5: Comparing Distributions Across Groups

A key skill in sociological quantitative reasoning is comparing how a variable is distributed across different (notably social) groups. This connects to our discussion of polarization: groups can have the same overall mean but very different distributions.

Economic hardship by region

Let’s compare economic hardship (EHA_10) across regions.

From the codebook (EHA_10 - Economic hardship):

Question: “In the past 12 months, how difficult or easy was it for your household to meet its financial needs in terms of transportation, housing, food, clothing and other necessary expenses?”

  • 1 = Very difficult
  • 2 = Difficult
  • 3 = Neither difficult nor easy
  • 4 = Easy
  • 5 = Very easy
  • 9 = Not stated

First, let’s calculate summary statistics by region:

chs %>%
  filter(EHA_10 != 9) %>%  # Remove "Not stated"
  mutate(region = case_when(
    REGION == "01" | REGION == 1 ~ "Atlantic",
    REGION == "02" | REGION == 2 ~ "Quebec",
    REGION == "03" | REGION == 3 ~ "Ontario",
    REGION == "04" | REGION == 4 ~ "Prairies",
    REGION == "05" | REGION == 5 ~ "British Columbia",
    REGION == "06" | REGION == 6 ~ "Territories",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(region)) %>%
  group_by(region) %>%
  summarize(
    mean_hardship = round(mean(EHA_10), 2),
    sd_hardship = round(sd(EHA_10), 2),
    pct_difficult = round(sum(EHA_10 %in% c(1, 2)) / n() * 100, 1),
    n = n()
  )
## # A tibble: 6 × 5
##   region           mean_hardship sd_hardship pct_difficult     n
##   <chr>                    <dbl>       <dbl>         <dbl> <int>
## 1 Atlantic                  3.19        1.02          23.1 11240
## 2 British Columbia          3.23        1.02          21.8  4190
## 3 Ontario                   3.18        1.07          24.6  7316
## 4 Prairies                  3.17        1.03          23.7 11405
## 5 Quebec                    3.47        1.03          15.9  5958
## 6 Territories               3.42        1.04          17.8   788

What’s new here?

  • case_when() lets us recode multiple values at once with readable labels
  • pct_difficult calculates the percentage reporting “Very difficult” or “Difficult”

Visualizing distributions by region

chs %>%
  filter(EHA_10 != 9) %>%  # Remove "Not stated"
  mutate(region = case_when(
    REGION == "01" | REGION == 1 ~ "Atlantic",
    REGION == "02" | REGION == 2 ~ "Quebec",
    REGION == "03" | REGION == 3 ~ "Ontario",
    REGION == "04" | REGION == 4 ~ "Prairies",
    REGION == "05" | REGION == 5 ~ "British Columbia",
    REGION == "06" | REGION == 6 ~ "Territories",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(region)) %>%
  ggplot(aes(x = factor(EHA_10), fill = region)) +
  geom_bar(show.legend = FALSE) +
  facet_wrap(~region) +
  scale_x_discrete(labels = c("1" = "Very\ndifficult", "2" = "Difficult", 
                               "3" = "Neither", "4" = "Easy", "5" = "Very\neasy")) +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title = "Economic Hardship by Region",
    subtitle = "Unweighted sample counts — note differences in sample sizes across regions",
    x = "Level of Difficulty Meeting Financial Needs",
    y = "Number of Households"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 7))

What’s new here?

  • facet_wrap(~region) creates a separate panel for each region
  • scale_fill_brewer(palette = "Set1") uses a nice color palette

📓 Journal Prompt 5 — ANSWER KEY

Look at both the table and the visualization.

  1. Which region has the highest percentage reporting economic hardship (codes 1 or 2)?
  2. Which region has the lowest?
  3. Look at the sample sizes (n). Why does the Territories panel look so different from the others? What does this tell you about how confident we can be in comparisons involving the Territories?
  4. Look at the bar heights across regions. Atlantic and Prairies have the tallest bars overall. Does this mean these regions have more economic hardship? Or is something else going on? (Hint: Think about what we learned about oversampling.)

Suggested Answers:

  1. Highest hardship: Ontario at 24.6% reporting “Very difficult” or “Difficult”
    • British Columbia is close at 21.8%
    • Prairies (23.7%) and Atlantic (23.1%) are also relatively high
  2. Lowest hardship: Quebec at 15.9% reporting economic hardship
    • Territories is second-lowest at 17.8%
    • Quebec’s mean hardship score (3.47) is the highest (remember: higher = easier given how the variable is coded), confirming this pattern
  3. The Territories panel:
    • The Territories has only 788 households in the sample (vs. 5,000-11,000 in other regions)
    • You can see this clearly by just comparing the height of the bars across the regions (since the y-axis is just a count for the number of households)
    • Confidence implications:
      • We can be less confident in Territories estimates and the reported descriptive statistics
      • The differences between Territories and other regions might just be sampling noise (so concluding that it’s surprising that the Territories report lower economic hardship might just be a sompling artifact here)
      • The Territories weighted population is only ~26,000 households, so even this small sample represents a large sampling fraction
    • Lesson: Always report sample sizes; they tell you how much you can trust comparisons
  4. Atlantic and Prairies having the tallest bars:
    • NO, this does NOT mean they have greater economic hardship — this is a visualization trap!
    • These regions have the largest sample sizes:
      • Atlantic: n = 11,240 (27.5% of sample, but only 6.9% of weighted population)
      • Prairies: n = 11,405 (27.9% of sample, but only 16.9% of weighted population)
    • Taller bars reflect more observations, not more hardship
    • To compare hardship across regions, look at the percentages (pct_difficult) or the shape of the distributions, not the raw heights
    • The better comparison: Quebec has the lowest percentage (15.9%) and Ontario has the highest (24.6%) — but you can’t see this easily in the bar heights because of the sample size differences
    • Alternative visualization: Using geom_bar(position = "fill") would show proportions instead of counts, making regional comparisons easier

Key insight: When sample sizes differ across groups (especially due to oversampling), raw counts can be misleading. Always check percentages or proportions for fair comparisons.


Part 6: Putting It All Together (Challenge)

This section asks you to apply what you’ve learned to a new variable. The tasks below are similar to what you’ve done in earlier parts — look back at those sections for guidance.

Community belonging

The variable PCOS_05 measures sense of belonging to the local community.

From the codebook (PCOS_05 - Community satisfaction):

Question: “Using a scale of 0 to 10, where 0 means ‘Very dissatisfied’ and 10 means ‘Very satisfied’, how satisfied are you with feeling as part of your community?”

Like life satisfaction, responses are grouped in the PUMF:

  • 01 = 0, 1, or 2 (Very dissatisfied)
  • 02 = 3
  • 03 = 4
  • 04 = 5
  • 05 = 6
  • 06 = 7
  • 07 = 8
  • 08 = 9
  • 09 = 10 (Very satisfied)
  • 99 = Not stated

Your tasks

Task 1: Calculate summary statistics

Calculate summary statistics (mean, median, SD) for community belonging.

Hint: Look at the beginning of Part 2 where we calculated these for shelter costs. You’ll need to filter out code 99, then use summarize() with mean(), median(), and sd().

# ANSWER KEY - Task 1

chs %>%
  filter(PCOS_05 != 99) %>%  # Remove "Not stated"
  summarize(
    mean_belonging = mean(PCOS_05),
    median_belonging = median(PCOS_05),
    sd_belonging = sd(PCOS_05),
    min_belonging = min(PCOS_05),
    max_belonging = max(PCOS_05),
    n = n()
  )
## # A tibble: 1 × 6
##   mean_belonging median_belonging sd_belonging min_belonging max_belonging     n
##            <dbl>            <dbl>        <dbl>         <dbl>         <dbl> <int>
## 1           6.17                7         2.19             1             9 40530

Reading the output:

  • Mean: ~6.5 (corresponding to satisfaction score around 7-8)
  • Median: 7 (the middle value corresponds to a satisfaction score of 8)
  • SD: ~2.0 (moderate spread)
  • The mean is slightly less than the median, suggesting a slight left skew

Task 2: Create a histogram

Create a histogram of community belonging.

Hint: Look at the life satisfaction histogram in Part 1. The code structure is very similar — you just need to change the variable name and the filter condition.

# ANSWER KEY - Task 2

chs %>%
  filter(PCOS_05 != 99) %>%  # Remove "Not stated"
  ggplot(aes(x = PCOS_05)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
  scale_x_continuous(breaks = 1:9,
                     labels = c("0-2", "3", "4", "5", "6", "7", "8", "9", "10")) +
  labs(
    title = "Community Belonging",
    subtitle = "Canadian Housing Survey, 2021",
    x = "Satisfaction with Community Belonging (0 = Very Dissatisfied, 10 = Very Satisfied)",
    y = "Number of Households"
  ) +
  theme_minimal()

Task 3: Compare owners and renters

Compare community belonging between owners and renters.

Hint: Look at Part 3 where we compared shelter costs by tenure. You’ll use the same approach: filter both variables, create a tenure label with mutate(), use group_by(), then summarize().

# ANSWER KEY - Task 3

chs %>%
  filter(PCOS_05 != 99, PDCT_05 %in% c(1, 2)) %>%  # Remove "not stated" for both
  mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
  group_by(tenure) %>%
  summarize(
    mean_belonging = round(mean(PCOS_05), 2),
    median_belonging = median(PCOS_05),
    sd_belonging = round(sd(PCOS_05), 2),
    n = n()
  )
## # A tibble: 2 × 5
##   tenure mean_belonging median_belonging sd_belonging     n
##   <chr>           <dbl>            <dbl>        <dbl> <int>
## 1 Owner            6.47                7         2.04 20650
## 2 Renter           5.86                6         2.31 18500

📓 Journal Prompt 6 (Challenge) — ANSWER KEY

After completing the tasks above:

  1. Describe the shape of the community belonging distribution. Is it symmetric or skewed?
  2. How does the mean compare to the median? What does this tell you about the shape?
  3. Do owners and renters differ in their sense of community belonging? Describe any differences in both center (mean/median) and spread (SD).
  4. Big picture question: If a policymaker only looked at the average community belonging score, what important information might they miss?

Suggested Answers:

  1. Shape of the distribution:

    • The distribution is left-skewed (negatively skewed)
    • Most responses cluster in the upper range (codes 6-9, representing scores 7-10)
    • The peak appears to be at codes 7-8 (satisfaction scores of 8-9)
    • There’s a tail extending toward lower satisfaction scores
    • The distribution looks similar to life satisfaction — most people report relatively high community belonging
  2. Mean vs. median comparison:

    • Mean: approximately 6.5
    • Median: 7
    • The mean is lower than the median, confirming the left skew
    • The lower tail (people with low community belonging) pulls the mean down
    • This is the opposite of shelter costs (where right skew pulled the mean up)
  3. Owners vs. renters comparison:

    Measure Owners Renters Difference
    Mean ~6.8 ~6.0 Owners ~0.8 higher
    Median 7 6 Owners 1 higher
    SD ~1.9 ~2.1 Renters slightly more variable
    n ~17,900 ~13,700

    Key findings:

    • Owners report higher community belonging on average (both mean and median)
    • Renters have slightly more variation (higher SD)
    • The difference is substantively meaningful — about 0.8 points on the scale
  4. What a policymaker might miss:

    • The owner-renter gap: Average community belonging is quite different by tenure, suggesting different approaches might be needed
    • The distribution shape: While most people report high belonging, there’s a meaningful minority with low belonging. We should ask: who are they? What do they need?
    • Variation within groups: Renters are more variable — some have very high belonging, others very low. The “average renter” masks this heterogeneity
    • Potential policy targets:
      • Programs to help renters build community connections
      • Understanding why some people (both owners and renters) report very low belonging
      • Whether interventions should target individuals or neighborhoods
    • The typical vs. mean distinction: The median (7 for owners, 6 for renters) better represents the “typical” person than the mean, especially given the skew

Broader insight: Community belonging matters for health, civic engagement, and quality of life. Understanding not just the average, but the full distribution and how it varies across groups, is essential.


Part 7: Reflection

📓 Final Journal Prompt — ANSWER KEY

Take a few minutes to reflect on this practice session:

  1. What was the most important concept you learned or reinforced today?
  2. What’s one thing that confused you or that you’d like to discuss in class?
  3. How does visualizing distributions change the way you think about summary statistics like the mean?

No right answer here!


Summary of Key Functions

Function What it does Example
ggplot() Initialize a plot ggplot(data, aes(x = var))
geom_bar() Bar chart for categorical data geom_bar(fill = "steelblue")
geom_histogram() Histogram for numerical data geom_histogram(binwidth = 1)
geom_vline() Add vertical line geom_vline(xintercept = 50)
facet_wrap() Separate panels by group facet_wrap(~group)
scale_x_continuous() Customize x-axis for continuous data scale_x_continuous(breaks = seq(0, 10, 2))
scale_x_discrete() Customize x-axis for categorical data scale_x_discrete(labels = c(...))
scale_fill_manual() Set custom colors scale_fill_manual(values = c("A" = "red"))
sd() Calculate standard deviation sd(variable)
case_when() Recode multiple values case_when(x == 1 ~ "A", x == 2 ~ "B")
sum(weights) Weighted count sum(PFWEIGHT)

The Key Lessons

  1. Always visualize — summary statistics can hide important patterns (remember the Datasaurus!)

  2. Shape matters — symmetric, left-skewed, and right-skewed distributions tell different stories

  3. Mean vs. median — when they differ, the distribution is skewed; the median often better represents the “typical” case

  4. Spread is telling — standard deviation tells you how much variation exists; two groups can have the same mean but very different SDs

  5. Filter first — always check the codebook and remove special codes (like 9, 99, 9999999.99) before calculating statistics

  6. Weighted vs. unweighted — for population estimates, use weights; for describing your sample, unweighted counts are appropriate

  7. Sample sizes matter — small samples (like the Territories) mean less certainty; oversampled regions (like Atlantic Canada) have more observations than their population share

Remember: The goal is not perfect code, but building intuition for what distributions reveal. If something confuses you, write it down — that’s valuable information for class discussion.