Before we dive into the data, let’s make sure you’re set up to run this document and publish your work.
If you haven’t already installed the tidyverse package, you’ll need to do so first. For instance, by placing the following in a code box. To generate a new code box just click the sign near Run (to the left of it) that has a +C symbol and then click R.
install.packages("tidyverse")
You only need to install packages once. After that, you just load
them with library().
Note: You do NOT need to install knitr
or rmarkdown separately — RStudio handles these
automatically.
Work through this document chunk by chunk, not all at once. For each gray code box:
Ctrl+Enter
(Windows) or Cmd+Enter (Mac)After running each chunk:
Important: Run chunks in order from top to bottom. Later chunks depend on earlier ones!
“Knitting” means converting your .Rmd file into a final HTML document. When you knit:
Ctrl+Shift+K (Windows) or
Cmd+Shift+K (Mac)The first time you knit, RStudio may prompt you to install additional packages. Click “Yes” to install them. This only happens once.
| Error Message | What It Means | How to Fix |
|---|---|---|
| “Object not found” | You’re using something that wasn’t created | Make sure earlier chunks ran successfully; check for typos |
| “Could not find function” | A package isn’t loaded | Add library(tidyverse) at the top |
| “File not found” | R can’t find your data file | Check the file path; make sure the file is in the right folder |
| “Unexpected symbol” | Syntax error in your code | Look for missing commas, parentheses, or quotes |
Remember: When you knit, R starts fresh. Even if a chunk worked when you ran it interactively, knitting might fail if the chunk depends on something you ran in the Console but didn’t include in the document.
RPubs is a free service for sharing R Markdown documents online. To publish:
Your document will now be live at
rpubs.com/yourusername/yourslug. You can share this link
with anyone!
Note: RPubs documents are public. Don’t include any personal information you wouldn’t want others to see.
This practice document builds on your skills from Week 2 by introducing visualization and measures of spread.
The core lesson from this week’s lecture: a single summary statistic hides as much as it reveals. To understand data, you need to see its shape.
We’ll also deepen our understanding of survey weights — why they matter and when to use them.
Work through each section, running the code chunks and answering the journal prompts in your course journal. Remember the cycle: run, check, interpret.
In Week 2, you learned an important lesson: survey data often contains special codes for missing or invalid responses. Remember the “household of 99 people” error?
The standard workflow for CHS variables:
Throughout this document, you’ll see this pattern:
chs %>%
filter(VARIABLE != 9) %>% # Remove "Not stated" (code 9)
...
The comment explains what’s being removed and why. Always check the codebook first to know which codes to filter.
In Week 2, we used count() to make frequency tables.
Tables are useful, but they don’t show us the shape of
a distribution. This week, we learn to visualize.
# Load the tidyverse package (includes ggplot2 for visualization)
library(tidyverse)
# Load the Canadian Housing Survey data
chs <- read_csv("data/CHS2021ECL_PUMF.csv")
Remember the Datasaurus Dozen from lecture? Four datasets had nearly identical means, but completely different patterns. The only way to see those patterns was to plot the data.
Let’s start with a variable we explored last week: self-assessed
health (GH_05).
# From codebook: GH_05 (Self-assessed health)
# 1 = Excellent, 2 = Very good, 3 = Good, 4 = Fair, 5 = Poor
# 9 = Not stated
chs %>%
filter(GH_05 != 9) %>% # Remove "Not stated"
count(GH_05) %>%
mutate(percent = round(n / sum(n) * 100, 1))
## # A tibble: 5 × 3
## GH_05 n percent
## <dbl> <int> <dbl>
## 1 1 5428 13.3
## 2 2 13124 32.2
## 3 3 13955 34.2
## 4 4 6208 15.2
## 5 5 2068 5.1
This tells us the counts and percentages. But what does the distribution look like?
For categorical variables, we use geom_bar().
chs %>%
filter(GH_05 != 9) %>% # Remove "Not stated"
ggplot(aes(x = factor(GH_05))) +
geom_bar(fill = "steelblue") +
scale_x_discrete(labels = c("1" = "Excellent", "2" = "Very good",
"3" = "Good", "4" = "Fair", "5" = "Poor")) +
labs(
title = "Self-Assessed Health",
subtitle = "Canadian Housing Survey, 2021",
x = "Health Rating",
y = "Number of Households"
) +
theme_minimal()
What’s happening in this code?
filter(GH_05 != 9) removes “not stated” responsesggplot(aes(x = factor(GH_05))) sets up the plot with
GH_05 on the x-axisfactor() tells R to treat the numbers as
categoriesgeom_bar() creates a bar chartscale_x_discrete(labels = ...) replaces the numeric
codes with readable labelslabs() adds titles and labelstheme_minimal() gives us a clean lookFor numerical variables, we use geom_histogram(). Let’s
look at life satisfaction (PLIS_05), which asks respondents
to rate their satisfaction on a 0-10 scale.
From the codebook (PLIS_05 - Life satisfaction):
The original question asked: “Using a scale of 0 to 10, where 0 means ‘Very dissatisfied’ and 10 means ‘Very satisfied’, how do you feel about your life as a whole right now?”
For the public use file, responses are grouped into categories:
Note: This grouping is common in public use microdata files (PUMFs) to protect respondent confidentiality. We’re working with binned data, not the original 0-10 responses.
chs %>%
filter(PLIS_05 != 99) %>% # Remove "Not stated"
ggplot(aes(x = PLIS_05)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
scale_x_continuous(breaks = 1:9,
labels = c("0-2", "3", "4", "5", "6", "7", "8", "9", "10")) +
labs(
title = "Life Satisfaction",
subtitle = "Canadian Housing Survey, 2021",
x = "Satisfaction Level (0 = Very Dissatisfied, 10 = Very Satisfied)",
y = "Number of Households"
) +
theme_minimal()
What’s different from a bar chart?
geom_histogram() is for continuous/numerical databinwidth = 1 means each bar spans 1 unit on the
scaleLook at both visualizations above.
- Describe the shape of the life satisfaction distribution. Is it symmetric? Skewed? Where is the peak?
- What does the shape tell you about how Canadians rate their life satisfaction?
Some key points to emphasize:
Additional notes:
One of the key lessons from lecture: when the mean and median differ, that tells you something about the shape of the distribution.
The variable PDV_SHCO records monthly shelter costs in
dollars. Unlike the grouped variables we just saw, this one is truly
continuous.
From the codebook (PDV_SHCO - Shelter costs):
Let’s first calculate the mean and median:
chs %>%
filter(PDV_SHCO < 9000) %>% # Filter out "not stated" (coded as 9999999.99)
summarize(
mean_cost = mean(PDV_SHCO),
median_cost = median(PDV_SHCO),
min_cost = min(PDV_SHCO),
max_cost = max(PDV_SHCO),
n = n()
)
## # A tibble: 1 × 5
## mean_cost median_cost min_cost max_cost n
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 1117. 900 0 8900 31876
Notice anything? The mean is higher than the median. What does that suggest about the shape?
chs %>%
filter(PDV_SHCO < 9000) %>% # Remove "not stated"
ggplot(aes(x = PDV_SHCO)) +
geom_histogram(binwidth = 200, fill = "steelblue", color = "white") +
geom_vline(aes(xintercept = mean(PDV_SHCO)),
color = "red", linetype = "dashed", linewidth = 1) +
geom_vline(aes(xintercept = median(PDV_SHCO)),
color = "darkgreen", linewidth = 1) +
labs(
title = "Monthly Shelter Costs",
subtitle = "Red dashed = Mean | Green solid = Median",
x = "Monthly Shelter Cost ($)",
y = "Number of Households"
) +
theme_minimal()
The red dashed line shows the mean; the green solid line shows the median.
- Is this distribution symmetric, left-skewed, or right-skewed?
- Why is the mean higher than the median? What’s “pulling” the mean in that direction?
- If a news headline said “Average Canadian household spends $X on shelter,” would that accurately represent the typical household? Why or why not?
Some key points:
Key insight: This is why we always report both measures for skewed distributions — and why “median household income” is preferred over “mean household income” in most socio-economic reporting.
Two distributions can have the same mean but tell completely different stories. What distinguishes them is spread — how much values vary from the center.
The standard deviation (SD) measures the typical distance of values from the mean.
Let’s calculate the SD for shelter costs:
chs %>%
filter(PDV_SHCO < 9000) %>% # Remove "not stated"
summarize(
mean_cost = mean(PDV_SHCO),
median_cost = median(PDV_SHCO),
sd_cost = sd(PDV_SHCO),
n = n()
)
## # A tibble: 1 × 4
## mean_cost median_cost sd_cost n
## <dbl> <dbl> <dbl> <int>
## 1 1117. 900 813. 31876
Reading this output:
sd() calculates the standard deviationDoes shelter cost vary more for owners or renters? Let’s find out
using group_by():
From the codebook (PDCT_05 - Tenure):
chs %>%
filter(PDV_SHCO < 9000, PDCT_05 %in% c(1, 2)) %>% # Remove "not stated" for both variables
mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
group_by(tenure) %>%
summarize(
mean_cost = mean(PDV_SHCO),
median_cost = median(PDV_SHCO),
sd_cost = sd(PDV_SHCO),
n = n()
)
## # A tibble: 2 × 5
## tenure mean_cost median_cost sd_cost n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Owner 1273. 1000 961. 17972
## 2 Renter 916. 800 502. 13904
What’s new here?
mutate() creates a new variable tenure
with readable labelsgroup_by(tenure) calculates statistics separately for
owners and renterschs %>%
filter(PDV_SHCO < 9000, PDCT_05 %in% c(1, 2)) %>% # Remove "not stated"
mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
ggplot(aes(x = PDV_SHCO, fill = tenure)) +
geom_histogram(binwidth = 200, color = "white", alpha = 0.9) +
facet_wrap(~tenure, ncol = 1) +
scale_fill_manual(values = c("Owner" = "#E27D60", "Renter" = "#41B3A3")) +
labs(
title = "Monthly Shelter Costs by Tenure",
x = "Monthly Shelter Cost ($)",
y = "Number of Households"
) +
theme_minimal() +
theme(legend.position = "none")
What’s new here?
facet_wrap(~tenure, ncol = 1) creates separate panels
for each groupscale_fill_manual() lets us choose specific colors for
each groupLook at both the table and the visualization above.
- Which group (owners or renters) has higher average shelter costs?
- Which group has more variation in shelter costs (higher SD)?
- What might explain the difference in spread between owners and renters? Think about what makes up shelter costs for each group.
- If you only reported the mean shelter cost for each group, what would you miss about how these groups differ?
Owners have higher average shelter costs:
Owners have much more variation (higher SD):
What explains the difference in spread?
For owners, shelter costs include:
For renters, shelter costs include:
Why owners have more spread:
What you’d miss by only reporting the mean:
During the lecture, we learned that survey samples don’t always perfectly represent the population. Some groups get oversampled (more in sample than in population) to ensure enough cases for analysis.
Even with careful sampling, some groups end up:
This happens because of:
Survey weights adjust for these imbalances. Each household gets a
weight (PFWEIGHT) based on how many households it
represents in the population.
Let’s replicate what we saw in lecture. First, the unweighted percentages (raw sample counts):
chs %>%
filter(PDCT_05 %in% c(1, 2)) %>% # Keep only owners and renters
count(PDCT_05) %>%
mutate(
tenure = ifelse(PDCT_05 == 1, "Owner", "Renter"),
pct_unweighted = round(n / sum(n) * 100, 1)
) %>%
dplyr::select(tenure, n, pct_unweighted)
## # A tibble: 2 × 3
## tenure n pct_unweighted
## <chr> <int> <dbl>
## 1 Owner 20821 52.6
## 2 Renter 18779 47.4
This is what we calculated in Week 2. But is this what Canada actually looks like?
The variable PFWEIGHT contains the survey weights. We
sum the weights to get population estimates:
chs %>%
filter(PDCT_05 %in% c(1, 2)) %>% # Keep only owners and renters
mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
group_by(tenure) %>%
summarize(
n_unweighted = n(),
n_weighted = sum(PFWEIGHT)
) %>%
mutate(
pct_unweighted = round(n_unweighted / sum(n_unweighted) * 100, 1),
pct_weighted = round(n_weighted / sum(n_weighted) * 100, 1)
)
## # A tibble: 2 × 5
## tenure n_unweighted n_weighted pct_unweighted pct_weighted
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Owner 20821 10086258. 52.6 67.9
## 2 Renter 18779 4768680. 47.4 32.1
What’s happening here?
n() counts the raw number of households in the
samplesum(PFWEIGHT) sums the weights, estimating the
population countHere’s a useful feature of the CHS codebook: it reports both Frequency (unweighted count) and Weighted Frequency for every variable. For PDCT_05:
| Category | Code | Frequency (n) | Weighted Frequency | Weighted % |
|---|---|---|---|---|
| Owned | 1 | 20,821 | 10,086,258 | 67.9% |
| Rented | 2 | 18,779 | 4,768,680 | 32.1% |
This tells you immediately that the weighted percentage of owners (67.9%) is much higher than the unweighted percentage (52.6%).
The CHS also oversampled Atlantic Canada. Let’s see this:
chs %>%
mutate(region = case_when(
REGION == "01" | REGION == 1 ~ "Atlantic",
REGION == "02" | REGION == 2 ~ "Quebec",
REGION == "03" | REGION == 3 ~ "Ontario",
REGION == "04" | REGION == 4 ~ "Prairies",
REGION == "05" | REGION == 5 ~ "British Columbia",
REGION == "06" | REGION == 6 ~ "Territories",
TRUE ~ NA_character_
)) %>%
filter(!is.na(region)) %>%
group_by(region) %>%
summarize(
n_unweighted = n(),
n_weighted = sum(PFWEIGHT)
) %>%
mutate(
pct_unweighted = round(n_unweighted / sum(n_unweighted) * 100, 1),
pct_weighted = round(n_weighted / sum(n_weighted) * 100, 1),
difference = pct_unweighted - pct_weighted
)
## # A tibble: 6 × 6
## region n_unweighted n_weighted pct_unweighted pct_weighted difference
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Atlantic 11263 1043174. 27.5 6.9 20.6
## 2 British Columb… 4197 2077140. 10.2 13.8 -3.6
## 3 Ontario 7333 5694396. 17.9 37.7 -19.8
## 4 Prairies 11434 2551913. 27.9 16.9 11
## 5 Quebec 5973 3701158. 14.6 24.5 -9.9
## 6 Territories 788 25934. 1.9 0.2 1.7
Look at Atlantic Canada: It makes up about 27% of the unweighted sample but only about 7% of the weighted population estimate. That’s deliberate oversampling to ensure enough cases for regional analysis.
| Purpose | Use |
|---|---|
| Describing what Canada looks like | Weighted percentages |
| Describing your sample | Unweighted counts |
| Calculating population estimates | Weighted |
| Reporting how many respondents you analyzed | Unweighted |
- What is the unweighted percentage of homeowners in the sample?
- What is the weighted percentage (the population estimate)?
- Why is there such a large difference (~15 percentage points)?
- If you were writing a report about Canadian homeownership rates, which percentage should you use? Why?
- Why might the CHS have deliberately oversampled Atlantic Canada?
Unweighted percentage of homeowners: 52.6% (20,821 out of 39,600 valid responses)
Weighted percentage of homeowners: 67.9% (representing approximately 10.1 million owner households out of 14.9 million total households)
Why the ~15 percentage point difference:
Which percentage for a report on Canadian homeownership:
Why oversample Atlantic Canada:
Methodological note: This is a standard survey design strategy called “stratified sampling with disproportionate allocation.” The key is that as long as you have proper weights, you can both (1) make reliable population estimates AND (2) analyze subgroups with adequate sample sizes.
A key skill in sociological quantitative reasoning is comparing how a variable is distributed across different (notably social) groups. This connects to our discussion of polarization: groups can have the same overall mean but very different distributions.
Let’s compare economic hardship (EHA_10) across
regions.
From the codebook (EHA_10 - Economic hardship):
Question: “In the past 12 months, how difficult or easy was it for your household to meet its financial needs in terms of transportation, housing, food, clothing and other necessary expenses?”
First, let’s calculate summary statistics by region:
chs %>%
filter(EHA_10 != 9) %>% # Remove "Not stated"
mutate(region = case_when(
REGION == "01" | REGION == 1 ~ "Atlantic",
REGION == "02" | REGION == 2 ~ "Quebec",
REGION == "03" | REGION == 3 ~ "Ontario",
REGION == "04" | REGION == 4 ~ "Prairies",
REGION == "05" | REGION == 5 ~ "British Columbia",
REGION == "06" | REGION == 6 ~ "Territories",
TRUE ~ NA_character_
)) %>%
filter(!is.na(region)) %>%
group_by(region) %>%
summarize(
mean_hardship = round(mean(EHA_10), 2),
sd_hardship = round(sd(EHA_10), 2),
pct_difficult = round(sum(EHA_10 %in% c(1, 2)) / n() * 100, 1),
n = n()
)
## # A tibble: 6 × 5
## region mean_hardship sd_hardship pct_difficult n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Atlantic 3.19 1.02 23.1 11240
## 2 British Columbia 3.23 1.02 21.8 4190
## 3 Ontario 3.18 1.07 24.6 7316
## 4 Prairies 3.17 1.03 23.7 11405
## 5 Quebec 3.47 1.03 15.9 5958
## 6 Territories 3.42 1.04 17.8 788
What’s new here?
case_when() lets us recode multiple values at once with
readable labelspct_difficult calculates the percentage reporting “Very
difficult” or “Difficult”chs %>%
filter(EHA_10 != 9) %>% # Remove "Not stated"
mutate(region = case_when(
REGION == "01" | REGION == 1 ~ "Atlantic",
REGION == "02" | REGION == 2 ~ "Quebec",
REGION == "03" | REGION == 3 ~ "Ontario",
REGION == "04" | REGION == 4 ~ "Prairies",
REGION == "05" | REGION == 5 ~ "British Columbia",
REGION == "06" | REGION == 6 ~ "Territories",
TRUE ~ NA_character_
)) %>%
filter(!is.na(region)) %>%
ggplot(aes(x = factor(EHA_10), fill = region)) +
geom_bar(show.legend = FALSE) +
facet_wrap(~region) +
scale_x_discrete(labels = c("1" = "Very\ndifficult", "2" = "Difficult",
"3" = "Neither", "4" = "Easy", "5" = "Very\neasy")) +
scale_fill_brewer(palette = "Set1") +
labs(
title = "Economic Hardship by Region",
subtitle = "Unweighted sample counts — note differences in sample sizes across regions",
x = "Level of Difficulty Meeting Financial Needs",
y = "Number of Households"
) +
theme_minimal() +
theme(axis.text.x = element_text(size = 7))
What’s new here?
facet_wrap(~region) creates a separate panel for each
regionscale_fill_brewer(palette = "Set1") uses a nice color
paletteLook at both the table and the visualization.
- Which region has the highest percentage reporting economic hardship (codes 1 or 2)?
- Which region has the lowest?
- Look at the sample sizes (n). Why does the Territories panel look so different from the others? What does this tell you about how confident we can be in comparisons involving the Territories?
- Look at the bar heights across regions. Atlantic and Prairies have the tallest bars overall. Does this mean these regions have more economic hardship? Or is something else going on? (Hint: Think about what we learned about oversampling.)
geom_bar(position = "fill") would show proportions instead
of counts, making regional comparisons easierKey insight: When sample sizes differ across groups (especially due to oversampling), raw counts can be misleading. Always check percentages or proportions for fair comparisons.
This section asks you to apply what you’ve learned to a new variable. The tasks below are similar to what you’ve done in earlier parts — look back at those sections for guidance.
The variable PCOS_05 measures sense of belonging to the
local community.
From the codebook (PCOS_05 - Community satisfaction):
Question: “Using a scale of 0 to 10, where 0 means ‘Very dissatisfied’ and 10 means ‘Very satisfied’, how satisfied are you with feeling as part of your community?”
Like life satisfaction, responses are grouped in the PUMF:
Calculate summary statistics (mean, median, SD) for community belonging.
Hint: Look at the beginning of Part 2 where we
calculated these for shelter costs. You’ll need to filter out code 99,
then use summarize() with mean(),
median(), and sd().
# ANSWER KEY - Task 1
chs %>%
filter(PCOS_05 != 99) %>% # Remove "Not stated"
summarize(
mean_belonging = mean(PCOS_05),
median_belonging = median(PCOS_05),
sd_belonging = sd(PCOS_05),
min_belonging = min(PCOS_05),
max_belonging = max(PCOS_05),
n = n()
)
## # A tibble: 1 × 6
## mean_belonging median_belonging sd_belonging min_belonging max_belonging n
## <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 6.17 7 2.19 1 9 40530
Reading the output:
Create a histogram of community belonging.
Hint: Look at the life satisfaction histogram in Part 1. The code structure is very similar — you just need to change the variable name and the filter condition.
# ANSWER KEY - Task 2
chs %>%
filter(PCOS_05 != 99) %>% # Remove "Not stated"
ggplot(aes(x = PCOS_05)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
scale_x_continuous(breaks = 1:9,
labels = c("0-2", "3", "4", "5", "6", "7", "8", "9", "10")) +
labs(
title = "Community Belonging",
subtitle = "Canadian Housing Survey, 2021",
x = "Satisfaction with Community Belonging (0 = Very Dissatisfied, 10 = Very Satisfied)",
y = "Number of Households"
) +
theme_minimal()
Compare community belonging between owners and renters.
Hint: Look at Part 3 where we compared shelter costs
by tenure. You’ll use the same approach: filter both variables, create a
tenure label with mutate(), use group_by(),
then summarize().
# ANSWER KEY - Task 3
chs %>%
filter(PCOS_05 != 99, PDCT_05 %in% c(1, 2)) %>% # Remove "not stated" for both
mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
group_by(tenure) %>%
summarize(
mean_belonging = round(mean(PCOS_05), 2),
median_belonging = median(PCOS_05),
sd_belonging = round(sd(PCOS_05), 2),
n = n()
)
## # A tibble: 2 × 5
## tenure mean_belonging median_belonging sd_belonging n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Owner 6.47 7 2.04 20650
## 2 Renter 5.86 6 2.31 18500
After completing the tasks above:
- Describe the shape of the community belonging distribution. Is it symmetric or skewed?
- How does the mean compare to the median? What does this tell you about the shape?
- Do owners and renters differ in their sense of community belonging? Describe any differences in both center (mean/median) and spread (SD).
- Big picture question: If a policymaker only looked at the average community belonging score, what important information might they miss?
Shape of the distribution:
Mean vs. median comparison:
Owners vs. renters comparison:
| Measure | Owners | Renters | Difference |
|---|---|---|---|
| Mean | ~6.8 | ~6.0 | Owners ~0.8 higher |
| Median | 7 | 6 | Owners 1 higher |
| SD | ~1.9 | ~2.1 | Renters slightly more variable |
| n | ~17,900 | ~13,700 | — |
Key findings:
What a policymaker might miss:
Broader insight: Community belonging matters for health, civic engagement, and quality of life. Understanding not just the average, but the full distribution and how it varies across groups, is essential.
Take a few minutes to reflect on this practice session:
- What was the most important concept you learned or reinforced today?
- What’s one thing that confused you or that you’d like to discuss in class?
- How does visualizing distributions change the way you think about summary statistics like the mean?
| Function | What it does | Example |
|---|---|---|
ggplot() |
Initialize a plot | ggplot(data, aes(x = var)) |
geom_bar() |
Bar chart for categorical data | geom_bar(fill = "steelblue") |
geom_histogram() |
Histogram for numerical data | geom_histogram(binwidth = 1) |
geom_vline() |
Add vertical line | geom_vline(xintercept = 50) |
facet_wrap() |
Separate panels by group | facet_wrap(~group) |
scale_x_continuous() |
Customize x-axis for continuous data | scale_x_continuous(breaks = seq(0, 10, 2)) |
scale_x_discrete() |
Customize x-axis for categorical data | scale_x_discrete(labels = c(...)) |
scale_fill_manual() |
Set custom colors | scale_fill_manual(values = c("A" = "red")) |
sd() |
Calculate standard deviation | sd(variable) |
case_when() |
Recode multiple values | case_when(x == 1 ~ "A", x == 2 ~ "B") |
sum(weights) |
Weighted count | sum(PFWEIGHT) |
Always visualize — summary statistics can hide important patterns (remember the Datasaurus!)
Shape matters — symmetric, left-skewed, and right-skewed distributions tell different stories
Mean vs. median — when they differ, the distribution is skewed; the median often better represents the “typical” case
Spread is telling — standard deviation tells you how much variation exists; two groups can have the same mean but very different SDs
Filter first — always check the codebook and remove special codes (like 9, 99, 9999999.99) before calculating statistics
Weighted vs. unweighted — for population estimates, use weights; for describing your sample, unweighted counts are appropriate
Sample sizes matter — small samples (like the Territories) mean less certainty; oversampled regions (like Atlantic Canada) have more observations than their population share
Remember: The goal is not perfect code, but building intuition for what distributions reveal. If something confuses you, write it down — that’s valuable information for class discussion.