Getting Started

Before we dive into the data, let’s make sure you’re set up to run this document and publish your work.

Installing Required Packages

If you haven’t already installed the tidyverse package, you’ll need to do so first. For instance, by placing the following in a code box. To generate a new code box just click the sign near Run (to the left of it) that has a +C symbol and then click R.

install.packages("tidyverse")

You only need to install packages once. After that, you just load them with library().

Note: You do NOT need to install knitr or rmarkdown separately — RStudio handles these automatically.

How to Run Code Chunks

Work through this document chunk by chunk, not all at once. For each gray code box:

Click the green “play” button (▶) in the top-right corner of the chunk, OR
Place your cursor in the chunk and press Ctrl+Enter (Windows) or Cmd+Enter (Mac)

After running each chunk:

Check the output that appears below
Make sure it looks reasonable before moving on
If you get an error, read it carefully — often it tells you what went wrong

Important: Run chunks in order from top to bottom. Later chunks depend on earlier ones!

Knitting to HTML

“Knitting” means converting your .Rmd file into a final HTML document. When you knit:

R runs ALL the code chunks from top to bottom in a fresh environment
It combines the text, code, and output into one polished document
The output file appears in the same folder as your .Rmd file

How to Knit

Click the Knit button at the top of the editor pane (it has a yarn ball icon), OR
Press Ctrl+Shift+K (Windows) or Cmd+Shift+K (Mac)

First Time Knitting?

The first time you knit, RStudio may prompt you to install additional packages. Click “Yes” to install them. This only happens once.

Recommended Workflow

Work through the document interactively first — run each chunk, check the output, fix any errors
Make sure everything works — if a chunk has an error, knitting will fail
Then knit the whole document — this creates your final HTML file

Common Knitting Errors

Error Message	What It Means	How to Fix
“Object not found”	You’re using something that wasn’t created	Make sure earlier chunks ran successfully; check for typos
“Could not find function”	A package isn’t loaded	Add `library(tidyverse)` at the top
“File not found”	R can’t find your data file	Check the file path; make sure the file is in the right folder
“Unexpected symbol”	Syntax error in your code	Look for missing commas, parentheses, or quotes

Remember: When you knit, R starts fresh. Even if a chunk worked when you ran it interactively, knitting might fail if the chunk depends on something you ran in the Console but didn’t include in the document.

Publishing to RPubs

RPubs is a free service for sharing R Markdown documents online. To publish:

First, Knit your document to HTML (see above)
In the HTML preview window, click the Publish button (blue icon, top-right)
Choose RPubs
If this is your first time:
- Click “Create an account” and sign up at rpubs.com
- Return to RStudio and try publishing again
Give your document a title and a unique URL slug
Click Publish

Your document will now be live at rpubs.com/yourusername/yourslug. You can share this link with anyone!

Note: RPubs documents are public. Don’t include any personal information you wouldn’t want others to see.

Overview

This practice document builds on your skills from Week 2 by introducing visualization and measures of spread.

The core lesson from this week’s lecture: a single summary statistic hides as much as it reveals. To understand data, you need to see its shape.

We’ll also deepen our understanding of survey weights — why they matter and when to use them.

Work through each section, running the code chunks and answering the journal prompts in your course journal. Remember the cycle: run, check, interpret.

Data Cleaning Reminder

In Week 2, you learned an important lesson: survey data often contains special codes for missing or invalid responses. Remember the “household of 99 people” error?

The standard workflow for CHS variables:

Check the codebook for special codes (9, 99, 999, 9999999.99, etc.)
Filter them out before analyzing
Then calculate statistics or create visualizations

Throughout this document, you’ll see this pattern:

chs %>%
  filter(VARIABLE != 9) %>%  # Remove "Not stated" (code 9)
  ...

The comment explains what’s being removed and why. Always check the codebook first to know which codes to filter.

Part 1: From Tables to Pictures

In Week 2, we used count() to make frequency tables. Tables are useful, but they don’t show us the shape of a distribution. This week, we learn to visualize.

Load packages and data

# Load the tidyverse package (includes ggplot2 for visualization)
library(tidyverse)

# Load the Canadian Housing Survey data
chs <- read_csv("data/CHS2021ECL_PUMF.csv")

Why visualize?

Remember the Datasaurus Dozen from lecture? Four datasets had nearly identical means, but completely different patterns. The only way to see those patterns was to plot the data.

Let’s start with a variable we explored last week: self-assessed health (GH_05).

A frequency table (review from Week 2)

# From codebook: GH_05 (Self-assessed health)
# 1 = Excellent, 2 = Very good, 3 = Good, 4 = Fair, 5 = Poor
# 9 = Not stated

chs %>%
  filter(GH_05 != 9) %>%  # Remove "Not stated"
  count(GH_05) %>%
  mutate(percent = round(n / sum(n) * 100, 1))

## # A tibble: 5 × 3
##   GH_05     n percent
##   <dbl> <int>   <dbl>
## 1     1  5428    13.3
## 2     2 13124    32.2
## 3     3 13955    34.2
## 4     4  6208    15.2
## 5     5  2068     5.1

This tells us the counts and percentages. But what does the distribution look like?

Our first bar chart

For categorical variables, we use geom_bar().

chs %>%
  filter(GH_05 != 9) %>%  # Remove "Not stated"
  ggplot(aes(x = factor(GH_05))) +
  geom_bar(fill = "steelblue") +
  scale_x_discrete(labels = c("1" = "Excellent", "2" = "Very good", 
                               "3" = "Good", "4" = "Fair", "5" = "Poor")) +
  labs(
    title = "Self-Assessed Health",
    subtitle = "Canadian Housing Survey, 2021",
    x = "Health Rating",
    y = "Number of Households"
  ) +
  theme_minimal()

What’s happening in this code?

filter(GH_05 != 9) removes “not stated” responses
ggplot(aes(x = factor(GH_05))) sets up the plot with GH_05 on the x-axis
factor() tells R to treat the numbers as categories
geom_bar() creates a bar chart
scale_x_discrete(labels = ...) replaces the numeric codes with readable labels
labs() adds titles and labels
theme_minimal() gives us a clean look

Our first histogram

For numerical variables, we use geom_histogram(). Let’s look at life satisfaction (PLIS_05), which asks respondents to rate their satisfaction on a 0-10 scale.

From the codebook (PLIS_05 - Life satisfaction):

The original question asked: “Using a scale of 0 to 10, where 0 means ‘Very dissatisfied’ and 10 means ‘Very satisfied’, how do you feel about your life as a whole right now?”

For the public use file, responses are grouped into categories:

01 = 0, 1, or 2 (Very dissatisfied)
02 = 3
03 = 4
04 = 5
05 = 6
06 = 7
07 = 8
08 = 9
09 = 10 (Very satisfied)
99 = Not stated

Note: This grouping is common in public use microdata files (PUMFs) to protect respondent confidentiality. We’re working with binned data, not the original 0-10 responses.

chs %>%
  filter(PLIS_05 != 99) %>%  # Remove "Not stated"
  ggplot(aes(x = PLIS_05)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
  scale_x_continuous(breaks = 1:9,
                     labels = c("0-2", "3", "4", "5", "6", "7", "8", "9", "10")) +
  labs(
    title = "Life Satisfaction",
    subtitle = "Canadian Housing Survey, 2021",
    x = "Satisfaction Level (0 = Very Dissatisfied, 10 = Very Satisfied)",
    y = "Number of Households"
  ) +
  theme_minimal()

What’s different from a bar chart?

geom_histogram() is for continuous/numerical data
binwidth = 1 means each bar spans 1 unit on the scale
The bars touch because the variable is continuous (or in this case, ordered categories representing a continuous concept)

📓 Journal Prompt 1

Look at both visualizations above.

Describe the shape of the life satisfaction distribution. Is it symmetric? Skewed? Where is the peak?
What does the shape tell you about how Canadians rate their life satisfaction?

Part 2: Understanding Shape

One of the key lessons from lecture: when the mean and median differ, that tells you something about the shape of the distribution.

Shelter costs: A case of skewness

The variable PDV_SHCO records monthly shelter costs in dollars. Unlike the grouped variables we just saw, this one is truly continuous.

From the codebook (PDV_SHCO - Shelter costs):

Values range from $0 to $9,000
9999999.99 = Not stated

Let’s first calculate the mean and median:

chs %>%
  filter(PDV_SHCO < 9000) %>%  # Filter out "not stated" (coded as 9999999.99)
  summarize(
    mean_cost = mean(PDV_SHCO),
    median_cost = median(PDV_SHCO),
    min_cost = min(PDV_SHCO),
    max_cost = max(PDV_SHCO),
    n = n()
  )

## # A tibble: 1 × 5
##   mean_cost median_cost min_cost max_cost     n
##       <dbl>       <dbl>    <dbl>    <dbl> <int>
## 1     1117.         900        0     8900 31876

Notice anything? The mean is higher than the median. What does that suggest about the shape?

Visualizing shelter costs

chs %>%
  filter(PDV_SHCO < 9000) %>%  # Remove "not stated"
  ggplot(aes(x = PDV_SHCO)) +
  geom_histogram(binwidth = 200, fill = "steelblue", color = "white") +
  geom_vline(aes(xintercept = mean(PDV_SHCO)), 
             color = "red", linetype = "dashed", linewidth = 1) +
  geom_vline(aes(xintercept = median(PDV_SHCO)), 
             color = "darkgreen", linewidth = 1) +
  labs(
    title = "Monthly Shelter Costs",
    subtitle = "Red dashed = Mean | Green solid = Median",
    x = "Monthly Shelter Cost ($)",
    y = "Number of Households"
  ) +
  theme_minimal()

The red dashed line shows the mean; the green solid line shows the median.

📓 Journal Prompt 2

Is this distribution symmetric, left-skewed, or right-skewed?
Why is the mean higher than the median? What’s “pulling” the mean in that direction?
If a news headline said “Average Canadian household spends $X on shelter,” would that accurately represent the typical household? Why or why not?

Part 3: Measuring Spread

Two distributions can have the same mean but tell completely different stories. What distinguishes them is spread — how much values vary from the center.

Introducing standard deviation

The standard deviation (SD) measures the typical distance of values from the mean.

Small SD: Values cluster tightly around the mean
Large SD: Values spread far from the mean

Let’s calculate the SD for shelter costs:

chs %>%
  filter(PDV_SHCO < 9000) %>%  # Remove "not stated"
  summarize(
    mean_cost = mean(PDV_SHCO),
    median_cost = median(PDV_SHCO),
    sd_cost = sd(PDV_SHCO),
    n = n()
  )

## # A tibble: 1 × 4
##   mean_cost median_cost sd_cost     n
##       <dbl>       <dbl>   <dbl> <int>
## 1     1117.         900    813. 31876

Reading this output:

sd() calculates the standard deviation
The SD tells you: on average, how far are values from the mean?

Comparing spread across groups

Does shelter cost vary more for owners or renters? Let’s find out using group_by():

From the codebook (PDCT_05 - Tenure):

1 = Yes (owned by household member)
2 = No (rented)
9 = Not stated

chs %>%
  filter(PDV_SHCO < 9000, PDCT_05 %in% c(1, 2)) %>%  # Remove "not stated" for both variables
  mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
  group_by(tenure) %>%
  summarize(
    mean_cost = mean(PDV_SHCO),
    median_cost = median(PDV_SHCO),
    sd_cost = sd(PDV_SHCO),
    n = n()
  )

## # A tibble: 2 × 5
##   tenure mean_cost median_cost sd_cost     n
##   <chr>      <dbl>       <dbl>   <dbl> <int>
## 1 Owner      1273.        1000    961. 17972
## 2 Renter      916.         800    502. 13904

What’s new here?

mutate() creates a new variable tenure with readable labels
group_by(tenure) calculates statistics separately for owners and renters

Visualizing the comparison

chs %>%
  filter(PDV_SHCO < 9000, PDCT_05 %in% c(1, 2)) %>%  # Remove "not stated"
  mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
  ggplot(aes(x = PDV_SHCO, fill = tenure)) +
  geom_histogram(binwidth = 200, color = "white", alpha = 0.9) +
  facet_wrap(~tenure, ncol = 1) +
  scale_fill_manual(values = c("Owner" = "#E27D60", "Renter" = "#41B3A3")) +
  labs(
    title = "Monthly Shelter Costs by Tenure",
    x = "Monthly Shelter Cost ($)",
    y = "Number of Households"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

What’s new here?

facet_wrap(~tenure, ncol = 1) creates separate panels for each group
scale_fill_manual() lets us choose specific colors for each group
This lets us compare distributions side by side

📓 Journal Prompt 3

Look at both the table and the visualization above.

Which group (owners or renters) has higher average shelter costs?
Which group has more variation in shelter costs (higher SD)?
What might explain the difference in spread between owners and renters? Think about what makes up shelter costs for each group.
If you only reported the mean shelter cost for each group, what would you miss about how these groups differ?

Part 4: Weighted vs. Unweighted Estimates

In lecture, we learned that survey samples don’t always perfectly represent the population. Some groups get oversampled (more in sample than in population) to ensure enough cases for analysis.

Why do we need weights?

Even with careful sampling, some groups end up:

Over-represented: More in the sample than in the population
Under-represented: Fewer in the sample than in the population

This happens because of:

Deliberate oversampling: The CHS oversampled renters and Atlantic Canadians to ensure enough cases for detailed analysis
Differential response rates: Some groups are less likely to respond to surveys
Sampling frame limitations: Not everyone has an equal chance of being selected

Survey weights adjust for these imbalances. Each household gets a weight (PFWEIGHT) based on how many households it represents in the population.

The CHS tenure example

Let’s replicate what we saw in lecture. First, the unweighted percentages (raw sample counts):

chs %>%
  filter(PDCT_05 %in% c(1, 2)) %>%  # Keep only owners and renters
  count(PDCT_05) %>%
  mutate(
    tenure = ifelse(PDCT_05 == 1, "Owner", "Renter"),
    pct_unweighted = round(n / sum(n) * 100, 1)
  ) %>%
  dplyr::select(tenure, n, pct_unweighted)

## # A tibble: 2 × 3
##   tenure     n pct_unweighted
##   <chr>  <int>          <dbl>
## 1 Owner  20821           52.6
## 2 Renter 18779           47.4

This is what we calculated in Week 2. But is this what Canada actually looks like?

Calculating weighted percentages

The variable PFWEIGHT contains the survey weights. We sum the weights to get population estimates:

chs %>%
  filter(PDCT_05 %in% c(1, 2)) %>%  # Keep only owners and renters
  mutate(tenure = ifelse(PDCT_05 == 1, "Owner", "Renter")) %>%
  group_by(tenure) %>%
  summarize(
    n_unweighted = n(),
    n_weighted = sum(PFWEIGHT)
  ) %>%
  mutate(
    pct_unweighted = round(n_unweighted / sum(n_unweighted) * 100, 1),
    pct_weighted = round(n_weighted / sum(n_weighted) * 100, 1)
  )

## # A tibble: 2 × 5
##   tenure n_unweighted n_weighted pct_unweighted pct_weighted
##   <chr>         <int>      <dbl>          <dbl>        <dbl>
## 1 Owner         20821  10086258.           52.6         67.9
## 2 Renter        18779   4768680.           47.4         32.1

What’s happening here?

n() counts the raw number of households in the sample
sum(PFWEIGHT) sums the weights, estimating the population count
We calculate percentages both ways to compare

The codebook shows both

Here’s a useful feature of the CHS codebook: it reports both Frequency (unweighted count) and Weighted Frequency for every variable. For PDCT_05:

Category	Code	Frequency (n)	Weighted Frequency	Weighted %
Owned	1	20,821	10,086,258	67.9%
Rented	2	18,779	4,768,680	32.1%

This tells you immediately that the weighted percentage of owners (67.9%) is much higher than the unweighted percentage (52.6%).

Regional oversampling

The CHS also oversampled Atlantic Canada. Let’s see this:

chs %>%
  mutate(region = case_when(
    REGION == "01" | REGION == 1 ~ "Atlantic",
    REGION == "02" | REGION == 2 ~ "Quebec",
    REGION == "03" | REGION == 3 ~ "Ontario",
    REGION == "04" | REGION == 4 ~ "Prairies",
    REGION == "05" | REGION == 5 ~ "British Columbia",
    REGION == "06" | REGION == 6 ~ "Territories",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(region)) %>%
  group_by(region) %>%
  summarize(
    n_unweighted = n(),
    n_weighted = sum(PFWEIGHT)
  ) %>%
  mutate(
    pct_unweighted = round(n_unweighted / sum(n_unweighted) * 100, 1),
    pct_weighted = round(n_weighted / sum(n_weighted) * 100, 1),
    difference = pct_unweighted - pct_weighted
  )

## # A tibble: 6 × 6
##   region          n_unweighted n_weighted pct_unweighted pct_weighted difference
##   <chr>                  <int>      <dbl>          <dbl>        <dbl>      <dbl>
## 1 Atlantic               11263   1043174.           27.5          6.9       20.6
## 2 British Columb…         4197   2077140.           10.2         13.8       -3.6
## 3 Ontario                 7333   5694396.           17.9         37.7      -19.8
## 4 Prairies               11434   2551913.           27.9         16.9       11  
## 5 Quebec                  5973   3701158.           14.6         24.5       -9.9
## 6 Territories              788     25934.            1.9          0.2        1.7

Look at Atlantic Canada: It makes up about 27% of the unweighted sample but only about 7% of the weighted population estimate. That’s deliberate oversampling to ensure enough cases for regional analysis.

When to use weighted vs. unweighted

Purpose	Use
Describing what Canada looks like	Weighted percentages
Describing your sample	Unweighted counts
Calculating population estimates	Weighted
Reporting how many respondents you analyzed	Unweighted

📓 Journal Prompt 4

What is the unweighted percentage of homeowners in the sample?
What is the weighted percentage (the population estimate)?
Why is there such a large difference (~15 percentage points)?
If you were writing a report about Canadian homeownership rates, which percentage should you use? Why?
Why might the CHS have deliberately oversampled Atlantic Canada?

Part 5: Comparing Distributions Across Groups

A key skill in quantitative reasoning is comparing how a variable is distributed across different groups. This connects to our discussion of polarization: groups can have the same overall mean but very different distributions.

Economic hardship by region

Let’s compare economic hardship (EHA_10) across regions.

From the codebook (EHA_10 - Economic hardship):

Question: “In the past 12 months, how difficult or easy was it for your household to meet its financial needs in terms of transportation, housing, food, clothing and other necessary expenses?”

1 = Very difficult
2 = Difficult
3 = Neither difficult nor easy
4 = Easy
5 = Very easy
9 = Not stated

First, let’s calculate summary statistics by region:

chs %>%
  filter(EHA_10 != 9) %>%  # Remove "Not stated"
  mutate(region = case_when(
    REGION == "01" | REGION == 1 ~ "Atlantic",
    REGION == "02" | REGION == 2 ~ "Quebec",
    REGION == "03" | REGION == 3 ~ "Ontario",
    REGION == "04" | REGION == 4 ~ "Prairies",
    REGION == "05" | REGION == 5 ~ "British Columbia",
    REGION == "06" | REGION == 6 ~ "Territories",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(region)) %>%
  group_by(region) %>%
  summarize(
    mean_hardship = round(mean(EHA_10), 2),
    sd_hardship = round(sd(EHA_10), 2),
    pct_difficult = round(sum(EHA_10 %in% c(1, 2)) / n() * 100, 1),
    n = n()
  )

## # A tibble: 6 × 5
##   region           mean_hardship sd_hardship pct_difficult     n
##   <chr>                    <dbl>       <dbl>         <dbl> <int>
## 1 Atlantic                  3.19        1.02          23.1 11240
## 2 British Columbia          3.23        1.02          21.8  4190
## 3 Ontario                   3.18        1.07          24.6  7316
## 4 Prairies                  3.17        1.03          23.7 11405
## 5 Quebec                    3.47        1.03          15.9  5958
## 6 Territories               3.42        1.04          17.8   788

What’s new here?

case_when() lets us recode multiple values at once with readable labels
pct_difficult calculates the percentage reporting “Very difficult” or “Difficult”

Visualizing distributions by region

chs %>%
  filter(EHA_10 != 9) %>%  # Remove "Not stated"
  mutate(region = case_when(
    REGION == "01" | REGION == 1 ~ "Atlantic",
    REGION == "02" | REGION == 2 ~ "Quebec",
    REGION == "03" | REGION == 3 ~ "Ontario",
    REGION == "04" | REGION == 4 ~ "Prairies",
    REGION == "05" | REGION == 5 ~ "British Columbia",
    REGION == "06" | REGION == 6 ~ "Territories",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(region)) %>%
  ggplot(aes(x = factor(EHA_10), fill = region)) +
  geom_bar(show.legend = FALSE) +
  facet_wrap(~region) +
  scale_x_discrete(labels = c("1" = "Very\ndifficult", "2" = "Difficult", 
                               "3" = "Neither", "4" = "Easy", "5" = "Very\neasy")) +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title = "Economic Hardship by Region",
    subtitle = "Unweighted sample counts — note differences in sample sizes across regions",
    x = "Level of Difficulty Meeting Financial Needs",
    y = "Number of Households"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 7))

What’s new here?

facet_wrap(~region) creates a separate panel for each region
scale_fill_brewer(palette = "Set1") uses a nice color palette

📓 Journal Prompt 5

Look at both the table and the visualization.

Which region has the highest percentage reporting economic hardship (codes 1 or 2)?
Which region has the lowest?
Look at the sample sizes (n). Why does the Territories panel look so different from the others? What does this tell you about how confident we can be in comparisons involving the Territories?
Look at the bar heights across regions. Atlantic and Prairies have the tallest bars overall. Does this mean these regions have more economic hardship? Or is something else going on? (Hint: Think about what we learned about oversampling.)

Part 6: Putting It All Together (Challenge)

This section asks you to apply what you’ve learned to a new variable. The tasks below are similar to what you’ve done in earlier parts — look back at those sections for guidance.

Community belonging

The variable PCOS_05 measures sense of belonging to the local community.

From the codebook (PCOS_05 - Community satisfaction):

Question: “Using a scale of 0 to 10, where 0 means ‘Very dissatisfied’ and 10 means ‘Very satisfied’, how satisfied are you with feeling as part of your community?”

Like life satisfaction, responses are grouped in the PUMF:

01 = 0, 1, or 2 (Very dissatisfied)
02 = 3
03 = 4
04 = 5
05 = 6
06 = 7
07 = 8
08 = 9
09 = 10 (Very satisfied)
99 = Not stated

Your tasks

Task 1: Calculate summary statistics (mean, median, SD) for community belonging.

Hint: Look at the beginning of Part 2 where we calculated these for shelter costs. You’ll need to filter out code 99, then use summarize() with mean(), median(), and sd().

# Your code here
# Replace PDV_SHCO with PCOS_05
# Remember to filter out the "not stated" code (99 instead of 9999999.99)

Task 2: Create a histogram of community belonging.

Hint: Look at the life satisfaction histogram in Part 1. The code structure is very similar — you just need to change the variable name and the filter condition.

# Your code here
# Use PCOS_05 instead of PLIS_05
# Filter out code 99
# Adjust the scale_x_continuous() labels to match the codebook categories

Task 3: Compare community belonging between owners and renters.

Hint: Look at Part 3 where we compared shelter costs by tenure. You’ll use the same approach: filter both variables, create a tenure label with mutate(), use group_by(), then summarize().

# Your code here
# Filter: PCOS_05 != 99 and PDCT_05 %in% c(1, 2)
# Create tenure variable with mutate()
# Group by tenure
# Calculate mean, median, sd, and n

📓 Journal Prompt 6 (Challenge)

After completing the tasks above:

Describe the shape of the community belonging distribution. Is it symmetric or skewed?
How does the mean compare to the median? What does this tell you about the shape?
Do owners and renters differ in their sense of community belonging? Describe any differences in both center (mean/median) and spread (SD).
Big picture question: If a policymaker only looked at the average community belonging score, what important information might they miss?

Part 7: Reflection

📓 Final Journal Prompt

Take a few minutes to reflect on this practice session:

What was the most important concept you learned or reinforced today?
What’s one thing that confused you or that you’d like to discuss in class?
How does visualizing distributions change the way you think about summary statistics like the mean?

Summary of Key Functions

Function	What it does	Example
`ggplot()`	Initialize a plot	`ggplot(data, aes(x = var))`
`geom_bar()`	Bar chart for categorical data	`geom_bar(fill = "steelblue")`
`geom_histogram()`	Histogram for numerical data	`geom_histogram(binwidth = 1)`
`geom_vline()`	Add vertical line	`geom_vline(xintercept = 50)`
`facet_wrap()`	Separate panels by group	`facet_wrap(~group)`
`scale_x_continuous()`	Customize x-axis for continuous data	`scale_x_continuous(breaks = seq(0, 10, 2))`
`scale_x_discrete()`	Customize x-axis for categorical data	`scale_x_discrete(labels = c(...))`
`scale_fill_manual()`	Set custom colors	`scale_fill_manual(values = c("A" = "red"))`
`sd()`	Calculate standard deviation	`sd(variable)`
`case_when()`	Recode multiple values	`case_when(x == 1 ~ "A", x == 2 ~ "B")`
`sum(weights)`	Weighted count	`sum(PFWEIGHT)`

The Key Lessons

Always visualize — summary statistics can hide important patterns (remember the Datasaurus!)
Shape matters — symmetric, left-skewed, and right-skewed distributions tell different stories
Mean vs. median — when they differ, the distribution is skewed; the median often better represents the “typical” case
Spread is telling — standard deviation tells you how much variation exists; two groups can have the same mean but very different SDs
Filter first — always check the codebook and remove special codes (like 9, 99, 9999999.99) before calculating statistics
Weighted vs. unweighted — for population estimates, use weights; for describing your sample, unweighted counts are appropriate
Sample sizes matter — small samples (like the Territories) mean less certainty; oversampled regions (like Atlantic Canada) have more observations than their population share

Remember: The goal is not perfect code, but building intuition for what distributions reveal. If something confuses you, write it down — that’s valuable information for class discussion.

Week 3 Practice: The Shape of Things

SOC3320 — Methods & Research II

Winter 2026

Getting Started

Installing Required Packages

How to Run Code Chunks

Knitting to HTML

How to Knit

First Time Knitting?

Recommended Workflow

Common Knitting Errors

Publishing to RPubs

Overview

Data Cleaning Reminder

Part 1: From Tables to Pictures

Load packages and data

Why visualize?

A frequency table (review from Week 2)

Our first bar chart

Our first histogram

📓 Journal Prompt 1

Part 2: Understanding Shape

Shelter costs: A case of skewness

Visualizing shelter costs

📓 Journal Prompt 2

Part 3: Measuring Spread

Introducing standard deviation

Comparing spread across groups

Visualizing the comparison

📓 Journal Prompt 3

Part 4: Weighted vs. Unweighted Estimates

Why do we need weights?

The CHS tenure example

Calculating weighted percentages

The codebook shows both

Regional oversampling

When to use weighted vs. unweighted

📓 Journal Prompt 4

Part 5: Comparing Distributions Across Groups

Economic hardship by region

Visualizing distributions by region

📓 Journal Prompt 5

Part 6: Putting It All Together (Challenge)

Community belonging

Your tasks

📓 Journal Prompt 6 (Challenge)

Part 7: Reflection

📓 Final Journal Prompt

Summary of Key Functions

The Key Lessons