Tutorial 4

Let’s begin with the package setup for today’s tutorial:

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

# List of packages
packages <- c("tidyverse", "fst", "gt", "scales", "viridis", "patchwork", "ggrepel")

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'scales'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## 
## Loading required package: viridisLite
## 
## 
## Attaching package: 'viridis'
## 
## 
## The following object is masked from 'package:scales':
## 
##     viridis_pal

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "gt"        "fst"       "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## [[4]]
##  [1] "scales"    "gt"        "fst"       "lubridate" "forcats"   "stringr"  
##  [7] "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"  
## [13] "tidyverse" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [19] "methods"   "base"     
## 
## [[5]]
##  [1] "viridis"     "viridisLite" "scales"      "gt"          "fst"        
##  [6] "lubridate"   "forcats"     "stringr"     "dplyr"       "purrr"      
## [11] "readr"       "tidyr"       "tibble"      "ggplot2"     "tidyverse"  
## [16] "stats"       "graphics"    "grDevices"   "utils"       "datasets"   
## [21] "methods"     "base"       
## 
## [[6]]
##  [1] "patchwork"   "viridis"     "viridisLite" "scales"      "gt"         
##  [6] "fst"         "lubridate"   "forcats"     "stringr"     "dplyr"      
## [11] "purrr"       "readr"       "tidyr"       "tibble"      "ggplot2"    
## [16] "tidyverse"   "stats"       "graphics"    "grDevices"   "utils"      
## [21] "datasets"    "methods"     "base"       
## 
## [[7]]
##  [1] "ggrepel"     "patchwork"   "viridis"     "viridisLite" "scales"     
##  [6] "gt"          "fst"         "lubridate"   "forcats"     "stringr"    
## [11] "dplyr"       "purrr"       "readr"       "tidyr"       "tibble"     
## [16] "ggplot2"     "tidyverse"   "stats"       "graphics"    "grDevices"  
## [21] "utils"       "datasets"    "methods"     "base"

Introduction to the General Social Survey

The General Social Survey (GSS) is one of the foundational data sources for understanding social change in the United States. Started in 1972, the GSS:

Assigns questionnaires to a nationally representative sample of U.S. adults
Covers a wide range of social, political, and economic topics
Provides crucial data for tracking long-term trends
Allows researchers to examine relationships between social characteristics

For this tutorial, we’ll use GSS data through 2022, focusing on understanding social patterns through careful data analysis and visualization.

# Load the full GSS dataset
gss <- read_fst("gss2022.fst")

# Initial look at data dimensions
dim(gss)

## [1] 72390  6646

Let’s begin by examining attitudes about gender roles in the family. The GSS has asked respondents whether they agree or disagree with the statement: “It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family” (fefam)

Link here to documentation: https://gssdataexplorer.norc.org/variables/706/vshow

Understanding Our Data

First, let’s examine how this variable appears in our dataset:

# Initial examination
table(gss$fefam)

## 
##                strongly agree                         agree 
##                          2810                          9839 
##                      disagree             strongly disagree 
##                         15198                          7284 
##                    don't know                           iap 
##                             0                             0 
##            I don't have a job                   dk, na, iap 
##                             0                             0 
##                     no answer    not imputable_(2147483637) 
##                             0                             0 
##    not imputable_(2147483638)                       refused 
##                             0                             0 
##                skipped on web                    uncodeable 
##                             0                             0 
## not available in this release    not available in this year 
##                             0                             0 
##                  see codebook 
##                             0

# View unique values to understand coding
unique(gss$fefam)

## [1] <NA>              agree             strongly agree    strongly disagree
## [5] disagree         
## 17 Levels: strongly agree agree disagree strongly disagree don't know ... see codebook

Cleaning and Recoding

Before analyzing patterns, we need to clean our data and categories:

# Clean and recode fefam
gss_cleaned <- gss %>%
  mutate(
    gender_roles = case_when(
      fefam == "strongly agree" ~ "Strongly Agree",
      fefam == "agree" ~ "Agree",
      fefam == "disagree" ~ "Disagree",
      fefam == "strongly disagree" ~ "Strongly Disagree",
      TRUE ~ NA_character_
    ),
    # Create ordered factor for proper sorting
    gender_roles = factor(gender_roles, 
                         levels = c("Strongly Agree", "Agree", 
                                  "Disagree", "Strongly Disagree"))
  )
table(gss_cleaned$gender_roles)

## 
##    Strongly Agree             Agree          Disagree Strongly Disagree 
##              2810              9839             15198              7284

Let’s confirm the range of years the survey item was asked:

# Check years where fefam was asked
gss %>%
  filter(!is.na(fefam)) %>%
  distinct(year) %>%
  arrange(year) %>%
  pull()

##  [1] 1977 1985 1986 1988 1989 1990 1991 1993 1994 1996 1998 2000 2002 2004 2006
## [16] 2008 2010 2012 2014 2016 2018 2021 2022

This check reveals that the GSS has asked about gender role attitudes consistently from 1977 onward, providing us with over four decades of data to examine social change. But importantly it did not start at the beginning of the survey in 1972.

# Assign object name and refer to dataset name
gender_summary <- gss_cleaned %>%
  # Remove missing values from gender_roles variable
  filter(!is.na(gender_roles)) %>%
  # Count number of responses for each category
  count(gender_roles) %>%
  # Create new columns for proportions and percentages
  mutate(
    # Calculate proportion (dividing count by total sum)
    proportion = n/sum(n),
    # Convert to percentage and round to 1 decimal place
    percentage = round(proportion * 100, 1)
  )

# Display raw summary to check our work
gender_summary

##        gender_roles     n proportion percentage
## 1    Strongly Agree  2810 0.07998634        8.0
## 2             Agree  9839 0.28006604       28.0
## 3          Disagree 15198 0.43260938       43.3
## 4 Strongly Disagree  7284 0.20733825       20.7

Understanding the Basic Distribution

Looking at our initial summary of gender role attitudes, several key patterns emerge:

Overall:

A clear majority (64%) of respondents disagree with traditional gender roles, with 43.3% disagreeing and 20.7% strongly disagreeing. This suggests that most Americans reject the strict separation of gender roles where men work outside the home while women focus on domestic responsibilities. But we can ask: has this changed over time? Do some groups showcase clear divides or patterns?

Support for Traditional Roles:

Despite overall opposition, a substantial minority (36%) still supports traditional gender roles
This asymmetry in the strength of views suggests more opposition than support for traditional gender roles
Most of this support is moderate (28% “Agree”) rather than strong (8% “Strongly Agree”)

This distribution provides our baseline understanding, but to fully grasp patterns in gender role attitudes, we’ll need to examine how they vary across different social groups and how they’ve changed over time.

Creating Clear Tables

Let’s create a professional table showing these distributions:

# Create a formatted table using gt package
gender_summary %>%
  # Initialize gt table
  gt() %>%
  
  # Rename column headers for clearer presentation
  cols_label(
    gender_roles = "Response Category",  # Change from 'gender_roles' to 'Response Category'
    n = "N",                            # Change from 'n' to 'N'
    proportion = "Proportion",          
    percentage = "Percent"              # Change from 'percentage' to 'Percent'
  ) %>%
  
  # Format proportion column as percentages
  fmt_percent(
    columns = proportion,    # Specify which column to format
    decimals = 1            # Show one decimal place
  ) %>%
  
  # Format percentage column with one decimal
  fmt_number(
    columns = percentage,    # Specify which column to format
    decimals = 1            # Show one decimal place
  ) %>%
  
  # Add title and subtitle to table
  tab_header(
    # Use md() for markdown formatting - makes title bold
    title = md("**Table 1. Attitudes Toward Traditional Gender Roles**"),
    subtitle = "General Social Survey, 1977-2022"
  ) %>%
  
  # Add source note and question wording below table
  tab_source_note(
    source_note = md("*Note:* Responses to: 'It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family.' Sample includes all valid responses across survey years.")
  ) %>%
  
  # Make column headers bold
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  ) %>%
  
  # Customize table appearance
  tab_options(
    # Add thick borders at top and bottom of table
    table.border.top.width = 2,
    table.border.bottom.width = 2,
    # Add thick border below column headers
    column_labels.border.bottom.width = 2,
    # Set font sizes for different table elements
    table.font.size = px(12),
    heading.title.font.size = px(14),
    heading.subtitle.font.size = px(12),
    source_notes.font.size = px(10),
    # Adjust spacing between rows
    data_row.padding = px(4)
  ) %>%
  
  # Format numbers with comma separator for thousands
  fmt_integer(
    columns = n,
    sep_mark = ","
  )

Response Category	N	Proportion	Percent
Table 1. Attitudes Toward Traditional Gender Roles
General Social Survey, 1977-2022
Strongly Agree	2,810	8.0%	8.0
Agree	9,839	28.0%	28.0
Disagree	15,198	43.3%	43.3
Strongly Disagree	7,284	20.7%	20.7
Note: Responses to: ‘It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family.’ Sample includes all valid responses across survey years.

Visualization

Now let’s create a visualization of these proportions:

# Create an improved visualization of gender role attitudes
ggplot(gender_summary, 
       aes(x = gender_roles, y = proportion)) +
  # Add bars with a professional color and slight transparency
  geom_col(
    fill = "#2c3e50",    # Deep blue color
    alpha = 0.9,         # Slight transparency
    width = 0.7         # Slightly thinner bars
  ) +
  # Format y-axis as percentages and set limits
  scale_y_continuous(
    labels = scales::percent_format(),  # Convert to percentage format
    limits = c(0, 0.5),                # Set y-axis limits
    expand = c(0, 0)                   # Remove spacing at bottom
  ) +
  # Add clear labels
  labs(
    title = "Distribution of Attitudes Toward Traditional Gender Roles",
    subtitle = "General Social Survey, 1977-2022",
    x = NULL,   # Remove x-axis label (redundant with categories)
    y = "Proportion of Respondents"
  ) +
  # Apply clean theme
  theme_minimal() +
  # Customize theme elements
  theme(
    # Customize title appearance
    plot.title = element_text(
      face = "bold",
      size = 14,
      margin = margin(b = 10)
    ),
    plot.subtitle = element_text(
      color = "grey40",
      size = 12,
      margin = margin(b = 20)
    ),
    # Clean up axes
    axis.line = element_line(color = "black"),
    axis.ticks = element_line(color = "black"),
    # Remove gridlines
    panel.grid = element_blank(),
    # Adjust text sizing and appearance
    axis.text = element_text(size = 11),
    axis.title = element_text(size = 11),
    # Add some margin around the plot
    plot.margin = margin(20, 20, 20, 20)
  )

Let’s continue building from our foundational visualization to examine patterns over time. This next section will help us understand how gender role attitudes have evolved:

Examining Change Over Time

# Calculate proportions by year
gender_trends <- gss_cleaned %>%
  # Remove missing values 
  filter(!is.na(gender_roles), !is.na(year)) %>%
  # Group by year to get annual proportions
  group_by(year) %>%
  # Count responses per category per year
  count(gender_roles) %>%
  # Calculate proportions within each year
  mutate(
    total = sum(n),
    proportion = n/total
  ) %>%
  # Ungroup for further operations
  ungroup()
gender_trends

## # A tibble: 92 × 5
##     year gender_roles          n total proportion
##    <int> <fct>             <int> <int>      <dbl>
##  1  1977 Strongly Agree      275  1503     0.183 
##  2  1977 Agree               714  1503     0.475 
##  3  1977 Disagree            422  1503     0.281 
##  4  1977 Strongly Disagree    92  1503     0.0612
##  5  1985 Strongly Agree      150  1502     0.0999
##  6  1985 Agree               577  1502     0.384 
##  7  1985 Disagree            574  1502     0.382 
##  8  1985 Strongly Disagree   201  1502     0.134 
##  9  1986 Strongly Agree      132  1444     0.0914
## 10  1986 Agree               557  1444     0.386 
## # ℹ 82 more rows

# Get last points for each category for label placement
last_points <- gender_trends %>%
  group_by(gender_roles) %>%
  slice_max(order_by = year, n = 1)

# Create improved visualization with direct labels
ggplot(gender_trends, 
       aes(x = year, y = proportion, color = gender_roles)) +
  # Add trend lines
  geom_line(size = 1.2) +
  # Use Heiss-inspired color palette
  scale_color_manual(
    values = c(
      "Strongly Agree" = "#6C3C89",    # Deep purple
      "Agree" = "#C098D0",             # Light purple
      "Disagree" = "#48A0A8",          # Teal
      "Strongly Disagree" = "#005F73"   # Deep teal
    )
  ) +
  # Add direct labels at end of lines
  geom_label_repel(
    data = last_points,
    aes(label = gender_roles),
    nudge_x = 2,
    direction = "y",
    hjust = 0,
    segment.size = 0.3,
    segment.color = "grey70",
    box.padding = 0.5,
    point.padding = 0.5,
    size = 3.5,
    fontface = "bold",
    label.size = 0.1,
    label.r = unit(0.2, "lines"),
    fill = alpha("white", 0.7)
  ) +
  # Format axes
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0, 0.6),
    breaks = seq(0, 0.6, by = 0.1),
    expand = c(0.01, 0.01)
  ) +
  scale_x_continuous(
    breaks = seq(1975, 2020, by = 5),
    limits = c(1972, 2027),  # Extended for labels
    expand = c(0.01, 0.01)
  ) +
  # Add labels
  labs(
    title = "The Decline of Traditional Gender Role Attitudes",
    subtitle = "Agreement with: 'Better if man works, woman tends home'",
    x = NULL,
    y = "Proportion of Respondents"
  ) +
  # Apply clean theme
  theme_minimal() +
  # Customize theme elements
  theme(
    plot.title = element_text(
      family = "sans",
      face = "bold",
      size = 14,
      margin = margin(b = 10)
    ),
    plot.subtitle = element_text(
      family = "sans",
      color = "grey40",
      size = 12,
      margin = margin(b = 20)
    ),
    axis.line = element_line(color = "black", size = 0.5),
    axis.ticks = element_line(color = "black", size = 0.5),
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_line(color = "grey90", size = 0.3),
    legend.position = "none",  # Remove legend since using direct labels
    plot.margin = margin(20, 20, 20, 20)
  )

Understanding Changes in Gender Role Attitudes

Looking at our visualization, we can identify several clear patterns:

Overall Trend

Support for traditional gender roles has steadily declined since 1977, but it is driven by a stee “Agree” decline, falling from nearly 50% to under 20%
Opposition (both “Disagree” and “Strongly Disagree”) has become the majority position
By 2022, over 75% of respondents oppose traditional gender roles
Changes continue but at a slower pace after 2000

This example shows us how examining proportions over time helps identify social changes that raw counts might miss. The steady decline in traditional views reflects broader social transformations in American society.

Building on Patterns: Adding Educational Differences

Education has long been identified as one of the strongest predictors of gender attitudes (see Davis and Greenstein 2009 [Annual Review of Sociology] for review). Let’s examine this relationship in our data.

First, let’s check our education variable:

# Examine education variable
table(gss$degree)

## 
##         less than high school                   high school 
##                         14192                         36446 
##      associate/junior college                    bachelor's 
##                          4355                         11248 
##                      graduate                    don't know 
##                          5953                             0 
##                           iap            I don't have a job 
##                             0                             0 
##                   dk, na, iap                     no answer 
##                             0                             0 
##    not imputable_(2147483637)    not imputable_(2147483638) 
##                             0                             0 
##                       refused                skipped on web 
##                             0                             0 
##                    uncodeable not available in this release 
##                             0                             0 
##    not available in this year                  see codebook 
##                             0                             0

# Check years where both variables are available
gss %>%
  filter(!is.na(degree), !is.na(fefam)) %>%
  distinct(year) %>%
  arrange(year) %>%
  pull()

##  [1] 1977 1985 1986 1988 1989 1990 1991 1993 1994 1996 1998 2000 2002 2004 2006
## [16] 2008 2010 2012 2014 2016 2018 2021 2022

Following standard practice in the literature, we’ll slightly recode education into four categories:

Less than High School
High School Graduate
Some College
Bachelor’s Degree or Higher

Other standards include doing a 2 or 3 category recode (remember, always justify your choices!)

# Recode education into standard categories
gss_edu <- gss_cleaned %>%
  mutate(
    education = case_when(
      degree == "less than high school" ~ "Less than High School",
      degree == "high school" ~ "High School",
      degree %in% c("associate/junior college") ~ "Some College",
      degree %in% c("bachelor's", "graduate") ~ "College Graduate",
      TRUE ~ NA_character_
    ),
    # Create ordered factor for proper sorting
    education = factor(education, 
                      levels = c("Less than High School",
                               "High School", 
                               "Some College",
                               "College Graduate"))
  )

# Verify recoding and distribution
table(gss_edu$education, useNA = "ifany")

## 
## Less than High School           High School          Some College 
##                 14192                 36446                  4355 
##      College Graduate                  <NA> 
##                 17201                   196

This coding scheme aligns with standard practice in the literature (e.g., Davis and Greenstein 2009; Bolzendahl and Myers 2004) and allows for comparison with other studies while maintaining meaningful distinctions in educational attainment.

Now let’s calculate proportions of gender attitudes by education

edu_attitudes <- gss_edu %>%
  # Remove missing values
  filter(!is.na(education), !is.na(gender_roles)) %>%
  # Group by education level
  group_by(education) %>%
  # Calculate proportions within each education level
  count(gender_roles) %>%
  mutate(
    total = sum(n),
    proportion = n/total
  )
edu_attitudes

## # A tibble: 16 × 5
## # Groups:   education [4]
##    education             gender_roles          n total proportion
##    <fct>                 <fct>             <int> <int>      <dbl>
##  1 Less than High School Strongly Agree      915  5665     0.162 
##  2 Less than High School Agree              2478  5665     0.437 
##  3 Less than High School Disagree           1769  5665     0.312 
##  4 Less than High School Strongly Disagree   503  5665     0.0888
##  5 High School           Strongly Agree     1392 17752     0.0784
##  6 High School           Agree              5137 17752     0.289 
##  7 High School           Disagree           8068 17752     0.454 
##  8 High School           Strongly Disagree  3155 17752     0.178 
##  9 Some College          Strongly Agree      146  2437     0.0599
## 10 Some College          Agree               534  2437     0.219 
## 11 Some College          Disagree           1159  2437     0.476 
## 12 Some College          Strongly Disagree   598  2437     0.245 
## 13 College Graduate      Strongly Agree      347  9209     0.0377
## 14 College Graduate      Agree              1675  9209     0.182 
## 15 College Graduate      Disagree           4170  9209     0.453 
## 16 College Graduate      Strongly Disagree  3017  9209     0.328

And visualize:

# Create visualization of educational differences
ggplot(edu_attitudes, 
       aes(x = education, y = proportion, fill = gender_roles)) +
  # Create proportional stacked bars
  geom_col(width = 0.7) +
  # Use clean color scheme
  scale_fill_manual(
    values = c(
      "Strongly Agree" = "#6C3C89",    # Deep purple
      "Agree" = "#C098D0",             # Light purple
      "Disagree" = "#48A0A8",          # Teal
      "Strongly Disagree" = "#005F73"   # Deep teal
    )
  ) +
  # Format y-axis as percentages
  scale_y_continuous(
    labels = scales::percent_format(),
    expand = c(0, 0)
  ) +
  # Add clear labels
  labs(
    title = "Educational Differences in Gender Role Attitudes",
    subtitle = "Agreement with: 'Better if man works, woman tends home'",
    x = "Educational Attainment",
    y = "Proportion of Respondents",
    fill = "Response"
  ) +
  # Apply clean theme
  theme_minimal() +
  # Customize theme elements
  theme(
    plot.title = element_text(
      face = "bold",
      size = 14,
      margin = margin(b = 10)
    ),
    plot.subtitle = element_text(
      color = "grey40",
      size = 12,
      margin = margin(b = 20)
    ),
    axis.text.x = element_text(angle = 0),
    legend.position = "right",
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_line(color = "grey90"),
    axis.line = element_line(color = "black", size = 0.5),
    plot.margin = margin(20, 20, 20, 20)
  )

Interpreting Educational Differences in Gender Role Attitudes

Looking at our visualization and proportions, we see a strong pattern between education and gender role attitudes:

1. Clear Educational Gradient

Traditional views (combined “Strongly Agree” and “Agree”) decrease with education:
- Less than High School: 60% traditional (16.2% strongly, 43.7% agree)
- High School: 37% traditional (7.8% strongly, 29% agree)
- Some College: 28% traditional (6% strongly, 22% agree)
- College Graduate: 22% traditional (3.8% strongly, 18.2% agree)

2. Strength of Opposition

Strong disagreement increases markedly with education:
- Less than High School: 8.9%
- High School: 17.8%
- Some College: 24.5%
- College Graduate: 32.8%

3. Pattern of Responses

Less educated: More concentrated in moderate positions
More educated: Stronger tendency toward disagreement, especially strong disagreement
“Disagree” is most common response for all groups except less than high school

Let’s extend this analysis by examining whether these educational differences have remained stable over time:

# Calculate proportions by education and year
edu_time_trends <- gss_edu %>%
  # Remove missing values
  filter(!is.na(education), !is.na(gender_roles), !is.na(year)) %>%
  # Group by education and year
  group_by(education, year) %>%
  # Calculate the proportion who agree/strongly agree (traditional views)
  summarize(
    n = n(),
    traditional = mean(gender_roles %in% c("Strongly Agree", "Agree")),
    .groups = "drop"
  )
edu_time_trends

## # A tibble: 92 × 4
##    education              year     n traditional
##    <fct>                 <int> <int>       <dbl>
##  1 Less than High School  1977   525       0.804
##  2 Less than High School  1985   398       0.691
##  3 Less than High School  1986   393       0.695
##  4 Less than High School  1988   236       0.682
##  5 Less than High School  1989   210       0.576
##  6 Less than High School  1990   172       0.616
##  7 Less than High School  1991   203       0.645
##  8 Less than High School  1993   187       0.578
##  9 Less than High School  1994   320       0.569
## 10 Less than High School  1996   360       0.586
## # ℹ 82 more rows

# Get last points for label placement
last_points <- edu_time_trends %>%
  group_by(education) %>%
  slice_max(order_by = year, n = 1)

# Create improved visualization
ggplot(edu_time_trends, 
       aes(x = year, y = traditional, color = education)) +
  # Add smoothed lines
  geom_line(size = 1.2) +
  # Use clean color scheme
  scale_color_manual(
    values = c(
      "Less than High School" = "#6C3C89",
      "High School" = "#C098D0",
      "Some College" = "#48A0A8",
      "College Graduate" = "#005F73"
    )
  ) +
  # Add direct labels at end of lines
  geom_label_repel(
    data = last_points,
    aes(label = education),
    nudge_x = 2,
    direction = "y",
    hjust = 0,
    segment.size = 0.3,
    segment.color = "grey70",
    box.padding = 0.5,
    point.padding = 0.5,
    size = 3.5,
    fontface = "bold",
    label.size = 0.1,
    label.r = unit(0.2, "lines"),
    fill = alpha("white", 0.7)
  ) +
  # Format axes
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0, 1.0),
    breaks = seq(0, 1.0, by = 0.2),
    expand = c(0.01, 0.01)
  ) +
  scale_x_continuous(
    breaks = seq(1975, 2020, by = 5),
    limits = c(1972, 2027),  # Extended for labels
    expand = c(0.01, 0.01)
  ) +
  # Add labels
  labs(
    title = "Educational Differences in Gender Role Attitudes Over Time",
    subtitle = "Proportion agreeing: 'Better if man works, woman tends home'",
    x = NULL,
    y = "Proportion Supporting Traditional Roles"
  ) +
  # Apply clean theme
  theme_minimal() +
  # Customize theme elements
  theme(
    plot.title = element_text(
      face = "bold",
      size = 14,
      margin = margin(b = 10)
    ),
    plot.subtitle = element_text(
      color = "grey40",
      size = 12,
      margin = margin(b = 20)
    ),
    axis.line = element_line(color = "black", size = 0.5),
    axis.ticks = element_line(color = "black", size = 0.5),
    panel.grid = element_blank(),
    axis.text = element_text(size = 10),
    legend.position = "none",
    plot.margin = margin(20, 20, 20, 20)
  )

Let’s discuss, what would be your takeaway?

From Description to Group Differences

To understand group differences more systematically, we’ll:

Calculate differences between groups
Examine changes in these differences over time
Build measures of divergence/convergence

Let’s start with partisan differences in gender attitudes. Variable information available here:

https://gssdataexplorer.norc.org/variables/141/vshow

# First check party ID variable
table(gss$partyid)

## 
##                    strong democrat           not very strong democrat 
##                              11795                              14286 
##     independent, close to democrat independent (neither, no response) 
##                               8663                              11540 
##   independent, close to republican         not very strong republican 
##                               6378                              10678 
##                  strong republican                        other party 
##                               7273                               1292 
##                         don't know                                iap 
##                                  0                                  0 
##                 I don't have a job                        dk, na, iap 
##                                  0                                  0 
##                          no answer         not imputable_(2147483637) 
##                                  0                                  0 
##         not imputable_(2147483638)                            refused 
##                                  0                                  0 
##                     skipped on web                         uncodeable 
##                                  0                                  0 
##      not available in this release         not available in this year 
##                                  0                                  0 
##                       see codebook 
##                                  0

As per the standard in the literature, let’s group into three categories (R/I/D):

# Recode party ID into three categories
gss_party <- gss_edu %>%
  mutate(
    party = case_when(
      partyid %in% c("strong democrat", 
                     "not very strong democrat",
                     "independent, close to democrat") ~ "Democrat",
      partyid %in% c("independent (neither, no response)",
                     "other party") ~ "Independent",
      partyid %in% c("strong republican",
                     "not very strong republican", 
                     "independent, close to republican") ~ "Republican",
      TRUE ~ NA_character_
    ),
    # Create factor with meaningful order
    party = factor(party, levels = c("Democrat", "Independent", "Republican"))
  )

# Verify recoding
table(gss_party$party, useNA = "ifany")

## 
##    Democrat Independent  Republican        <NA> 
##       34744       12832       24329         485

# Calculate party differences in gender attitudes over time
party_trends <- gss_party %>%
  # Remove missing values and independents for D-R comparison
  filter(!is.na(party), 
         party != "Independent",
         !is.na(gender_roles),
         !is.na(year)) %>%
  # Group by party and year
  group_by(party, year) %>%
  # Calculate proportion traditional for each party-year
  summarize(
    n = n(),
    traditional = mean(gender_roles %in% c("Strongly Agree", "Agree")),
    .groups = "drop"
  )

Understanding Group Differences: Measuring the Partisan Gap

The code for calculating partisan differences demonstrates a fundamental approach to measuring group divergence:

First, we reshape our data from long to wide format using pivot_wider(), creating separate columns for Democratic and Republican proportions
Then, we calculate the difference between groups (Republican - Democrat)
A positive difference indicates Republicans are more traditional; negative means Democrats are more traditional
This simple subtraction provides an intuitive measure of partisan divergence at each time point (and whether it is increasing or decreasing)

# Calculate partisan difference by year
partisan_diff <- party_trends %>%
  select(-n) %>%
  pivot_wider(
    names_from = party,
    values_from = traditional
  ) %>%
  mutate(
    diff = Republican - Democrat  # Positive values = Republicans more traditional
  )
partisan_diff

## # A tibble: 23 × 4
##     year Democrat Republican     diff
##    <int>    <dbl>      <dbl>    <dbl>
##  1  1977    0.676      0.656 -0.0198 
##  2  1985    0.477      0.505  0.0280 
##  3  1986    0.467      0.492  0.0251 
##  4  1988    0.409      0.455  0.0459 
##  5  1989    0.404      0.432  0.0277 
##  6  1990    0.399      0.418  0.0197 
##  7  1991    0.415      0.411 -0.00370
##  8  1993    0.337      0.391  0.0539 
##  9  1994    0.324      0.388  0.0645 
## 10  1996    0.349      0.456  0.107  
## # ℹ 13 more rows

Understanding pivot_wider

Let’s look at a simple example to understand what pivot_wider() does to our data:

# Original data (long format)
party_example <- data.frame(
  year = c(2020, 2020),
  party = c("Democrat", "Republican"),
  traditional = c(0.32, 0.44)
)

# Show original format
print("Original (Long) Format:")

## [1] "Original (Long) Format:"

party_example

##   year      party traditional
## 1 2020   Democrat        0.32
## 2 2020 Republican        0.44

# Create wide format
party_example_wide <- party_example %>%
  pivot_wider(
    names_from = party,           # Column that becomes new columns
    values_from = traditional     # Values that fill new columns
  )

# Show transformed format
print("After pivot_wider (Wide Format):")

## [1] "After pivot_wider (Wide Format):"

party_example_wide

## # A tibble: 1 × 3
##    year Democrat Republican
##   <dbl>    <dbl>      <dbl>
## 1  2020     0.32       0.44

pivot_wider() transforms our data from “long” format (where Democrat and Republican values are in rows) to “wide” format (where they appear as separate columns). This makes it easy to calculate differences between groups – we can simply subtract one column from another.

In our case:

Before: Each row had a party and its proportion
After: Each row has both Democrat and Republican proportions
This allows us to calculate Republican - Democrat difference

Let’s visualize both the trends and difference:

# Create visualization of party trends
ggplot(party_trends, 
       aes(x = year, y = traditional, color = party)) +
  # Add smoothed lines
  geom_line(size = 1.2) +
  # Use clear party colors
  scale_color_manual(
    values = c(
      "Democrat" = "#2E74C0",     # Blue
      "Republican" = "#CB454A"     # Red
    )
  ) +
  # Add direct labels at end of lines
  geom_label_repel(
    data = party_trends %>% 
      group_by(party) %>% 
      slice_max(order_by = year, n = 1),
    aes(label = party),
    nudge_x = 2,
    direction = "y",
    hjust = 0,
    segment.size = 0.3,
    segment.color = "grey70",
    box.padding = 0.5,
    point.padding = 0.5,
    size = 3.5,
    fontface = "bold",
    label.size = 0.1,
    label.r = unit(0.2, "lines"),
    fill = alpha("white", 0.7)
  ) +
  # Format axes
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0, 0.8),
    breaks = seq(0, 0.8, by = 0.2)
  ) +
  scale_x_continuous(
    breaks = seq(1975, 2020, by = 5),
    limits = c(1975, 2027)
  ) +
  # Add labels
  labs(
    title = "Partisan Differences in Gender Role Attitudes",
    subtitle = "Proportion supporting traditional gender roles by party identification",
    x = NULL,
    y = "Proportion Supporting Traditional Roles"
  ) +
  # Clean theme
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "grey40", size = 12),
    axis.line = element_line(color = "black", size = 0.5),
    axis.ticks = element_line(color = "black", size = 0.5),
    panel.grid = element_blank(),
    legend.position = "none",
    plot.margin = margin(20, 20, 20, 20)
  )

Distribution Changes: From Overlap to Divergence

When examining changes in attitudes between groups over time, density plots offer powerful insights into:

Overall distribution shape
Areas of overlap between groups
Central tendency differences
Spread of attitudes within groups

For gender attitudes, we can examine how Democrats and Republicans have potentially “sorted” over time by comparing distributions at two time points. This helps us understand not just average differences, but how the full range of attitudes has changed within each party.

# Prepare data for distribution comparison
party_dist <- gss_party %>%
  # Keep only Democrats and Republicans at start/end points
  filter(!is.na(party), 
         party != "Independent",
         !is.na(gender_roles),
         year %in% c(1977, 2022)) %>%
  # Convert year to factor and ensure proper ordering of attitudes
  mutate(
    year = factor(year),
    # Create numeric version of gender_roles for density estimation
    attitude_score = as.numeric(factor(gender_roles, 
      levels = c("Strongly Agree", "Agree", "Disagree", "Strongly Disagree")
    ))
  )

# Create density plot comparison
ggplot(party_dist, 
       aes(x = attitude_score, 
           fill = party)) +
  # Create separate panels for each time point
  facet_wrap(~year, 
             nrow = 2,
             labeller = labeller(year = c(
               "1977" = "1977: High Overlap in Views",
               "2022" = "2022: Clear Partisan Division"
             ))) +
  # Add density curves
  geom_density(alpha = 0.5) +
  # Use standard party colors
  scale_fill_manual(
    values = c(
      "Democrat" = "#2E74C0",
      "Republican" = "#CB454A"
    )
  ) +
  # Add labels for x-axis points
  scale_x_continuous(
    breaks = 1:4,
    labels = c("Strongly\nAgree", "Agree", "Disagree", "Strongly\nDisagree"),
    limits = c(0.5, 4.5)
  ) +
  # Clean theme
  theme_minimal() +
  theme(
    axis.line = element_line(color = "black"),
    panel.grid = element_blank(),
    legend.position = "right",
    plot.title = element_text(face = "bold"),
    axis.text.x = element_text(size = 9)
  ) +
  # Add clear labels
  labs(
    title = "Partisan Divergence in Gender Attitudes",
    subtitle = "Distribution of responses to: 'Better if man works, woman tends home'",
    x = "More Progressive →",
    y = "Density",
    fill = "Party ID"
  )

Let’s discuss: what do you observe?

Now let’s examine the partisan difference:

# Create plot to visualize partisan differences over time
ggplot(partisan_diff, 
       aes(x = year, y = diff)) +
  # Add main trend line
  geom_line(size = 1.2, color = "#2C3E50") +
  # Add reference line at 0 (no difference between parties)
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
  # Format y-axis as percentages with reasonable range and breaks
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(-0.2, 0.2),
    breaks = seq(-0.2, 0.2, by = 0.1)
  ) +
  # Set clear x-axis breaks by 5-year intervals
  scale_x_continuous(
    breaks = seq(1975, 2020, by = 5)
  ) +
  # Add informative labels
  labs(
    title = "Partisan Gap in Gender Role Attitudes",
    subtitle = "Difference in support for traditional roles (Republican - Democrat)",
    x = NULL,
    y = "Percentage Point Difference"
  ) +
  # Apply clean theme
  theme_minimal() +
  # Customize theme elements for professional presentation
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "grey40", size = 12),
    axis.line = element_line(color = "black", size = 0.5),
    axis.ticks = element_line(color = "black", size = 0.5),
    panel.grid = element_blank(),
    plot.margin = margin(20, 20, 20, 20)
  )

Interpreting Partisan Differences in Gender Attitudes

The data reveals a clear pattern of partisan divergence in gender role attitudes since 1977:

In 1977, virtually no partisan gap existed (-2 percentage points)
A consistent gap emerges in the late 1980s and early 1990s
The gap widens substantially in the 2000s and 2010s
By 2021-22, the partisan gap reaches its peak (~20 percentage points)
- Republicans: ~33-35% traditional
- Democrats: ~10-16% traditional

This growing divide reflects increasing partisan sorting on gender attitudes, with Republicans maintaining more traditional views while Democrats show steeper decline in traditional attitudes.

Understanding Partisan Sorting: From Overlap to Polarization

Before diving into specific measures, let’s understand what we’re trying to capture when comparing distributions. When examining whether groups have become more different over time, we want to know:

How far apart are their typical positions (central tendency)?
How much do group members vary in their views (spread)?
How much do the distributions overlap?

Statistical Foundations: Cohen’s d and Effect Size

Cohen’s d is a standardized measure of the difference between two groups that helps us understand how much they differ while accounting for variation within each group. The formula is:

Cohen's d = (Mean of Group 1 - Mean of Group 2) / Pooled Standard Deviation

where the pooled standard deviation combines the spread from both groups:

Pooled SD = sqrt((SD1² + SD2²) / 2)

This measure is valuable because it:

Accounts for both difference in means and spread within groups
Provides a standardized scale for comparison
Provides ~ benchmarks (~ small: 0.2, ~ medium: 0.5, ~ large: 0.8)

# Calculate detailed overlap statistics
overlap_detailed <- party_dist %>%
  group_by(year) %>%
  summarize(
    # Mean values for each party
    mean_dem = mean(attitude_score[party == "Democrat"]),
    mean_rep = mean(attitude_score[party == "Republican"]),
    # Standard deviations
    sd_dem = sd(attitude_score[party == "Democrat"]),
    sd_rep = sd(attitude_score[party == "Republican"]),
    # Calculate difference and Cohen's d
    mean_diff = mean_rep - mean_dem,
    pooled_sd = sqrt((sd_dem^2 + sd_rep^2) / 2),
    cohens_d = mean_diff / pooled_sd,
    # Sample sizes
    n_dem = sum(party == "Democrat"),
    n_rep = sum(party == "Republican")
  )

# Format results table
overlap_detailed %>%
  mutate(across(where(is.numeric), ~round(., 3))) %>%
  gt() %>%
  tab_header(
    title = "Measuring Partisan Sorting on Gender Attitudes",
    subtitle = "Comparing distributions in 1977 and 2022"
  ) %>%
  cols_label(
    mean_dem = "Democrat Mean",
    mean_rep = "Republican Mean",
    sd_dem = "Democrat SD",
    sd_rep = "Republican SD",
    mean_diff = "Mean Difference",
    pooled_sd = "Pooled SD",
    cohens_d = "Cohen's d",
    n_dem = "N (Dem)",
    n_rep = "N (Rep)"
  ) %>%
  tab_source_note(
    source_note = "Note: Higher means indicate more progressive views"
  )

year	Democrat Mean	Republican Mean	Democrat SD	Republican SD	Mean Difference	Pooled SD	Cohen's d	N (Dem)	N (Rep)
Measuring Partisan Sorting on Gender Attitudes
Comparing distributions in 1977 and 2022
1977	2.199	2.217	0.796	0.814	0.018	0.805	0.022	863	456
2022	3.277	2.756	0.796	0.862	-0.521	0.830	-0.628	960	750
Note: Higher means indicate more progressive views

Interpreting Partisan Sorting in Gender Attitudes (1977-2022)

Looking at our measurement of partisan differences in gender attitudes between 1977 and 2022, we find clear evidence of increasing partisan division:

1. From Unity to Division in Average Views

In 1977, Democrats (2.20) and Republicans (2.22) held nearly identical average positions
By 2022, a clear gap emerged:
- Democrats moved strongly progressive (3.28)
- Republicans showed modest change (2.76)
- Net difference of 0.52 points on our 4-point scale

2. Stability in Within-Party Variation

Democratic spread remained constant (SD ≈ 0.80)
Republican variation increased slightly (0.81 to 0.86)
This tells us sorting occurred through shifting positions, not increasing within-party consensus

3. Quantifying the Growing Divide

1977: Tiny partisan effect (d = 0.02)
- Nearly complete overlap in views
- Party identification told us almost nothing about gender attitudes
2022: Substantial partisan gap (d = -0.63)
- Clear separation in distributions
- Party now strongly predicts gender attitudes
- Negative value shows Democrats more progressive

This analysis reveals not just that parties diverged, but how they diverged: through a substantial progressive shift among Democrats while Republicans changed more modestly, all while maintaining similar levels of internal diversity of views.

Comparing to other Social/Political Attitudes

This approach helps us understand sorting across different issues. Looking at abortion views provides an important comparison:

# Prepare abortion attitude data
abortion_dist <- gss_party %>%
  # Keep only Democrats and Republicans at same time points
  filter(!is.na(party), 
         party != "Independent",
         !is.na(abany),
         year %in% c(1977, 2022)) %>%
  mutate(
    year = factor(year),
    # Create numeric score (1 = oppose, 2 = support)
    abortion_score = as.numeric(factor(abany, levels = c("no", "yes")))
  )

# Create comparison plot
ggplot(abortion_dist, 
       aes(x = abortion_score, 
           fill = party)) +
  facet_wrap(~year, 
             nrow = 2,
             labeller = labeller(year = c(
               "1977" = "1977: Mixed Views",
               "2022" = "2022: Party Alignment"
             ))) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(
    values = c(
      "Democrat" = "#2E74C0",
      "Republican" = "#CB454A"
    )
  ) +
  scale_x_continuous(
    breaks = 1:2,
    labels = c("Oppose", "Support"),
    limits = c(0.5, 2.5)
  ) +
  theme_minimal() +
  theme(
    axis.line = element_line(color = "black"),
    panel.grid = element_blank(),
    legend.position = "right",
    plot.title = element_text(face = "bold")
  ) +
  labs(
    title = "Partisan Sorting on Abortion Rights",
    subtitle = "Distribution of support for legal abortion by party",
    x = "More Progressive →",
    y = "Density",
    fill = "Party ID"
  )

The abortion comparison raises a number of questions, notably whether gender attitudes are unique or part of broader sorting and how different issues may sort at different rates.

Turning to Religious Change

These partisan and educational patterns suggest deeper social transformations. Another key area where we see substantial change is religious identification, particularly the rise of the “religious nones.”

Let’s first examine our religious identification variable in the GSS:

# Initial check of religious categories
table(gss$relig)

## 
##                    protestant                      catholic 
##                         38707                         16498 
##                        jewish                          none 
##                          1360                          8918 
##                         other                      buddhism 
##                          1135                           245 
##                      hinduism       other eastern religions 
##                           130                            41 
##                  muslim/islam            orthodox-christian 
##                           178                           155 
##                     christian               native american 
##                           915                            34 
##       inter-nondenominational                    don't know 
##                           154                             0 
##                           iap            I don't have a job 
##                             0                             0 
##                   dk, na, iap                     no answer 
##                             0                             0 
##    not imputable_(2147483637)    not imputable_(2147483638) 
##                             0                             0 
##                       refused                skipped on web 
##                             0                             0 
##                    uncodeable not available in this release 
##                             0                             0 
##    not available in this year                  see codebook 
##                             0                             0

# Check years available
gss %>%
  filter(!is.na(relig)) %>%
  distinct(year) %>%
  arrange(year) %>%
  pull()

##  [1] 1972 1973 1974 1975 1976 1977 1978 1980 1982 1983 1984 1985 1986 1987 1988
## [16] 1989 1990 1991 1993 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
## [31] 2016 2018 2021

Following research on religious change (e.g., Voas and Chaves 2016), we’ll focus on the key contrast between religious affiliation and non-affiliation:

# Create religious none indicator
gss_relig <- gss %>%
  mutate(
    religious_none = case_when(
      relig == "none" ~ "None",
      relig %in% c("protestant", "catholic", "jewish", 
                   "buddhism", "hinduism", "muslim/islam",
                   "orthodox-christian", "christian", 
                   "inter-nondenominational", "other") ~ "Affiliated",
      TRUE ~ NA_character_
    ),
    religious_none = factor(religious_none)
  )

# Calculate proportion "none" by year
none_trends <- gss_relig %>%
  filter(!is.na(religious_none)) %>%
  group_by(year) %>%
  summarize(
    n = n(),
    prop_none = mean(religious_none == "None"),
    .groups = "drop"
  )

none_trends

## # A tibble: 33 × 3
##     year     n prop_none
##    <int> <int>     <dbl>
##  1  1972  1608    0.0516
##  2  1973  1500    0.064 
##  3  1974  1483    0.0681
##  4  1975  1488    0.0759
##  5  1976  1497    0.0762
##  6  1977  1523    0.0611
##  7  1978  1528    0.0779
##  8  1980  1465    0.0717
##  9  1982  1848    0.0714
## 10  1983  1595    0.0734
## # ℹ 23 more rows

Let’s visualize this trend:

# Create visualization of religious none trend
ggplot(none_trends, 
       aes(x = year, y = prop_none)) +
  geom_line(size = 1.2, color = "#2C3E50") +
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0, 0.3),
    breaks = seq(0, 0.3, by = 0.05)
  ) +
  scale_x_continuous(
    breaks = seq(1975, 2020, by = 5)
  ) +
  labs(
    title = "The Rise of Religious 'Nones' in the United States",
    subtitle = "Proportion reporting no religious affiliation",
    x = NULL,
    y = "Proportion 'None'"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "grey40", size = 12),
    axis.line = element_line(color = "black", size = 0.5),
    axis.ticks = element_line(color = "black", size = 0.5),
    panel.grid = element_blank(),
    plot.margin = margin(20, 20, 20, 20)
  )

Reference:

Voas, D., & Chaves, M. (2016). Is the United States a counterexample to the secularization thesis?. American Journal of Sociology, 121(5), 1517-1556.

Interpreting the Rise of Religious “Nones”

Looking at these proportions over time reveals a dramatic transformation in American religious identification:

1. Three Distinct Periods

Looking at the graph:

1972-1990: What do you notice?
1991-2010: What about now?
2011-2022: And now?

Pattern of Change - Not linear but accelerating

No evidence of plateau yet

Let’s examine how this religious change intersects with age groups:

# Create age groups and examine religious none proportions
gss_relig <- gss_relig %>%
  mutate(
    age_group = case_when(
      age < 30 ~ "18-29",
      age < 45 ~ "30-44",
      age < 60 ~ "45-59",
      TRUE ~ "60+"
    ),
    age_group = factor(age_group, 
                      levels = c("18-29", "30-44", "45-59", "60+"))
  )

# Calculate proportions by age group and year
age_none_trends <- gss_relig %>%
  filter(!is.na(religious_none), !is.na(age_group)) %>%
  group_by(age_group, year) %>%
  summarize(
    n = n(),
    prop_none = mean(religious_none == "None"),
    .groups = "drop"
  )
age_none_trends

## # A tibble: 132 × 4
##    age_group  year     n prop_none
##    <fct>     <int> <int>     <dbl>
##  1 18-29      1972   397    0.103 
##  2 18-29      1973   381    0.126 
##  3 18-29      1974   380    0.108 
##  4 18-29      1975   405    0.148 
##  5 18-29      1976   387    0.152 
##  6 18-29      1977   365    0.0959
##  7 18-29      1978   407    0.123 
##  8 18-29      1980   357    0.112 
##  9 18-29      1982   494    0.144 
## 10 18-29      1983   415    0.120 
## # ℹ 122 more rows

Let’s make the output clearer and more digestible:

# Create cleaner 5-year periods
age_none_periods <- age_none_trends %>%
  # Create 5-year periods
  mutate(
    period = case_when(
      year <= 1979 ~ "1975-1979",
      year <= 1984 ~ "1980-1984",
      year <= 1989 ~ "1985-1989",
      year <= 1994 ~ "1990-1994",
      year <= 1999 ~ "1995-1999",
      year <= 2004 ~ "2000-2004",
      year <= 2009 ~ "2005-2009",
      year <= 2014 ~ "2010-2014",
      year <= 2019 ~ "2015-2019",
      TRUE ~ "2020-2022"
    )
  ) %>%
  # Calculate average for each period and age group
  group_by(age_group, period) %>%
  summarize(
    n = sum(n),  # Total observations in period
    prop_none = mean(prop_none),  # Average proportion none
    .groups = "drop"
  ) %>%
  # Convert to percentage
  mutate(
    pct_none = round(prop_none * 100, 1)
  )

# Create wide format for clear period comparison
age_periods_wide <- age_none_periods %>%
  select(age_group, period, pct_none) %>%
  pivot_wider(
    names_from = period,
    values_from = pct_none
  )

# Display formatted table
age_periods_wide %>%
  knitr::kable(
    caption = "Religious Non-Affiliation by Age Group and Period (%)",
    align = c("l", rep("c", 10))
  )

Religious Non-Affiliation by Age Group and Period (%)
age_group	1975-1979	1980-1984	1985-1989	1990-1994	1995-1999	2000-2004	2005-2009	2010-2014	2015-2019	2020-2022
18-29	12.2	12.1	11.4	12.2	22.1	21.6	26.3	30.7	36.0	46.3
30-44	7.0	8.6	9.2	10.1	13.9	16.0	19.1	22.8	28.2	38.3
45-59	4.0	3.8	4.8	6.6	10.1	12.9	14.6	17.3	19.0	24.7
60+	3.3	3.2	3.0	4.0	6.0	6.8	9.0	11.8	13.4	20.3

Interpreting Religious Change Across Age Groups and Periods

Let’s create a clear interpretation followed by a visually enhanced table:

1. Key Historical Periods

1975-1989: Remarkable stability across all age groups
1990-1999: First notable increase, especially among young adults
2000-2014: Steady rises across all ages
2015-2022: Dramatic acceleration, particularly among under-45s

2. Age-Specific Patterns

Young Adults (18-29): Most dramatic change
- 1975-1989: Stable around 12%
- By 2020-2022: Reached 46.3%
Middle Adults (30-44): Similar pattern but lagged
- Started at 7%
- Reached 38.3% by 2020-2022
Older Groups: More modest but still substantial increases

# Create enhanced table with gt
age_periods_wide %>%
  gt() %>%
  # Format title and subtitle
  tab_header(
    title = md("**Religious Non-Affiliation Across Age Groups and Time**"),
    subtitle = "Percentage reporting no religious affiliation"
  ) %>%
  # Format columns
  fmt_number(
    columns = -1,
    decimals = 1
  ) %>%
  # Add color scales for values
  data_color(
    columns = -1,
    colors = scales::col_numeric(
      palette = c("#FFFFFF", "#f4a582", "#92c5de", "#0571b0"),
      domain = c(0, 50)
    )
  ) %>%
  # Format column headers
  cols_label(
    age_group = "Age Group"
  ) %>%
  # Add source note
  tab_source_note(
    source_note = "Source: General Social Survey, 1975-2022"
  ) %>%
  # Style elements
  tab_style(
    style = list(
      cell_text(weight = "bold")
    ),
    locations = cells_column_labels()
  )

Age Group	1975-1979	1980-1984	1985-1989	1990-1994	1995-1999	2000-2004	2005-2009	2010-2014	2015-2019	2020-2022
Religious Non-Affiliation Across Age Groups and Time
Percentage reporting no religious affiliation
18-29	12.2	12.1	11.4	12.2	22.1	21.6	26.3	30.7	36.0	46.3
30-44	7.0	8.6	9.2	10.1	13.9	16.0	19.1	22.8	28.2	38.3
45-59	4.0	3.8	4.8	6.6	10.1	12.9	14.6	17.3	19.0	24.7
60+	3.3	3.2	3.0	4.0	6.0	6.8	9.0	11.8	13.4	20.3
Source: General Social Survey, 1975-2022

3. Notable Transitions Highlighted

Major shift begins in mid-1990s
Acceleration in late 2000s
Most dramatic changes in post-2015 period
Pandemic period (2020-2022) shows highest recorded levels

This table highlights both cohort replacement (each generation more secular than previous) and period effects (all groups showing increases, especially recently).

Let’s visualize the trends:

ggplot(age_none_trends, 
       aes(x = year, y = prop_none, color = age_group)) +
  geom_line(size = 1.2) +
  # Use Nord color palette (inspired by Arctic ice and aurora colors)
  scale_color_manual(
    values = c(
      "18-29" = "#5E81AC",  # Steel blue
      "30-44" = "#81A1C1",  # Glacier blue
      "45-59" = "#88C0D0",  # Frost blue
      "60+" = "#B48EAD"     # Aurora purple
    )
  ) +
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0, 0.4),
    breaks = seq(0, 0.4, by = 0.1)
  ) +
  scale_x_continuous(
    breaks = seq(1975, 2020, by = 5)
  ) +
  labs(
    title = "Age Patterns in Religious Non-Affiliation",
    subtitle = "Young adults leading religious change",
    x = NULL,
    y = "Proportion 'None'",
    color = "Age Group"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "grey40", size = 12),
    axis.line = element_line(color = "black", size = 0.5),
    axis.ticks = element_line(color = "black", size = 0.5),
    panel.grid = element_blank(),
    legend.position = "right",
    plot.margin = margin(20, 20, 20, 20)
  )

Understanding Probabilities in Social Research: From Basic to Conditional

The Foundation: Unconditional Probability

Let’s start with a fundamental concept: unconditional (or marginal) probability. This is the probability of an event occurring across an entire population, without considering any other factors.

Example with Religious Non-Affiliation: If we observe that 25% of Americans report no religious affiliation, we can write this as:

P(None) = 0.25

This tells us that if we randomly selected someone from the population, there’s a 25% chance they would be religiously unaffiliated. But this single number hides important variations across social groups.

Moving to Conditional Probability

Conditional probability helps us understand how the likelihood of an outcome varies across different groups. It answers questions like: “What’s the probability of being religiously unaffiliated among college graduates?” or “What’s the probability of being unaffiliated among those under 30?”

Mathematically, we write conditional probability as P(A|B), read as “probability of A given B.”

Building Intuition:

Imagine examining religious non-affiliation across education levels:

Overall (unconditional): 25% unaffiliated
Among those with:
- Less than high school: 15% unaffiliated
- High school: 20% unaffiliated
- College degree: 35% unaffiliated
- Graduate degree: 40% unaffiliated

We would write:

P(None) = 0.25                             # Unconditional probability
P(None|Less than HS) = 0.15               # Conditional on education
P(None|Graduate Degree) = 0.40

Why Conditional Probabilities Matter

Revealing Hidden Patterns
- The overall rate (in the example = 25%) masks significant variation
- Some groups have much higher/lower probabilities
- These differences often reveal important social processes
Understanding Group Differences
- Compare probabilities across meaningful social categories
- Identify which characteristics are strongly associated with outcomes
- Guide theoretical understanding of social phenomena

Calculating Conditional Probabilities

In practice, we calculate conditional probabilities by:

Identifying the specific group (condition)
Counting occurrences within that group
Dividing by the total size of the group

P(A|B) = Number of cases with both A and B / Total number of cases with B

For example:

P(None|College Grad) = Number of unaffiliated college graduates / Total number of college graduates

Building a Framework for Social Pattern Exploration: Starting with Key Demographics

When exploring social patterns, sociologists often begin by examining differences across key demographic characteristics. While the specific variables chosen should always be guided by theory and research questions, we’ll demonstrate our framework using five common demographic dimensions. This example should not be taken as prescriptive - different research questions will require different variables and approaches.

Let’s start by examining and cleaning our variables:

# First, let's examine our variables of interest
cat("Education Categories:\n")

## Education Categories:

table(gss$degree)

## 
##         less than high school                   high school 
##                         14192                         36446 
##      associate/junior college                    bachelor's 
##                          4355                         11248 
##                      graduate                    don't know 
##                          5953                             0 
##                           iap            I don't have a job 
##                             0                             0 
##                   dk, na, iap                     no answer 
##                             0                             0 
##    not imputable_(2147483637)    not imputable_(2147483638) 
##                             0                             0 
##                       refused                skipped on web 
##                             0                             0 
##                    uncodeable not available in this release 
##                             0                             0 
##    not available in this year                  see codebook 
##                             0                             0

cat("\nRacial Categories:\n")

## 
## Racial Categories:

table(gss$race)

## 
##                         white                         black 
##                         57657                         10215 
##                         other                    don't know 
##                          4411                             0 
##                           iap            I don't have a job 
##                             0                             0 
##                   dk, na, iap                     no answer 
##                             0                             0 
##    not imputable_(2147483637)    not imputable_(2147483638) 
##                             0                             0 
##                       refused                skipped on web 
##                             0                             0 
##                    uncodeable not available in this release 
##                             0                             0 
##    not available in this year                  see codebook 
##                             0                             0

cat("\nGender Categories:\n")

## 
## Gender Categories:

table(gss$sex)

## 
##                          male                        female 
##                         31977                         40301 
##                    don't know                           iap 
##                             0                             0 
##            I don't have a job                   dk, na, iap 
##                             0                             0 
##                     no answer    not imputable_(2147483637) 
##                             0                             0 
##    not imputable_(2147483638)                       refused 
##                             0                             0 
##                skipped on web                    uncodeable 
##                             0                             0 
## not available in this release    not available in this year 
##                             0                             0 
##                  see codebook 
##                             0

cat("\nAge Distribution:\n")

## 
## Age Distribution:

summary(gss$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   32.00   44.00   46.56   60.00   89.00     769

cat("\nIncome Distribution (in constant dollars):\n")

## 
## Income Distribution (in constant dollars):

summary(gss$realinc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     218   12081   24139   32537   40756  162607    7478

Before recoding, let’s consider each variable carefully:

1. Education

Have both degrees and years
Need meaningful groupings that reflect educational transitions
Consider implications of combining categories

2. Race

Three broad categories in GSS
Important to note limitations of these categories
Consider sample size implications

3. Gender

Binary categories in GSS
Important to acknowledge limitations
Consider how to discuss this sensitively and insightfully

4. Age

Continuous variable (18-89)
Can consder/justify groupings
Consider life course stages

5. Income

Substantial missing data
Need to account for inflation
Consider relative vs. absolute measures

Let’s create our cleaned variables:

# Create clean demographic variables
gss_demo <- gss %>%
    mutate(
    # Create religious none indicator
    religious_none = case_when(
      relig == "none" ~ "None",
      relig %in% c("protestant", "catholic", "jewish", 
                   "buddhism", "hinduism", "muslim/islam",
                   "orthodox-christian", "christian", 
                   "inter-nondenominational", "other") ~ "Affiliated",
      TRUE ~ NA_character_
    ),
    # Education (4 meaningful categories)
    education = case_when(
      degree == "less than high school" ~ "Less than High School",
      degree == "high school" ~ "High School",
      degree %in% c("associate/junior college", "bachelor's") ~ "Some College/BA",
      degree == "graduate" ~ "Graduate Degree",
      TRUE ~ NA_character_
    ),
    # Create factor with meaningful order
    education = factor(education, 
                      levels = c("Less than High School", "High School",
                               "Some College/BA", "Graduate Degree")),
    
    # Race (maintain GSS categories while noting limitations)
    race = case_when(
      race == "white" ~ "White",
      race == "black" ~ "Black",
      race == "other" ~ "Other",
      TRUE ~ NA_character_
    ),
    
    # Gender (binary in GSS)
    gender = case_when(
      sex == "male" ~ "Male",
      sex == "female" ~ "Female",
      TRUE ~ NA_character_
    ),
    
    # Age groups (life course stages)
    age_group = case_when(
      age < 30 ~ "18-29",
      age < 45 ~ "30-44",
      age < 60 ~ "45-59",
      TRUE ~ "60+"
    ),
    age_group = factor(age_group, 
                      levels = c("18-29", "30-44", "45-59", "60+")),
    
    # Income quartiles (relative position)
    income_group = ntile(realinc, 4),
    income_group = factor(income_group,
                         labels = c("Bottom 25%", "Lower Middle",
                                  "Upper Middle", "Top 25%"))
  )

# Verify our cleaning
check_missing <- gss_demo %>%
  summarize(across(c(education, race, gender, age_group, income_group),
                  ~sum(is.na(.))))

print("Missing Values in Each Variable:")

## [1] "Missing Values in Each Variable:"

check_missing

##   education race gender age_group income_group
## 1       196  107    112         0         7478

Group Patterns: Proportions and Conditional Probabilities

Let’s start by examining educational differences:

# Calculate educational patterns with key statistics
education_patterns <- gss_demo %>%
  filter(!is.na(education), !is.na(religious_none)) %>%
  # First get total sample size
  mutate(total_n = n()) %>%
  # Then group by education
  group_by(education) %>%
  summarize(
    # Group size
    n = n(),
    # Distribution
    pct_of_sample = n/first(total_n) * 100,
    # Conditional probability
    n_none = sum(religious_none == "None"),
    pct_none = n_none/n * 100
  )

# Create formatted table
edu_table <- education_patterns %>%
  gt() %>%
  tab_header(
    title = md("**Educational Differences in Religious Non-Affiliation**"),
    subtitle = "Distribution and conditional probabilities"
  ) %>%
  fmt_number(
    columns = n,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  fmt_number(
    columns = c(pct_of_sample, pct_none),
    decimals = 1
  ) %>%
  cols_label(
    education = "Education Level",
    n = "Group Size",
    pct_of_sample = "% of Sample",
    n_none = "Number Unaffiliated",
    pct_none = "% Unaffiliated"
  ) %>%
  tab_footnote(
    footnote = "'% Unaffiliated' shows conditional probability: P(None|Education Level)",
    locations = cells_column_labels(columns = pct_none)
  ) %>%
  tab_source_note(
    source_note = "Data: General Social Survey"
  )
edu_table

Education Level	Group Size	% of Sample	Number Unaffiliated	% Unaffiliated¹
Educational Differences in Religious Non-Affiliation
Distribution and conditional probabilities
Less than High School	13,742	20.1	1280	9.3
High School	34,611	50.7	4208	12.2
Some College/BA	14,445	21.2	2375	16.4
Graduate Degree	5,437	8.0	1034	19.0
Data: General Social Survey
¹ '% Unaffiliated' shows conditional probability: P(None\|Education Level)

Interpreting Educational Patterns

Looking at our table, we observe several key patterns:

1. Sample Distribution

High school graduates form the largest group (50.7% of sample)
About one-fifth have less than high school (20.1%)
Similar proportion have some college/BA (21.2%)
Graduate degree holders are the smallest group (8.0%)

2. Conditional Probabilities

We see a clear educational gradient in religious non-affiliation:

Less than high school: 9.3% unaffiliated
High school: 12.2% unaffiliated
Some college/BA: 16.4% unaffiliated
Graduate degree: 19.0% unaffiliated

This suggests:

Higher education associated with greater likelihood of non-affiliation
Each additional level of education shows increased probability
Nearly double the rate of non-affiliation between lowest and highest education

Now, let’s visualize these patterns:

# Calculate conditional probabilities for education
edu_conditional <- gss_demo %>%
  filter(!is.na(education), !is.na(religious_none)) %>%
  group_by(education) %>%
  summarize(
    n = n(),
    n_none = sum(religious_none == "None"),
    p_none = n_none/n,
    pct_none = p_none * 100
  )

# Create visualization
edu_plot <- ggplot(edu_conditional, 
       aes(x = pct_none, y = reorder(education, pct_none))) +
  # Add main bars
  geom_col(
    width = 0.6,
    fill = "#114B5F",
    alpha = 0.9
  ) +
  # Add point at end of each bar
  geom_point(
    size = 3,
    color = "#028090"
  ) +
  # Add value labels
  geom_text(
    aes(label = sprintf("%.1f%%", pct_none)),
    hjust = -0.5,
    family = "sans",
    size = 4,
    fontface = "bold",
    color = "#114B5F"
  ) +
  # Clean scales
  scale_x_continuous(
    limits = c(0, max(edu_conditional$pct_none) * 1.2),
    breaks = seq(0, 20, by = 5),
    labels = scales::label_number(suffix = "%")
  ) +
  # Add labels
  labs(
    title = "Educational Differences in Religious Non-Affiliation",
    subtitle = "Conditional Probability: P(None | Education Level)",
    caption = "Data: General Social Survey",
    x = "Probability of No Religious Affiliation",
    y = NULL
  ) +
  # Custom theme
  theme_minimal() +
  theme(
    # Text elements
    plot.title = element_text(
      family = "sans",
      face = "bold",
      size = 16,
      margin = margin(b = 10)
    ),
    plot.subtitle = element_text(
      family = "sans",
      color = "#114B5F",
      size = 12,
      margin = margin(b = 20)
    ),
    plot.caption = element_text(
      color = "grey40",
      margin = margin(t = 20)
    ),
    # Axis formatting
    axis.text = element_text(
      family = "sans",
      size = 11,
      color = "grey20"
    ),
    axis.text.y = element_text(
      face = "bold"
    ),
    axis.title.x = element_text(
      margin = margin(t = 10),
      color = "grey20"
    ),
    # Grid lines
    panel.grid.major.x = element_line(
      color = "grey95",
      size = 0.3
    ),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank(),
    # Add breathing room
    plot.margin = margin(30, 30, 30, 30)
  )

edu_plot

Learning a fundamental skill: Saving

Suppose you were really happy with the table and visual you created. You want to save it so you do not need to always re-run the code or to insert in your report – well, you can save (and in your preferred format). Let’s look at how to save for Word and PDFs.

For the tables:

# Identify saved name object
edu_table

Education Level	Group Size	% of Sample	Number Unaffiliated	% Unaffiliated¹
Educational Differences in Religious Non-Affiliation
Distribution and conditional probabilities
Less than High School	13,742	20.1	1280	9.3
High School	34,611	50.7	4208	12.2
Some College/BA	14,445	21.2	2375	16.4
Graduate Degree	5,437	8.0	1034	19.0
Data: General Social Survey
¹ '% Unaffiliated' shows conditional probability: P(None\|Education Level)

# You can select and copy directly to Word

Now let’s save the table to PDF (you would remove hashtags):

# Save to PDF
#edu_table %>%
 # gtsave("education_patterns.pdf")

We can do the same for the visual (you would remove hashtags):

edu_plot

# Save to PDF
ggsave(
  "edu_plot.pdf",
  plot = edu_plot,
  width = 10,
  height = 6
)

Now, let’s examine age patterns in religious non-affiliation:

# Calculate age patterns with key statistics
age_patterns <- gss_demo %>%
  filter(!is.na(age_group), !is.na(religious_none)) %>%
  # First get total sample size
  mutate(total_n = n()) %>%
  # Then group by age
  group_by(age_group) %>%
  summarize(
    # Group size
    n = n(),
    # Distribution
    pct_of_sample = n/first(total_n) * 100,
    # Conditional probability
    n_none = sum(religious_none == "None"),
    pct_none = n_none/n * 100
  )

# Create formatted table
age_patterns %>%
  gt() %>%
  tab_header(
    title = md("**Age Differences in Religious Non-Affiliation**"),
    subtitle = "Distribution and conditional probabilities"
  ) %>%
  fmt_number(
    columns = n,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  fmt_number(
    columns = c(pct_of_sample, pct_none),
    decimals = 1
  ) %>%
  cols_label(
    age_group = "Age Group",
    n = "Group Size",
    pct_of_sample = "% of Sample",
    n_none = "Number Unaffiliated",
    pct_none = "% Unaffiliated"
  ) %>%
  tab_footnote(
    footnote = "'% Unaffiliated' shows conditional probability: P(None|Age Group)",
    locations = cells_column_labels(columns = pct_none)
  ) %>%
  tab_source_note(
    source_note = "Data: General Social Survey"
  )

Age Group	Group Size	% of Sample	Number Unaffiliated	% Unaffiliated¹
Age Differences in Religious Non-Affiliation
Distribution and conditional probabilities
18-29	13,744	20.1	2623	19.1
30-44	20,695	30.3	3120	15.1
45-59	16,294	23.8	1798	11.0
60+	17,662	25.8	1377	7.8
Data: General Social Survey
¹ '% Unaffiliated' shows conditional probability: P(None\|Age Group)

Interpreting Age Patterns

Looking at our table, we observe several key patterns:

1. Sample Distribution

Middle-aged adults (30-44) form the largest group (30.3% of sample)
Relatively even distribution across other age groups:
- 60+ (25.8%)
- 45-59 (23.8%)
- 18-29 (20.1%)

2. Conditional Probabilities

We see a clear and strong age gradient in religious non-affiliation:

Young adults (18-29): 19.1% unaffiliated
Early midlife (30-44): 15.1% unaffiliated
Later midlife (45-59): 11.0% unaffiliated
Older adults (60+): 7.8% unaffiliated

This suggests:

More than double the rate of non-affiliation among youngest vs. oldest
Steady decline across age groups rather than sharp breaks

Let’s visualize these patterns:

# Create visualization with new color scheme
ggplot(age_patterns, 
       aes(x = pct_none, y = reorder(age_group, pct_none))) +
  # Add main bars
  geom_col(
    width = 0.6,
    fill = "#2E86AB",  # New color for age patterns
    alpha = 0.9
  ) +
  # Add point at end of each bar
  geom_point(
    size = 3,
    color = "#084B83"
  ) +
  # Add value labels
  geom_text(
    aes(label = sprintf("%.1f%%", pct_none)),
    hjust = -0.5,
    family = "sans",
    size = 4,
    fontface = "bold",
    color = "#2E86AB"
  ) +
  # Clean scales
  scale_x_continuous(
    limits = c(0, max(age_patterns$pct_none) * 1.2),
    breaks = seq(0, 20, by = 5),
    labels = scales::label_number(suffix = "%")
  ) +
  # Add labels
  labs(
    title = "Age Differences in Religious Non-Affiliation",
    subtitle = "Conditional Probability: P(None | Age Group)",
    caption = "Data: General Social Survey",
    x = "Probability of No Religious Affiliation",
    y = NULL
  ) +
  # Custom theme
  theme_minimal() +
  theme(
    # Text elements
    plot.title = element_text(
      family = "sans",
      face = "bold",
      size = 16,
      margin = margin(b = 10)
    ),
    plot.subtitle = element_text(
      family = "sans",
      color = "#2E86AB",
      size = 12,
      margin = margin(b = 20)
    ),
    plot.caption = element_text(
      color = "grey40",
      margin = margin(t = 20)
    ),
    # Axis formatting
    axis.text = element_text(
      family = "sans",
      size = 11,
      color = "grey20"
    ),
    axis.text.y = element_text(
      face = "bold"
    ),
    axis.title.x = element_text(
      margin = margin(t = 10),
      color = "grey20"
    ),
    # Grid lines
    panel.grid.major.x = element_line(
      color = "grey95",
      size = 0.3
    ),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank(),
    # Add breathing room
    plot.margin = margin(30, 30, 30, 30)
  )

Let’s examine racial patterns in religious non-affiliation:

# Calculate racial patterns with key statistics
race_patterns <- gss_demo %>%
  filter(!is.na(race), !is.na(religious_none)) %>%
  # First get total sample size
  mutate(total_n = n()) %>%
  # Then group by race
  group_by(race) %>%
  summarize(
    # Group size
    n = n(),
    # Distribution
    pct_of_sample = n/first(total_n) * 100,
    # Conditional probability
    n_none = sum(religious_none == "None"),
    pct_none = n_none/n * 100
  )

# Create formatted table
race_patterns %>%
  gt() %>%
  tab_header(
    title = md("**Racial Differences in Religious Non-Affiliation**"),
    subtitle = "Distribution and conditional probabilities"
  ) %>%
  fmt_number(
    columns = n,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  fmt_number(
    columns = c(pct_of_sample, pct_none),
    decimals = 1
  ) %>%
  cols_label(
    race = "Race",
    n = "Group Size",
    pct_of_sample = "% of Sample",
    n_none = "Number Unaffiliated",
    pct_none = "% Unaffiliated"
  ) %>%
  tab_footnote(
    footnote = "'% Unaffiliated' shows conditional probability: P(None|Race)",
    locations = cells_column_labels(columns = pct_none)
  ) %>%
  tab_source_note(
    source_note = "Data: General Social Survey"
  )

Race	Group Size	% of Sample	Number Unaffiliated	% Unaffiliated¹
Racial Differences in Religious Non-Affiliation
Distribution and conditional probabilities
Black	9,566	14.0	955	10.0
Other	3,916	5.7	687	17.5
White	54,860	80.3	7253	13.2
Data: General Social Survey
¹ '% Unaffiliated' shows conditional probability: P(None\|Race)

Interpreting Racial Patterns

Looking at our table, we observe two key dimensions:

1. Sample Distribution

White respondents form a clear majority (80.3% of sample)
Black respondents represent about one-seventh (14.0%)
Other racial groups comprise a small share (5.7%)

2. Conditional Probabilities

We see notable racial variation in religious non-affiliation:

Other racial groups: 17.5% unaffiliated
White respondents: 13.2% unaffiliated
Black respondents: 10.0% unaffiliated

Key patterns:

‘Other’ racial groups show highest rate of non-affiliation
7.5 percentage point gap between highest and lowest groups
Black respondents most likely to maintain religious affiliation

# Create visualization with new color scheme
ggplot(race_patterns, 
       aes(x = pct_none, y = reorder(race, pct_none))) +
  # Add main bars
  geom_col(
    width = 0.6,
    fill = "#7E6B8F",  # New purple color for racial patterns
    alpha = 0.9
  ) +
  # Add point at end of each bar
  geom_point(
    size = 3,
    color = "#4A4063"
  ) +
  # Add value labels
  geom_text(
    aes(label = sprintf("%.1f%%", pct_none)),
    hjust = -0.5,
    family = "sans",
    size = 4,
    fontface = "bold",
    color = "#7E6B8F"
  ) +
  # Clean scales
  scale_x_continuous(
    limits = c(0, max(race_patterns$pct_none) * 1.2),
    breaks = seq(0, 20, by = 5),
    labels = scales::label_number(suffix = "%")
  ) +
  # Add labels
  labs(
    title = "Racial Differences in Religious Non-Affiliation",
    subtitle = "Conditional Probability: P(None | Race)",
    caption = "Data: General Social Survey",
    x = "Probability of No Religious Affiliation",
    y = NULL
  ) +
  # Custom theme
  theme_minimal() +
  theme(
    # Text elements
    plot.title = element_text(
      family = "sans",
      face = "bold",
      size = 16,
      margin = margin(b = 10)
    ),
    plot.subtitle = element_text(
      family = "sans",
      color = "#7E6B8F",
      size = 12,
      margin = margin(b = 20)
    ),
    plot.caption = element_text(
      color = "grey40",
      margin = margin(t = 20)
    ),
    # Axis formatting
    axis.text = element_text(
      family = "sans",
      size = 11,
      color = "grey20"
    ),
    axis.text.y = element_text(
      face = "bold"
    ),
    axis.title.x = element_text(
      margin = margin(t = 10),
      color = "grey20"
    ),
    # Grid lines
    panel.grid.major.x = element_line(
      color = "grey95",
      size = 0.3
    ),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank(),
    # Add breathing room
    plot.margin = margin(30, 30, 30, 30)
  )

# Calculate income patterns with key statistics
income_patterns <- gss_demo %>%
  filter(!is.na(income_group), !is.na(religious_none)) %>%
  # First get total sample size
  mutate(total_n = n()) %>%
  # Then group by income
  group_by(income_group) %>%
  summarize(
    # Group size
    n = n(),
    # Distribution
    pct_of_sample = n/first(total_n) * 100,
    # Conditional probability
    n_none = sum(religious_none == "None"),
    pct_none = n_none/n * 100
  )

# Create formatted table
income_patterns %>%
  gt() %>%
  tab_header(
    title = md("**Income Differences in Religious Non-Affiliation**"),
    subtitle = "Distribution and conditional probabilities"
  ) %>%
  fmt_number(
    columns = n,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  fmt_number(
    columns = c(pct_of_sample, pct_none),
    decimals = 1
  ) %>%
  cols_label(
    income_group = "Income Group",
    n = "Group Size",
    pct_of_sample = "% of Sample",
    n_none = "Number Unaffiliated",
    pct_none = "% Unaffiliated"
  ) %>%
  tab_footnote(
    footnote = "'% Unaffiliated' shows conditional probability: P(None|Income Group)",
    locations = cells_column_labels(columns = pct_none)
  ) %>%
  tab_source_note(
    source_note = "Data: General Social Survey"
  )

Income Group	Group Size	% of Sample	Number Unaffiliated	% Unaffiliated¹
Income Differences in Religious Non-Affiliation
Distribution and conditional probabilities
Bottom 25%	15,359	24.9	2089	13.6
Lower Middle	15,397	25.0	1993	12.9
Upper Middle	15,628	25.4	1907	12.2
Top 25%	15,178	24.7	2121	14.0
Data: General Social Survey
¹ '% Unaffiliated' shows conditional probability: P(None\|Income Group)

Interpreting Income Patterns

Looking at our table, we observe two interesting dimensions:

1. Sample Distribution

Almost perfectly even quartiles (by design)
Each group represents approximately 25% of the sample
This enables clear comparison across income levels

2. Conditional Probabilities

We see a surprising pattern in religious non-affiliation:

Top 25%: 14.0% unaffiliated
Bottom 25%: 13.6% unaffiliated
Lower Middle: 12.9% unaffiliated
Upper Middle: 12.2% unaffiliated

Key insights:

Relatively flat relationship with income
Slight U-shape pattern
Only 1.8 percentage point difference between highest and lowest rates

# Create visualization with new color scheme
ggplot(income_patterns, 
       aes(x = pct_none, y = reorder(income_group, pct_none))) +
  # Add main bars
  geom_col(
    width = 0.6,
    fill = "#D4B483",  # Warm gold color for income patterns
    alpha = 0.9
  ) +
  # Add point at end of each bar
  geom_point(
    size = 3,
    color = "#B68D4C"
  ) +
  # Add value labels
  geom_text(
    aes(label = sprintf("%.1f%%", pct_none)),
    hjust = -0.5,
    family = "sans",
    size = 4,
    fontface = "bold",
    color = "#D4B483"
  ) +
  # Clean scales
  scale_x_continuous(
    limits = c(0, max(income_patterns$pct_none) * 1.2),
    breaks = seq(0, 15, by = 3),
    labels = scales::label_number(suffix = "%")
  ) +
  # Add labels
  labs(
    title = "Income Differences in Religious Non-Affiliation",
    subtitle = "Conditional Probability: P(None | Income Group)",
    caption = "Data: General Social Survey",
    x = "Probability of No Religious Affiliation",
    y = NULL
  ) +
  # Custom theme
  theme_minimal() +
  theme(
    # Text elements
    plot.title = element_text(
      family = "sans",
      face = "bold",
      size = 16,
      margin = margin(b = 10)
    ),
    plot.subtitle = element_text(
      family = "sans",
      color = "#D4B483",
      size = 12,
      margin = margin(b = 20)
    ),
    plot.caption = element_text(
      color = "grey40",
      margin = margin(t = 20)
    ),
    # Axis formatting
    axis.text = element_text(
      family = "sans",
      size = 11,
      color = "grey20"
    ),
    axis.text.y = element_text(
      face = "bold"
    ),
    axis.title.x = element_text(
      margin = margin(t = 10),
      color = "grey20"
    ),
    # Grid lines
    panel.grid.major.x = element_line(
      color = "grey95",
      size = 0.3
    ),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank(),
    # Add breathing room
    plot.margin = margin(30, 30, 30, 30)
  )

Let’s examine gender patterns in religious non-affiliation:

# Calculate gender patterns with key statistics
gender_patterns <- gss_demo %>%
  filter(!is.na(gender), !is.na(religious_none)) %>%
  # First get total sample size
  mutate(total_n = n()) %>%
  # Then group by gender
  group_by(gender) %>%
  summarize(
    # Group size
    n = n(),
    # Distribution
    pct_of_sample = n/first(total_n) * 100,
    # Conditional probability
    n_none = sum(religious_none == "None"),
    pct_none = n_none/n * 100
  )

# Create formatted table
gender_patterns %>%
  gt() %>%
  tab_header(
    title = md("**Gender Differences in Religious Non-Affiliation**"),
    subtitle = "Distribution and conditional probabilities"
  ) %>%
  fmt_number(
    columns = n,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  fmt_number(
    columns = c(pct_of_sample, pct_none),
    decimals = 1
  ) %>%
  cols_label(
    gender = "Gender",
    n = "Group Size",
    pct_of_sample = "% of Sample",
    n_none = "Number Unaffiliated",
    pct_none = "% Unaffiliated"
  ) %>%
  tab_footnote(
    footnote = "'% Unaffiliated' shows conditional probability: P(None|Gender)",
    locations = cells_column_labels(columns = pct_none)
  ) %>%
  tab_source_note(
    source_note = "Data: General Social Survey"
  )

Gender	Group Size	% of Sample	Number Unaffiliated	% Unaffiliated¹
Gender Differences in Religious Non-Affiliation
Distribution and conditional probabilities
Female	38,189	55.9	3936	10.3
Male	30,158	44.1	4969	16.5
Data: General Social Survey
¹ '% Unaffiliated' shows conditional probability: P(None\|Gender)

Interpreting Gender Patterns

Looking at our table, we observe two distinct dimensions:

1. Sample Distribution

Women form a majority of respondents (55.9% of sample)
Men comprise 44.1% of the sample
This roughly matches population gender distribution

2. Conditional Probabilities

We see a substantial gender gap in religious non-affiliation:

Men: 16.5% unaffiliated
Women: 10.3% unaffiliated
6.2 percentage point difference between genders

Key insights:

Clear gender difference in religious affiliation
Men about 60% more likely to be unaffiliated

# Create visualization with new color scheme
ggplot(gender_patterns, 
       aes(x = pct_none, y = reorder(gender, pct_none))) +
  # Add main bars
  geom_col(
    width = 0.6,
    fill = "#386FA4",  # Steel blue for gender patterns
    alpha = 0.9
  ) +
  # Add point at end of each bar
  geom_point(
    size = 3,
    color = "#133C55"
  ) +
  # Add value labels
  geom_text(
    aes(label = sprintf("%.1f%%", pct_none)),
    hjust = -0.5,
    family = "sans",
    size = 4,
    fontface = "bold",
    color = "#386FA4"
  ) +
  # Clean scales
  scale_x_continuous(
    limits = c(0, max(gender_patterns$pct_none) * 1.2),
    breaks = seq(0, 20, by = 5),
    labels = scales::label_number(suffix = "%")
  ) +
  # Add labels
  labs(
    title = "Gender Differences in Religious Non-Affiliation",
    subtitle = "Conditional Probability: P(None | Gender)",
    caption = "Data: General Social Survey",
    x = "Probability of No Religious Affiliation",
    y = NULL
  ) +
  # Custom theme
  theme_minimal() +
  theme(
    # Text elements
    plot.title = element_text(
      family = "sans",
      face = "bold",
      size = 16,
      margin = margin(b = 10)
    ),
    plot.subtitle = element_text(
      family = "sans",
      color = "#386FA4",
      size = 12,
      margin = margin(b = 20)
    ),
    plot.caption = element_text(
      color = "grey40",
      margin = margin(t = 20)
    ),
    # Axis formatting
    axis.text = element_text(
      family = "sans",
      size = 11,
      color = "grey20"
    ),
    axis.text.y = element_text(
      face = "bold"
    ),
    axis.title.x = element_text(
      margin = margin(t = 10),
      color = "grey20"
    ),
    # Grid lines
    panel.grid.major.x = element_line(
      color = "grey95",
      size = 0.3
    ),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank(),
    # Add breathing room
    plot.margin = margin(30, 30, 30, 30)
  )

# Create separate trend calculations first
education_trends <- gss_demo %>%
  filter(!is.na(education), !is.na(religious_none)) %>%
  group_by(year, education) %>%
  summarize(
    n = n(),
    p_none = mean(religious_none == "None"),
    group_type = "Education",
    category = as.character(first(education)),
    .groups = "drop"
  )

age_trends <- gss_demo %>%
  filter(!is.na(age_group), !is.na(religious_none)) %>%
  group_by(year, age_group) %>%
  summarize(
    n = n(),
    p_none = mean(religious_none == "None"),
    group_type = "Age",
    category = as.character(first(age_group)),
    .groups = "drop"
  )

race_trends <- gss_demo %>%
  filter(!is.na(race), !is.na(religious_none)) %>%
  group_by(year, race) %>%
  summarize(
    n = n(),
    p_none = mean(religious_none == "None"),
    group_type = "Race",
    category = as.character(first(race)),
    .groups = "drop"
  )

gender_trends <- gss_demo %>%
  filter(!is.na(gender), !is.na(religious_none)) %>%
  group_by(year, gender) %>%
  summarize(
    n = n(),
    p_none = mean(religious_none == "None"),
    group_type = "Gender",
    category = as.character(first(gender)),
    .groups = "drop"
  )

# Combine all trends
all_trends <- bind_rows(
  education_trends,
  age_trends,
  race_trends,
  gender_trends
) %>%
  # Calculate start and end values
  group_by(group_type, category) %>%
  mutate(
    start_value = first(p_none[year == min(year)]),
    end_value = last(p_none[year == max(year)])
  ) %>%
  ungroup() %>%
  # Identify extreme categories
  mutate(
    highlight = case_when(
      start_value == max(start_value, na.rm = TRUE) ~ "High Start",
      start_value == min(start_value, na.rm = TRUE) ~ "Low Start",
      end_value == max(end_value, na.rm = TRUE) ~ "High End",
      end_value == min(end_value, na.rm = TRUE) ~ "Low End",
      TRUE ~ "Other"
    )
  )

# Create visualization
ggplot(all_trends, aes(x = year, y = p_none, 
                       group = interaction(group_type, category),
                       color = highlight)) +
  # Add background lines
  geom_line(data = filter(all_trends, highlight == "Other"),
            color = "grey80", alpha = 0.3, size = 0.5) +
  # Add highlighted lines
  geom_line(data = filter(all_trends, highlight != "Other"),
            size = 1.2) +
  # Add labels for highlighted groups
  geom_label_repel(
    data = filter(all_trends, 
                  highlight != "Other" & year == max(year)),
    aes(label = paste0(category, "\n(", group_type, ")")),
    direction = "y",
    nudge_x = 2,
    size = 3,
    fontface = "bold",
    box.padding = 0.5
  ) +
  # Color scheme
  scale_color_manual(
    values = c(
      "High Start" = "#E41A1C",
      "Low Start" = "#377EB8",
      "High End" = "#4DAF4A",
      "Low End" = "#984EA3",
      "Other" = "grey80"
    )
  ) +
  # Format scales
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0, 0.5)
  ) +
  scale_x_continuous(
    breaks = seq(1980, 2020, by = 5),
    limits = c(1980, 2027)
  ) +
  # Labels
  labs(
    title = "Demographic Patterns in Religious Non-Affiliation (1975-2022)",
    subtitle = "Highlighting groups with extreme starting and ending positions",
    x = NULL,
    y = "Proportion Unaffiliated",
    color = "Pattern Type"
  ) +
  # Theme
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "grey40", size = 12),
    legend.position = "bottom",
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "grey95"),
    axis.line = element_line(color = "black", size = 0.5)
  )

Tutorial 4

Sébastien Parker

SOC3320 - Winter 2025

Learning a fundamental skill: Saving

Interpreting Age Patterns

Interpreting Racial Patterns

Interpreting Income Patterns

Interpreting Gender Patterns

End