Introduction to the General Social Survey
The General Social Survey (GSS) is one of the foundational data
sources for understanding social change in the United States. Started in
1972, the GSS:
- Assigns questionnaires to a nationally representative sample of U.S.
adults
- Covers a wide range of social, political, and economic topics
- Provides crucial data for tracking long-term trends
- Allows researchers to examine relationships between social
characteristics
For this tutorial, we’ll use GSS data through 2022, focusing on
understanding social patterns through careful data analysis and
visualization.
# Load the full GSS dataset
gss <- read_fst("gss2022.fst")
# Initial look at data dimensions
dim(gss)
## [1] 72390 6646
Let’s begin by examining attitudes about gender roles in the family.
The GSS has asked respondents whether they agree or disagree with the
statement: “It is much better for everyone involved if the man is the
achiever outside the home and the woman takes care of the home and
family” (fefam
)
Link here to documentation: https://gssdataexplorer.norc.org/variables/706/vshow
Understanding Our Data
First, let’s examine how this variable appears in our dataset:
# Initial examination
table(gss$fefam)
##
## strongly agree agree
## 2810 9839
## disagree strongly disagree
## 15198 7284
## don't know iap
## 0 0
## I don't have a job dk, na, iap
## 0 0
## no answer not imputable_(2147483637)
## 0 0
## not imputable_(2147483638) refused
## 0 0
## skipped on web uncodeable
## 0 0
## not available in this release not available in this year
## 0 0
## see codebook
## 0
# View unique values to understand coding
unique(gss$fefam)
## [1] <NA> agree strongly agree strongly disagree
## [5] disagree
## 17 Levels: strongly agree agree disagree strongly disagree don't know ... see codebook
Cleaning and Recoding
Before analyzing patterns, we need to clean our data and
categories:
# Clean and recode fefam
gss_cleaned <- gss %>%
mutate(
gender_roles = case_when(
fefam == "strongly agree" ~ "Strongly Agree",
fefam == "agree" ~ "Agree",
fefam == "disagree" ~ "Disagree",
fefam == "strongly disagree" ~ "Strongly Disagree",
TRUE ~ NA_character_
),
# Create ordered factor for proper sorting
gender_roles = factor(gender_roles,
levels = c("Strongly Agree", "Agree",
"Disagree", "Strongly Disagree"))
)
table(gss_cleaned$gender_roles)
##
## Strongly Agree Agree Disagree Strongly Disagree
## 2810 9839 15198 7284
Let’s confirm the range of years the survey item was asked:
# Check years where fefam was asked
gss %>%
filter(!is.na(fefam)) %>%
distinct(year) %>%
arrange(year) %>%
pull()
## [1] 1977 1985 1986 1988 1989 1990 1991 1993 1994 1996 1998 2000 2002 2004 2006
## [16] 2008 2010 2012 2014 2016 2018 2021 2022
This check reveals that the GSS has asked about gender role attitudes
consistently from 1977 onward, providing us with over four decades of
data to examine social change. But importantly it did not start at the
beginning of the survey in 1972.
# Assign object name and refer to dataset name
gender_summary <- gss_cleaned %>%
# Remove missing values from gender_roles variable
filter(!is.na(gender_roles)) %>%
# Count number of responses for each category
count(gender_roles) %>%
# Create new columns for proportions and percentages
mutate(
# Calculate proportion (dividing count by total sum)
proportion = n/sum(n),
# Convert to percentage and round to 1 decimal place
percentage = round(proportion * 100, 1)
)
# Display raw summary to check our work
gender_summary
## gender_roles n proportion percentage
## 1 Strongly Agree 2810 0.07998634 8.0
## 2 Agree 9839 0.28006604 28.0
## 3 Disagree 15198 0.43260938 43.3
## 4 Strongly Disagree 7284 0.20733825 20.7
Understanding the Basic Distribution
Looking at our initial summary of gender role attitudes, several key
patterns emerge:
Overall:
A clear majority (64%) of respondents disagree with traditional
gender roles, with 43.3% disagreeing and 20.7% strongly disagreeing.
This suggests that most Americans reject the strict separation of gender
roles where men work outside the home while women focus on domestic
responsibilities. But we can ask: has this changed over time? Do some
groups showcase clear divides or patterns?
Support for Traditional Roles:
Despite overall opposition, a substantial minority (36%) still
supports traditional gender roles
This asymmetry in the strength of views suggests more opposition
than support for traditional gender roles
Most of this support is moderate (28% “Agree”) rather than strong
(8% “Strongly Agree”)
This distribution provides our baseline understanding, but to fully
grasp patterns in gender role attitudes, we’ll need to examine how they
vary across different social groups and how they’ve changed over
time.
Creating Clear Tables
Let’s create a professional table showing these distributions:
# Create a formatted table using gt package
gender_summary %>%
# Initialize gt table
gt() %>%
# Rename column headers for clearer presentation
cols_label(
gender_roles = "Response Category", # Change from 'gender_roles' to 'Response Category'
n = "N", # Change from 'n' to 'N'
proportion = "Proportion",
percentage = "Percent" # Change from 'percentage' to 'Percent'
) %>%
# Format proportion column as percentages
fmt_percent(
columns = proportion, # Specify which column to format
decimals = 1 # Show one decimal place
) %>%
# Format percentage column with one decimal
fmt_number(
columns = percentage, # Specify which column to format
decimals = 1 # Show one decimal place
) %>%
# Add title and subtitle to table
tab_header(
# Use md() for markdown formatting - makes title bold
title = md("**Table 1. Attitudes Toward Traditional Gender Roles**"),
subtitle = "General Social Survey, 1977-2022"
) %>%
# Add source note and question wording below table
tab_source_note(
source_note = md("*Note:* Responses to: 'It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family.' Sample includes all valid responses across survey years.")
) %>%
# Make column headers bold
tab_style(
style = cell_text(weight = "bold"),
locations = cells_column_labels()
) %>%
# Customize table appearance
tab_options(
# Add thick borders at top and bottom of table
table.border.top.width = 2,
table.border.bottom.width = 2,
# Add thick border below column headers
column_labels.border.bottom.width = 2,
# Set font sizes for different table elements
table.font.size = px(12),
heading.title.font.size = px(14),
heading.subtitle.font.size = px(12),
source_notes.font.size = px(10),
# Adjust spacing between rows
data_row.padding = px(4)
) %>%
# Format numbers with comma separator for thousands
fmt_integer(
columns = n,
sep_mark = ","
)
Table 1. Attitudes Toward Traditional Gender Roles |
General Social Survey, 1977-2022 |
Response Category |
N |
Proportion |
Percent |
Strongly Agree |
2,810 |
8.0% |
8.0 |
Agree |
9,839 |
28.0% |
28.0 |
Disagree |
15,198 |
43.3% |
43.3 |
Strongly Disagree |
7,284 |
20.7% |
20.7 |
Note: Responses to: ‘It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family.’ Sample includes all valid responses across survey years. |
Visualization
Now let’s create a visualization of these proportions:
# Create an improved visualization of gender role attitudes
ggplot(gender_summary,
aes(x = gender_roles, y = proportion)) +
# Add bars with a professional color and slight transparency
geom_col(
fill = "#2c3e50", # Deep blue color
alpha = 0.9, # Slight transparency
width = 0.7 # Slightly thinner bars
) +
# Format y-axis as percentages and set limits
scale_y_continuous(
labels = scales::percent_format(), # Convert to percentage format
limits = c(0, 0.5), # Set y-axis limits
expand = c(0, 0) # Remove spacing at bottom
) +
# Add clear labels
labs(
title = "Distribution of Attitudes Toward Traditional Gender Roles",
subtitle = "General Social Survey, 1977-2022",
x = NULL, # Remove x-axis label (redundant with categories)
y = "Proportion of Respondents"
) +
# Apply clean theme
theme_minimal() +
# Customize theme elements
theme(
# Customize title appearance
plot.title = element_text(
face = "bold",
size = 14,
margin = margin(b = 10)
),
plot.subtitle = element_text(
color = "grey40",
size = 12,
margin = margin(b = 20)
),
# Clean up axes
axis.line = element_line(color = "black"),
axis.ticks = element_line(color = "black"),
# Remove gridlines
panel.grid = element_blank(),
# Adjust text sizing and appearance
axis.text = element_text(size = 11),
axis.title = element_text(size = 11),
# Add some margin around the plot
plot.margin = margin(20, 20, 20, 20)
)

Let’s continue building from our foundational visualization to
examine patterns over time. This next section will help us understand
how gender role attitudes have evolved:
Examining Change Over Time
# Calculate proportions by year
gender_trends <- gss_cleaned %>%
# Remove missing values
filter(!is.na(gender_roles), !is.na(year)) %>%
# Group by year to get annual proportions
group_by(year) %>%
# Count responses per category per year
count(gender_roles) %>%
# Calculate proportions within each year
mutate(
total = sum(n),
proportion = n/total
) %>%
# Ungroup for further operations
ungroup()
gender_trends
## # A tibble: 92 × 5
## year gender_roles n total proportion
## <int> <fct> <int> <int> <dbl>
## 1 1977 Strongly Agree 275 1503 0.183
## 2 1977 Agree 714 1503 0.475
## 3 1977 Disagree 422 1503 0.281
## 4 1977 Strongly Disagree 92 1503 0.0612
## 5 1985 Strongly Agree 150 1502 0.0999
## 6 1985 Agree 577 1502 0.384
## 7 1985 Disagree 574 1502 0.382
## 8 1985 Strongly Disagree 201 1502 0.134
## 9 1986 Strongly Agree 132 1444 0.0914
## 10 1986 Agree 557 1444 0.386
## # ℹ 82 more rows
# Get last points for each category for label placement
last_points <- gender_trends %>%
group_by(gender_roles) %>%
slice_max(order_by = year, n = 1)
# Create improved visualization with direct labels
ggplot(gender_trends,
aes(x = year, y = proportion, color = gender_roles)) +
# Add trend lines
geom_line(size = 1.2) +
# Use Heiss-inspired color palette
scale_color_manual(
values = c(
"Strongly Agree" = "#6C3C89", # Deep purple
"Agree" = "#C098D0", # Light purple
"Disagree" = "#48A0A8", # Teal
"Strongly Disagree" = "#005F73" # Deep teal
)
) +
# Add direct labels at end of lines
geom_label_repel(
data = last_points,
aes(label = gender_roles),
nudge_x = 2,
direction = "y",
hjust = 0,
segment.size = 0.3,
segment.color = "grey70",
box.padding = 0.5,
point.padding = 0.5,
size = 3.5,
fontface = "bold",
label.size = 0.1,
label.r = unit(0.2, "lines"),
fill = alpha("white", 0.7)
) +
# Format axes
scale_y_continuous(
labels = scales::percent_format(),
limits = c(0, 0.6),
breaks = seq(0, 0.6, by = 0.1),
expand = c(0.01, 0.01)
) +
scale_x_continuous(
breaks = seq(1975, 2020, by = 5),
limits = c(1972, 2027), # Extended for labels
expand = c(0.01, 0.01)
) +
# Add labels
labs(
title = "The Decline of Traditional Gender Role Attitudes",
subtitle = "Agreement with: 'Better if man works, woman tends home'",
x = NULL,
y = "Proportion of Respondents"
) +
# Apply clean theme
theme_minimal() +
# Customize theme elements
theme(
plot.title = element_text(
family = "sans",
face = "bold",
size = 14,
margin = margin(b = 10)
),
plot.subtitle = element_text(
family = "sans",
color = "grey40",
size = 12,
margin = margin(b = 20)
),
axis.line = element_line(color = "black", size = 0.5),
axis.ticks = element_line(color = "black", size = 0.5),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_line(color = "grey90", size = 0.3),
legend.position = "none", # Remove legend since using direct labels
plot.margin = margin(20, 20, 20, 20)
)

Understanding Changes in Gender Role Attitudes
Looking at our visualization, we can identify several clear
patterns:
Overall Trend
Support for traditional gender roles has steadily declined since
1977, but it is driven by a stee “Agree” decline, falling from nearly
50% to under 20%
Opposition (both “Disagree” and “Strongly Disagree”) has become
the majority position
By 2022, over 75% of respondents oppose traditional gender
roles
Changes continue but at a slower pace after 2000
This example shows us how examining proportions over time helps
identify social changes that raw counts might miss. The steady decline
in traditional views reflects broader social transformations in American
society.
Building on Patterns: Adding Educational Differences
Education has long been identified as one of the strongest predictors
of gender attitudes (see Davis and Greenstein 2009 [Annual Review of
Sociology] for review). Let’s examine this relationship in our data.
First, let’s check our education variable:
# Examine education variable
table(gss$degree)
##
## less than high school high school
## 14192 36446
## associate/junior college bachelor's
## 4355 11248
## graduate don't know
## 5953 0
## iap I don't have a job
## 0 0
## dk, na, iap no answer
## 0 0
## not imputable_(2147483637) not imputable_(2147483638)
## 0 0
## refused skipped on web
## 0 0
## uncodeable not available in this release
## 0 0
## not available in this year see codebook
## 0 0
# Check years where both variables are available
gss %>%
filter(!is.na(degree), !is.na(fefam)) %>%
distinct(year) %>%
arrange(year) %>%
pull()
## [1] 1977 1985 1986 1988 1989 1990 1991 1993 1994 1996 1998 2000 2002 2004 2006
## [16] 2008 2010 2012 2014 2016 2018 2021 2022
Following standard practice in the literature, we’ll slightly recode
education into four categories:
Other standards include doing a 2 or 3 category recode (remember,
always justify your choices!)
# Recode education into standard categories
gss_edu <- gss_cleaned %>%
mutate(
education = case_when(
degree == "less than high school" ~ "Less than High School",
degree == "high school" ~ "High School",
degree %in% c("associate/junior college") ~ "Some College",
degree %in% c("bachelor's", "graduate") ~ "College Graduate",
TRUE ~ NA_character_
),
# Create ordered factor for proper sorting
education = factor(education,
levels = c("Less than High School",
"High School",
"Some College",
"College Graduate"))
)
# Verify recoding and distribution
table(gss_edu$education, useNA = "ifany")
##
## Less than High School High School Some College
## 14192 36446 4355
## College Graduate <NA>
## 17201 196
This coding scheme aligns with standard practice in the literature
(e.g., Davis and Greenstein 2009; Bolzendahl and Myers 2004) and allows
for comparison with other studies while maintaining meaningful
distinctions in educational attainment.
Now let’s calculate proportions of gender attitudes by education
edu_attitudes <- gss_edu %>%
# Remove missing values
filter(!is.na(education), !is.na(gender_roles)) %>%
# Group by education level
group_by(education) %>%
# Calculate proportions within each education level
count(gender_roles) %>%
mutate(
total = sum(n),
proportion = n/total
)
edu_attitudes
## # A tibble: 16 × 5
## # Groups: education [4]
## education gender_roles n total proportion
## <fct> <fct> <int> <int> <dbl>
## 1 Less than High School Strongly Agree 915 5665 0.162
## 2 Less than High School Agree 2478 5665 0.437
## 3 Less than High School Disagree 1769 5665 0.312
## 4 Less than High School Strongly Disagree 503 5665 0.0888
## 5 High School Strongly Agree 1392 17752 0.0784
## 6 High School Agree 5137 17752 0.289
## 7 High School Disagree 8068 17752 0.454
## 8 High School Strongly Disagree 3155 17752 0.178
## 9 Some College Strongly Agree 146 2437 0.0599
## 10 Some College Agree 534 2437 0.219
## 11 Some College Disagree 1159 2437 0.476
## 12 Some College Strongly Disagree 598 2437 0.245
## 13 College Graduate Strongly Agree 347 9209 0.0377
## 14 College Graduate Agree 1675 9209 0.182
## 15 College Graduate Disagree 4170 9209 0.453
## 16 College Graduate Strongly Disagree 3017 9209 0.328
And visualize:
# Create visualization of educational differences
ggplot(edu_attitudes,
aes(x = education, y = proportion, fill = gender_roles)) +
# Create proportional stacked bars
geom_col(width = 0.7) +
# Use clean color scheme
scale_fill_manual(
values = c(
"Strongly Agree" = "#6C3C89", # Deep purple
"Agree" = "#C098D0", # Light purple
"Disagree" = "#48A0A8", # Teal
"Strongly Disagree" = "#005F73" # Deep teal
)
) +
# Format y-axis as percentages
scale_y_continuous(
labels = scales::percent_format(),
expand = c(0, 0)
) +
# Add clear labels
labs(
title = "Educational Differences in Gender Role Attitudes",
subtitle = "Agreement with: 'Better if man works, woman tends home'",
x = "Educational Attainment",
y = "Proportion of Respondents",
fill = "Response"
) +
# Apply clean theme
theme_minimal() +
# Customize theme elements
theme(
plot.title = element_text(
face = "bold",
size = 14,
margin = margin(b = 10)
),
plot.subtitle = element_text(
color = "grey40",
size = 12,
margin = margin(b = 20)
),
axis.text.x = element_text(angle = 0),
legend.position = "right",
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_line(color = "grey90"),
axis.line = element_line(color = "black", size = 0.5),
plot.margin = margin(20, 20, 20, 20)
)

Interpreting Educational Differences in Gender Role Attitudes
Looking at our visualization and proportions, we see a strong pattern
between education and gender role attitudes:
1. Clear Educational Gradient
2. Strength of Opposition
3. Pattern of Responses
Less educated: More concentrated in moderate positions
More educated: Stronger tendency toward disagreement, especially
strong disagreement
“Disagree” is most common response for all groups except less
than high school
Let’s extend this analysis by examining whether these educational
differences have remained stable over time:
# Calculate proportions by education and year
edu_time_trends <- gss_edu %>%
# Remove missing values
filter(!is.na(education), !is.na(gender_roles), !is.na(year)) %>%
# Group by education and year
group_by(education, year) %>%
# Calculate the proportion who agree/strongly agree (traditional views)
summarize(
n = n(),
traditional = mean(gender_roles %in% c("Strongly Agree", "Agree")),
.groups = "drop"
)
edu_time_trends
## # A tibble: 92 × 4
## education year n traditional
## <fct> <int> <int> <dbl>
## 1 Less than High School 1977 525 0.804
## 2 Less than High School 1985 398 0.691
## 3 Less than High School 1986 393 0.695
## 4 Less than High School 1988 236 0.682
## 5 Less than High School 1989 210 0.576
## 6 Less than High School 1990 172 0.616
## 7 Less than High School 1991 203 0.645
## 8 Less than High School 1993 187 0.578
## 9 Less than High School 1994 320 0.569
## 10 Less than High School 1996 360 0.586
## # ℹ 82 more rows
# Get last points for label placement
last_points <- edu_time_trends %>%
group_by(education) %>%
slice_max(order_by = year, n = 1)
# Create improved visualization
ggplot(edu_time_trends,
aes(x = year, y = traditional, color = education)) +
# Add smoothed lines
geom_line(size = 1.2) +
# Use clean color scheme
scale_color_manual(
values = c(
"Less than High School" = "#6C3C89",
"High School" = "#C098D0",
"Some College" = "#48A0A8",
"College Graduate" = "#005F73"
)
) +
# Add direct labels at end of lines
geom_label_repel(
data = last_points,
aes(label = education),
nudge_x = 2,
direction = "y",
hjust = 0,
segment.size = 0.3,
segment.color = "grey70",
box.padding = 0.5,
point.padding = 0.5,
size = 3.5,
fontface = "bold",
label.size = 0.1,
label.r = unit(0.2, "lines"),
fill = alpha("white", 0.7)
) +
# Format axes
scale_y_continuous(
labels = scales::percent_format(),
limits = c(0, 1.0),
breaks = seq(0, 1.0, by = 0.2),
expand = c(0.01, 0.01)
) +
scale_x_continuous(
breaks = seq(1975, 2020, by = 5),
limits = c(1972, 2027), # Extended for labels
expand = c(0.01, 0.01)
) +
# Add labels
labs(
title = "Educational Differences in Gender Role Attitudes Over Time",
subtitle = "Proportion agreeing: 'Better if man works, woman tends home'",
x = NULL,
y = "Proportion Supporting Traditional Roles"
) +
# Apply clean theme
theme_minimal() +
# Customize theme elements
theme(
plot.title = element_text(
face = "bold",
size = 14,
margin = margin(b = 10)
),
plot.subtitle = element_text(
color = "grey40",
size = 12,
margin = margin(b = 20)
),
axis.line = element_line(color = "black", size = 0.5),
axis.ticks = element_line(color = "black", size = 0.5),
panel.grid = element_blank(),
axis.text = element_text(size = 10),
legend.position = "none",
plot.margin = margin(20, 20, 20, 20)
)

Let’s discuss, what would be your takeaway?
From Description to Group Differences
To understand group differences more systematically, we’ll:
Calculate differences between groups
Examine changes in these differences over time
Build measures of divergence/convergence
Let’s start with partisan differences in gender attitudes. Variable
information available here:
https://gssdataexplorer.norc.org/variables/141/vshow
# First check party ID variable
table(gss$partyid)
##
## strong democrat not very strong democrat
## 11795 14286
## independent, close to democrat independent (neither, no response)
## 8663 11540
## independent, close to republican not very strong republican
## 6378 10678
## strong republican other party
## 7273 1292
## don't know iap
## 0 0
## I don't have a job dk, na, iap
## 0 0
## no answer not imputable_(2147483637)
## 0 0
## not imputable_(2147483638) refused
## 0 0
## skipped on web uncodeable
## 0 0
## not available in this release not available in this year
## 0 0
## see codebook
## 0
As per the standard in the literature, let’s group into three
categories (R/I/D):
# Recode party ID into three categories
gss_party <- gss_edu %>%
mutate(
party = case_when(
partyid %in% c("strong democrat",
"not very strong democrat",
"independent, close to democrat") ~ "Democrat",
partyid %in% c("independent (neither, no response)",
"other party") ~ "Independent",
partyid %in% c("strong republican",
"not very strong republican",
"independent, close to republican") ~ "Republican",
TRUE ~ NA_character_
),
# Create factor with meaningful order
party = factor(party, levels = c("Democrat", "Independent", "Republican"))
)
# Verify recoding
table(gss_party$party, useNA = "ifany")
##
## Democrat Independent Republican <NA>
## 34744 12832 24329 485
# Calculate party differences in gender attitudes over time
party_trends <- gss_party %>%
# Remove missing values and independents for D-R comparison
filter(!is.na(party),
party != "Independent",
!is.na(gender_roles),
!is.na(year)) %>%
# Group by party and year
group_by(party, year) %>%
# Calculate proportion traditional for each party-year
summarize(
n = n(),
traditional = mean(gender_roles %in% c("Strongly Agree", "Agree")),
.groups = "drop"
)
Understanding Group Differences: Measuring the Partisan Gap
The code for calculating partisan differences demonstrates a
fundamental approach to measuring group divergence:
First, we reshape our data from long to wide format using
pivot_wider()
, creating separate columns for Democratic and
Republican proportions
Then, we calculate the difference between groups (Republican -
Democrat)
A positive difference indicates Republicans are more traditional;
negative means Democrats are more traditional
This simple subtraction provides an intuitive measure of partisan
divergence at each time point (and whether it is increasing or
decreasing)
# Calculate partisan difference by year
partisan_diff <- party_trends %>%
select(-n) %>%
pivot_wider(
names_from = party,
values_from = traditional
) %>%
mutate(
diff = Republican - Democrat # Positive values = Republicans more traditional
)
partisan_diff
## # A tibble: 23 × 4
## year Democrat Republican diff
## <int> <dbl> <dbl> <dbl>
## 1 1977 0.676 0.656 -0.0198
## 2 1985 0.477 0.505 0.0280
## 3 1986 0.467 0.492 0.0251
## 4 1988 0.409 0.455 0.0459
## 5 1989 0.404 0.432 0.0277
## 6 1990 0.399 0.418 0.0197
## 7 1991 0.415 0.411 -0.00370
## 8 1993 0.337 0.391 0.0539
## 9 1994 0.324 0.388 0.0645
## 10 1996 0.349 0.456 0.107
## # ℹ 13 more rows
Understanding pivot_wider
Let’s look at a simple example to understand what
pivot_wider()
does to our data:
# Original data (long format)
party_example <- data.frame(
year = c(2020, 2020),
party = c("Democrat", "Republican"),
traditional = c(0.32, 0.44)
)
# Show original format
print("Original (Long) Format:")
## [1] "Original (Long) Format:"
## year party traditional
## 1 2020 Democrat 0.32
## 2 2020 Republican 0.44
# Create wide format
party_example_wide <- party_example %>%
pivot_wider(
names_from = party, # Column that becomes new columns
values_from = traditional # Values that fill new columns
)
# Show transformed format
print("After pivot_wider (Wide Format):")
## [1] "After pivot_wider (Wide Format):"
## # A tibble: 1 × 3
## year Democrat Republican
## <dbl> <dbl> <dbl>
## 1 2020 0.32 0.44
pivot_wider()
transforms our data from “long” format
(where Democrat and Republican values are in rows) to “wide” format
(where they appear as separate columns). This makes it easy to calculate
differences between groups – we can simply subtract one column from
another.
In our case:
Before: Each row had a party and its proportion
After: Each row has both Democrat and Republican
proportions
This allows us to calculate Republican - Democrat
difference
Let’s visualize both the trends and difference:
# Create visualization of party trends
ggplot(party_trends,
aes(x = year, y = traditional, color = party)) +
# Add smoothed lines
geom_line(size = 1.2) +
# Use clear party colors
scale_color_manual(
values = c(
"Democrat" = "#2E74C0", # Blue
"Republican" = "#CB454A" # Red
)
) +
# Add direct labels at end of lines
geom_label_repel(
data = party_trends %>%
group_by(party) %>%
slice_max(order_by = year, n = 1),
aes(label = party),
nudge_x = 2,
direction = "y",
hjust = 0,
segment.size = 0.3,
segment.color = "grey70",
box.padding = 0.5,
point.padding = 0.5,
size = 3.5,
fontface = "bold",
label.size = 0.1,
label.r = unit(0.2, "lines"),
fill = alpha("white", 0.7)
) +
# Format axes
scale_y_continuous(
labels = scales::percent_format(),
limits = c(0, 0.8),
breaks = seq(0, 0.8, by = 0.2)
) +
scale_x_continuous(
breaks = seq(1975, 2020, by = 5),
limits = c(1975, 2027)
) +
# Add labels
labs(
title = "Partisan Differences in Gender Role Attitudes",
subtitle = "Proportion supporting traditional gender roles by party identification",
x = NULL,
y = "Proportion Supporting Traditional Roles"
) +
# Clean theme
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "grey40", size = 12),
axis.line = element_line(color = "black", size = 0.5),
axis.ticks = element_line(color = "black", size = 0.5),
panel.grid = element_blank(),
legend.position = "none",
plot.margin = margin(20, 20, 20, 20)
)

Distribution Changes: From Overlap to Divergence
When examining changes in attitudes between groups over time, density
plots offer powerful insights into:
Overall distribution shape
Areas of overlap between groups
Central tendency differences
Spread of attitudes within groups
For gender attitudes, we can examine how Democrats and Republicans
have potentially “sorted” over time by comparing distributions at two
time points. This helps us understand not just average differences, but
how the full range of attitudes has changed within each party.
# Prepare data for distribution comparison
party_dist <- gss_party %>%
# Keep only Democrats and Republicans at start/end points
filter(!is.na(party),
party != "Independent",
!is.na(gender_roles),
year %in% c(1977, 2022)) %>%
# Convert year to factor and ensure proper ordering of attitudes
mutate(
year = factor(year),
# Create numeric version of gender_roles for density estimation
attitude_score = as.numeric(factor(gender_roles,
levels = c("Strongly Agree", "Agree", "Disagree", "Strongly Disagree")
))
)
# Create density plot comparison
ggplot(party_dist,
aes(x = attitude_score,
fill = party)) +
# Create separate panels for each time point
facet_wrap(~year,
nrow = 2,
labeller = labeller(year = c(
"1977" = "1977: High Overlap in Views",
"2022" = "2022: Clear Partisan Division"
))) +
# Add density curves
geom_density(alpha = 0.5) +
# Use standard party colors
scale_fill_manual(
values = c(
"Democrat" = "#2E74C0",
"Republican" = "#CB454A"
)
) +
# Add labels for x-axis points
scale_x_continuous(
breaks = 1:4,
labels = c("Strongly\nAgree", "Agree", "Disagree", "Strongly\nDisagree"),
limits = c(0.5, 4.5)
) +
# Clean theme
theme_minimal() +
theme(
axis.line = element_line(color = "black"),
panel.grid = element_blank(),
legend.position = "right",
plot.title = element_text(face = "bold"),
axis.text.x = element_text(size = 9)
) +
# Add clear labels
labs(
title = "Partisan Divergence in Gender Attitudes",
subtitle = "Distribution of responses to: 'Better if man works, woman tends home'",
x = "More Progressive →",
y = "Density",
fill = "Party ID"
)

Let’s discuss: what do you observe?
Now let’s examine the partisan difference:
# Create plot to visualize partisan differences over time
ggplot(partisan_diff,
aes(x = year, y = diff)) +
# Add main trend line
geom_line(size = 1.2, color = "#2C3E50") +
# Add reference line at 0 (no difference between parties)
geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
# Format y-axis as percentages with reasonable range and breaks
scale_y_continuous(
labels = scales::percent_format(),
limits = c(-0.2, 0.2),
breaks = seq(-0.2, 0.2, by = 0.1)
) +
# Set clear x-axis breaks by 5-year intervals
scale_x_continuous(
breaks = seq(1975, 2020, by = 5)
) +
# Add informative labels
labs(
title = "Partisan Gap in Gender Role Attitudes",
subtitle = "Difference in support for traditional roles (Republican - Democrat)",
x = NULL,
y = "Percentage Point Difference"
) +
# Apply clean theme
theme_minimal() +
# Customize theme elements for professional presentation
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "grey40", size = 12),
axis.line = element_line(color = "black", size = 0.5),
axis.ticks = element_line(color = "black", size = 0.5),
panel.grid = element_blank(),
plot.margin = margin(20, 20, 20, 20)
)

Interpreting Partisan Differences in Gender Attitudes
The data reveals a clear pattern of partisan divergence in gender
role attitudes since 1977:
- In 1977, virtually no partisan gap existed (-2 percentage
points)
- A consistent gap emerges in the late 1980s and early 1990s
- The gap widens substantially in the 2000s and 2010s
- By 2021-22, the partisan gap reaches its peak (~20 percentage
points)
- Republicans: ~33-35% traditional
- Democrats: ~10-16% traditional
This growing divide reflects increasing partisan sorting on gender
attitudes, with Republicans maintaining more traditional views while
Democrats show steeper decline in traditional attitudes.
Understanding Partisan Sorting: From Overlap to Polarization
Before diving into specific measures, let’s understand what we’re
trying to capture when comparing distributions. When examining whether
groups have become more different over time, we want to know:
How far apart are their typical positions (central
tendency)?
How much do group members vary in their views (spread)?
How much do the distributions overlap?
Statistical Foundations: Cohen’s d and Effect Size
Cohen’s d is a standardized measure of the difference between two
groups that helps us understand how much they differ while accounting
for variation within each group. The formula is:
Cohen's d = (Mean of Group 1 - Mean of Group 2) / Pooled Standard Deviation
where the pooled standard deviation combines the spread from both
groups:
Pooled SD = sqrt((SD1² + SD2²) / 2)
This measure is valuable because it:
Accounts for both difference in means and spread within
groups
Provides a standardized scale for comparison
Provides ~ benchmarks (~ small: 0.2, ~ medium: 0.5, ~ large:
0.8)
# Calculate detailed overlap statistics
overlap_detailed <- party_dist %>%
group_by(year) %>%
summarize(
# Mean values for each party
mean_dem = mean(attitude_score[party == "Democrat"]),
mean_rep = mean(attitude_score[party == "Republican"]),
# Standard deviations
sd_dem = sd(attitude_score[party == "Democrat"]),
sd_rep = sd(attitude_score[party == "Republican"]),
# Calculate difference and Cohen's d
mean_diff = mean_rep - mean_dem,
pooled_sd = sqrt((sd_dem^2 + sd_rep^2) / 2),
cohens_d = mean_diff / pooled_sd,
# Sample sizes
n_dem = sum(party == "Democrat"),
n_rep = sum(party == "Republican")
)
# Format results table
overlap_detailed %>%
mutate(across(where(is.numeric), ~round(., 3))) %>%
gt() %>%
tab_header(
title = "Measuring Partisan Sorting on Gender Attitudes",
subtitle = "Comparing distributions in 1977 and 2022"
) %>%
cols_label(
mean_dem = "Democrat Mean",
mean_rep = "Republican Mean",
sd_dem = "Democrat SD",
sd_rep = "Republican SD",
mean_diff = "Mean Difference",
pooled_sd = "Pooled SD",
cohens_d = "Cohen's d",
n_dem = "N (Dem)",
n_rep = "N (Rep)"
) %>%
tab_source_note(
source_note = "Note: Higher means indicate more progressive views"
)
Measuring Partisan Sorting on Gender Attitudes |
Comparing distributions in 1977 and 2022 |
year |
Democrat Mean |
Republican Mean |
Democrat SD |
Republican SD |
Mean Difference |
Pooled SD |
Cohen's d |
N (Dem) |
N (Rep) |
1977 |
2.199 |
2.217 |
0.796 |
0.814 |
0.018 |
0.805 |
0.022 |
863 |
456 |
2022 |
3.277 |
2.756 |
0.796 |
0.862 |
-0.521 |
0.830 |
-0.628 |
960 |
750 |
Note: Higher means indicate more progressive views |
Interpreting Partisan Sorting in Gender Attitudes (1977-2022)
Looking at our measurement of partisan differences in gender
attitudes between 1977 and 2022, we find clear evidence of increasing
partisan division:
1. From Unity to Division in Average Views
In 1977, Democrats (2.20) and Republicans (2.22) held nearly
identical average positions
By 2022, a clear gap emerged:
Democrats moved strongly progressive (3.28)
Republicans showed modest change (2.76)
Net difference of 0.52 points on our 4-point scale
2. Stability in Within-Party Variation
Democratic spread remained constant (SD ≈ 0.80)
Republican variation increased slightly (0.81 to 0.86)
This tells us sorting occurred through shifting positions, not
increasing within-party consensus
3. Quantifying the Growing Divide
This analysis reveals not just that parties diverged, but how they
diverged: through a substantial progressive shift among Democrats while
Republicans changed more modestly, all while maintaining similar levels
of internal diversity of views.
Comparing to other Social/Political Attitudes
This approach helps us understand sorting across different issues.
Looking at abortion views provides an important comparison:
# Prepare abortion attitude data
abortion_dist <- gss_party %>%
# Keep only Democrats and Republicans at same time points
filter(!is.na(party),
party != "Independent",
!is.na(abany),
year %in% c(1977, 2022)) %>%
mutate(
year = factor(year),
# Create numeric score (1 = oppose, 2 = support)
abortion_score = as.numeric(factor(abany, levels = c("no", "yes")))
)
# Create comparison plot
ggplot(abortion_dist,
aes(x = abortion_score,
fill = party)) +
facet_wrap(~year,
nrow = 2,
labeller = labeller(year = c(
"1977" = "1977: Mixed Views",
"2022" = "2022: Party Alignment"
))) +
geom_density(alpha = 0.5) +
scale_fill_manual(
values = c(
"Democrat" = "#2E74C0",
"Republican" = "#CB454A"
)
) +
scale_x_continuous(
breaks = 1:2,
labels = c("Oppose", "Support"),
limits = c(0.5, 2.5)
) +
theme_minimal() +
theme(
axis.line = element_line(color = "black"),
panel.grid = element_blank(),
legend.position = "right",
plot.title = element_text(face = "bold")
) +
labs(
title = "Partisan Sorting on Abortion Rights",
subtitle = "Distribution of support for legal abortion by party",
x = "More Progressive →",
y = "Density",
fill = "Party ID"
)

The abortion comparison raises a number of questions, notably whether
gender attitudes are unique or part of broader sorting and how different
issues may sort at different rates.
Turning to Religious Change
These partisan and educational patterns suggest deeper social
transformations. Another key area where we see substantial change is
religious identification, particularly the rise of the “religious
nones.”
Let’s first examine our religious identification variable in the
GSS:
# Initial check of religious categories
table(gss$relig)
##
## protestant catholic
## 38707 16498
## jewish none
## 1360 8918
## other buddhism
## 1135 245
## hinduism other eastern religions
## 130 41
## muslim/islam orthodox-christian
## 178 155
## christian native american
## 915 34
## inter-nondenominational don't know
## 154 0
## iap I don't have a job
## 0 0
## dk, na, iap no answer
## 0 0
## not imputable_(2147483637) not imputable_(2147483638)
## 0 0
## refused skipped on web
## 0 0
## uncodeable not available in this release
## 0 0
## not available in this year see codebook
## 0 0
# Check years available
gss %>%
filter(!is.na(relig)) %>%
distinct(year) %>%
arrange(year) %>%
pull()
## [1] 1972 1973 1974 1975 1976 1977 1978 1980 1982 1983 1984 1985 1986 1987 1988
## [16] 1989 1990 1991 1993 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
## [31] 2016 2018 2021
Following research on religious change (e.g., Voas and Chaves 2016),
we’ll focus on the key contrast between religious affiliation and
non-affiliation:
# Create religious none indicator
gss_relig <- gss %>%
mutate(
religious_none = case_when(
relig == "none" ~ "None",
relig %in% c("protestant", "catholic", "jewish",
"buddhism", "hinduism", "muslim/islam",
"orthodox-christian", "christian",
"inter-nondenominational", "other") ~ "Affiliated",
TRUE ~ NA_character_
),
religious_none = factor(religious_none)
)
# Calculate proportion "none" by year
none_trends <- gss_relig %>%
filter(!is.na(religious_none)) %>%
group_by(year) %>%
summarize(
n = n(),
prop_none = mean(religious_none == "None"),
.groups = "drop"
)
none_trends
## # A tibble: 33 × 3
## year n prop_none
## <int> <int> <dbl>
## 1 1972 1608 0.0516
## 2 1973 1500 0.064
## 3 1974 1483 0.0681
## 4 1975 1488 0.0759
## 5 1976 1497 0.0762
## 6 1977 1523 0.0611
## 7 1978 1528 0.0779
## 8 1980 1465 0.0717
## 9 1982 1848 0.0714
## 10 1983 1595 0.0734
## # ℹ 23 more rows
Let’s visualize this trend:
# Create visualization of religious none trend
ggplot(none_trends,
aes(x = year, y = prop_none)) +
geom_line(size = 1.2, color = "#2C3E50") +
scale_y_continuous(
labels = scales::percent_format(),
limits = c(0, 0.3),
breaks = seq(0, 0.3, by = 0.05)
) +
scale_x_continuous(
breaks = seq(1975, 2020, by = 5)
) +
labs(
title = "The Rise of Religious 'Nones' in the United States",
subtitle = "Proportion reporting no religious affiliation",
x = NULL,
y = "Proportion 'None'"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "grey40", size = 12),
axis.line = element_line(color = "black", size = 0.5),
axis.ticks = element_line(color = "black", size = 0.5),
panel.grid = element_blank(),
plot.margin = margin(20, 20, 20, 20)
)

Reference:
- Voas, D., & Chaves, M. (2016). Is the United States a
counterexample to the secularization thesis?. American Journal of
Sociology, 121(5), 1517-1556.
Interpreting the Rise of Religious “Nones”
Looking at these proportions over time reveals a dramatic
transformation in American religious identification:
1. Three Distinct Periods
Looking at the graph:
Pattern of Change - Not linear but accelerating
- No evidence of plateau yet
Let’s examine how this religious change intersects with age
groups:
# Create age groups and examine religious none proportions
gss_relig <- gss_relig %>%
mutate(
age_group = case_when(
age < 30 ~ "18-29",
age < 45 ~ "30-44",
age < 60 ~ "45-59",
TRUE ~ "60+"
),
age_group = factor(age_group,
levels = c("18-29", "30-44", "45-59", "60+"))
)
# Calculate proportions by age group and year
age_none_trends <- gss_relig %>%
filter(!is.na(religious_none), !is.na(age_group)) %>%
group_by(age_group, year) %>%
summarize(
n = n(),
prop_none = mean(religious_none == "None"),
.groups = "drop"
)
age_none_trends
## # A tibble: 132 × 4
## age_group year n prop_none
## <fct> <int> <int> <dbl>
## 1 18-29 1972 397 0.103
## 2 18-29 1973 381 0.126
## 3 18-29 1974 380 0.108
## 4 18-29 1975 405 0.148
## 5 18-29 1976 387 0.152
## 6 18-29 1977 365 0.0959
## 7 18-29 1978 407 0.123
## 8 18-29 1980 357 0.112
## 9 18-29 1982 494 0.144
## 10 18-29 1983 415 0.120
## # ℹ 122 more rows
Let’s make the output clearer and more digestible:
# Create cleaner 5-year periods
age_none_periods <- age_none_trends %>%
# Create 5-year periods
mutate(
period = case_when(
year <= 1979 ~ "1975-1979",
year <= 1984 ~ "1980-1984",
year <= 1989 ~ "1985-1989",
year <= 1994 ~ "1990-1994",
year <= 1999 ~ "1995-1999",
year <= 2004 ~ "2000-2004",
year <= 2009 ~ "2005-2009",
year <= 2014 ~ "2010-2014",
year <= 2019 ~ "2015-2019",
TRUE ~ "2020-2022"
)
) %>%
# Calculate average for each period and age group
group_by(age_group, period) %>%
summarize(
n = sum(n), # Total observations in period
prop_none = mean(prop_none), # Average proportion none
.groups = "drop"
) %>%
# Convert to percentage
mutate(
pct_none = round(prop_none * 100, 1)
)
# Create wide format for clear period comparison
age_periods_wide <- age_none_periods %>%
select(age_group, period, pct_none) %>%
pivot_wider(
names_from = period,
values_from = pct_none
)
# Display formatted table
age_periods_wide %>%
knitr::kable(
caption = "Religious Non-Affiliation by Age Group and Period (%)",
align = c("l", rep("c", 10))
)
Religious Non-Affiliation by Age Group and Period (%)
18-29 |
12.2 |
12.1 |
11.4 |
12.2 |
22.1 |
21.6 |
26.3 |
30.7 |
36.0 |
46.3 |
30-44 |
7.0 |
8.6 |
9.2 |
10.1 |
13.9 |
16.0 |
19.1 |
22.8 |
28.2 |
38.3 |
45-59 |
4.0 |
3.8 |
4.8 |
6.6 |
10.1 |
12.9 |
14.6 |
17.3 |
19.0 |
24.7 |
60+ |
3.3 |
3.2 |
3.0 |
4.0 |
6.0 |
6.8 |
9.0 |
11.8 |
13.4 |
20.3 |
Interpreting Religious Change Across Age Groups and Periods
Let’s create a clear interpretation followed by a visually enhanced
table:
1. Key Historical Periods
1975-1989: Remarkable stability across all age groups
1990-1999: First notable increase, especially among young
adults
2000-2014: Steady rises across all ages
2015-2022: Dramatic acceleration, particularly among
under-45s
2. Age-Specific Patterns
Young Adults (18-29): Most dramatic change
Middle Adults (30-44): Similar pattern but lagged
Older Groups: More modest but still substantial
increases
# Create enhanced table with gt
age_periods_wide %>%
gt() %>%
# Format title and subtitle
tab_header(
title = md("**Religious Non-Affiliation Across Age Groups and Time**"),
subtitle = "Percentage reporting no religious affiliation"
) %>%
# Format columns
fmt_number(
columns = -1,
decimals = 1
) %>%
# Add color scales for values
data_color(
columns = -1,
colors = scales::col_numeric(
palette = c("#FFFFFF", "#f4a582", "#92c5de", "#0571b0"),
domain = c(0, 50)
)
) %>%
# Format column headers
cols_label(
age_group = "Age Group"
) %>%
# Add source note
tab_source_note(
source_note = "Source: General Social Survey, 1975-2022"
) %>%
# Style elements
tab_style(
style = list(
cell_text(weight = "bold")
),
locations = cells_column_labels()
)
Religious Non-Affiliation Across Age Groups and Time |
Percentage reporting no religious affiliation |
Age Group |
1975-1979 |
1980-1984 |
1985-1989 |
1990-1994 |
1995-1999 |
2000-2004 |
2005-2009 |
2010-2014 |
2015-2019 |
2020-2022 |
18-29 |
12.2 |
12.1 |
11.4 |
12.2 |
22.1 |
21.6 |
26.3 |
30.7 |
36.0 |
46.3 |
30-44 |
7.0 |
8.6 |
9.2 |
10.1 |
13.9 |
16.0 |
19.1 |
22.8 |
28.2 |
38.3 |
45-59 |
4.0 |
3.8 |
4.8 |
6.6 |
10.1 |
12.9 |
14.6 |
17.3 |
19.0 |
24.7 |
60+ |
3.3 |
3.2 |
3.0 |
4.0 |
6.0 |
6.8 |
9.0 |
11.8 |
13.4 |
20.3 |
Source: General Social Survey, 1975-2022 |
3. Notable Transitions Highlighted
Major shift begins in mid-1990s
Acceleration in late 2000s
Most dramatic changes in post-2015 period
Pandemic period (2020-2022) shows highest recorded
levels
This table highlights both cohort replacement (each generation more
secular than previous) and period effects (all groups showing increases,
especially recently).
Let’s visualize the trends:
ggplot(age_none_trends,
aes(x = year, y = prop_none, color = age_group)) +
geom_line(size = 1.2) +
# Use Nord color palette (inspired by Arctic ice and aurora colors)
scale_color_manual(
values = c(
"18-29" = "#5E81AC", # Steel blue
"30-44" = "#81A1C1", # Glacier blue
"45-59" = "#88C0D0", # Frost blue
"60+" = "#B48EAD" # Aurora purple
)
) +
scale_y_continuous(
labels = scales::percent_format(),
limits = c(0, 0.4),
breaks = seq(0, 0.4, by = 0.1)
) +
scale_x_continuous(
breaks = seq(1975, 2020, by = 5)
) +
labs(
title = "Age Patterns in Religious Non-Affiliation",
subtitle = "Young adults leading religious change",
x = NULL,
y = "Proportion 'None'",
color = "Age Group"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "grey40", size = 12),
axis.line = element_line(color = "black", size = 0.5),
axis.ticks = element_line(color = "black", size = 0.5),
panel.grid = element_blank(),
legend.position = "right",
plot.margin = margin(20, 20, 20, 20)
)

Understanding Probabilities in Social Research: From Basic to
Conditional
The Foundation: Unconditional Probability
Let’s start with a fundamental concept: unconditional (or marginal)
probability. This is the probability of an event occurring across an
entire population, without considering any other factors.
Example with Religious Non-Affiliation: If we
observe that 25% of Americans report no religious affiliation, we can
write this as:
P(None) = 0.25
This tells us that if we randomly selected someone from the
population, there’s a 25% chance they would be religiously unaffiliated.
But this single number hides important variations across social
groups.
Moving to Conditional Probability
Conditional probability helps us understand how the likelihood of an
outcome varies across different groups. It answers questions like:
“What’s the probability of being religiously unaffiliated among college
graduates?” or “What’s the probability of being unaffiliated among those
under 30?”
Mathematically, we write conditional probability as P(A|B), read as
“probability of A given B.”
Building Intuition:
Imagine examining religious non-affiliation across education
levels:
We would write:
P(None) = 0.25 # Unconditional probability
P(None|Less than HS) = 0.15 # Conditional on education
P(None|Graduate Degree) = 0.40
Why Conditional Probabilities Matter
- Revealing Hidden Patterns
- The overall rate (in the example = 25%) masks significant
variation
- Some groups have much higher/lower probabilities
- These differences often reveal important social processes
- Understanding Group Differences
- Compare probabilities across meaningful social categories
- Identify which characteristics are strongly associated with
outcomes
- Guide theoretical understanding of social phenomena
Calculating Conditional Probabilities
In practice, we calculate conditional probabilities by:
Identifying the specific group (condition)
Counting occurrences within that group
Dividing by the total size of the group
P(A|B) = Number of cases with both A and B / Total number of cases with B
For example:
P(None|College Grad) = Number of unaffiliated college graduates / Total number of college graduates
Building a Framework for Social Pattern Exploration: Starting with
Key Demographics
When exploring social patterns, sociologists often begin by examining
differences across key demographic characteristics. While the specific
variables chosen should always be guided by theory and research
questions, we’ll demonstrate our framework using five common demographic
dimensions. This example should not be taken as prescriptive - different
research questions will require different variables and approaches.
Let’s start by examining and cleaning our variables:
# First, let's examine our variables of interest
cat("Education Categories:\n")
## Education Categories:
##
## less than high school high school
## 14192 36446
## associate/junior college bachelor's
## 4355 11248
## graduate don't know
## 5953 0
## iap I don't have a job
## 0 0
## dk, na, iap no answer
## 0 0
## not imputable_(2147483637) not imputable_(2147483638)
## 0 0
## refused skipped on web
## 0 0
## uncodeable not available in this release
## 0 0
## not available in this year see codebook
## 0 0
cat("\nRacial Categories:\n")
##
## Racial Categories:
##
## white black
## 57657 10215
## other don't know
## 4411 0
## iap I don't have a job
## 0 0
## dk, na, iap no answer
## 0 0
## not imputable_(2147483637) not imputable_(2147483638)
## 0 0
## refused skipped on web
## 0 0
## uncodeable not available in this release
## 0 0
## not available in this year see codebook
## 0 0
cat("\nGender Categories:\n")
##
## Gender Categories:
##
## male female
## 31977 40301
## don't know iap
## 0 0
## I don't have a job dk, na, iap
## 0 0
## no answer not imputable_(2147483637)
## 0 0
## not imputable_(2147483638) refused
## 0 0
## skipped on web uncodeable
## 0 0
## not available in this release not available in this year
## 0 0
## see codebook
## 0
cat("\nAge Distribution:\n")
##
## Age Distribution:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.00 32.00 44.00 46.56 60.00 89.00 769
cat("\nIncome Distribution (in constant dollars):\n")
##
## Income Distribution (in constant dollars):
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 218 12081 24139 32537 40756 162607 7478
Before recoding, let’s consider each variable carefully:
1. Education
Have both degrees and years
Need meaningful groupings that reflect educational
transitions
Consider implications of combining categories
2. Race
Three broad categories in GSS
Important to note limitations of these categories
Consider sample size implications
3. Gender
4. Age
Continuous variable (18-89)
Can consder/justify groupings
Consider life course stages
5. Income
Let’s create our cleaned variables:
# Create clean demographic variables
gss_demo <- gss %>%
mutate(
# Create religious none indicator
religious_none = case_when(
relig == "none" ~ "None",
relig %in% c("protestant", "catholic", "jewish",
"buddhism", "hinduism", "muslim/islam",
"orthodox-christian", "christian",
"inter-nondenominational", "other") ~ "Affiliated",
TRUE ~ NA_character_
),
# Education (4 meaningful categories)
education = case_when(
degree == "less than high school" ~ "Less than High School",
degree == "high school" ~ "High School",
degree %in% c("associate/junior college", "bachelor's") ~ "Some College/BA",
degree == "graduate" ~ "Graduate Degree",
TRUE ~ NA_character_
),
# Create factor with meaningful order
education = factor(education,
levels = c("Less than High School", "High School",
"Some College/BA", "Graduate Degree")),
# Race (maintain GSS categories while noting limitations)
race = case_when(
race == "white" ~ "White",
race == "black" ~ "Black",
race == "other" ~ "Other",
TRUE ~ NA_character_
),
# Gender (binary in GSS)
gender = case_when(
sex == "male" ~ "Male",
sex == "female" ~ "Female",
TRUE ~ NA_character_
),
# Age groups (life course stages)
age_group = case_when(
age < 30 ~ "18-29",
age < 45 ~ "30-44",
age < 60 ~ "45-59",
TRUE ~ "60+"
),
age_group = factor(age_group,
levels = c("18-29", "30-44", "45-59", "60+")),
# Income quartiles (relative position)
income_group = ntile(realinc, 4),
income_group = factor(income_group,
labels = c("Bottom 25%", "Lower Middle",
"Upper Middle", "Top 25%"))
)
# Verify our cleaning
check_missing <- gss_demo %>%
summarize(across(c(education, race, gender, age_group, income_group),
~sum(is.na(.))))
print("Missing Values in Each Variable:")
## [1] "Missing Values in Each Variable:"
## education race gender age_group income_group
## 1 196 107 112 0 7478
Group Patterns: Proportions and Conditional Probabilities
Let’s start by examining educational differences:
# Calculate educational patterns with key statistics
education_patterns <- gss_demo %>%
filter(!is.na(education), !is.na(religious_none)) %>%
# First get total sample size
mutate(total_n = n()) %>%
# Then group by education
group_by(education) %>%
summarize(
# Group size
n = n(),
# Distribution
pct_of_sample = n/first(total_n) * 100,
# Conditional probability
n_none = sum(religious_none == "None"),
pct_none = n_none/n * 100
)
# Create formatted table
edu_table <- education_patterns %>%
gt() %>%
tab_header(
title = md("**Educational Differences in Religious Non-Affiliation**"),
subtitle = "Distribution and conditional probabilities"
) %>%
fmt_number(
columns = n,
decimals = 0,
use_seps = TRUE
) %>%
fmt_number(
columns = c(pct_of_sample, pct_none),
decimals = 1
) %>%
cols_label(
education = "Education Level",
n = "Group Size",
pct_of_sample = "% of Sample",
n_none = "Number Unaffiliated",
pct_none = "% Unaffiliated"
) %>%
tab_footnote(
footnote = "'% Unaffiliated' shows conditional probability: P(None|Education Level)",
locations = cells_column_labels(columns = pct_none)
) %>%
tab_source_note(
source_note = "Data: General Social Survey"
)
edu_table
Educational Differences in Religious Non-Affiliation |
Distribution and conditional probabilities |
Education Level |
Group Size |
% of Sample |
Number Unaffiliated |
% Unaffiliated |
Less than High School |
13,742 |
20.1 |
1280 |
9.3 |
High School |
34,611 |
50.7 |
4208 |
12.2 |
Some College/BA |
14,445 |
21.2 |
2375 |
16.4 |
Graduate Degree |
5,437 |
8.0 |
1034 |
19.0 |
Data: General Social Survey |
Interpreting Educational Patterns
Looking at our table, we observe several key patterns:
1. Sample Distribution
High school graduates form the largest group (50.7% of
sample)
About one-fifth have less than high school (20.1%)
Similar proportion have some college/BA (21.2%)
Graduate degree holders are the smallest group (8.0%)
2. Conditional Probabilities
We see a clear educational gradient in religious non-affiliation:
Less than high school: 9.3% unaffiliated
High school: 12.2% unaffiliated
Some college/BA: 16.4% unaffiliated
Graduate degree: 19.0% unaffiliated
This suggests:
Higher education associated with greater likelihood of
non-affiliation
Each additional level of education shows increased
probability
Nearly double the rate of non-affiliation between lowest and
highest education
Now, let’s visualize these patterns:
# Calculate conditional probabilities for education
edu_conditional <- gss_demo %>%
filter(!is.na(education), !is.na(religious_none)) %>%
group_by(education) %>%
summarize(
n = n(),
n_none = sum(religious_none == "None"),
p_none = n_none/n,
pct_none = p_none * 100
)
# Create visualization
edu_plot <- ggplot(edu_conditional,
aes(x = pct_none, y = reorder(education, pct_none))) +
# Add main bars
geom_col(
width = 0.6,
fill = "#114B5F",
alpha = 0.9
) +
# Add point at end of each bar
geom_point(
size = 3,
color = "#028090"
) +
# Add value labels
geom_text(
aes(label = sprintf("%.1f%%", pct_none)),
hjust = -0.5,
family = "sans",
size = 4,
fontface = "bold",
color = "#114B5F"
) +
# Clean scales
scale_x_continuous(
limits = c(0, max(edu_conditional$pct_none) * 1.2),
breaks = seq(0, 20, by = 5),
labels = scales::label_number(suffix = "%")
) +
# Add labels
labs(
title = "Educational Differences in Religious Non-Affiliation",
subtitle = "Conditional Probability: P(None | Education Level)",
caption = "Data: General Social Survey",
x = "Probability of No Religious Affiliation",
y = NULL
) +
# Custom theme
theme_minimal() +
theme(
# Text elements
plot.title = element_text(
family = "sans",
face = "bold",
size = 16,
margin = margin(b = 10)
),
plot.subtitle = element_text(
family = "sans",
color = "#114B5F",
size = 12,
margin = margin(b = 20)
),
plot.caption = element_text(
color = "grey40",
margin = margin(t = 20)
),
# Axis formatting
axis.text = element_text(
family = "sans",
size = 11,
color = "grey20"
),
axis.text.y = element_text(
face = "bold"
),
axis.title.x = element_text(
margin = margin(t = 10),
color = "grey20"
),
# Grid lines
panel.grid.major.x = element_line(
color = "grey95",
size = 0.3
),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
# Add breathing room
plot.margin = margin(30, 30, 30, 30)
)
edu_plot

Learning a fundamental skill: Saving
Suppose you were really happy with the table and visual you created.
You want to save it so you do not need to always re-run the code or to
insert in your report – well, you can save (and in your preferred
format). Let’s look at how to save for Word and PDFs.
For the tables:
# Identify saved name object
edu_table
Educational Differences in Religious Non-Affiliation |
Distribution and conditional probabilities |
Education Level |
Group Size |
% of Sample |
Number Unaffiliated |
% Unaffiliated |
Less than High School |
13,742 |
20.1 |
1280 |
9.3 |
High School |
34,611 |
50.7 |
4208 |
12.2 |
Some College/BA |
14,445 |
21.2 |
2375 |
16.4 |
Graduate Degree |
5,437 |
8.0 |
1034 |
19.0 |
Data: General Social Survey |
# You can select and copy directly to Word
Now let’s save the table to PDF (you would remove hashtags):
# Save to PDF
#edu_table %>%
# gtsave("education_patterns.pdf")
We can do the same for the visual (you would remove hashtags):

# Save to PDF
ggsave(
"edu_plot.pdf",
plot = edu_plot,
width = 10,
height = 6
)
Now, let’s examine age patterns in religious non-affiliation:
# Calculate age patterns with key statistics
age_patterns <- gss_demo %>%
filter(!is.na(age_group), !is.na(religious_none)) %>%
# First get total sample size
mutate(total_n = n()) %>%
# Then group by age
group_by(age_group) %>%
summarize(
# Group size
n = n(),
# Distribution
pct_of_sample = n/first(total_n) * 100,
# Conditional probability
n_none = sum(religious_none == "None"),
pct_none = n_none/n * 100
)
# Create formatted table
age_patterns %>%
gt() %>%
tab_header(
title = md("**Age Differences in Religious Non-Affiliation**"),
subtitle = "Distribution and conditional probabilities"
) %>%
fmt_number(
columns = n,
decimals = 0,
use_seps = TRUE
) %>%
fmt_number(
columns = c(pct_of_sample, pct_none),
decimals = 1
) %>%
cols_label(
age_group = "Age Group",
n = "Group Size",
pct_of_sample = "% of Sample",
n_none = "Number Unaffiliated",
pct_none = "% Unaffiliated"
) %>%
tab_footnote(
footnote = "'% Unaffiliated' shows conditional probability: P(None|Age Group)",
locations = cells_column_labels(columns = pct_none)
) %>%
tab_source_note(
source_note = "Data: General Social Survey"
)
Age Differences in Religious Non-Affiliation |
Distribution and conditional probabilities |
Age Group |
Group Size |
% of Sample |
Number Unaffiliated |
% Unaffiliated |
18-29 |
13,744 |
20.1 |
2623 |
19.1 |
30-44 |
20,695 |
30.3 |
3120 |
15.1 |
45-59 |
16,294 |
23.8 |
1798 |
11.0 |
60+ |
17,662 |
25.8 |
1377 |
7.8 |
Data: General Social Survey |
Interpreting Age Patterns
Looking at our table, we observe several key patterns:
1. Sample Distribution
2. Conditional Probabilities
We see a clear and strong age gradient in religious
non-affiliation:
Young adults (18-29): 19.1% unaffiliated
Early midlife (30-44): 15.1% unaffiliated
Later midlife (45-59): 11.0% unaffiliated
Older adults (60+): 7.8% unaffiliated
This suggests:
Let’s visualize these patterns:
# Create visualization with new color scheme
ggplot(age_patterns,
aes(x = pct_none, y = reorder(age_group, pct_none))) +
# Add main bars
geom_col(
width = 0.6,
fill = "#2E86AB", # New color for age patterns
alpha = 0.9
) +
# Add point at end of each bar
geom_point(
size = 3,
color = "#084B83"
) +
# Add value labels
geom_text(
aes(label = sprintf("%.1f%%", pct_none)),
hjust = -0.5,
family = "sans",
size = 4,
fontface = "bold",
color = "#2E86AB"
) +
# Clean scales
scale_x_continuous(
limits = c(0, max(age_patterns$pct_none) * 1.2),
breaks = seq(0, 20, by = 5),
labels = scales::label_number(suffix = "%")
) +
# Add labels
labs(
title = "Age Differences in Religious Non-Affiliation",
subtitle = "Conditional Probability: P(None | Age Group)",
caption = "Data: General Social Survey",
x = "Probability of No Religious Affiliation",
y = NULL
) +
# Custom theme
theme_minimal() +
theme(
# Text elements
plot.title = element_text(
family = "sans",
face = "bold",
size = 16,
margin = margin(b = 10)
),
plot.subtitle = element_text(
family = "sans",
color = "#2E86AB",
size = 12,
margin = margin(b = 20)
),
plot.caption = element_text(
color = "grey40",
margin = margin(t = 20)
),
# Axis formatting
axis.text = element_text(
family = "sans",
size = 11,
color = "grey20"
),
axis.text.y = element_text(
face = "bold"
),
axis.title.x = element_text(
margin = margin(t = 10),
color = "grey20"
),
# Grid lines
panel.grid.major.x = element_line(
color = "grey95",
size = 0.3
),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
# Add breathing room
plot.margin = margin(30, 30, 30, 30)
)

Let’s examine racial patterns in religious non-affiliation:
# Calculate racial patterns with key statistics
race_patterns <- gss_demo %>%
filter(!is.na(race), !is.na(religious_none)) %>%
# First get total sample size
mutate(total_n = n()) %>%
# Then group by race
group_by(race) %>%
summarize(
# Group size
n = n(),
# Distribution
pct_of_sample = n/first(total_n) * 100,
# Conditional probability
n_none = sum(religious_none == "None"),
pct_none = n_none/n * 100
)
# Create formatted table
race_patterns %>%
gt() %>%
tab_header(
title = md("**Racial Differences in Religious Non-Affiliation**"),
subtitle = "Distribution and conditional probabilities"
) %>%
fmt_number(
columns = n,
decimals = 0,
use_seps = TRUE
) %>%
fmt_number(
columns = c(pct_of_sample, pct_none),
decimals = 1
) %>%
cols_label(
race = "Race",
n = "Group Size",
pct_of_sample = "% of Sample",
n_none = "Number Unaffiliated",
pct_none = "% Unaffiliated"
) %>%
tab_footnote(
footnote = "'% Unaffiliated' shows conditional probability: P(None|Race)",
locations = cells_column_labels(columns = pct_none)
) %>%
tab_source_note(
source_note = "Data: General Social Survey"
)
Racial Differences in Religious Non-Affiliation |
Distribution and conditional probabilities |
Race |
Group Size |
% of Sample |
Number Unaffiliated |
% Unaffiliated |
Black |
9,566 |
14.0 |
955 |
10.0 |
Other |
3,916 |
5.7 |
687 |
17.5 |
White |
54,860 |
80.3 |
7253 |
13.2 |
Data: General Social Survey |
Interpreting Racial Patterns
Looking at our table, we observe two key dimensions:
1. Sample Distribution
White respondents form a clear majority (80.3% of
sample)
Black respondents represent about one-seventh (14.0%)
Other racial groups comprise a small share (5.7%)
2. Conditional Probabilities
We see notable racial variation in religious non-affiliation:
Other racial groups: 17.5% unaffiliated
White respondents: 13.2% unaffiliated
Black respondents: 10.0% unaffiliated
Key patterns:
‘Other’ racial groups show highest rate of
non-affiliation
7.5 percentage point gap between highest and lowest
groups
Black respondents most likely to maintain religious
affiliation
# Create visualization with new color scheme
ggplot(race_patterns,
aes(x = pct_none, y = reorder(race, pct_none))) +
# Add main bars
geom_col(
width = 0.6,
fill = "#7E6B8F", # New purple color for racial patterns
alpha = 0.9
) +
# Add point at end of each bar
geom_point(
size = 3,
color = "#4A4063"
) +
# Add value labels
geom_text(
aes(label = sprintf("%.1f%%", pct_none)),
hjust = -0.5,
family = "sans",
size = 4,
fontface = "bold",
color = "#7E6B8F"
) +
# Clean scales
scale_x_continuous(
limits = c(0, max(race_patterns$pct_none) * 1.2),
breaks = seq(0, 20, by = 5),
labels = scales::label_number(suffix = "%")
) +
# Add labels
labs(
title = "Racial Differences in Religious Non-Affiliation",
subtitle = "Conditional Probability: P(None | Race)",
caption = "Data: General Social Survey",
x = "Probability of No Religious Affiliation",
y = NULL
) +
# Custom theme
theme_minimal() +
theme(
# Text elements
plot.title = element_text(
family = "sans",
face = "bold",
size = 16,
margin = margin(b = 10)
),
plot.subtitle = element_text(
family = "sans",
color = "#7E6B8F",
size = 12,
margin = margin(b = 20)
),
plot.caption = element_text(
color = "grey40",
margin = margin(t = 20)
),
# Axis formatting
axis.text = element_text(
family = "sans",
size = 11,
color = "grey20"
),
axis.text.y = element_text(
face = "bold"
),
axis.title.x = element_text(
margin = margin(t = 10),
color = "grey20"
),
# Grid lines
panel.grid.major.x = element_line(
color = "grey95",
size = 0.3
),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
# Add breathing room
plot.margin = margin(30, 30, 30, 30)
)

# Calculate income patterns with key statistics
income_patterns <- gss_demo %>%
filter(!is.na(income_group), !is.na(religious_none)) %>%
# First get total sample size
mutate(total_n = n()) %>%
# Then group by income
group_by(income_group) %>%
summarize(
# Group size
n = n(),
# Distribution
pct_of_sample = n/first(total_n) * 100,
# Conditional probability
n_none = sum(religious_none == "None"),
pct_none = n_none/n * 100
)
# Create formatted table
income_patterns %>%
gt() %>%
tab_header(
title = md("**Income Differences in Religious Non-Affiliation**"),
subtitle = "Distribution and conditional probabilities"
) %>%
fmt_number(
columns = n,
decimals = 0,
use_seps = TRUE
) %>%
fmt_number(
columns = c(pct_of_sample, pct_none),
decimals = 1
) %>%
cols_label(
income_group = "Income Group",
n = "Group Size",
pct_of_sample = "% of Sample",
n_none = "Number Unaffiliated",
pct_none = "% Unaffiliated"
) %>%
tab_footnote(
footnote = "'% Unaffiliated' shows conditional probability: P(None|Income Group)",
locations = cells_column_labels(columns = pct_none)
) %>%
tab_source_note(
source_note = "Data: General Social Survey"
)
Income Differences in Religious Non-Affiliation |
Distribution and conditional probabilities |
Income Group |
Group Size |
% of Sample |
Number Unaffiliated |
% Unaffiliated |
Bottom 25% |
15,359 |
24.9 |
2089 |
13.6 |
Lower Middle |
15,397 |
25.0 |
1993 |
12.9 |
Upper Middle |
15,628 |
25.4 |
1907 |
12.2 |
Top 25% |
15,178 |
24.7 |
2121 |
14.0 |
Data: General Social Survey |
Interpreting Income Patterns
Looking at our table, we observe two interesting dimensions:
1. Sample Distribution
Almost perfectly even quartiles (by design)
Each group represents approximately 25% of the sample
This enables clear comparison across income levels
2. Conditional Probabilities
We see a surprising pattern in religious non-affiliation:
Top 25%: 14.0% unaffiliated
Bottom 25%: 13.6% unaffiliated
Lower Middle: 12.9% unaffiliated
Upper Middle: 12.2% unaffiliated
Key insights:
# Create visualization with new color scheme
ggplot(income_patterns,
aes(x = pct_none, y = reorder(income_group, pct_none))) +
# Add main bars
geom_col(
width = 0.6,
fill = "#D4B483", # Warm gold color for income patterns
alpha = 0.9
) +
# Add point at end of each bar
geom_point(
size = 3,
color = "#B68D4C"
) +
# Add value labels
geom_text(
aes(label = sprintf("%.1f%%", pct_none)),
hjust = -0.5,
family = "sans",
size = 4,
fontface = "bold",
color = "#D4B483"
) +
# Clean scales
scale_x_continuous(
limits = c(0, max(income_patterns$pct_none) * 1.2),
breaks = seq(0, 15, by = 3),
labels = scales::label_number(suffix = "%")
) +
# Add labels
labs(
title = "Income Differences in Religious Non-Affiliation",
subtitle = "Conditional Probability: P(None | Income Group)",
caption = "Data: General Social Survey",
x = "Probability of No Religious Affiliation",
y = NULL
) +
# Custom theme
theme_minimal() +
theme(
# Text elements
plot.title = element_text(
family = "sans",
face = "bold",
size = 16,
margin = margin(b = 10)
),
plot.subtitle = element_text(
family = "sans",
color = "#D4B483",
size = 12,
margin = margin(b = 20)
),
plot.caption = element_text(
color = "grey40",
margin = margin(t = 20)
),
# Axis formatting
axis.text = element_text(
family = "sans",
size = 11,
color = "grey20"
),
axis.text.y = element_text(
face = "bold"
),
axis.title.x = element_text(
margin = margin(t = 10),
color = "grey20"
),
# Grid lines
panel.grid.major.x = element_line(
color = "grey95",
size = 0.3
),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
# Add breathing room
plot.margin = margin(30, 30, 30, 30)
)

Let’s examine gender patterns in religious non-affiliation:
# Calculate gender patterns with key statistics
gender_patterns <- gss_demo %>%
filter(!is.na(gender), !is.na(religious_none)) %>%
# First get total sample size
mutate(total_n = n()) %>%
# Then group by gender
group_by(gender) %>%
summarize(
# Group size
n = n(),
# Distribution
pct_of_sample = n/first(total_n) * 100,
# Conditional probability
n_none = sum(religious_none == "None"),
pct_none = n_none/n * 100
)
# Create formatted table
gender_patterns %>%
gt() %>%
tab_header(
title = md("**Gender Differences in Religious Non-Affiliation**"),
subtitle = "Distribution and conditional probabilities"
) %>%
fmt_number(
columns = n,
decimals = 0,
use_seps = TRUE
) %>%
fmt_number(
columns = c(pct_of_sample, pct_none),
decimals = 1
) %>%
cols_label(
gender = "Gender",
n = "Group Size",
pct_of_sample = "% of Sample",
n_none = "Number Unaffiliated",
pct_none = "% Unaffiliated"
) %>%
tab_footnote(
footnote = "'% Unaffiliated' shows conditional probability: P(None|Gender)",
locations = cells_column_labels(columns = pct_none)
) %>%
tab_source_note(
source_note = "Data: General Social Survey"
)
Gender Differences in Religious Non-Affiliation |
Distribution and conditional probabilities |
Gender |
Group Size |
% of Sample |
Number Unaffiliated |
% Unaffiliated |
Female |
38,189 |
55.9 |
3936 |
10.3 |
Male |
30,158 |
44.1 |
4969 |
16.5 |
Data: General Social Survey |
Interpreting Gender Patterns
Looking at our table, we observe two distinct dimensions:
1. Sample Distribution
Women form a majority of respondents (55.9% of sample)
Men comprise 44.1% of the sample
This roughly matches population gender distribution
2. Conditional Probabilities
We see a substantial gender gap in religious non-affiliation:
Key insights:
# Create visualization with new color scheme
ggplot(gender_patterns,
aes(x = pct_none, y = reorder(gender, pct_none))) +
# Add main bars
geom_col(
width = 0.6,
fill = "#386FA4", # Steel blue for gender patterns
alpha = 0.9
) +
# Add point at end of each bar
geom_point(
size = 3,
color = "#133C55"
) +
# Add value labels
geom_text(
aes(label = sprintf("%.1f%%", pct_none)),
hjust = -0.5,
family = "sans",
size = 4,
fontface = "bold",
color = "#386FA4"
) +
# Clean scales
scale_x_continuous(
limits = c(0, max(gender_patterns$pct_none) * 1.2),
breaks = seq(0, 20, by = 5),
labels = scales::label_number(suffix = "%")
) +
# Add labels
labs(
title = "Gender Differences in Religious Non-Affiliation",
subtitle = "Conditional Probability: P(None | Gender)",
caption = "Data: General Social Survey",
x = "Probability of No Religious Affiliation",
y = NULL
) +
# Custom theme
theme_minimal() +
theme(
# Text elements
plot.title = element_text(
family = "sans",
face = "bold",
size = 16,
margin = margin(b = 10)
),
plot.subtitle = element_text(
family = "sans",
color = "#386FA4",
size = 12,
margin = margin(b = 20)
),
plot.caption = element_text(
color = "grey40",
margin = margin(t = 20)
),
# Axis formatting
axis.text = element_text(
family = "sans",
size = 11,
color = "grey20"
),
axis.text.y = element_text(
face = "bold"
),
axis.title.x = element_text(
margin = margin(t = 10),
color = "grey20"
),
# Grid lines
panel.grid.major.x = element_line(
color = "grey95",
size = 0.3
),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
# Add breathing room
plot.margin = margin(30, 30, 30, 30)
)

# Create separate trend calculations first
education_trends <- gss_demo %>%
filter(!is.na(education), !is.na(religious_none)) %>%
group_by(year, education) %>%
summarize(
n = n(),
p_none = mean(religious_none == "None"),
group_type = "Education",
category = as.character(first(education)),
.groups = "drop"
)
age_trends <- gss_demo %>%
filter(!is.na(age_group), !is.na(religious_none)) %>%
group_by(year, age_group) %>%
summarize(
n = n(),
p_none = mean(religious_none == "None"),
group_type = "Age",
category = as.character(first(age_group)),
.groups = "drop"
)
race_trends <- gss_demo %>%
filter(!is.na(race), !is.na(religious_none)) %>%
group_by(year, race) %>%
summarize(
n = n(),
p_none = mean(religious_none == "None"),
group_type = "Race",
category = as.character(first(race)),
.groups = "drop"
)
gender_trends <- gss_demo %>%
filter(!is.na(gender), !is.na(religious_none)) %>%
group_by(year, gender) %>%
summarize(
n = n(),
p_none = mean(religious_none == "None"),
group_type = "Gender",
category = as.character(first(gender)),
.groups = "drop"
)
# Combine all trends
all_trends <- bind_rows(
education_trends,
age_trends,
race_trends,
gender_trends
) %>%
# Calculate start and end values
group_by(group_type, category) %>%
mutate(
start_value = first(p_none[year == min(year)]),
end_value = last(p_none[year == max(year)])
) %>%
ungroup() %>%
# Identify extreme categories
mutate(
highlight = case_when(
start_value == max(start_value, na.rm = TRUE) ~ "High Start",
start_value == min(start_value, na.rm = TRUE) ~ "Low Start",
end_value == max(end_value, na.rm = TRUE) ~ "High End",
end_value == min(end_value, na.rm = TRUE) ~ "Low End",
TRUE ~ "Other"
)
)
# Create visualization
ggplot(all_trends, aes(x = year, y = p_none,
group = interaction(group_type, category),
color = highlight)) +
# Add background lines
geom_line(data = filter(all_trends, highlight == "Other"),
color = "grey80", alpha = 0.3, size = 0.5) +
# Add highlighted lines
geom_line(data = filter(all_trends, highlight != "Other"),
size = 1.2) +
# Add labels for highlighted groups
geom_label_repel(
data = filter(all_trends,
highlight != "Other" & year == max(year)),
aes(label = paste0(category, "\n(", group_type, ")")),
direction = "y",
nudge_x = 2,
size = 3,
fontface = "bold",
box.padding = 0.5
) +
# Color scheme
scale_color_manual(
values = c(
"High Start" = "#E41A1C",
"Low Start" = "#377EB8",
"High End" = "#4DAF4A",
"Low End" = "#984EA3",
"Other" = "grey80"
)
) +
# Format scales
scale_y_continuous(
labels = scales::percent_format(),
limits = c(0, 0.5)
) +
scale_x_continuous(
breaks = seq(1980, 2020, by = 5),
limits = c(1980, 2027)
) +
# Labels
labs(
title = "Demographic Patterns in Religious Non-Affiliation (1975-2022)",
subtitle = "Highlighting groups with extreme starting and ending positions",
x = NULL,
y = "Proportion Unaffiliated",
color = "Pattern Type"
) +
# Theme
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "grey40", size = 12),
legend.position = "bottom",
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "grey95"),
axis.line = element_line(color = "black", size = 0.5)
)
