In this tutorial we’ll use the 2024 General Social Survey (GSS) to
explore patterns of religious attendance across race and gender. Along
the way you’ll practice core skills: filtering data, recoding variables,
computing means, and building visualizations with
ggplot2.
We start by loading the socsci package, which provides
several custom functions we’ll use throughout (ct(),
mean_ci(), frcode(), etc.), along with
ggplot2 and dplyr (both loaded automatically
with socsci).
library(socsci)
library(tidyverse)
library(scales)
error_bar <- function(wd){
if(missing(wd)){
geom_errorbar(aes(ymin=lower, ymax=upper), width=.2, position=position_dodge(.9))
} else{
geom_errorbar(aes(ymin=lower, ymax=upper), width=wd, position=position_dodge(.9))
}
}
gss24 <- read.csv("https://www.dropbox.com/scl/fi/t83kf7w379cwu79f535fq/gss24.csv?rlkey=8t3a6n5csqg4x0zhgsuyfba94&st=zwkoggzs&dl=1")
ct()The ct() function gives us a quick frequency table —
counts and percentages — for any variable. Let’s start with
sex to see the basic gender breakdown in the sample.
gss24 %>%
ct(sex)
## sex n pct
## 1 1 1467 0.443
## 2 2 1823 0.551
## 3 NA 19 0.006
Notice that sex is coded numerically (1 = Male, 2 =
Female). We’ll recode that into labels later when we build our
chart.
Sometimes we want to look at a specific subgroup. We can chain
filter() calls to narrow the data before tabulating. Here
we look at church attendance only among White men.
gss24 %>%
filter(sex == 1) %>% # men only
filter(race == 1) %>% # white only
ct(attend, show_na = FALSE)
## attend n pct
## 1 0 378 0.370
## 2 1 117 0.114
## 3 2 123 0.120
## 4 3 99 0.097
## 5 4 42 0.041
## 6 5 44 0.043
## 7 6 54 0.053
## 8 7 127 0.124
## 9 8 39 0.038
The show_na = FALSE argument drops any missing values
from the table so we can focus on valid responses.
mutate() and
case_when()The attend variable has 9 categories (0–8). For many
analyses it’s useful to collapse this into a simpler binary:
weekly attenders vs. everyone
else.
The GSS codes attendance as follows:
| Value | Label |
|---|---|
| 0 | Never |
| 1 | Less than once a year |
| 2 | About once or twice a year |
| 3 | Several times a year |
| 4 | About once a month |
| 5 | 2–3 times a month |
| 6 | Nearly every week |
| 7 | Every week |
| 8 | Several times a week |
We’ll define “weekly” as values 6, 7, or 8.
gss24 %>%
mutate(wk = case_when(
attend == 6 | attend == 7 | attend == 8 ~ 1,
attend <= 5 ~ 0
)) %>%
mean_ci(wk)
## # A tibble: 1 × 8
## mean sd n n_eff se lower upper ci
## <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.247 0.431 3276 3276 0.00753 0.232 0.261 0.95
mean_ci() returns the mean and 95% confidence interval.
Because wk is a 0/1 variable, the mean equals the
proportion of weekly attenders in the sample.
Now let’s break that same weekly attendance estimate down by both
race and sex. We use group_by() before
mean_ci() so the calculation happens within each group.
We also use frcode() here — a socsci
wrapper around case_when() that automatically turns the
result into a factor with levels ordered as they appear
in your recoding statements. This is handy for controlling the order of
bars in a chart.
gg1 <- gss24 %>%
mutate(race = frcode(
race == 1 ~ "White",
race == 2 ~ "Black",
race == 3 ~ "Other"
)) %>%
mutate(sex = frcode(
sex == 1 ~ "Men",
sex == 2 ~ "Women"
)) %>%
group_by(sex, race) %>%
mutate(wk = case_when(
attend == 6 | attend == 7 | attend == 8 ~ 1,
attend <= 5 ~ 0
)) %>%
mean_ci(wk) %>%
na.omit()
gg1
## # A tibble: 6 × 10
## sex race mean sd n n_eff se lower upper ci
## <fct> <fct> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Men White 0.215 0.411 1023 1023 0.0129 0.190 0.240 0.95
## 2 Men Black 0.219 0.415 219 219 0.0280 0.164 0.274 0.95
## 3 Men Other 0.180 0.385 189 189 0.0280 0.125 0.235 0.95
## 4 Women White 0.259 0.438 1232 1232 0.0125 0.234 0.283 0.95
## 5 Women Black 0.300 0.459 343 343 0.0248 0.252 0.349 0.95
## 6 Women Other 0.294 0.457 194 194 0.0328 0.229 0.358 0.95
Before we plot, we’ll define a helper function called
lab_bar() that adds percentage labels to bar charts
automatically. This uses tidy evaluation
(enquo() and !!) to accept column names as
arguments — a more advanced R concept, but the key idea is that it lets
us write lab_bar(type = mean) and have it work on whichever
column we pass in.
lab_bar <- function(type, pos = 0, sz = 8, above = TRUE) {
type <- enquo(type)
geom_text(
aes(
y = if (above) !!type + pos else pos,
label = paste0(round(!!type, 2) * 100, '%')
),
position = position_dodge(width = 0.9),
size = sz
)
}
Now we’re ready to visualize. We use a dodged bar chart so we can compare men and women within each racial group side by side.
Key elements of this chart:
geom_col(position = "dodge") — places bars next to each
other rather than stackingerror_bar() — adds 95% confidence interval lines (from
socsci)lab_bar() — places percentage labels inside the
barsscale_fill_manual() — sets custom colors for each
gendergg1 %>%
ggplot(aes(x = race, y = mean, fill = sex)) +
geom_col(position = "dodge", color = "black") +
scale_y_continuous(labels = percent) +
theme_minimal() +
theme(
legend.position = "bottom",
legend.title = element_blank(),
plot.title = element_text(size = 24),
legend.text = element_text(size = 24)
) +
error_bar() +
lab_bar(above = FALSE, type = mean, pos = .03, sz = 12) +
scale_fill_manual(name = NULL, values = c("#9b59b6", "#16a085")) +
labs(
x = "", y = "",
title = "Weekly Attendance by Race and Gender",
caption = "Data: General Social Survey, 2024"
)
ggsave("wkattend_race_gender.png", bg = "white", width = 8, height = 6)
What do you notice? Women attend more frequently than men across all racial groups. Black respondents show the highest weekly attendance overall.
Rather than collapsing attendance into a binary, we can visualize the full distribution. First, let’s tabulate it.
gss24 %>%
ct(attend, show_na = FALSE)
## attend n pct
## 1 0 1028 0.314
## 2 1 356 0.109
## 3 2 371 0.113
## 4 3 353 0.108
## 5 4 152 0.046
## 6 5 208 0.063
## 7 6 194 0.059
## 8 7 446 0.136
## 9 8 168 0.051
Now we recode the numeric values into descriptive labels using
frcode(), which again preserves the order we specify (Never
→ Several Times a Week).
gg2 <- gss24 %>%
mutate(attend = frcode(
attend == 0 ~ "Never",
attend == 1 ~ "Once or Less",
attend == 2 ~ "Once or Twice",
attend == 3 ~ "Several Times",
attend == 4 ~ "Once a Month",
attend == 5 ~ "2-3 Times per Month",
attend == 6 ~ "Nearly Weekly",
attend == 7 ~ "Weekly",
attend == 8 ~ "Several Times per Week"
)) %>%
ct(attend, show_na = FALSE)
gg2
## attend n pct
## 1 Never 1028 0.314
## 2 Once or Less 356 0.109
## 3 Once or Twice 371 0.113
## 4 Several Times 353 0.108
## 5 Once a Month 152 0.046
## 6 2-3 Times per Month 208 0.063
## 7 Nearly Weekly 194 0.059
## 8 Weekly 446 0.136
## 9 Several Times per Week 168 0.051
A horizontal bar chart works well here because the attendance labels
are long. We use coord_flip() to rotate the chart, and a
gradient fill to encode the percentage visually (darker
= higher share).
gg2 %>%
ggplot(aes(x = factor(attend), y = pct, fill = pct)) +
geom_col(color = "black") +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
lab_bar(above = TRUE, pos = .015, sz = 8, type = pct) +
scale_y_continuous(labels = percent) +
labs(
x = "Attendance", y = "Percent",
title = "Distribution of Annual Religious Attendance"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 20),
legend.position = "none"
) +
coord_flip()
ggsave("attend_distribution.png", bg = "white", width = 8, height = 6)
In this tutorial we covered:
ct() — frequency tables with counts
and percentagesfilter() — subsetting rows based on
conditionsmutate() + case_when() —
creating new variables (including binary recodes)frcode() — recoding into ordered
factorsgroup_by() + mean_ci() —
computing grouped means with confidence intervalsThese are the core building blocks for most descriptive analyses of survey data. As you work with your own data, try swapping in different grouping variables or outcomes and see what patterns emerge.