Overview

In this class session, we worked with the Cooperative Election Study (CCES) — one of the largest and most widely-used surveys in American political science. The CCES is conducted every two years during election cycles, with sample sizes typically ranging from 50,000 to 60,000 respondents. That large sample size is what makes it so useful for studying smaller subgroups (like specific religious traditions or generations) without losing statistical precision.

Our goal today was straightforward: use the CCES to understand how church attendance has changed over time, and then break that trend down by race and generation.


Step 1: Load the Data

library(fst)
library(tidyverse)
library(socsci)
library(scales)
cces <- read_fst("E://data/full_cces24.fst")
  • tidyverse gives us tools for data manipulation (dplyr) and visualization (ggplot2)
  • socsci is a custom package that provides helper functions like frcode(), ct(), and mean_ci() designed specifically for survey data analysis

The file path in read_fst() will need to be updated to wherever you saved the data on your own machine.


Step 2: Recode Church Attendance

The CCES asks respondents how often they attend religious services. The raw variable (pew_attendance) is coded numerically, but the scale runs backwards — 1 means “More than once a week” and 6 means “Never.” We wrote a custom function to flip this and apply human-readable labels:

cces_attend <- function(df, var){
  var <- enquo(var)
  df %>% 
    mutate(att = case_when(
      !! var == 6 ~ 1,
      !! var == 5 ~ 2,
      !! var == 4 ~ 3,
      !! var == 3 ~ 4,
      !! var == 2 ~ 5,
      !! var == 1 ~ 6,
      TRUE ~ NA_real_)) %>% 
    mutate(att = frcode(
      att == 1 ~ "Never",
      att == 2 ~ "Seldom",
      att == 3 ~ "Yearly",
      att == 4 ~ "Monthly",
      att == 5 ~ "Weekly",
      att == 6 ~ "Weekly+")) 
}

A few things to notice here:

  • The first case_when() reverses the numeric scale so that higher numbers now mean more frequent attendance
  • The second mutate() uses frcode() to convert those numbers into labeled factor levels. The frcode() function (from socsci) is similar to case_when(), but it automatically sets the factor levels in the order they appear in your code — so our attendance variable will display in a logical order from “Never” to “Weekly+”
  • The function takes any data frame and any variable as inputs, making it reusable across different datasets or variable names

We also defined similar recoding functions for party ID, race, education, and income — all following the same logic of converting raw numeric codes into meaningful factor labels.


Step 3: Check What Years Are in the Data

Before diving into analysis, it’s always good practice to see what you’re working with:

cces %>% ct(year)
##    year     n   pct
## 1  2006 36421 0.053
## 2  2008 32800 0.047
## 3  2009 13800 0.020
## 4  2010 55400 0.080
## 5  2011 20150 0.029
## 6  2012 54535 0.079
## 7  2013 16400 0.024
## 8  2014 56200 0.081
## 9  2015 14250 0.021
## 10 2016 64600 0.093
## 11 2017 18200 0.026
## 12 2018 60000 0.087
## 13 2019 18000 0.026
## 14 2020 61000 0.088
## 15 2021 25700 0.037
## 16 2022 60000 0.087
## 17 2023 24500 0.035
## 18 2024 60000 0.087

The ct() function from socsci produces a quick frequency table with counts and percentages. This tells us which survey years are present in the cumulative dataset and how many respondents are in each wave.


Step 4: Compare Attendance in 2008 vs. 2024

Our first visualization compared the full distribution of church attendance across all six categories for just two years — 2008 and 2024 — to see how the picture has changed.

cc1 <- cces %>% 
  filter(year == 2008 | year == 2012 | year == 2016 | year == 2020 | year == 2024) %>% 
  cces_attend(pew_attendance) %>% 
  group_by(year) %>% 
  ct(att, show_na = FALSE, wt = weight)

Let’s break down what each step does:

  1. filter() — restricts the data to only the five election years we care about
  2. cces_attend() — applies our recoding function to create the att variable
  3. group_by(year) — tells R to perform calculations separately within each year
  4. ct(att, show_na = FALSE, wt = weight) — counts the frequency of each attendance category, drops missing values, and applies the survey weight so that our percentages reflect the actual U.S. population rather than just our raw sample

Then we visualized this as a stacked horizontal bar chart, with one bar per year:

cc1 %>% 
  mutate(lab = round(pct, 2)) %>% 
  ggplot(., aes(x = 1, y = pct, fill = fct_rev(att))) +
  geom_col(color = "black") + 
  coord_flip() +
  facet_wrap(~ year, ncol =1, strip.position = "left") +
  theme_minimal() +
  scale_fill_manual(values = c(
    "Never"   = "#7B2D8B",
    "Seldom"  = "#C0659A",
    "Yearly"  = "#E8A0B4",
    "Monthly" = "#A8C8E8",
    "Weekly"  = "#4A90C4",
    "Weekly+" = "#1A4F8A"
  )) +
  theme(legend.position = "bottom") +
  scale_y_continuous(labels = percent) +
  theme(strip.text.y.left = element_text(angle=0)) +
  guides(fill = guide_legend(reverse=T, nrow = 1)) +
  theme(axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
  theme(panel.grid.minor.y=element_blank(), panel.grid.major.y=element_blank()) +
  geom_text(aes(label = ifelse(pct >.05, paste0(lab*100, '%'), '')), 
            position = position_stack(vjust = 0.5), size = 8, color = "black") +
  geom_text(aes(label = ifelse(pct >.05 & att ==  "Never", paste0(lab*100, '%'), '')), 
            position = position_stack(vjust = 0.5), size = 8, color = "white") +
  geom_text(aes(label = ifelse(pct >.05 & att ==  "Weekly+", paste0(lab*100, '%'), '')), 
            position = position_stack(vjust = 0.5), size = 8, color = "white") +
  theme(strip.text.y.left = element_text(angle = 0, hjust = 1)) +
  labs(x = "", y = "", title = "Attendance in 2008 vs 2024") +
  theme(legend.title = element_blank())

Some design choices worth explaining:

  • fct_rev(att) reverses the factor order within the stacked bar so that “Never” appears on the left and “Weekly+” on the right — a more intuitive layout
  • facet_wrap(~ year, strip.position = "left") creates one bar per year, stacked vertically with the year label on the left side
  • coord_flip() rotates the whole chart so the bars run horizontally
  • The color palette uses purple shades for non-attenders and blue shades for regular attenders, making the shift visually obvious
  • We layer three geom_text() calls: one for all labels in black, then two overrides in white for the darkest segments (“Never” and “Weekly+”) so the text remains readable against dark backgrounds

Step 5: Track the “Never” Category Over Time

Rather than looking at all six categories, we next focused on just the “Never” group — arguably the most important indicator of religious disengagement — and tracked it across every year in the dataset.

cc2 <- cces %>% 
  mutate(never = case_when(pew_attendance == 6 ~ 1,
                           pew_attendance <= 5 ~ 0)) %>% 
  group_by(year) %>% 
  mean_ci(never, wt = weight)

Here we create a binary variable (never) coded 1 for non-attenders and 0 for everyone else. When you take a weighted mean of a 0/1 variable, you get the proportion of respondents who are in the “1” category — in this case, the share who never attend church. The mean_ci() function also calculates confidence intervals, which tells us how certain we can be about each estimate.

cc2 %>% 
  filter(year > 2006) %>% 
  ggplot(., aes(x = year, y = mean)) +
  geom_line() +
  geom_point(shape = 21, fill = "white", color = "goldenrod", size = 3) +
  theme_minimal()

This simple line chart makes the trend immediately visible — and the trend is striking.


Step 6: Break Down the “Never” Trend by Race

Does the rise in non-attendance look the same across racial groups? We added a group_by(year, race) to find out:

cces_race <- function(df, var){
  var <- enquo(var)
  df %>% 
    mutate(race = frcode(
      !! var == 1 ~ "White",
      !! var == 2 ~ "Black",
      !! var == 3 ~ "Hispanic",
      !! var == 4 ~ "Asian",
      TRUE ~ "All Others"))
}


cc2 <- cces %>% 
  cces_race(race) %>% 
  mutate(never = case_when(pew_attendance == 6 ~ 1,
                           pew_attendance <= 5 ~ 0)) %>% 
  group_by(year, race) %>% 
  mean_ci(never, wt = weight)

cc2 %>% 
  filter(year > 2006) %>% 
  filter(race != "All Others") %>% 
  ggplot(., aes(x = year, y = mean, color = race)) +
  geom_line() +
  geom_point(shape = 21, fill = "white", size = 3) +
  theme_minimal() +
  scale_y_continuous(labels = percent, limits = c(0, .42))

The cces_race() function works just like cces_attend() — it recodes the numeric race variable into labeled categories. We filter out “All Others” to keep the chart focused on the four largest groups. The key insight from this chart: the rise in non-attendance is not uniform across racial groups.


Step 7: Break Down the Trend by Generation

Finally, we created a generation variable from the respondent’s birth year and tracked non-attendance separately for each cohort:

gg1 <- cces %>% 
  mutate(gen = frcode(
    birthyr >= 1946 & birthyr <= 1964 ~ "Boomers",
    birthyr >= 1965 & birthyr <= 1980 ~ "Gen X", 
    birthyr >= 1981 & birthyr <= 1996 ~ "Millennials",
    birthyr >= 1997 ~ "Gen Z")) %>% 
  filter(gen != "NA") %>% 
  mutate(never = case_when(pew_attendance == 6 ~ 1,
                           pew_attendance <= 5 ~ 0)) %>% 
  group_by(year, gen) %>% 
  mean_ci(never, wt = weight)

The generational cutoffs we used:

Generation Birth Years
Baby Boomers 1946–1964
Gen X 1965–1980
Millennials 1981–1996
Gen Z 1997+

We then visualized this two ways — all generations on one chart to compare them directly, and then faceted to see each generation’s trajectory in isolation:

# All on one chart
gg1 %>% 
  filter(year > 2006) %>% 
  ggplot(., aes(x = year, y = mean, color = gen)) +
  geom_line() +
  geom_point(shape = 21, fill = "white", size = 3) +
  theme_minimal() +
  scale_y_continuous(labels = percent, limits = c(.1, .45))

# Faceted by generation
gg1 %>% 
  filter(year > 2006) %>% 
  ggplot(., aes(x = year, y = mean, color = gen)) +
  geom_line() +
  geom_point(shape = 21, fill = "white", size = 3) +
  facet_wrap(~ gen) +
  theme_minimal() +
  scale_y_continuous(labels = percent, limits = c(.1, .45))

The faceted version makes it easier to see each generation’s internal trend, while the combined version lets you compare levels across generations directly. Both are useful, and together they tell a richer story.


Key Takeaways

By the end of class, we had built a complete workflow for working with large survey data in R:

  1. Load a large dataset efficiently using fst
  2. Recode variables into meaningful categories using custom functions and frcode()
  3. Summarize data with weighted frequencies (ct()) and weighted means with confidence intervals (mean_ci())
  4. Visualize trends using ggplot2 with stacked bars, line charts, and facets
  5. Disaggregate trends by demographic subgroups to look for variation within the overall pattern

The broader finding from all of this: the rise in religious non-attendance is real, it spans racial groups and generations, but the pace and level differ meaningfully across those groups — which is exactly the kind of nuance that careful data analysis can reveal.


Data: Cooperative Election Study Cumulative Dataset, 2006–2024. All estimates are weighted.