Voter participation is a key indicator of civic engagement and a robust democracy, and yet participation rates in the United States can fluctuate substantially based on region, socioeconomic levels, and demographic factors. California is one of the most populous and diverse states in the county and presents a compelling case study for examining a lack of voter participation. According to the U.S. Census Bureau (2016) in the 2016 Presidential election, California had the third lowest voter participation in the country at 48.2%, ranking just above Hawaii (43.3%) and Texas (47.7%)
Due to California’s large and diverse population and vast geographic landscape it presents a distinctive contrast in voting patterns. California has large urban areas like Los Angeles, San Diego, and San Francisco where large population density and diverse populations influence voting behavior. Conversely, there are expansive rural farming communities throughout the Central Valley and other agricultural areas that present contrasting political, economic, and social impacts on voter engagement. This project aims to examine how voter turnout varies across California counties and explore the connections between engagement and key demographic and socioeconomic contributors, such as income, education attainment, age distribution, and urbanization. Through analyzing urban and rural counties, this study seeks to identify patterns of voter turnout and uncover insights that may assist with advocacy groups that assist communities with under representation of civic engagement.
The 2016 Presidential election data was chosen to provide a more comprehensive dataset for examining voter engagement trends compared to 2020. With the expansion of mail-in voting during the 2020 election due to the COVID-19 pandemic, the 2020 election had atypically increased participation rates (U.S. Census Bureau, 2021). Voter engagement in the US has historically followed predictable trends that are swayed by demographics, socioeconomic factors, and polling accessibility. The significantly elevated participation rates in the 2020 election are considered to be an outlier due to the expansion of mail-in and early voting options given to voters. To ensure a standardized comparison of voter behavior that aligns with historical trends and therefore, a more accurate analysis of socioeconomic and demographic impacts on participation, this report will focus solely on 2016 data.
This study will focus on voter turnout at the county level in California for the 2016 Presidential Election. County level was chosen as the unit of study because:
The purpose of this study is to analyze voter turnout rates in California at the county level for the 2016 Presidential Election. The research combines voter participation data, socioeconomic indicators (such as household income, education level attainment, and median age) along with population density to identify spatial patterns and relationships between these factors. I utilized a variety of statistical tests, spatial analysis, and regression modeling to determine meaningful predictors of voter engagement.
The correlation between voter participation and socioeconomic variables has been repeatedly documented in political science and social research (Wolfinger & Rosenstone, 1980; Leighley & Nagler, 2013). This analysis will focus on three key factors: Income levels, educational attainment, and urbanization. The variables were chosen for their strong theoretical and empirical association with voter engagement.
Income
Earning a higher income is typically associated with higher voter participation. Citizens with a higher expendable income tend to have:
For this research median household income from the American Community Survey (ACS) was utilized as a key variable to assess its relationship to voter turnout.
Education
Education attainment levels strongly correlate with political participation. Research shows that individuals with higher educational levels tend to have higher levels of engagment due to:
To examine this correlation, data on percentage of the population with a bachelor’s degree or higher from the ACS was used in the study.
Urbanization (Rural vs. Urban Differences)
Geographic location has an impact on voter participation, Urban areas tend to experience higher voter participation due to:
Conversely, rural areas tend to have:
For the purpose of this study, population density was used as a proxy for urbanization, where counties classified as “Urban” (density ≥ 500 people/km²) or “Rural” (density < 500 people/km²).
Through integrating these variables into the spatial analysis and regression models, the aim of the study is to identify patterns in voter participation and uncover potential disparities that can impact civic engagement.
This study will utilize three primary datasets:
# Load needed libraries
library(tidyverse)
library(dplyr)
library(ggplot2)
library(tidycensus)
library(sf)
library(readr)
library(spdep)
library(leaflet)
library(htmlwidgets)
library(spatialreg)
library(ppcor)
library(RColorBrewer)
# Retrieve ACS Data for Demographics
income_data <- get_acs(
geography = "county",
state = "CA",
variables = "B19013_001",
year = 2016,
survey = "acs5",
key = NULL
)
education_data <- get_acs(
geography = "county",
state = "CA",
variables = "B15003_022",
year = 2016,
survey = "acs5",
key = NULL
)
age_data <- get_acs(
geography = "county",
state = "CA",
variables = "B01002_001",
year = 2016,
survey = "acs5",
key = NULL
)
options(tigris_use_cache = TRUE)
population_data <- get_acs(
geography = "county",
state = "CA",
variables = "B01003_001",
year = 2016,
survey = "acs5",
geometry = TRUE,
key = NULL
)
# Clean population_data
population_data <- population_data %>%
rename(County = NAME)
population_data <- population_data %>%
mutate(County = gsub(" County, California", "", County)) # Remove " County, California"
age_data <- age_data %>%
rename(County = NAME)
age_data <- age_data %>%
mutate(County = gsub(" County, California", "", County))
education_data <- education_data %>%
rename(County = NAME)
education_data <- education_data %>%
mutate(County = gsub(" County, California", "", County))
income_data <- income_data %>%
rename(County = NAME)
income_data <- income_data %>%
mutate(County = gsub(" County, California", "", County))
population_data <- population_data %>%
mutate(land_area_km2 = as.numeric(st_area(geometry)) / 1e6) %>%
mutate(population_density = estimate / land_area_km2)
# Load voter turnout data
file_path <- "C:\\Users\\ranse\\OneDrive\\Desktop\\GEOG_588\\Term_Project\\Term_Project_Data\\CA_VoterTurnout_Cleaned.csv"
voter_data <- read_csv(file_path)
voter_data <- voter_data %>%
mutate(County = gsub(" County", "", County))
# Save County and geometry from the original sf object
geometry_data <- population_data %>%
dplyr::select(County, geometry)
# Remove geometry before Joining Data Frames
population_data <- st_drop_geometry(population_data)
# Perform the merge of Data Frames
population_data <- population_data %>%
left_join(voter_data, by = "County") %>%
left_join(income_data, by = "County") %>%
left_join(education_data, by = "County") %>%
left_join(age_data, by = "County")
# Rejoin saved geometry
population_data <- left_join(population_data, geometry_data, by = "County")
# Reconvert to sf object
population_data <- st_as_sf(population_data, sf_column_name = "geometry")
# Ensure correct Projection Coordinates
population_data <- st_transform(population_data, crs = 4326)
#Clean population_data after merge
population_data <- population_data %>%
rename(Participation_Perct = `Participation %`)
population_data <- population_data %>%
rename(Median_Income = estimate.y)
population_data <- population_data %>%
dplyr::select(
-variable.x, -moe.x, -GEOID.y,
-variable.y, -moe.y, -GEOID.x.x,
-variable.x.x, -moe.x.x, -GEOID.y.y,
-variable.y.y, -moe.y.y
)
population_data <- population_data %>%
rename(Bachelor_Degree = estimate.x.x)
population_data <- population_data %>%
rename(Median_Age = estimate.y.y)
population_data <- population_data %>%
rename(Population_total = estimate.x)
# Convert Participation_Perct to numeric values
is.numeric(population_data$Participation_Perct)
population_data$Participation_Perct <- as.numeric(gsub("%", "", population_data$Participation_Perct))
str(population_data$Participation_Perct)
#Urban to Rural Analysis ----
# Define urban vs. rural counties based on population density
population_data <- population_data %>%
mutate(urban_rural = case_when(
population_density >= 500 ~ "Urban",
population_density < 500 ~ "Rural",
TRUE ~ NA_character_
))
population_data %>%
group_by(urban_rural) %>%
summarise(avg_turnout = mean(Participation_Perct, na.rm = TRUE))
Data was retrieved using R with the ‘tidycensus’ package for ACS data and ‘readr’ for voter participation data.When all the datasets were collected, several pre-processing and merging steps were performed to create a single dataset for analysis. The datasets were then joined based on county names to create a cohesive dataset for analysis.
Each available dataset has some limitations that can impact reliability and accuracy of the analysis. This research relies on voter turnout data from the California Secretary of State and demographic and socioeconomic data from the U.S. Census Bureau’s American Community Survey (ACS). While these data sources provide comprehensive information, certain limitations should be recognized.
One of the challenges for this analysis is missing or incomplete data in both voter turnout data and data obtained from the ACS. Common issues include:
summary()) and missing value checks
(is.na()).The American Community Survey (ACS) is a sample based survey and its estimates are subject to sampling variability. Some key issues that have been found include:
Before starting analysis, data cleaning was required to standardize and merge datasets for consistency and accuracy.
# Standardize county names
#population_data <- population_data %>%
#mutate(County = gsub(" County, California", "", County))
Handling Non-Numeric Values: Voter turnout percentages, total votes, and registered voter counts were originally stored as text with commas or percentage symbols. These were converted to numeric format to allow for statistical analysis.
Addressing Missing Values: - Counties with missing socioeconomic data (e.g. median income, educational levels) had their missing values replaced with the median of that variable. - Any missing voter participation rates were kept as NA to avoid artificial inflation of turnout rates.
Computing Population Density: - ACS provided total population data not pre-calculated density, so population density was calculated using total population and land area estimates
#population_data <- population_data %>%
#mutate(land_area_km2 = as.numeric(st_area(geometry)) / 1e6) %>%
#mutate(population_density = Population_total / land_area_km2)
The cleaned datasets were merged using county names as the common factor. This resulted in a single dataset containing:
The cleaned and merged dataset serves as the foundation for spatial and statistical analysis for the following sections
The Histogram above shows the distribution of voter turnout rates across all counties in California
The data shows:
Interpretation: The distribution of voter participation suggests that while most counties have moderate turnout rates, local factors impact community engagement levels. Particular counties may have high political engagement trends, while other counties may face impediments limiting voter participation.
The boxplot chart above compares urban voter turnout to rural voter turnout reveals some noteworthy differences:
Interpretation: The patterns of voter turnout through urban and rural communities showcases the differences in engagement trends. The rural communities in California have a slightly higher median turnout but also show greater variability, with some counties showing exceptionally high engagement while others reported significantly lower participation levels. In comparison, urban counties display a more consistent engagement rate, but they generally show rates below the highest performing rural counties. The results of this analysis shows that local political culture, economic conditions, and accessibility impact voter participation differently across geographic settings. Historically urban areas offer increased access to polling locations and outreach programs, long wait times or lower engagement among transient populations create barriers to engagement. While some rural counties with a tight-knit community and localized political influence may increase engagement, logistical challenges and low political interest may be contributing factors to lower participation rates.
The choropleth map above illustrates voter turnout rates across California counties in the 2016 Presidential Election, uncovering distinct regional patterns in electoral participation.
Interpretation: The distribution of voter turnout shows disparities in electoral participation throughout California. Low turnout in the Central Valley and Southern California are indicators of systemic barriers, such as reduced polling accessibility, lower socioeconomic conditions, or lower levels of political mobilization. In contrast, counties with higher turnout rates, especially in the north and coastal areas, may have increased engagement due to increased access to polling locations, and historically stronger civic participation. These findings support the influence of geographic and demographic nuances in voter turnout analysis, since local factors such as education attainment levels, economic stability, and election policies can have impacts on participation levels. Deeper analysis into specific socioeconomic traits of low-turnout counties could give greater insight into strategic voter engagement strategies in underrepresented communities.
The scatterplot
above explores the relationship between median age and voter turnout
rates across California counties in the 2016 Presidential Election.
Interpretation: The correlation between age and voter engagement reinforces the pattern that older individuals are more likely to participate in the election process. Younger populations may be impacted by low political engagement, time management issues due to school or work. While older citizens show a trend of consistent voting habits, higher levels of political knowledge, and increased investment in policy outcomes that affect retirement, healthcare, and tax policies. This analysis indicates that a targeted outreach approach focused on younger demographics could increase engagement, especially if focused in counties with lower median ages.
The scatterplot
above examines the relationship between median household income and
voter turnout rates across California counties in the 2016 Presidential
Election.
Interpretation: The scatterplot reveals the relationship between income and voter turnout and implies that higher-income earners are more likely to participate in the election process. Citizens in wealthier areas may find fewer impediments to voting, such as inflexible work schedules, lack of transportation, or limited polling locations. While lower-income communities may face more systematic issues that can reduce participation, such as economic instability, time constraints due to work schedules or multiple jobs, and generally lower levels of political engagement. These findings stress the significance of addressing socioeconomic disparities in voter engagement, since lower-income populations can benefit from increased polling access, early voting options, and focused outreach efforts to decrease barriers to participation.
The scatterplot
above examines the relationship between educational attainment
(percentage of the population with a bachelor’s degree or higher) and
voter turnout rates across California counties in the 2016 Presidential
Election.
Interpretation: The correlation between educational attainment and voter turnout implies that voter turnout rates increase with increased levels of education. Citizens that hold a college degree may be more informed about political issues and have increased access to election resources and feel a stronger sense of civic duty. In contrast, lower levels of educational attainment show a reduction in political interest, limited access to reliable voting information, and potential impediments to engagment. This analysis reinforces the importance of voter education programs, particularly in locations with lower education levels. It may be possible to bridge gaps in engagement through expanding access to civic education programs, voter mobilization efforts, and access to voting resources.
The map above visualizes the
results of a Local Moran’s I analysis, identifying spatial clusters of
voter turnout across California counties in the 2016 Presidential
Election.
Interpretation: To account for spatial dependence in voting patterns, Local Moran’s I (Anselin, 1995) was used to identify spatial clusters, followed by spatial regression models to address autocorrelation in predictor variables and residuals (Anselin, 1988; Bivand et al., 2013). The spatial clustering of voter turnout rates reveals that electoral participation in California is not randomly distributed but instead impacted by regional social, economic, and political influences. High-turnout clusters in regions of northern and coastal California, could be influenced by higher socioeconomic factors, increased voter outreach efforts, and generally increased political engagement. While the low-turnout clusters centered in the Central Valley and inland communities may be impacted by lower education levels, economic struggles, and reduced access to voting resources. These findings showcase the need for geographically focused voter outreach programs, particularly in low-turnout clusters where turnout levels fall behind neighboring locations. Analysis of these spatial patterns could assist policy makers and advocates reduce voting disparities across California.
library(sf)
library(leaflet)
library(dplyr)
# Create palette
pal <- colorNumeric(
palette = "viridis",
domain = (population_data$Participation_Perct)
)
# Create leaflet map
leaflet(population_data) %>%
addTiles() %>%
addPolygons(
fillColor = ~pal(Participation_Perct),
color = "white",
weight = 1,
fillOpacity = 0.7,
highlight = highlightOptions(
weight = 3,
color = "#666",
bringToFront = TRUE
),
popup = ~paste0(
"<div style='width:200px;'>",
"<strong>",County, "</strong><br>",
"Turnout: ", Participation_Perct, "%<br>",
"Population Density: ", round(population_density, 1), "<br>",
"Median Income: $", round(Median_Income, 0), "<br>",
"Bachelor's Degree Holders: ", round(Bachelor_Degree, 1), "<br>",
"Median Age: ", round(Median_Age, 1), "<br>",
"</div>"
),
labelOptions = labelOptions(
style = list("font-weight" = "bold", "color" = "black"),
direction = "auto"
)
) %>%
addLegend(pal = pal, values = population_data$Participation_Perct, title = "Voter Turnout (%)", opacity = 1)
The interactive leaflet map above visualizes voter turnout across California counties in the 2016 Presidential Election using a choropleth color scale.
Darker purple counties represent lower levels of voter participation, while lighter green to yellow counties indicate higher turnout percentages.
The color gradient corresponds to the percentage of eligible voters who participated in the election, with the legend displaying a range from under 25% to over 55%.
Interactive popups provide county-specific details, including turnout rate, population density, median income, educational attainment, and median age.
For example, Santa Clara County is shown with a turnout of 36.29%, a population density of 560, and a median income of $101,173, illustrating the kind of demographic and socioeconomic context that accompanies voter participation levels.
This visualization highlights geographic disparities in civic engagement and allows users to explore how structural and regional factors may influence turnout across the state.
Interpretation: The leaflet map shows regional disparities in voter participation throughout California, reinforcing and consolidating observations documented in previous analyses. Counties with low engagement often have lower median incomes, lower educational attainment, and younger populations that implies impediments to participation. Counties with increased levels of turnout show higher levels of college graduates, higher median incomes, and older populations, characteristics that are traditionally linked to high political engagement. These findings reinforce the importance of structured and targeted voter outreach programs, especially in low-turnout communities. Through and understanding of these patterns policymakers can implement local interventions to increase voter accessibility and participation.
To explore the relationship between voter turnout and key socioeconomic factors, an Ordinary Least Squares (OLS) regression model was first constructed using Participation_Perct as the dependent variable, and population_density and Median_Income as predictors. This baseline model provided insight into the general strength and direction of the association between voter engagement and these variables.
However, voter behavior tends to exhibit spatial dependence, where turnout in one county may be influenced by neighboring counties. To test for spatial effects and improve model accuracy, two spatial regression models were constructed: a Spatial Lag Model (SAR) and a Spatial Error Model (SEM).
To support these models, a spatial weights matrix was created using poly2nb() to define county neighbors and nb2listw() to generate row-standardized weights. This structure captures the spatial relationships between California counties.
The Spatial Lag Model (SAR) incorporates the influence of neighboring counties’ turnout into each county’s prediction. This model is suitable when spillover effects are expected — for example, when political mobilization in one region affects turnout in adjacent areas.
The Spatial Error Model (SEM), by contrast, accounts for spatial autocorrelation in the error terms, assuming that unobserved factors influencing turnout are spatially clustered but not explicitly modeled.
Each model was evaluated using the Akaike Information Criterion (AIC). A lower AIC value indicates a better model fit with a penalty for model complexity. Comparing AIC values across the OLS, SAR, and SEM models allowed for a more robust assessment of whether incorporating spatial dependence improved explanatory power.
ols_model <- lm(Participation_Perct ~ population_density + `Median_Income`, data = population_data)
summary(ols_model)
##
## Call:
## lm(formula = Participation_Perct ~ population_density + Median_Income,
## data = population_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1073 -5.6995 -0.9064 5.1660 21.7053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.895e+01 4.166e+00 6.950 4.55e-09 ***
## population_density -1.129e-03 1.279e-03 -0.882 0.381
## Median_Income 1.683e-04 7.142e-05 2.357 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.014 on 55 degrees of freedom
## Multiple R-squared: 0.09178, Adjusted R-squared: 0.05875
## F-statistic: 2.779 on 2 and 55 DF, p-value: 0.07084
# Create spatial neighbors
nb <- poly2nb(population_data)
lw <- nb2listw(nb, style="W")
# Spatial Lag Model
sar_model <- lagsarlm(Participation_Perct ~ population_density + `Median_Income`,
data = population_data,
listw = lw)
summary(sar_model)
##
## Call:lagsarlm(formula = Participation_Perct ~ population_density +
## Median_Income, data = population_data, listw = lw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.93584 -4.67022 -0.93859 4.00406 17.94192
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.1042e+00 5.4298e+00 1.4925 0.135555
## population_density -1.3779e-03 1.0825e-03 -1.2729 0.203052
## Median_Income 1.6405e-04 6.1784e-05 2.6552 0.007926
##
## Rho: 0.55894, LR test value: 11.835, p-value: 0.00058131
## Asymptotic standard error: 0.12733
## z-value: 4.3898, p-value: 1.1347e-05
## Wald statistic: 19.27, p-value: 1.1347e-05
##
## Log likelihood: -195.5523 for lag model
## ML residual variance (sigma squared): 45.753, (sigma: 6.7641)
## Number of observations: 58
## Number of parameters estimated: 5
## AIC: 401.1, (AIC for lm: 410.94)
## LM test for residual autocorrelation
## test value: 1.5816, p-value: 0.20853
# Spatial Error Model
sem_model <- errorsarlm(Participation_Perct ~ population_density + `Median_Income`,
data = population_data,
listw = lw)
summary(sem_model)
##
## Call:errorsarlm(formula = Participation_Perct ~ population_density +
## Median_Income, data = population_data, listw = lw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7081 -4.5238 -0.8234 4.1101 18.2139
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.6303e+01 4.8753e+00 5.3951 6.847e-08
## population_density -7.6895e-04 9.7590e-04 -0.7879 0.430735
## Median_Income 2.2148e-04 7.4584e-05 2.9695 0.002983
##
## Lambda: 0.57779, LR test value: 13.369, p-value: 0.00025577
## Asymptotic standard error: 0.12772
## z-value: 4.5239, p-value: 6.0708e-06
## Wald statistic: 20.466, p-value: 6.0708e-06
##
## Log likelihood: -194.7851 for error model
## ML residual variance (sigma squared): 44.262, (sigma: 6.653)
## Number of observations: 58
## Number of parameters estimated: 5
## AIC: 399.57, (AIC for lm: 410.94)
The comparison of AIC values revealed that the Spatial Error Model (SEM) had the lowest AIC, indicating the best overall model fit among the three. This suggests that accounting for spatial autocorrelation in the error terms provided a more accurate representation of the factors influencing voter turnout than the basic OLS model or the Spatial Lag Model. The improvement in model performance implies that unobserved, spatially correlated factors—such as localized political culture or regional election infrastructure—may play a significant role in driving voter participation across California counties.
# Compare Model Performance
AIC(ols_model, sar_model, sem_model)
## df AIC
## ols_model 4 410.9396
## sar_model 5 401.1047
## sem_model 5 399.5702
Anselin, L. (1988). Spatial econometrics: Methods and models. Kluwer
Academic Publishers.
https://doi.org/10.1007/978-94-015-7799-1
Anselin, L. (1995). Local indicators of spatial association—LISA. Geographical Analysis, 27(2), 93–115. https://doi.org/10.1111/j.1538-4632.1995.tb00338.x
Bivand, R. S., Pebesma, E., & Gómez-Rubio, V. (2013). Applied spatial data analysis with R (2nd ed.). Springer. https://doi.org/10.1007/978-1-4614-7618-4
California Secretary of State. (2016). Statewide Voter Participation Statistics by County. Retrieved from https://www.sos.ca.gov/elections
Leighley, J. E., & Nagler, J. (2013). Who Votes Now? Demographics, Issues, Inequality, and Turnout in the United States. Princeton University Press.
U.S. Census Bureau. (2016). Voting and Registration in the Election
of November 2016. Retrieved from
https://www.census.gov/data/tables/time-series/demo/voting-and-registration/p20-580.html
U.S. Census Bureau. (2021). Voting and Registration in the Election of November 2020. Retrieved from https://www.census.gov/data/tables/time-series/demo/voting-and-registration/p20-585.html
U.S. Census Bureau. (2012–2016). American Community Survey 5-Year Estimates. Retrieved via R tidycensus package.
Wolfinger, R. E., & Rosenstone, S. J. (1980). Who Votes? Yale University Press.