BASC Data Exercise

Author

Collin Paschall

Objective

This is an initial data exploration of the social vulnerability index data from the CDC.

Our analysis here is focused on census tracts that are completely or partially within the municipal boundaries of Columbus, Ohio. Our brief for this assignment is to provide a quick baseline of social vulnerability factors within the city.

Because one of our project’s areas of focus is racial wealth inequality, this quick skim of the data will investigate the relationships between social vulnerability factors and the racial composition of census tracts within the city.

This note addresses three questions:

To what extent are the components of social vulnerability correlated?
How is race associated with these measures in Columbus?
Is the relationship between social vulnerability and race in Columbus similar to that found in other geographic areas of the United States?

Data

The dataset includes multiple indicators of social vulnerability, measured at the census tract level. The full data set is currently unavailable because of an ongoing review of federal data sources by the new administration. We may have to consider enriching these data with additional sources in the future. While there are many more in the full dataset from the CDC, here we focus on the below measures:¹

Percent of people below 150% of the federal poverty line
Percent of households with no vehicle
Percent of overcrowded housing units
Percent of people 25+ w/o high school diploma

There are also several demographic measures for each tract:

Population
Number of housing units
Households
Percent of BIPOC residents

Should you wish to sort or filter the data manually, this table displays tract-level data for census tracts that are entirely or partly within the city of Columbus. The original data did not include an indicator for whether a tract lay completely or partially within Columbus city limits. We used the tigris package in R to retrieve the geographic boundaries for the city and used that to filter the original data to include only tracts at least partially in Columbus.²

Show the code

# Import common data wrangling and visualization tools, along with some mapping libraries
library(tidyverse) 
library(tigris)
library(sf)
library(DT)
library(leaflet)
library(GGally)
library(ggExtra)
library(patchwork)
library(skimr)
library(stargazer)

# Import Data
svi <- read_csv("Data exercise data - Sheet1.csv")

# Calculate percentages

svi$'% HH w/o veh.' <- round(svi$`Households with no vehicle`/svi$Households*100,2)
svi$'% Ppl<150% Pvrty'<-round(svi$`Ppl Below 150% Poverty`/svi$Population*100,2)
svi$'% Ppl>25 w/o dip.'<-round(svi$`People 25+ w/o high school diploma`*100/svi$Population,2)
svi$'% BIPOC' <- round(svi$`BIPOC Residents`/svi$Population*100,2)

# Rename percent overcrowded for consistency
svi<-svi %>% rename('% Crowded Housing' = 'Percent of Overcrowded Housing Units')

# Filter to Franklin County, OHio
fc <- svi %>% filter(County=="Franklin County" & State == "Ohio")

# Extract Census Tracts
location_vector<-unlist(str_split(fc$Location,";"))
tracts<-location_vector[grepl("Census",location_vector)]
fc$'Census Tract' <- tracts
fc$merge <- fc$'Census Tract'

# Drop the "Census Tract" Text
fc$tract_number <- str_remove(fc$`Census Tract`,"Census Tract ")

## visual check that Location and Census Tract is correct
#cbind(fc$Location,fc$`Census Tract`)

# Get the city boundaries

columbus <-places(state="Ohio",cb=TRUE,year=2020) %>% filter(NAME=="Columbus") %>% st_transform(crs=4326)

# Get the Census tracts

c_tracts <- tracts(state="Ohio",county = "Franklin",cb=TRUE,year = 2020) %>% st_transform(crs=4326)


# this codes takes a while to find the intersections
city_tracts <- st_intersection(columbus,c_tracts)

city_tracts$merge <- city_tracts$NAMELSAD.1

dat<-left_join(city_tracts,fc,by="merge")

#save.image("good_stuff.Rdata")

#load("good_stuff.Rdata")

Show the code

### Make a nice looking interactive table

browsing_dat<- dat %>% rename("Tract Number" = "tract_number") %>% select('Tract Number','Population','% BIPOC','Housing Units','Households','% Ppl<150% Pvrty','% HH w/o veh.','% Crowded Housing','% Ppl>25 w/o dip.') %>% tibble() %>% select(-"geometry") %>% drop_na()

# There is one missing row we'll have to investigate in the future, probably from the TIGRIS data

datatable(data=browsing_dat,options=list(searching=FALSE))

This table provides some basic descriptive statistics for the variables of interest. More descriptive statistics could be generated upon request.

Show the code

my_skim <- skim_with(
  #base=sfl(Missing = n_missing),
  base=sfl(),
  numeric=sfl(IQR=IQR,Mean=mean,SD = sd)
  ,append=FALSE)

summary_dat <- browsing_dat %>% select(-`Tract Number`)

summary_table <- tibble(my_skim(summary_dat)) %>% select(-skim_type)

names(summary_table) <- c("Variable","IQR","Mean","SD")

summary_table<-summary_table %>% mutate(across(where(is.numeric),~round(.x,digits=2)))

datatable(summary_table,options=list(searching=FALSE,dom='tpr'))

1. To what extent are the components of social vulnerability correlated?

Our first insight from these data is that the components of social vulnerability are correlated with one another, meaning that where one measure of social vulnerability is high, other measures are likely to be high. The converse holds as well; where one measure is low, others tend to be low.

We can see the evidence for this conclusion in the below scatterplot matrix. On the diagonal, this figure displays the univariate distributions of each of our measures of social vulnerability. The scatterplots to the left of the diagonal display the bivariate relationships between each measure. The correlation coefficients to the right of the diagonal summarize the correlations between the measures. The correlations are all positive, which indicates that these measures tend to “run together” across Census tracts.³

We should be cautious to not over-interpret these data, because there are plenty of examples of tracts that are high on one social vulnerability indicator while low on another. However, as a general tendency, these data provide evidence of a positive correlation.

Bottom line: Measures of social vulnerability are intertwined. A census tract is often high on multiple measures of social vulnerability simultaneously. Communities must contend with more than one structural challenge at a time.

Show the code

browsing_dat %>% tibble() %>% select('% Ppl<150% Pvrty',
                            '% HH w/o veh.',
                            '% Crowded Housing',
                            '% Ppl>25 w/o dip.') %>% ggpairs(.)

2. How is race associated with these measures in Columbus?

It is not only absolute levels of social vulnerability across a municipality that are relevant to understanding vulnerability and crafting policy responses. Context matters, and social vulnerability very often has a spatial component. Social vulnerability will not be evenly distributed across a city.

In the United States, the strong pattern of racial segregation often underlies the spatial distribution of vulnerability. Columbus is no different, as illustrated by the two maps below.

The map on the left shows each census tract in Columbus, with the fill color of each census tract indicating a summary measure of social vulnerability that combines the data for poverty, education, household overcrowding, and household vehicle ownership within that tract.⁴ In short, darker colors indicate higher levels of social vulnerability within the tract.

The map on the right again displays each census tract in the city, with the fill this time indicating the percent of the residents in the tract who report as BIPOC.

The spatial and racial dynamics of social vulnerability are immediately apparent. Columbus is deeply segregated along racial lines, with major concentrations of BIPOC citizens in the east side of the city. Accordingly, we can also see a spatial pattern to the social vulnerability measures, which generally follow the same geographic pattern.

Show the code

# Create a score using averaged z-scores
dat$`SVI Score` <-(scale(dat$`% Ppl>25 w/o dip.`)+scale(dat$`% HH w/o veh.`)+scale(dat$`% Crowded Housing`)+scale(dat$`% Ppl>25 w/o dip.`))/4

# Reset the scale to 0 as the minimum
dat$`SVI Score` <- dat$`SVI Score` + (-1*min(dat$`SVI Score`,na.rm=T))

# Make maps
svi_plot<-dat %>% ggplot()+geom_sf(aes(fill=`SVI Score`)) +scale_fill_viridis_c(direction=-1) + theme_minimal() + theme(axis.text = element_blank()) +
  theme(panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank())
  
  
bipoc_plot<-dat %>% ggplot()+geom_sf(aes(fill=`% BIPOC`))+scale_fill_viridis_c(direction=-1) + theme_minimal() + theme(axis.text = element_blank())  + theme(panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank())

# Display maps together
svi_plot+bipoc_plot

Though the spatial pattern is clear, this relationship can also be summarized without respect to spatial relationships. The figure below is a scatterplot, where each point represents a census tract in the city. Its position in the plot corresponds with the percent of BIPOC residents in the tract (left to right, x-axis) and the tract’s SVI summary score (up and down, y-axis).

The distinctive pattern of the points moving from the lower left to the upper right of the plot reflects the positive correlation between the racial composition of a tract and its level of social vulnerability. The line superimposed in the plot is a line of best fit using a simple linear regression model with a 95% confidence interval; its positive slope and small confidence interval signals the positive correlation.

While this pattern is clear and strongly supported by this evidence, we must be very careful about the conclusions we draw from this. These results do not mean that social vulnerability is isolated to BIPOC communities or households, or that all BIPOC communities or households are socially vulnerable. These data do not establish that racial identity is a “cause” of social vulnerability. However, it does suggest that we need to think carefully about the spatial and racial dimensions of inequality when we work to develop policy.

Bottom line: The percent of BIPOC residents in a census tract appears to be associated with a higher level of social vulnerability within that tract.

Show the code

# Bivariate plot
dat %>% ggplot(aes(x=`% BIPOC`,y=`SVI Score`))+geom_point()+theme_classic() +geom_smooth(method='lm')

3. Is the relationship between social vulnerability and race in Columbus similar to that relationship in other geographic areas of the United States?

While these data show clear evidence of an identity-based dimension of social vulnerability in Columbus, it may be of interest to compare this evidence against the results of similar analyses in other parts of the United States. We could search for cities of similar sizes and characteristics to Columbus (Indianapolis, Austin, Nashville, etc.) and compare our findings with the patterns in those cities.

Such an analysis would be outside the scope of this initial review, but we can quickly look at how these results in Columbus compare with patterns nationwide.

To summarize this comparison, below is a regression table that reports estimates from two linear models. The first model estimates the association between the social vulnerability score and percent BIPOC residents in a tract, using data from every census tract in the United States. The second model restricts the data to only census tracts in Columbus (note the huge difference in sample size).

The table indicates that the association in Columbus between the racial composition of a census tract and the level of social vulnerability in that tract is similar to the pattern we see nationwide. The regression coefficient is similar in both models.

Substantively, we should interpret this regression coefficient as a hypothetical comparison between two otherwise similar census tracts. Based on nationwide patterns (Model 1), between two otherwise similar census tracts, we should expect a census tract with 1 percent higher BIPOC population to have a social vulnerability score that is 0.016 higher than the comparison unit. This coefficient value is substantively close to what we see when we focus just on the city of Columbus in the second model (0.019).⁵

Bottom line: The racial dynamics of social vulnerability in Columbus appear consistent with and similar to national patterns.

Show the code

# Estimate two regression models

# Additionally, we might prefer to use a per household rate instead of per capita. We can think about this more down the road.    (mean = `r round(mean(fc$Population,na.rm=T))`, min = `r round(min(fc$Population,na.rm=T))`, max = `r round(max(fc$Population,na.rm=T))`


svi$`SVI Score` <-(scale(svi$`% Ppl>25 w/o dip.`)+scale(svi$`% HH w/o veh.`)+scale(svi$`% Crowded Housing`)+scale(svi$`% Ppl>25 w/o dip.`))/4

mod1 <-lm(`SVI Score`~`% BIPOC`,data=svi)
mod2 <-lm(`SVI Score`~`% BIPOC`,data=dat)
stargazer(mod1,mod2,type='html',omit.stat=c("f","ser","adj.rsq"),star.cutoffs=NA,omit.table.layout = "n")


	Dependent variable:

	`SVI Score`
	(1)	(2)

`% BIPOC`	0.016	0.019
	(0.0001)	(0.002)

Constant	-0.654	0.233
	(0.004)	(0.076)


Observations	83,282	275
R²	0.376	0.362

Footnotes

While the original data includes raw values of number of people below 150% poverty, households with no vehicle, number of people 25+ w/o high school diploma, and number of BIPOC residents, we generated percent values for consistency, because census tracts vary substantially in population and number of households. Some caution is needed in interpreting the percent of population older than 25 without a HS diploma, because the denominator for the ratio is the entire population across all ages, and not just people older than 25. However, this should not make any substantive impact on this value as an indicator of social vulnerability within the census tract.↩︎
Technically, this might be slightly over-inclusive, because some tracts may lie within two municipalities. For example, a tract on the border of Columbus and Upper Arlington could span both municipalities. This is unlikely to affect any initial conclusions to draw from this note, and with more time, we could more carefully check for instances where tracts lie in more than one local municipality.↩︎
Disregard the stars next to the correlation coefficient values. These indicate statistical significance, but because of the frequent misinterpretation of this concept, we will not discuss it in this note. It suffices to say here that there is reason to believe these correlations are genuine and not an artifact of this particular set of observations.↩︎
This is a crude score generated by standardizing these four measures, adding them together, dividing by 4, and resetting the value to start from zero. This is a “quick and dirty” method of scale construction, but adequate for a first cut at the data. See code for details.↩︎
The coefficients for % BIPOC are statistically significant in both models, but that information is omitted because of the frequent misinterpretation of this concept, as discussed above in footnote 3. Again, there is reason to believe this relationship is genuine and not an artifact of this particular set of observations.↩︎