Diversity probably isn’t the first thing that comes to mind when you think of Nashville, Tennessee.
Thirty minutes from Murfreesboro, where I teach journalism at Middle Tennessee State University, Nashville is famous for music - particularly country music. The city is home to the Grand Ole Opry, the Music Row recording studio and publishing district, and the Lower Broad neighborhood, with its honky tonk bars, live music, and wall-to-wall tourists.
Look past the facade of guitars, cowboy hats and beer, though, and you’ll find one of the most diverse cities in the state, progressive and inclusive, yet marked by geographic segregation left over from its painful Jim Crow past.
This interactive map lets you explore racial diversity across the city’s 35 neighborhoods.
Note, especially, the contrast between the rest of the city and heavily minority Northwest Nashville, where the occupying Union Army built fortifications during the American Civil War, where four prominent HBCUs were founded and still operate, where Dr. Martin Luther King Jr. spoke during his visits to the city, and where white supremacists dynamited the home of Black civil rights attorney and Nashville City councilman Z. Alexander Looby in April of 1960.
A lot has changed in Nashville. A lot hasn’t.
This session will introduce you to the R, the free, open-source programming language that produced this map, and show you how you could use it and data from the U.S. Census Bureau’s American Community Survey to explore diversity - and much more - in your community.
I might as well acknowledge, up front, that writing R code likely will seem harder than pointing-and-clicking your way through data analysis with something like the Microsoft Excel spreadsheet. I’ll have more to say about that later. For now, trust me on two things:
Mainly, it’s harder only at first. Stick with coding, and it quickly will become as easy as using a spreadsheet. Before long, it will become easier, especially when you get into all the things R can do that Excel can’t.
Beginning is easier than you think. The Getting started with R section of this tutorial includes a couple of fun, engaging YouTube videos that will show you how to install R and begin using it. You’ll also find links to more such videos.
Let me walk you through the script that produced the map you saw above. I’ll break the script into parts and explain at least a little about what each part does. I’ll put the whole script after this section, so you can easily copy and paste the whole thing later and try it out.
Most R scripts start by loading one or more add-on programs that expand what basic R can do and make R code easier to write and understand. These add-ons, called “packages,” are free and open-source, just like base R. My script uses six such packages. Here’s the code for setting up and activating them:
# Installing and loading required packages
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidycensus")) install.packages("tidycensus")
if (!require("sf")) install.packages("sf")
if (!require("mapview")) install.packages("mapview")
if (!require("leaflet")) install.packages("leaflet")
if (!require("leaflet.extras2")) install.packages("leaflet.extras2")
library(tidyverse)
library(tidycensus)
library(sf)
library(mapview)
library(leaflet)
library(leafpop)
When you actually run this code, you’ll see a whole lot of activity in R Studio’s output pane. The first time will take a little longer, because R will be installing the packages. After the first time, R needs only to load them into memory.
Look for a small, red stop sign icon in the upper-right corner of the output pane. As long as that stop sign is visible, R is working. When it disappears, R will be done, and you can proceed.
You’ll need to get your own, unique access key from the U.S. Census Bureau, activate it, and plug it into the code below. Otherwise, the code won’t work.
Use this form to request one. The key is free, and you’ll receive it via e-mail almost immediately after completing and submitting the form. When the e-mail with your key arrives, open the e-mail and click the activation link you’ll find inside the e-mail. Here’s the e-mail that delivered my API key. I have blurred the key, because you’re not supposed to share your key with anyone.
After you have activated your key, carefully copy and paste the key
into this code, in place of
PasteYourAPIKeyBetweenTheseQuoteMarks
. The key will be a
long string of letters and numbers.
Be sure you don’t accidentally delete the quote marks in the code or include spaces, punctuation, or anything other than the key’s characters.
census_api_key("PasteYourAPIKeyBetweenTheseQuoteMarks")
I’m looking for the percentage of people in each Davidson County
division who identify as “white.” The American Community Survey
estimates that percentage with a variable named
DP05_0037P
.
The ACS provides thousands of estimates, each with a unique variable
name. I’ll explain later how to find the variable name for whatever
estimate you want to get. For now, just trust that
DP05_0037P
is the variable I need. This code shares that
variable name with R:
VariableList =
c(Estimate_ = "DP05_0037P")
Time to get some data! Loosely translated, this code uses the
get_acs()
function from the tidycensus
package
to:
Connect to the Census Bureau’s server
(get_acs
)
Say, “I want county subdivision-level data”
(geography = "county subdivision"
)
Add, “For every county subdivision in Tennessee”
(state = "TN"
)
Add, “For the variables I specified
(variables = VariableList
)
Add, “As of 2023” (year = 2023
)
Add, “According to the five-year American Community Survey”
(survey = "acs5"
)
Add, “Laid out with a row for each subdivision and a column for
each variable”(output = "wide"
)
Add, “With the subdivision borders included”
(geometry = TRUE
)
Add, “And put all of it in a data frame called”mydata”
(mydata <-
)
Once you’re done, take the data frame’s “NAME” variable, which
contains the division number, county name, and state, all inconveniently
lumped together, and split it up so that there’s a column for each part:
one column called “Division,” a second column called “County,” and a
third column called “State.”
(separate_wider_delim
)
If you’ve ever imported external data into Excel, you might be getting an inkling of why code is so much better. Doing all of the things listed above in Excel would take a bit. R will do all of it in no time, and fully automatically. Furthermore, Excel would have no idea what to do with the subdivision border information that comes included and ready to go in the R data.
# ---------- County subdivision analysis ----------
# Fetching data
mydata <- get_acs(
geography = "county subdivision",
state = "TN",
variables = VariableList,
year = 2023,
survey = "acs5",
output = "wide",
geometry = TRUE)
# Reformatting data
mydata <-
separate_wider_delim(mydata,
NAME,
delim = ", ",
names = c("Division", "County", "State"))
Once you run the above code, the mydata
data frame will
appear in the upper-right pane of RStudio under the “Environment” tab.
Clicking on the data frame will open it in RStudio’s viewer. It will
look something like this:
If turned into a map, this file would show every county division in the entire state of Tennessee - all 844 of them. That might be useful for some purposes, but I’m after only the ones in Davidson County, where Nashville is.
This code filter’s mydata for divisions in Davidson County, then puts them in a new data frame called filtereddata.
filtereddata <- mydata %>%
filter(County == "Davidson County")
Open the filtereddata
data frame, and you’ll see only
the 35 divisions in Davidson County:
The first row contains data for Davidson County’s “District 13”
division. The Estimate_E
value, 54.1, indicates that an
estimated 54.1 percent of District 13’s population is white. The “E” in
Estimate_E
stands for “estimated.”
“Estimated” is an important adjective to keep in mind. ACS figures
are not based on actual counts the way figures in the 10-year Census
are. Instead, ACS estimates are derived from random-sample surveys. The
surveys are high quality, but anything short of an actual count could be
at least a little bit wrong. Each district’s Estimate_M
indicates how wrong.
Let’s take District 13 as an example. District 13’s
Estimate_M
of of 6.3 means that the actual percentage of
whites in District 13 is probably somewhere within 6.3 percentage points
of the district’s Estimate_E
, which is 54.1 percent.
Specifically, the actual figure for District 13 is probably between 54.1
- 6.3, or 47.8 percent, and 54.1 + 6.3, or 60.4 percent. This range is
called the estimate’s “margin of error.” The “M” in
Estimate_M
stands for “margin.”
One way to understand how helpful this information can be is to graph each district’s estimate along with its error margin. Here is some code that will make such a graph for the estimated white perentages in Davidson County districts. The graph appears below the code:
# Plotting data
# Plotting data
mygraph <- ggplot(filtereddata, aes(x = Estimate_E, y = reorder(Division, Estimate_E))) +
geom_errorbarh(aes(xmin = Estimate_E - Estimate_M, xmax = Estimate_E + Estimate_M)) +
geom_point(size = 3, color = "darkblue") +
theme_minimal(base_size = 12.5) +
labs(
title = "Pct. White Alone",
subtitle = "Davidson County subdivisions. Brackets show error margins.",
x = "2018-2022 ACS estimate",
y = "")
mygraph
District 2, in the chart’s bottom-left corner, really stands out, doesn’t it? It has the lowest percentage of whites, just above 20 percent. That means it has the highest percentage of nonwhites - just under 80 percent. Furthermore, its error margin doesn’t overlap with the error margins for any of the other districts. It almost certainly has a lower percentage of whites than any other district in the county.
Districts 3, 21, 32, 9, 30, 1, and 28 all have error margins that overlap one another. That means their white population percentages probably aren’t all that different from one another. But none of their error margins exceed the 50 percent mark. So, all of these districts probably have nonwhite majorities. On the map, they are shaded purple and blue.
In the upper-right corner of the graph, meanwhile, districts 25, 23, 35, 24, and 34 stand out as having distinctly high proportions of white residents. These are the least-diverse districts in the county. On the map, they are bright yellow.
If Davidson County were more integrated geographically, these extremes would not be evident. Instead, more districts would have similar proportions and overlapping error margins.
You’ve seen the map already. Here is the code that creates and displays it:
# Mapping data
mapdata <- filtereddata %>%
rename(Estimate = Estimate_E,
Range = Estimate_M)
mapdata <- st_as_sf(mapdata)
mapviewOptions(basemaps.color.shuffle = FALSE)
DivisionMap <- mapview(mapdata,
zcol = "Estimate",
layer.name = "Estimate",
popup = popupTable(
mapdata,
feature.id = FALSE,
row.numbers = FALSE,
zcol = c(
"State",
"County",
"Division",
"Estimate",
"Range")))
DivisionMap
Finally, suppose you wanted the data in a comma-separated value file that you could import into some other application. If you run this code, you’ll find the data in a file called DivisionData.csv stored on your computer in the same directory as your R script.
# Exporting data in .csv format
CSVdata <- st_drop_geometry(mapdata)
write_csv(CSVdata, "DivisionData.csv")
As promised, here’s the whole script, without interruption:
# Installing and loading required packages
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidycensus")) install.packages("tidycensus")
if (!require("sf")) install.packages("sf")
if (!require("mapview")) install.packages("mapview")
if (!require("leaflet")) install.packages("leaflet")
if (!require("leaflet.extras2")) install.packages("leaflet.extras2")
library(tidyverse)
library(tidycensus)
library(sf)
library(mapview)
library(leaflet)
library(leafpop)
# Transmitting API key
census_api_key("PasteYourAPIKeyBetweenTheseQuoteMarks")
# Fetching ACS codebooks
DetailedTables <- load_variables(2022, "acs5", cache = TRUE)
SubjectTables <- load_variables(2022, "acs5/subject", cache = TRUE)
ProfileTables <- load_variables(2022, "acs5/profile", cache = TRUE)
# Specifying target variables
VariableList =
c(Estimate_ = "DP05_0037P")
# ---------- County subdivision analysis ----------
# Fetching data
mydata <- get_acs(
geography = "county subdivision",
state = "TN",
variables = VariableList,
year = 2023,
survey = "acs5",
output = "wide",
geometry = TRUE)
# Reformatting data
mydata <-
separate_wider_delim(mydata,
NAME,
delim = ", ",
names = c("Division", "County", "State"))
# Filtering data
filtereddata <- mydata %>%
filter(County == "Davidson County")
# Plotting data
mygraph <- ggplot(filtereddata, aes(x = Estimate_E, y = reorder(Division, Estimate_E))) +
geom_errorbarh(aes(xmin = Estimate_E - Estimate_M, xmax = Estimate_E + Estimate_M)) +
geom_point(size = 3, color = "darkblue") +
theme_minimal(base_size = 12.5) +
labs(title = "Pct. White Alone",
subtitle = "Davidson County subdivisions. Brackets show error margins.",
x = "2019-2023 ACS estimate",
y = "")
mygraph
# Mapping data
mapdata <- filtereddata %>%
rename(Estimate = Estimate_E,
Range = Estimate_M)
mapdata <- st_as_sf(mapdata)
mapviewOptions(basemaps.color.shuffle = FALSE)
DivisionMap <- mapview(mapdata,
zcol = "Estimate",
layer.name = "Estimate",
popup = popupTable(
mapdata,
feature.id = FALSE,
row.numbers = FALSE,
zcol = c(
"State",
"County",
"Division",
"Estimate",
"Range")))
DivisionMap
# Exporting data in .csv format
CSVdata <- st_drop_geometry(mapdata)
write_csv(CSVdata, "DivisionData.csv")
But race isn’t the only, or even the most substantial, way to think
about diversity. Cultural diversity matters, too. The ACS variable
DP02_0114P
estimates the proportion of people age 5 and
older who speak a language other than English while at home.
Yeah. Let’s run the entire analysis again, but with the
DP02_0114P
language estimate instead of the
DP05_0037P
race estimate. Sound like a slog? It isn’t.
Because all you have to do is change DP05_0037P
to
DP02_0114P
in the VariableList
code and tweak
the graphic’s headline. Here’s a script. The
<== This has changed
notes indicate the (only) two lines
of code that had to be edited:
# Specifying target variables
VariableList =
c(Estimate_ = "DP02_0114P") # <== This has changed
# ---------- County subdivision analysis ----------
# Fetching data
mydata <- get_acs(
geography = "county subdivision",
state = "TN",
variables = VariableList,
year = 2023,
survey = "acs5",
output = "wide",
geometry = TRUE)
# Reformatting data
mydata <-
separate_wider_delim(mydata,
NAME,
delim = ", ",
names = c("Division", "County", "State"))
# Filtering data
filtereddata <- mydata %>%
filter(County == "Davidson County")
# Plotting data
mygraph <- ggplot(filtereddata, aes(x = Estimate_E, y = reorder(Division, Estimate_E))) +
geom_errorbarh(aes(xmin = Estimate_E - Estimate_M, xmax = Estimate_E + Estimate_M)) +
geom_point(size = 3, color = "darkblue") +
theme_minimal(base_size = 12.5) +
labs(title = "Pct. non-English at home", # <== This has changed
subtitle = "Davidson County subdivisions. Brackets show error margins.",
x = "2019-2023 ACS estimate",
y = "")
mygraph
# Mapping data
mapdata <- filtereddata %>%
rename(Estimate = Estimate_E,
Range = Estimate_M)
mapdata <- st_as_sf(mapdata)
mapviewOptions(basemaps.color.shuffle = FALSE)
DivisionMap <- mapview(mapdata,
zcol = "Estimate",
layer.name = "Estimate",
popup = popupTable(
mapdata,
feature.id = FALSE,
row.numbers = FALSE,
zcol = c(
"State",
"County",
"Division",
"Estimate",
"Range")))
DivisionMap
# Exporting data in .csv format
CSVdata <- st_drop_geometry(mapdata)
write_csv(CSVdata, "DivisionData.csv")
Now, we can look at Davidson County’s diversity from a whole different perspective. Two districts - 28 and 30 - stand out in the chart as having majorities of people who speak a language other than English while at home. In 11 more - 4, 15, 9, 33, 29, 16, 26, 13, 32, 27, and 31 - between 20 and 40 percent don’t use English at home. In the rest, under 20 percent speak a language besides English at home.
The map shows that Davidson County’s other-than-English population is concentrated in the Southeastern fourth of the county. The two districts with proportions above 50 percent stand out as yellow. In Davidson County, these are culturally Hispanic and Latino areas.
But enough about Nashville/Davidson County. You’re probably more interested in exploring where you live than in exploring where I live. We’re from all over, of course. So let’s pick somewhere we all might have been, or would like to visit: New York City.
I’ll stick with DP02_0114P
, the variable that estimates
the proportion of people age 5 and older who speak a language other than
English while at home. Just as easily, though, I could go back to
DP05_0037P
, the race estimate, or even some other variable
from the ACS. To switch to New York City, though, all I have to do is
change state = "TN"
to state = "NY"
in the
get_acs
section of the code, like this:
mydata <- get_acs(
geography = "county subdivision",
state = "NY", # <== This has changed
variables = VariableList,
year = 2023,
survey = "acs5",
output = "wide",
geometry = TRUE)
… change the filter
section of the code from
filter(County == "Davidson County")
to filter for the
several counties that lie within New York City. The |
character can be used to indicate “or” within a filter
function. The code ends up looking like this:
filtereddata <- mydata %>%
filter(County == "Suffolk County"|
County == "Nassau County"|
County == "Queens County"|
County == "Kings County"|
County == "Richmond County"|
County == "New York County"|
County == "Bronx County"|
County == "Westchester County"|
County == "Rockland County")
… and, finally, change the chart’s subheading code from
subtitle = "Davidson County subdivisions. Brackets show error margins."
to
subtitle = "New York City subdivisions. Brackets show error margins."
Because New York has so many divisions, the chart is going to be
unreadably crowded. In practice, I’d either skip it or make a separate
chart for each county. But I’ll leave it in, as is. Here’s what the
revamped code looks like:
mygraph <- ggplot(filtereddata, aes(x = Estimate_E, y = reorder(Division, Estimate_E))) +
geom_errorbarh(aes(xmin = Estimate_E - Estimate_M, xmax = Estimate_E + Estimate_M)) +
geom_point(size = 3, color = "darkblue") +
theme_minimal(base_size = 12.5) +
labs(title = "Pct. non-English at home",
subtitle = "New York City subdivisions. Brackets show error margins.", # <== This has changed
x = "2019-2023 ACS estimate",
y = "")
Everything else can stay the same. Any data frames or .csv files you created with previous versions of the code will simply get overwritten. Here’s the full script, with the two changes detailed above:
# Specifying target variables
VariableList =
c(Estimate_ = "DP02_0114P")
# ---------- County subdivision analysis ----------
# Fetching data
mydata <- get_acs(
geography = "county subdivision",
state = "NY", # <== This has changed
variables = VariableList,
year = 2023,
survey = "acs5",
output = "wide",
geometry = TRUE)
# Reformatting data
mydata <-
separate_wider_delim(mydata,
NAME,
delim = ", ",
names = c("Division", "County", "State"))
# Filtering data
filtereddata <- mydata %>% # <== This section has changed
filter(County == "Suffolk County"|
County == "Nassau County"|
County == "Queens County"|
County == "Kings County"|
County == "Richmond County"|
County == "New York County"|
County == "Bronx County"|
County == "Westchester County"|
County == "Rockland County")
# Plotting data
mygraph <- ggplot(filtereddata, aes(x = Estimate_E, y = reorder(Division, Estimate_E))) +
geom_errorbarh(aes(xmin = Estimate_E - Estimate_M, xmax = Estimate_E + Estimate_M)) +
geom_point(size = 3, color = "darkblue") +
theme_minimal(base_size = 12.5) +
labs(title = "Pct. non-English at home", # <== This has changed
subtitle = "New York City subdivisions. Brackets show error margins.", # <== This has changed
x = "2019-2023 ACS estimate",
y = "")
mygraph
# Mapping data
mapdata <- filtereddata %>%
rename(Estimate = Estimate_E,
Range = Estimate_M)
mapdata <- st_as_sf(mapdata)
mapviewOptions(basemaps.color.shuffle = FALSE)
DivisionMap <- mapview(mapdata,
zcol = "Estimate",
layer.name = "Estimate",
popup = popupTable(
mapdata,
feature.id = FALSE,
row.numbers = FALSE,
zcol = c(
"State",
"County",
"Division",
"Estimate",
"Range")))
DivisionMap
# Exporting data in .csv format
CSVdata <- st_drop_geometry(mapdata)
write_csv(CSVdata, "DivisionData.csv")
I hope you are interested by now in setting up R on your own computer and trying out the scripts shown above. This section provides some resources that can help you not only get started but also build your R coding skills.
These two videos, both by Dr. Greg Martin, a public health expert in Ireland and host of one of YouTube’s leading “how to” channels about R programming, can get you going.
The first video explains some basic things about R and how it works. The second shows you how to install and use RStudio, a programming environment that will let you easily write, test, and run R code and look at the data frames and output R produces.
“How to” videos, web pages and free e-books are available all over the Internet. If there’s something you need help doing in R, just Google “R” and a description of what you’re trying to figure out.
For a more structured approach, though, I suggest subscribing to these two excellent YouTube channels. The first is where I got the two videos embedded above. The second is from Andrew Gard, a mathematics professor at Wake Forest College. Gard’s channel includes more material than Martin’s does about advanced statistical analysis - something journalists may not need very often. But both offer excellent material for R beginners and videos with good production values.
RProgramming101, from Dr. Greg Martin.
Equitable Equations, from Dr. Andrew Gard.
Another good source for you is the tidycensus website built by tidycensus developer and super-nice guy Kyle Walker.
You also can reach out to me. You’ll find my contact information at drkblake.com.
I promised to tell you how I knew that DP05_0037P
would
estimate the percentage of whites and that DP02_0114P
would
estimate the percentage of people age 5 and older who speak a language
other than English while at home.
You can look up these and other variable names in the set of codebooks for whatever American Community Survey’ dataset you are working with. In data analysis, a codebook is a document that lists each variable in a dataset and explains what the variable measures.
There are a few ways to get access to the American Community Survey’s
codebooks. They can be big files; the ACS measures nearly 50,000
variables. Fortunately, the tidyverse
package has a
built-in process that makes it easy.
Begin by ensuring, if you haven’t already, that the tidyverse package is installed and loaded, and that you have supplied R with your Census API key. If you have successfully run any of the above code since opening RStudio, you almost certainly already have done both.
But if you haven’t, this code will handle everything. Just be sure to
replace PasteYourAPIKeyBetweenTheseQuoteMarks
with your
actual API key before you run the code.
# Make sure tidycensus is installed and loaded
# Make sure you have provided R with you Census API key
if (!require("tidyverse")) install.packages("tidyverse")
library(tidyverse)
census_api_key("PasteYourAPIKeyBetweenTheseQuoteMarks")
If tidyverse
and your API key are good to go, running
this code will download each of the 2022 five-year ACS codebooks, put
them in three data frames called DetailedTables,
SubjectTables, and ProfileTables, and
put them in the environment pane in RStudio.
# Fetch ACS codebooks
DetailedTables <- load_variables(2023, "acs5", cache = TRUE)
SubjectTables <- load_variables(2023, "acs5/subject", cache = TRUE)
ProfileTables <- load_variables(2023, "acs5/profile", cache = TRUE)
Click any of the three dataframes, and RStudio will open it under a tab in the upper-left pane. You’ll be able to scroll through the codebook. The “name” column will contain the variable names. The “label” column will explain what the variable estimates (although not always terribly clearly). The “concept” column will give you the general topic that the variable fits into.
Some tips:
I recommend starting with the ProfileTables codebook. It’s the smallest of the three and contains the most-frequently-used variables. It also is the first place to look for variables that estimate percentages, rather than raw counts. For journalists, percentages often (but not always) are more helpful and informative than raw counts.
Clicking View / Panes / Zoom Source will give you a better view. So will using your mouse to grab the right border of the codebook’s “label” column and move the border farther to the right.
Paying attention to the concept column can be important. For example, the ProfileTables codebook contains variables for the United States and also for Puerto Rico. If you want to estimate something about a place in the U.S., make sure you’re not trying to do so with a variable for Puerto Rico, and vice versa. Another hint is that Puerto Rico variable names always end with “PR.”
Click the Filter icon to open a filter box at
the top of each column. Typing a word or phrase into a box will filter
the codebook for rows that contain, in that column, the word or phrase
you typed in. For example, here’s the ProfileTables codebook, filtered
for “United States” concept variables that estimate something about
“language.” Note the entry for the DP02_0114P
variable, the
one I used to estimate language diversity.
Changing acs5
to acs1
will get the
codebooks for the single-year ACS instead of the five-year ACS.
Single-year ACS datasets can be more up-to-date than five-year datasets.
But they include data only for places that have 60,000 or more
residents. Error margins on some single-year dataset variables also tend
to be wider.
You also can pick a different year to get the single-year or five-year ACS dataset for that year. Right now, 2023 is the latest-available five-year dataset. It came out on Dec. 12, 2024. The single-year 2023 dataset is already out. See the Census Bureau’s data releases page for details.