Last week, the Practical R Part 1 YouTube video and RPubs page taught you how to produce and publish the first two elements below: the map showing district-level percentages of in Rutherford County who hold a bachelor’s degree or higher, and the graph of the percentages and their error margins.
This week, as you continue working on your final paper, take a moment to learn how to produce the next two elements shown below: the multi-county map, and the table of county-level data. The Part 2 R code, below, will produce both. Part 3 of the series is coming up next week.
All you have to do to complete this assignment is integrate the code into an R Markdown document - so that the document shows the map and table when you knit it - then publish the document on RPubs.com and submit the URL via this week’s assignment drop box. Everything from How to read R code to the end of the page is optional material.
Percent with a bachelor’s degree or higher
Rutherford County, 2019-2023. Source: American Community
Survey
Percent with a bachelor’s degree or higher
Nashville MSA, 2019-2023. Source: American Community
Survey.
| Estimates by County | |||
| (2019–2023 ACS 5-Year ACS) | |||
| County | Percent (%) | Margin of Error | Count |
|---|---|---|---|
| Williamson County, Tennessee | 61.8 | 1.2 | 167,620 |
| Davidson County, Tennessee | 47.3 | 0.7 | 493,272 |
| Wilson County, Tennessee | 37.2 | 1.2 | 105,888 |
| Rutherford County, Tennessee | 34.4 | 1.1 | 223,304 |
| Sumner County, Tennessee | 32.6 | 1.2 | 138,609 |
| Maury County, Tennessee | 26.5 | 1.5 | 73,091 |
| Cheatham County, Tennessee | 25.1 | 2.3 | 29,614 |
| Dickson County, Tennessee | 21.4 | 1.9 | 38,559 |
| Robertson County, Tennessee | 21.1 | 1.6 | 50,891 |
| Smith County, Tennessee | 17.9 | 2.3 | 14,180 |
| Cannon County, Tennessee | 16.9 | 3.2 | 10,397 |
| Hickman County, Tennessee | 13.6 | 2.2 | 18,044 |
| Macon County, Tennessee | 11.8 | 2.3 | 17,135 |
| Trousdale County, Tennessee | 10.1 | 2.7 | 8,490 |
| Estimates by Region | |||
| (2019–2023 ACS 5-Year ACS) | |||
| Region | Percent (%) | Margin of Error | Count |
|---|---|---|---|
| Davidson | 47.3 | 0.7 | 493,272 |
| Doughnut | 35.4 | 1.4 | 715,926 |
| Non-doughnut | 16.9 | 2.3 | 179,896 |
Percent with a bachelor’s degree or higher
Tennessee, 2019-2023. Source: American Community Survey.
Below is the code for creating the multi-county map and summary table. Again, all you have to do to complete this assignment is integrate the code into an R Markdown document - so that the document shows the map and table when you knit it - then publish the document on RPubs.com and submit the URL via this week’s assignment drop box. Everything from How to read R code to the end of the page is optional material.
The first five sections of the code are identical to the code you ran last week. I left out sections 6 through 8, which produce the single-county map and the dot plot with error margins.
The new stuff starts with Section 9. Here, the code filters mydata, like last time, but not just for Rutherford County. Instead, it filters for all 14 counties in the Nashville Metropolitan Statistical Area. MSAs are Census-defined areas consisting of one or more large metro areas and the surrounding counties that are economically and culturally integrated with them. An MSA can sometimes include counties that lie in different states, as is the case with the Memphis MSA.
The tricky part about making this map is getting the county borders
to show along with the borders of the county subdivisions the map is
mainly about. Making that happen requires overlaying the county
subdivisions map with a map of the county borders. That’s why Section 10
includes a second get_acs() function that retrieves
county-level data, rather than county-subdivision-level data and stashes
it in a data frame called county_borders. The
addPolygons() function puts the county subdivision borders
and data on the map by pulling them from the mapdata2 data frame. The
subsequent addPolylines() function adds the county borders
by pulling them from the county_borders data frame.
Pretty slick, huh?
The other new part starts in Section 11. Here, yet another
get_acs() function pulls county-level results for the
DP02_0059 and DP02_0059P variables, and the
gt() function formats them into a nice, neat table. The
gt() function is the same one used to all semester to show
you tables of output from the statistical tests you have been
running.
# ============================================================
# 0. INSTALL AND LOAD REQUIRED PACKAGES
# ============================================================
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidycensus")) install.packages("tidycensus")
if (!require("sf")) install.packages("sf")
if (!require("leaflet")) install.packages("leaflet")
if (!require("leaflet.extras2")) install.packages("leaflet.extras2")
if (!require("gt")) install.packages("gt")
if (!require("gtExtras")) install.packages("gtExtras")
if (!require("plotly")) install.packages("plotly")
library(tidyverse)
library(tidycensus)
library(sf)
library(leaflet)
library(leaflet.extras2)
library(gt)
library(gtExtras)
library(plotly)
# ============================================================
# 1. CENSUS API KEY
# ============================================================
census_api_key("PasteYourAPIKeyBetweenTheseQuoteMarks")
# ============================================================
# 2. LOAD ACS CODEBOOKS
# ============================================================
DetailedTables <- load_variables(2023, "acs5", cache = TRUE)
SubjectTables <- load_variables(2023, "acs5/subject", cache = TRUE)
ProfileTables <- load_variables(2023, "acs5/profile", cache = TRUE)
# ============================================================
# 3. DEFINE VARIABLES OF INTEREST
# ============================================================
VariableList <- c(
Count_ = "DP02_0059",
Percent_ = "DP02_0068P"
)
# ============================================================
# 4. FETCH COUNTY SUBDIVISION DATA (TENNESSEE)
# ============================================================
mydata <- get_acs(
geography = "county subdivision",
state = "TN",
variables = VariableList,
year = 2023,
survey = "acs5",
output = "wide",
geometry = TRUE
)
# ============================================================
# 5. CLEAN AND REFORMAT GEOGRAPHIC NAMES
# ============================================================
mydata <- mydata %>%
separate_wider_delim(
NAME,
delim = ", ",
names = c("Division", "County", "State")
) %>%
mutate(County = str_remove(County, " County"))
# ============================================================
# NOTE: Sections 6 through 8 of this script have been omitted.
# See "Practical R (Part 1)" for those sections.
# ============================================================
# ============================================================
# 9. FILTER DATA FOR MULTI-COUNTY AREA
# ============================================================
CountyNameList <- c(
"Cannon","Cheatham","Davidson","Dickson","Hickman",
"Macon","Maury","Robertson","Rutherford","Smith",
"Sumner","Trousdale","Williamson","Wilson"
)
filtereddata2 <- mydata %>%
filter(County %in% CountyNameList)
# ============================================================
# 10. MAP MULTI-COUNTY AREA (WITH COUNTY BORDERS)
# ============================================================
mapdata2 <- filtereddata2 %>%
rename(
Percent = Percent_E,
PercentEM = Percent_M,
Count = Count_E,
CountEM = Count_M
) %>%
st_as_sf() %>%
st_transform(4326)
# Get Tennessee county boundaries for overlay
county_borders <- get_acs(
geography = "county",
state = "TN",
variables = "B01001_001", # dummy population variable
year = 2023,
geometry = TRUE
) %>%
st_as_sf() %>%
st_transform(4326) %>%
mutate(County = str_remove(NAME, " County, Tennessee")) %>%
filter(County %in% CountyNameList)
# Viridis color palette
pal <- colorNumeric(
palette = "viridis",
domain = mapdata2$Percent
)
# Build leaflet map with county borders
DivisionMap2 <- leaflet(mapdata2) %>%
# Base maps
addProviderTiles("CartoDB.Positron", group = "Positron (Light)") %>%
addProviderTiles("Esri.WorldStreetMap", group = "ESRI Streets") %>%
addProviderTiles("Esri.WorldImagery", group = "ESRI Satellite") %>%
# Main polygons
addPolygons(
fillColor = ~pal(Percent),
fillOpacity = 0.7,
color = "#333333",
weight = 1,
group = "Data Layer",
popup = ~paste0(
"<b>", Division, "</b><br>",
"County: ", County, "<br>",
"Percent: ", Percent, "%<br>",
"Percent MOE: ±", PercentEM, "<br>",
"Count: ", Count, "<br>",
"Count MOE: ±", CountEM
)
) %>%
# County borders overlay
addPolylines(
data = county_borders,
color = "black",
weight = 1,
opacity = 0.8,
group = "County Borders"
) %>%
# Legend
addLegend(
pal = pal,
values = mapdata2$Percent,
position = "topright",
title = "Percent"
) %>%
addLayersControl(
baseGroups = c(
"Positron (Light)",
"ESRI Streets",
"ESRI Satellite"
),
overlayGroups = c("Data Layer", "County Borders"),
options = layersControlOptions(
collapsed = TRUE,
position = "bottomleft" # 👈 Control position
)
)
DivisionMap2
# ============================================================
# 11. COUNTY-LEVEL TABLE FOR SELECTED COUNTIES
# ============================================================
CountyLevelData <- get_acs(
geography = "county",
state = "TN",
variables = VariableList,
year = 2023,
survey = "acs5",
output = "wide",
geometry = FALSE
)
SelectedCounties <- CountyLevelData %>%
mutate(NAME = str_remove(NAME, " County, Tennessee")) %>%
filter(NAME %in% CountyNameList) %>%
select(NAME, Percent_E, Percent_M, Count_E, Count_M) %>%
rename(
County = NAME,
Percent = Percent_E,
PctMOE = Percent_M,
Count = Count_E,
CountMOE = Count_M
) %>%
arrange(desc(Percent))
CountyTable <- SelectedCounties %>%
gt() %>%
gt_theme_espn() %>%
cols_label(
Percent = "Percent (%)",
PctMOE = "Percent MOE",
Count = "Count",
CountMOE = "Count MOE"
) %>%
fmt_number(columns = c(Percent, PctMOE), decimals = 1) %>%
fmt_number(columns = c(Count), decimals = 0, use_seps = TRUE) %>%
tab_header(
title = "Estimates by County",
subtitle = "(2019–2023 ACS 5-Year ACS)"
)
CountyTable
You can paste the new code into an R Markdown document, add your Census API key, publish the document, send me the URL and be done, if that’s all you have time for.
But if you can, it might be worth you while to try to understand how the script works. R code may look like gobbledygook to you (sometimes, it still looks that way to me). But it actually follows patterns that conform to precise rules. Understand the patterns and rules, and you can understand the code.
To see what I mean, take a closer look at this snippet from early in
the script. It’s the part that gets data from the American Community
Survey server and puts it into a data frame called mydata
in your RStudio window’s “Environment” tab, so that you can work with
it:
# ============================================================
# 4. FETCH COUNTY SUBDIVISION DATA (TENNESSEE)
# ============================================================
mydata <- get_acs(
geography = "county subdivision",
state = "TN",
variables = VariableList,
year = 2023,
survey = "acs5",
output = "wide",
geometry = TRUE
)
The first thing to understand is that anything preceded by a
# is a “Comment.” R will ignore it. It’s there to help you,
the code’s user, to understand what’s going on on the script. Here, the
three lines preceded by # form a code section heading with a title that
tells you what the section does. It retrieves county-subdivision-level
data for all county subdivisions in a specified state (here,
Tennessee).
The rest of the snippet illustrates the basic structure behind what most R code does: Fundamentally, a string of R code applies a function to an object.
A function is like a little mini program that R
knows how to run all by itself, kind of like how your body knows how to
raise your arm when you want your arm raised. You don’t have to
consciously tell your body to activate the nerves that tell specific
muscles in your arm to contract in the all the specific ways and
sequences that lift your arm. That’s all “hard-coded,” so to speak. All
you have to do is think (or barely think, in most cases), “Arm, lift.”
In this snippet, the get_acs() function tells R to go get
data from the ACS server, using a bunch of specific steps R already
knows how to do.
An object is whatever thing the function affects, or
applies to. In this snippet, the object is mydata. Prior to
this snippet, mydata did not exist, so R creates it on the
fly. R will understand that it needs to be a data frame, because only a
data frame can hold the kind of data the get_acs() function
provides.
The <- is called an assignment
operator. There are other assignment operators in R code. But
<- is the most common one for applying a function to an
object. You type it by pressing the < key, followed by the - key. In
RStudio, you can get it by holding down the Alt key and pressing the -
key. I usually just type it manually. It tells R to do what whatever is
described on the right to whatever is specified on the left. This
particular function tells R to apply the get_acs() function
to the mydata dataframe.
You can spot R functions by looking for their parentheses, like the
ones after the get_acs() function. The parentheses, which
always come in pairs, contain arguments that tell R
exactly how to do whatever the function is telling it to do. To go back
to the “raising your arm” analogy: You might not have to consciously
think about which nerves, muscles and so forth have to activate in order
to raise your arm. But you do have to think about things like
which arm to raise, how high to raise it, how quickly to raise it, and
so on. Those are the kinds of variations arguments describe for a
function. Here, all the things inside the get_acs()
functions’ parentheses are arguments telling R specific things about how
to get_acs().
For example, the first argument listed,
geography = "county subdivision", tells R which geography
level to get ACS data for. Other available geographies include
geography = "state", geography = "county",
geography = "tract",
geography = "zip code tabulation area", and even things
like geography = "congressional district" and
geography = "school district (secondary)". See Section 2.2
of the TidyCensus documentation for more.
Commas separate each argument, like the comma between the
geography = "county subdivision" argument and the next
argument on the list, state = "TN". Forget one of the
commas, and R will write you a nasty error message.
Also: Most functions have default arguments that R will assume you want unless you specify otherwise. Some arguments have to be included overtly in the code. Others don’t have to be specified; R will quietly use the defaults. The function’s documentation will list all of the arguments and tell you about each one.
So, in English, this code snippet is telling R to create a data frame called mydata, and fill it with the data that result from contacting the ACS server and asking for:
County-subdivision-level data
For all of Tennessee
For the variables in the (previously defined) “Variables” list
For the year 2023
In that year’s five-year dataset
In “wide” format (variables in columns and areas in rows, instead of the other way around)
And - finally - with the “geometry” data included. The “geometry” data is the string of latitude and longitude coordinates needed to define the boundaries of each area on a map.
Sometimes, you’ll see a curious-looking %>% string in
R code. That’s called the “pipe operator,” and it’s just a way of
stringing together a whole series of functions that you want R to apply
to the same object. Essentially, it means “and then, do this.” Its name
comes from the idea of “piping” the same object name into two or more
functions. In the script’s next section of code, for example:
# ============================================================
# 5. CLEAN AND REFORMAT GEOGRAPHIC NAMES
# ============================================================
mydata <- mydata %>%
separate_wider_delim(
NAME,
delim = ", ",
names = c("Division", "County", "State")
) %>%
mutate(County = str_remove(County, " County"))
… the pipe operator tells R, “Take the mydata data
frame, and then apply the separate_wider_delim()
function to it, according to the arguments given between the function’s
parentheses, and then apply the mutate() function
to it, again according to the arguments between the function’s
parentheses.
Once you know at least a little bit about how R code works, you can get your favorite artificial intelligence chat bot to explain lines, chunks, or whole scripts of R code to you. You can even get an AI bot to debug, modify, or even write code for you.
Getting AI to explain R code
Below is what I got when I gave ChatGPT the prompt, “List every function used in this R code, what the function does in general, and what the function does in this particular case. Also: For each function, include a URL to authoritative documentation for the function,” followed by the code in sections 4 and 5 of this week’s script. You might not get exactly the same thing; ChatGPT composes each answer on the fly, sometimes informed by previous conversations you have had with the platform. But you should get something similar:
SECTION 4: FETCH COUNTY SUBDIVISION DATA (TENNESSEE)
1. Function: get_acs()
Package: tidycensus
General purpose: Retrieves American Community Survey (ACS) data from the U.S. Census Bureau’s API, including both estimates and margins of error.
In this script: get_acs() is used to fetch ACS 5-year estimates (2023) for county subdivisions in Tennessee, returning the data in a wide format with associated geometry (spatial boundaries) for mapping.
Arguments used here:
- geography = "county subdivision": Requests data at the county subdivision level (minor civil divisions).
- state = "TN": Limits the data to Tennessee.
- variables = VariableList: Uses a pre-defined list of ACS variable IDs.
- year = 2023: Specifies the 2023 ACS dataset.
- survey = "acs5": Chooses the 5-year ACS survey.
- output = "wide": Returns a single-row-per-feature format (instead of long).
- geometry = TRUE: Includes spatial boundary information as sf objects.
Documentation: https://walker-data.com/tidycensus/reference/get_acs.html
2. Function: <- (Assignment Operator)
Package: base R
General purpose: Assigns a value or object to a variable name in R.
In this script: Used twice:
- mydata <- get_acs(...): Stores the downloaded ACS dataset in the variable mydata.
- mydata <- mydata %>% ...: Overwrites mydata with its cleaned and reformatted version.
Documentation: https://stat.ethz.ch/R-manual/R-devel/library/base/html/assignOps.html
SECTION 5: CLEAN AND REFORMAT GEOGRAPHIC NAMES
3. Function: %>% (Pipe Operator)
Package: magrittr (also loaded by tidyverse and dplyr)
General purpose: Passes the output of one expression as the first argument of the next function, improving readability and flow.
In this script: Used to create a pipeline that starts with mydata, then separates the NAME field into three parts and removes "County" from the County field.
Documentation: https://magrittr.tidyverse.org/reference/pipe.html
4. Function: separate_wider_delim()
Package: tidyr
General purpose: Splits a single character column into multiple columns using a specified delimiter.
In this script: Takes the column NAME (for example, "Smithville, DeKalb County, Tennessee") and splits it into three new columns—Division, County, and State—using the delimiter ", ".
Arguments used here:
- NAME: The column being split.
- delim = ", ": The delimiter separating values.
- names = c("Division", "County", "State"): The names of the new columns created.
Documentation: https://tidyr.tidyverse.org/reference/separate_wider_delim.html
5. Function: mutate()
Package: dplyr
General purpose: Creates or transforms columns in a data frame or tibble.
In this script: Creates a modified version of the County column by removing the trailing word "County" using str_remove().
Documentation: https://dplyr.tidyverse.org/reference/mutate.html
6. Function: str_remove()
Package: stringr
General purpose: Removes the first instance of a specified pattern (regular expression) from a string.
In this script: Removes the word " County" from entries in the County column (for example, converts "DeKalb County" to "DeKalb").
Arguments used here:
- County: The input string (column).
- " County": The literal substring to remove.
Documentation: https://stringr.tidyverse.org/reference/str_remove.html
SUMMARY TABLE
Function: get_acs()
Package: tidycensus
Purpose: Retrieves ACS data from Census API
Script Role: Downloads Tennessee county subdivision data (2023, ACS5, wide format)
Documentation: https://walker-data.com/tidycensus/reference/get_acs.html
Function: <-
Package: base R
Purpose: Assigns values to variables
Script Role: Stores or overwrites mydata
Documentation: https://stat.ethz.ch/R-manual/R-devel/library/base/html/assignOps.html
Function: %>%
Package: magrittr / tidyverse
Purpose: Pipes output between functions
Script Role: Chains data cleaning steps
Documentation: https://magrittr.tidyverse.org/reference/pipe.html
Function: separate_wider_delim()
Package: tidyr
Purpose: Splits a column by a delimiter into multiple columns
Script Role: Splits NAME into Division, County, and State
Documentation: https://tidyr.tidyverse.org/reference/separate_wider_delim.html
Function: mutate()
Package: dplyr
Purpose: Adds or transforms columns
Script Role: Cleans up county names
Documentation: https://dplyr.tidyverse.org/reference/mutate.html
Function: str_remove()
Package: stringr
Purpose: Removes text patterns from strings
Script Role: Removes “County” from county names
Documentation: https://stringr.tidyverse.org/reference/str_remove.html
That’s pretty informative, if you take time to compare it to the script. Another approach would be to work through a script by asking ChatGPT something like, “What would R do in response to this code?” then pasting in a section of code. Ask any follow-up questions you like, then ask, “What if I the add this code?” followed by the next section of code. ChatGPT will patiently explain what R will do next. You could work through an entire script that way.
Getting AI to fix, tweak, or even write code
But AI can do even more for you. Try feeding ChatGPT the entire
script for this week’s assignment, preceded by a prompt to make some
kind of change in the script, like, “Edit this R script to filter for
counties in the Knoxville, Tennessee, MSA instead of for those in the
Nashville, Tennessee, MSA.”
When I tried it, ChatGPT modified sections 9 and 10 of the script,
showed me its alterations, and offered to generate the entire script
with the changes included. I replied that I would like that, and the
resulting script worked as promised, showing me a map and table for the
12 counties in the Knoxville MSA.
If you’re methodical about it, you can even get AI to write code for you from scratch. Here’s a YouTube video from RProgramming101’s Dr. Greg Martin explaining how to do it: