Practical R (Part 2)

Intro

Last week, the Practical R Part 1 YouTube video and RPubs page taught you how to produce and publish the first two elements below: the map showing district-level percentages of in Rutherford County who hold a bachelor’s degree or higher, and the graph of the percentages and their error margins.

This week, as you continue working on your final paper, take a moment to learn how to produce the next two elements shown below: the multi-county map, and the table of county-level data. The Part 2 R code, below, will produce both. Part 3 of the series is coming up next week.

All you have to do to complete this assignment is integrate the code into an R Markdown document - so that the document shows the map and table when you knit it - then publish the document on RPubs.com and submit the URL via this week’s assignment drop box. Everything from How to read R code to the end of the page is optional material.

Percent with a bachelor’s degree or higher
Rutherford County, 2019-2023. Source: American Community Survey

Percent with a bachelor’s degree or higher
Nashville MSA, 2019-2023. Source: American Community Survey.

County	Percent (%)	Margin of Error	Count
Estimates by County
(2019–2023 ACS 5-Year ACS)
Williamson County, Tennessee	61.8	1.2	167,620
Davidson County, Tennessee	47.3	0.7	493,272
Wilson County, Tennessee	37.2	1.2	105,888
Rutherford County, Tennessee	34.4	1.1	223,304
Sumner County, Tennessee	32.6	1.2	138,609
Maury County, Tennessee	26.5	1.5	73,091
Cheatham County, Tennessee	25.1	2.3	29,614
Dickson County, Tennessee	21.4	1.9	38,559
Robertson County, Tennessee	21.1	1.6	50,891
Smith County, Tennessee	17.9	2.3	14,180
Cannon County, Tennessee	16.9	3.2	10,397
Hickman County, Tennessee	13.6	2.2	18,044
Macon County, Tennessee	11.8	2.3	17,135
Trousdale County, Tennessee	10.1	2.7	8,490

Region	Percent (%)	Margin of Error	Count
Estimates by Region
(2019–2023 ACS 5-Year ACS)
Davidson	47.3	0.7	493,272
Doughnut	35.4	1.4	715,926
Non-doughnut	16.9	2.3	179,896

Percent with a bachelor’s degree or higher
Tennessee, 2019-2023. Source: American Community Survey.

Part 2 R code

Below is the code for creating the multi-county map and summary table. Again, all you have to do to complete this assignment is integrate the code into an R Markdown document - so that the document shows the map and table when you knit it - then publish the document on RPubs.com and submit the URL via this week’s assignment drop box. Everything from How to read R code to the end of the page is optional material.

What’s new

The first five sections of the code are identical to the code you ran last week. I left out sections 6 through 8, which produce the single-county map and the dot plot with error margins.

The new stuff starts with Section 9. Here, the code filters mydata, like last time, but not just for Rutherford County. Instead, it filters for all 14 counties in the Nashville Metropolitan Statistical Area. MSAs are Census-defined areas consisting of one or more large metro areas and the surrounding counties that are economically and culturally integrated with them. An MSA can sometimes include counties that lie in different states, as is the case with the Memphis MSA.

The tricky part about making this map is getting the county borders to show along with the borders of the county subdivisions the map is mainly about. Making that happen requires overlaying the county subdivisions map with a map of the county borders. That’s why Section 10 includes a second get_acs() function that retrieves county-level data, rather than county-subdivision-level data and stashes it in a data frame called county_borders. The addPolygons() function puts the county subdivision borders and data on the map by pulling them from the mapdata2 data frame. The subsequent addPolylines() function adds the county borders by pulling them from the county_borders data frame.

Pretty slick, huh?

The other new part starts in Section 11. Here, yet another get_acs() function pulls county-level results for the DP02_0059 and DP02_0059P variables, and the gt() function formats them into a nice, neat table. The gt() function is the same one used to all semester to show you tables of output from the statistical tests you have been running.

The code (at last)

# ============================================================
# 0. INSTALL AND LOAD REQUIRED PACKAGES
# ============================================================

if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidycensus")) install.packages("tidycensus")
if (!require("sf")) install.packages("sf")
if (!require("leaflet")) install.packages("leaflet")
if (!require("leaflet.extras2")) install.packages("leaflet.extras2")
if (!require("gt")) install.packages("gt")
if (!require("gtExtras")) install.packages("gtExtras")
if (!require("plotly")) install.packages("plotly")

library(tidyverse)
library(tidycensus)
library(sf)
library(leaflet)
library(leaflet.extras2)
library(gt)
library(gtExtras)
library(plotly)

# ============================================================
# 1. CENSUS API KEY
# ============================================================

census_api_key("PasteYourAPIKeyBetweenTheseQuoteMarks")

# ============================================================
# 2. LOAD ACS CODEBOOKS
# ============================================================

DetailedTables <- load_variables(2023, "acs5", cache = TRUE)
SubjectTables  <- load_variables(2023, "acs5/subject", cache = TRUE)
ProfileTables  <- load_variables(2023, "acs5/profile", cache = TRUE)

# ============================================================
# 3. DEFINE VARIABLES OF INTEREST
# ============================================================

VariableList <- c(
  Count_   = "DP02_0059",
  Percent_ = "DP02_0068P"
)

# ============================================================
# 4. FETCH COUNTY SUBDIVISION DATA (TENNESSEE)
# ============================================================

mydata <- get_acs(
  geography = "county subdivision",
  state = "TN",
  variables = VariableList,
  year = 2023,
  survey = "acs5",
  output = "wide",
  geometry = TRUE
)

# ============================================================
# 5. CLEAN AND REFORMAT GEOGRAPHIC NAMES
# ============================================================

mydata <- mydata %>%
  separate_wider_delim(
    NAME,
    delim  = ", ",
    names  = c("Division", "County", "State")
  ) %>%
  mutate(County = str_remove(County, " County"))

# ============================================================
# NOTE: Sections 6 through 8 of this script have been omitted.
# See "Practical R (Part 1)" for those sections.
# ============================================================

# ============================================================
# 9. FILTER DATA FOR MULTI-COUNTY AREA
# ============================================================

CountyNameList <- c(
  "Cannon","Cheatham","Davidson","Dickson","Hickman",
  "Macon","Maury","Robertson","Rutherford","Smith",
  "Sumner","Trousdale","Williamson","Wilson"
)

filtereddata2 <- mydata %>%
  filter(County %in% CountyNameList)

# ============================================================
# 10. MAP MULTI-COUNTY AREA (WITH COUNTY BORDERS)
# ============================================================

mapdata2 <- filtereddata2 %>%
  rename(
    Percent   = Percent_E,
    PercentEM = Percent_M,
    Count     = Count_E,
    CountEM   = Count_M
  ) %>%
  st_as_sf() %>%
  st_transform(4326)

# Get Tennessee county boundaries for overlay
county_borders <- get_acs(
  geography = "county",
  state = "TN",
  variables = "B01001_001",  # dummy population variable
  year = 2023,
  geometry = TRUE
) %>%
  st_as_sf() %>%
  st_transform(4326) %>%
  mutate(County = str_remove(NAME, " County, Tennessee")) %>%
  filter(County %in% CountyNameList)

# Viridis color palette
pal <- colorNumeric(
  palette = "viridis",
  domain = mapdata2$Percent
)

# Build leaflet map with county borders
DivisionMap2 <- leaflet(mapdata2) %>%
  # Base maps
  addProviderTiles("CartoDB.Positron", group = "Positron (Light)") %>%
  addProviderTiles("Esri.WorldStreetMap", group = "ESRI Streets") %>%
  addProviderTiles("Esri.WorldImagery", group = "ESRI Satellite") %>%
  # Main polygons
  addPolygons(
    fillColor = ~pal(Percent),
    fillOpacity = 0.7,
    color = "#333333",
    weight = 1,
    group = "Data Layer",
    popup = ~paste0(
      "<b>", Division, "</b><br>",
      "County: ", County, "<br>",
      "Percent: ", Percent, "%<br>",
      "Percent MOE: ±", PercentEM, "<br>",
      "Count: ", Count, "<br>",
      "Count MOE: ±", CountEM
    )
  ) %>%
  # County borders overlay
  addPolylines(
    data = county_borders,
    color = "black",
    weight = 1,
    opacity = 0.8,
    group = "County Borders"
  ) %>%
  # Legend
  addLegend(
    pal = pal,
    values = mapdata2$Percent,
    position = "topright",
    title = "Percent"
  ) %>%
  addLayersControl(
    baseGroups = c(
      "Positron (Light)",
      "ESRI Streets",
      "ESRI Satellite"
    ),
    overlayGroups = c("Data Layer", "County Borders"),
    options = layersControlOptions(
      collapsed = TRUE,
      position = "bottomleft"   # 👈 Control position
    )
  )

DivisionMap2

# ============================================================
# 11. COUNTY-LEVEL TABLE FOR SELECTED COUNTIES
# ============================================================

CountyLevelData <- get_acs(
  geography = "county",
  state = "TN",
  variables = VariableList,
  year = 2023,
  survey = "acs5",
  output = "wide",
  geometry = FALSE
)

SelectedCounties <- CountyLevelData %>% 
  mutate(NAME = str_remove(NAME, " County, Tennessee")) %>%
  filter(NAME %in% CountyNameList) %>%
  select(NAME, Percent_E, Percent_M, Count_E, Count_M) %>%
  rename(
    County  = NAME,
    Percent = Percent_E,
    PctMOE  = Percent_M,
    Count   = Count_E,
    CountMOE = Count_M
  ) %>%
  arrange(desc(Percent))

CountyTable <- SelectedCounties %>%
  gt() %>%
  gt_theme_espn() %>%
  cols_label(
    Percent = "Percent (%)",
    PctMOE = "Percent MOE",
    Count = "Count",
    CountMOE = "Count MOE"
  ) %>%
  fmt_number(columns = c(Percent, PctMOE), decimals = 1) %>%
  fmt_number(columns = c(Count), decimals = 0, use_seps = TRUE) %>%
  tab_header(
    title = "Estimates by County",
    subtitle = "(2019–2023 ACS 5-Year ACS)"
  )

CountyTable

How to read R code

You can paste the new code into an R Markdown document, add your Census API key, publish the document, send me the URL and be done, if that’s all you have time for.

But if you can, it might be worth you while to try to understand how the script works. R code may look like gobbledygook to you (sometimes, it still looks that way to me). But it actually follows patterns that conform to precise rules. Understand the patterns and rules, and you can understand the code.

To see what I mean, take a closer look at this snippet from early in the script. It’s the part that gets data from the American Community Survey server and puts it into a data frame called mydata in your RStudio window’s “Environment” tab, so that you can work with it:

# ============================================================
# 4. FETCH COUNTY SUBDIVISION DATA (TENNESSEE)
# ============================================================

mydata <- get_acs(
  geography = "county subdivision",
  state = "TN",
  variables = VariableList,
  year = 2023,
  survey = "acs5",
  output = "wide",
  geometry = TRUE
)

The first thing to understand is that anything preceded by a # is a “Comment.” R will ignore it. It’s there to help you, the code’s user, to understand what’s going on on the script. Here, the three lines preceded by # form a code section heading with a title that tells you what the section does. It retrieves county-subdivision-level data for all county subdivisions in a specified state (here, Tennessee).

The rest of the snippet illustrates the basic structure behind what most R code does: Fundamentally, a string of R code applies a function to an object.

A function is like a little mini program that R knows how to run all by itself, kind of like how your body knows how to raise your arm when you want your arm raised. You don’t have to consciously tell your body to activate the nerves that tell specific muscles in your arm to contract in the all the specific ways and sequences that lift your arm. That’s all “hard-coded,” so to speak. All you have to do is think (or barely think, in most cases), “Arm, lift.” In this snippet, the get_acs() function tells R to go get data from the ACS server, using a bunch of specific steps R already knows how to do.

An object is whatever thing the function affects, or applies to. In this snippet, the object is mydata. Prior to this snippet, mydata did not exist, so R creates it on the fly. R will understand that it needs to be a data frame, because only a data frame can hold the kind of data the get_acs() function provides.

The <- is called an assignment operator. There are other assignment operators in R code. But <- is the most common one for applying a function to an object. You type it by pressing the < key, followed by the - key. In RStudio, you can get it by holding down the Alt key and pressing the - key. I usually just type it manually. It tells R to do what whatever is described on the right to whatever is specified on the left. This particular function tells R to apply the get_acs() function to the mydata dataframe.

You can spot R functions by looking for their parentheses, like the ones after the get_acs() function. The parentheses, which always come in pairs, contain arguments that tell R exactly how to do whatever the function is telling it to do. To go back to the “raising your arm” analogy: You might not have to consciously think about which nerves, muscles and so forth have to activate in order to raise your arm. But you do have to think about things like which arm to raise, how high to raise it, how quickly to raise it, and so on. Those are the kinds of variations arguments describe for a function. Here, all the things inside the get_acs() functions’ parentheses are arguments telling R specific things about how to get_acs().

For example, the first argument listed, geography = "county subdivision", tells R which geography level to get ACS data for. Other available geographies include geography = "state", geography = "county", geography = "tract", geography = "zip code tabulation area", and even things like geography = "congressional district" and geography = "school district (secondary)". See Section 2.2 of the TidyCensus documentation for more.

Commas separate each argument, like the comma between the geography = "county subdivision" argument and the next argument on the list, state = "TN". Forget one of the commas, and R will write you a nasty error message.

Also: Most functions have default arguments that R will assume you want unless you specify otherwise. Some arguments have to be included overtly in the code. Others don’t have to be specified; R will quietly use the defaults. The function’s documentation will list all of the arguments and tell you about each one.

So, in English, this code snippet is telling R to create a data frame called mydata, and fill it with the data that result from contacting the ACS server and asking for:

County-subdivision-level data
For all of Tennessee
For the variables in the (previously defined) “Variables” list
For the year 2023
In that year’s five-year dataset
In “wide” format (variables in columns and areas in rows, instead of the other way around)
And - finally - with the “geometry” data included. The “geometry” data is the string of latitude and longitude coordinates needed to define the boundaries of each area on a map.

The pipe operator

Sometimes, you’ll see a curious-looking %>% string in R code. That’s called the “pipe operator,” and it’s just a way of stringing together a whole series of functions that you want R to apply to the same object. Essentially, it means “and then, do this.” Its name comes from the idea of “piping” the same object name into two or more functions. In the script’s next section of code, for example:

# ============================================================
# 5. CLEAN AND REFORMAT GEOGRAPHIC NAMES
# ============================================================

mydata <- mydata %>%
  separate_wider_delim(
    NAME,
    delim  = ", ",
    names  = c("Division", "County", "State")
  ) %>%
  mutate(County = str_remove(County, " County"))

… the pipe operator tells R, “Take the mydata data frame, and then apply the separate_wider_delim() function to it, according to the arguments given between the function’s parentheses, and then apply the mutate() function to it, again according to the arguments between the function’s parentheses.

Getting coding help from AI

Once you know at least a little bit about how R code works, you can get your favorite artificial intelligence chat bot to explain lines, chunks, or whole scripts of R code to you. You can even get an AI bot to debug, modify, or even write code for you.

Getting AI to explain R code

Below is what I got when I gave ChatGPT the prompt, “List every function used in this R code, what the function does in general, and what the function does in this particular case. Also: For each function, include a URL to authoritative documentation for the function,” followed by the code in sections 4 and 5 of this week’s script. You might not get exactly the same thing; ChatGPT composes each answer on the fly, sometimes informed by previous conversations you have had with the platform. But you should get something similar:

SECTION 4: FETCH COUNTY SUBDIVISION DATA (TENNESSEE)

1. Function: get_acs()
Package: tidycensus
General purpose: Retrieves American Community Survey (ACS) data from the U.S. Census Bureau’s API, including both estimates and margins of error.
In this script: get_acs() is used to fetch ACS 5-year estimates (2023) for county subdivisions in Tennessee, returning the data in a wide format with associated geometry (spatial boundaries) for mapping.
Arguments used here:
- geography = "county subdivision": Requests data at the county subdivision level (minor civil divisions).
- state = "TN": Limits the data to Tennessee.
- variables = VariableList: Uses a pre-defined list of ACS variable IDs.
- year = 2023: Specifies the 2023 ACS dataset.
- survey = "acs5": Chooses the 5-year ACS survey.
- output = "wide": Returns a single-row-per-feature format (instead of long).
- geometry = TRUE: Includes spatial boundary information as sf objects.
Documentation: https://walker-data.com/tidycensus/reference/get_acs.html

2. Function: <- (Assignment Operator)
Package: base R
General purpose: Assigns a value or object to a variable name in R.
In this script: Used twice:
- mydata <- get_acs(...): Stores the downloaded ACS dataset in the variable mydata.
- mydata <- mydata %>% ...: Overwrites mydata with its cleaned and reformatted version.
Documentation: https://stat.ethz.ch/R-manual/R-devel/library/base/html/assignOps.html

SECTION 5: CLEAN AND REFORMAT GEOGRAPHIC NAMES

3. Function: %>% (Pipe Operator)
Package: magrittr (also loaded by tidyverse and dplyr)
General purpose: Passes the output of one expression as the first argument of the next function, improving readability and flow.
In this script: Used to create a pipeline that starts with mydata, then separates the NAME field into three parts and removes "County" from the County field.
Documentation: https://magrittr.tidyverse.org/reference/pipe.html

4. Function: separate_wider_delim()
Package: tidyr
General purpose: Splits a single character column into multiple columns using a specified delimiter.
In this script: Takes the column NAME (for example, "Smithville, DeKalb County, Tennessee") and splits it into three new columns—Division, County, and State—using the delimiter ", ".
Arguments used here:
- NAME: The column being split.
- delim = ", ": The delimiter separating values.
- names = c("Division", "County", "State"): The names of the new columns created.
Documentation: https://tidyr.tidyverse.org/reference/separate_wider_delim.html

5. Function: mutate()
Package: dplyr
General purpose: Creates or transforms columns in a data frame or tibble.
In this script: Creates a modified version of the County column by removing the trailing word "County" using str_remove().
Documentation: https://dplyr.tidyverse.org/reference/mutate.html

6. Function: str_remove()
Package: stringr
General purpose: Removes the first instance of a specified pattern (regular expression) from a string.
In this script: Removes the word " County" from entries in the County column (for example, converts "DeKalb County" to "DeKalb").
Arguments used here:
- County: The input string (column).
- " County": The literal substring to remove.
Documentation: https://stringr.tidyverse.org/reference/str_remove.html

SUMMARY TABLE

Function: get_acs()
Package: tidycensus
Purpose: Retrieves ACS data from Census API
Script Role: Downloads Tennessee county subdivision data (2023, ACS5, wide format)
Documentation: https://walker-data.com/tidycensus/reference/get_acs.html

Function: <-
Package: base R
Purpose: Assigns values to variables
Script Role: Stores or overwrites mydata
Documentation: https://stat.ethz.ch/R-manual/R-devel/library/base/html/assignOps.html

Function: %>%
Package: magrittr / tidyverse
Purpose: Pipes output between functions
Script Role: Chains data cleaning steps
Documentation: https://magrittr.tidyverse.org/reference/pipe.html

Function: separate_wider_delim()
Package: tidyr
Purpose: Splits a column by a delimiter into multiple columns
Script Role: Splits NAME into Division, County, and State
Documentation: https://tidyr.tidyverse.org/reference/separate_wider_delim.html

Function: mutate()
Package: dplyr
Purpose: Adds or transforms columns
Script Role: Cleans up county names
Documentation: https://dplyr.tidyverse.org/reference/mutate.html

Function: str_remove()
Package: stringr
Purpose: Removes text patterns from strings
Script Role: Removes “County” from county names
Documentation: https://stringr.tidyverse.org/reference/str_remove.html

That’s pretty informative, if you take time to compare it to the script. Another approach would be to work through a script by asking ChatGPT something like, “What would R do in response to this code?” then pasting in a section of code. Ask any follow-up questions you like, then ask, “What if I the add this code?” followed by the next section of code. ChatGPT will patiently explain what R will do next. You could work through an entire script that way.

Getting AI to fix, tweak, or even write code

But AI can do even more for you. Try feeding ChatGPT the entire script for this week’s assignment, preceded by a prompt to make some kind of change in the script, like, “Edit this R script to filter for counties in the Knoxville, Tennessee, MSA instead of for those in the Nashville, Tennessee, MSA.”

When I tried it, ChatGPT modified sections 9 and 10 of the script, showed me its alterations, and offered to generate the entire script with the changes included. I replied that I would like that, and the resulting script worked as promised, showing me a map and table for the 12 counties in the Knoxville MSA.

If you’re methodical about it, you can even get AI to write code for you from scratch. Here’s a YouTube video from RProgramming101’s Dr. Greg Martin explaining how to do it: