1 U.S. Census Data

  • Census Bureau conducts many surveys on a variety of subjects and shares data with the public
    • Economic and business surveys, housing surveys, international data, population estimates and projections, and more
  • Decennial Census – full count for apportionment every 10 years
  • American Community Survey – Source of estimated demographic data about the US population, sample of about 3% of US population.
    • 1-year ACS (areas with population of 65,000 or greater)
    • 5-year ACS (moving 5-year average for all areas down to block group)

1.1 Census Hierarchy

  • Data from the decennial Census, ACS, and other Census surveys are aggregated and shared at different geographies
    • Legal entities, such as states and counties
    • Statistical entities, not official jurisdictions but used to standardize data
    • Smallest unit for decennial is block, block group for ACS
    • Availability and accuracy of estimates varies depending on geography
      Source: U.S. Census Bureau

      Source: U.S. Census Bureau

2 Census Data and R

  • There are many R packages designed to help analysts navigate this hierarchy
  • Which data do I need? From where? From when?
  • Downloading bulk files from the Census FTP site or using data.census.gov are not practical if you regularly work with Census data
    • That leaves Census API or 3rd party providers
    • Choice depends on specific data needs (variables, geography, survey, scope, level of detail)

2.1 Census API? Use tidycensus and tigris

  • Created by Kyle Walker (https://walker-data.com/)
  • tidycensus
    • Allows R users to access Census data in a format that can be plugged into tidyverse workflows
    • Makes data wrangling, analysis, and visualization of Census data simpler and easier
  • tigris
    • Pulls official geographic boundary shapefiles for mapping and spatial analysis

2.2 Getting started with tidycensus

  • To use tidycensus, you will need to sign up for a free Census API Key from https://api.census.gov/data/key_signup.html and activate it
    • After the key is installed in your .Renviron file, it no longer needs to be included in your code
install.packages("tidycensus", repos = "http://cran.us.r-project.org")
install.packages("tigris", repos = "http://cran.us.r-project.org")
library(tidycensus)
library(tigris)
# stores boundaries in cache to preserve memory
options(tigris_use_cache = TRUE)
census_api_key("YOUR KEY GOES HERE", install = TRUE)
# only need to include install = TRUE once

2.3 Pulling Data: Decennial Census

  • get_decennial() can query decennial Census data from the 2000, 2010, and 2020 decennial U.S. Censuses
  • Users must include:
    • a geographic area
    • a vector of Census variable IDs
    • The following code below gets data on total population by state from the 2010 decennial Census
total_population_10 <- get_decennial(geography = "state", variables = "P001001",
    year = 2010)
GEOID NAME variable value
01 Alabama P001001 4779736
02 Alaska P001001 710231
04 Arizona P001001 6392017
05 Arkansas P001001 2915918
06 California P001001 37253956
22 Louisiana P001001 4533372
21 Kentucky P001001 4339367
08 Colorado P001001 5029196
09 Connecticut P001001 3574097
10 Delaware P001001 897934

2.4 Pulling Data: American Community Survey

  • get_acs() pulls data from annual ACS estimates
  • Parameters:
    • geographic area
    • variable(s)
    • table id (optional)
    • year (defaults to the most recent five-year ACS sample)
    • survey (defaults to the 5-year ACS, can be changed to the 1-year ACS by using survey = “acs1”)
  • Similar to data from get_decennial(), but includes columns for estimate and margin of error instead of value

2.5 Selecting Variables

  • load_variables() returns a dataset of variables from the Census Bureau website and formats it for fast searching
  • two required arguments
    • year of the Census dataset or ACS sample
    • dataset name (sf1, sf3, pl, acs1, or acs5)
      • For ACS datasets, append “/profile” for the Data Profile, and “/summary” for the Summary Tables.
  • user can include cache = TRUE in the function call to store the data in the user’s cache directory for faster processing
v19 <- load_variables(2019, "acs5", cache = TRUE)

2.6 Querying Variables

  • Variables from the ACS detailed tables, data profiles, summary tables, and supplemental estimates are available
  • Supplying a table name to the table parameter returns all variables for that table
  • The code below returns all variables associated with table B01001 (sex by age), from the 2015-2019 5-year ACS:
dp_table <- get_acs(geography = "state", table = "B01001", year = 2019)
GEOID NAME variable estimate moe
01 Alabama B01001_001 4876250 NA
01 Alabama B01001_002 2359355 1270
01 Alabama B01001_003 149090 704
01 Alabama B01001_004 153494 2290
01 Alabama B01001_005 158617 2274
01 Alabama B01001_006 98257 468
01 Alabama B01001_007 64980 834
01 Alabama B01001_008 35870 1436
01 Alabama B01001_009 35040 1472
01 Alabama B01001_010 95065 1916

2.7 Narrowing Our Query

  • To get data from within specific states and/or counties, users can pass state and county names to their respective parameters
ga_income <- get_acs(geography = "county", variables = "B19013_001", state = "GA",
    year = 2019)
GEOID NAME variable estimate moe
13001 Appling County, Georgia B19013_001 40304 5180
13003 Atkinson County, Georgia B19013_001 37197 3686
13005 Bacon County, Georgia B19013_001 37519 5492
13007 Baker County, Georgia B19013_001 32917 6967
13009 Baldwin County, Georgia B19013_001 43672 3736
13011 Banks County, Georgia B19013_001 47811 3240
13013 Barrow County, Georgia B19013_001 62345 2204
13015 Bartow County, Georgia B19013_001 57423 2743
13017 Ben Hill County, Georgia B19013_001 32229 3845
13019 Berrien County, Georgia B19013_001 40415 3024
fulton_income <- get_acs(geography = "tract", variables = "B19013_001", state = "GA",
    county = "Fulton")
GEOID NAME variable estimate moe
13121000100 Census Tract 1, Fulton County, Georgia B19013_001 168396 18644
13121000200 Census Tract 2, Fulton County, Georgia B19013_001 158011 37856
13121000400 Census Tract 4, Fulton County, Georgia B19013_001 97257 30528
13121000500 Census Tract 5, Fulton County, Georgia B19013_001 90140 13217
13121000600 Census Tract 6, Fulton County, Georgia B19013_001 53686 9322
13121000700 Census Tract 7, Fulton County, Georgia B19013_001 84107 33365
13121001001 Census Tract 10.01, Fulton County, Georgia B19013_001 103846 10554
13121001002 Census Tract 10.02, Fulton County, Georgia B19013_001 31042 9051
13121001100 Census Tract 11, Fulton County, Georgia B19013_001 109426 7959
13121001201 Census Tract 12.01, Fulton County, Georgia B19013_001 80476 16521

2.8 Renaming Variables

  • We can rename variables from within our function call if we know the default names
ga_rename <- get_acs(geography = "county", state = "Georgia", variables = c(medinc = "B19013_001",
    medage = "B01002_001"), output = "wide")
GEOID NAME medincE medincM medageE medageM
13005 Bacon County, Georgia 37519 5492 36.7 0.7
13025 Brantley County, Georgia 38857 3480 41.1 0.8
13017 Ben Hill County, Georgia 32229 3845 39.9 1.1
13033 Burke County, Georgia 44151 2438 37.4 0.6
13047 Catoosa County, Georgia 56235 2290 40.4 0.4
13053 Chattahoochee County, Georgia 47096 5158 24.5 0.5
13055 Chattooga County, Georgia 36807 2268 39.4 0.7
13073 Columbia County, Georgia 82339 3532 36.9 0.4
13087 Decatur County, Georgia 41481 3584 37.8 0.6
13115 Floyd County, Georgia 48336 2266 38.3 0.3

2.9 Choosing a Data Structure

  • By default, tidycensus returns a tibble in “tidy” (or long) format:
    • Each observation forms a row
    • Each variable forms a column with one data type
    • Each observational unit forms a table
hhinc_tidy <- get_acs(geography = "state", table = "B19001", survey = "acs1", year = 2019)
GEOID NAME variable estimate moe
01 Alabama B19001_001 1897576 10370
01 Alabama B19001_002 154558 5883
01 Alabama B19001_003 103653 6001
01 Alabama B19001_004 108500 5926
01 Alabama B19001_005 98706 6491
01 Alabama B19001_006 90916 5859
01 Alabama B19001_007 105146 4149
01 Alabama B19001_008 85014 5417
01 Alabama B19001_009 87118 5163
01 Alabama B19001_010 82323 4231
  • W can specify “wide” format if we prefer, with columns for each variable
hhinc_wide <- get_acs(geography = "state", table = "B19001", survey = "acs1", year = 2019,
    output = "wide")
GEOID NAME B19001_001E B19001_001M B19001_002E B19001_002M B19001_003E B19001_003M B19001_004E B19001_004M B19001_005E B19001_005M B19001_006E B19001_006M B19001_007E B19001_007M B19001_008E B19001_008M B19001_009E B19001_009M B19001_010E B19001_010M B19001_011E B19001_011M B19001_012E B19001_012M B19001_013E B19001_013M B19001_014E B19001_014M B19001_015E B19001_015M B19001_016E B19001_016M B19001_017E B19001_017M
28 Mississippi 1100229 9887 102072 5891 72195 5563 74167 4559 63780 4127 64942 4703 58033 4432 55748 4483 51897 4901 43423 3597 86214 5966 105304 5907 122825 6357 72151 4143 47726 3282 44171 3353 35581 2826
29 Missouri 2458337 10406 155827 6687 111813 5577 115140 5621 122095 5489 111981 5528 121608 5756 113730 5553 116231 5683 104380 4723 198458 7477 260646 7646 318344 8489 216220 6681 135532 5371 129949 6087 126383 5183
30 Montana 437651 4637 23489 2194 21550 2415 21335 1726 23022 2418 18577 1955 21148 2408 20959 2769 21917 2316 20023 2670 36769 3056 44994 3051 58858 3855 39064 2723 23818 2653 23210 2351 18918 2211
31 Nebraska 771444 4475 37219 2805 30395 2679 26514 2433 35113 2753 30390 2643 36654 2792 32908 2509 37925 3333 31514 2681 64039 3492 86318 3863 111438 4655 75244 3690 47119 3029 47279 2722 41375 2731
32 Nevada 1143557 8281 68825 4164 45099 3502 39946 3403 50808 4299 44720 3607 48167 3496 46762 3920 51108 3854 51006 4531 90046 5207 123473 5807 159361 6715 116598 5686 64746 3977 69365 4085 73527 4035
33 New Hampshire 541396 5389 21229 2566 16539 2663 16216 2271 19054 2165 18377 2294 18368 2010 15983 1838 22257 2391 21203 2169 39230 3351 52252 3758 76433 4378 61021 4132 40422 3563 49858 3039 52954 3332
34 New Jersey 3286264 9466 154374 7219 95123 5069 91608 5576 113018 5398 96889 6271 112428 6077 98003 5978 104770 6257 93426 5753 202459 7998 290552 8825 401067 11429 343771 10134 250060 8064 338989 9700 499727 8735
35 New Mexico 793420 7074 65613 4413 42519 3479 41543 3752 48909 4239 41728 4027 39967 3532 32647 3073 37765 2701 28960 3159 63845 4301 79144 4473 96945 5032 60173 4229 38540 3416 39437 3428 35685 2595
36 New York 7446812 13820 497771 10405 329325 10854 288805 8790 286239 9953 261940 10138 277219 9361 246739 9135 268533 10075 242835 9161 472300 12232 658024 13405 883154 14989 709376 12558 488838 11556 634178 12227 901536 12818
37 North Carolina 4046348 15467 251158 9332 187653 8080 187757 7905 200849 7810 196035 8288 194732 6745 187488 8731 195142 7604 176142 7460 317444 9175 412432 12009 513941 13478 350665 10038 207679 8060 224461 7108 242770 6452

3 Manipulating Data

  • We can refine our queries even more by using pipes and tidyverse functions
    • Filter based on certain conditions
    • Rearrange and sort the data
    • Create new variables
    • Generate summary statistics
library(tidyverse)
library(dplyr)

median_age <- get_acs(geography = "county", variables = "B01002_001", year = 2019) %>%
    # arrange data by largest estimate value to smallest
arrange(desc(estimate)) %>%
    # only records with median age 50 or older
filter(estimate >= 50) %>%
    # create individual county and state columns
separate(NAME, into = c("county", "state"), sep = ", ")
GEOID county state variable estimate moe
12119 Sumter County Florida B01002_001 67.4 0.2
51091 Highland County Virginia B01002_001 60.9 3.5
08027 Custer County Colorado B01002_001 59.7 2.6
12015 Charlotte County Florida B01002_001 59.1 0.2
41069 Wheeler County Oregon B01002_001 59.0 3.3
51133 Northumberland County Virginia B01002_001 58.9 0.7
26131 Ontonagon County Michigan B01002_001 58.6 0.4
35021 Harding County New Mexico B01002_001 58.5 5.5
53031 Jefferson County Washington B01002_001 58.3 0.7
26001 Alcona County Michigan B01002_001 58.2 0.3

3.1 Summary Variables

  • tidycensus can calculate summary variables for us
    • these can be useful in calculating percentages and totals
# useful to assign lengthy and/or frequently used sets of variables to a vector
race_vars <- c(White = "B03002_003", Black = "B03002_004", Native = "B03002_005",
    Asian = "B03002_006", HIPI = "B03002_007", Hispanic = "B03002_012")
# this gives us the total population count for all GA counties by race
ga_race <- get_acs(geography = "county", state = "GA", variables = race_vars, summary_var = "B03002_001")
GEOID NAME variable estimate moe summary_est summary_moe
13001 Appling County, Georgia White 12745 21 18440 NA
13001 Appling County, Georgia Black 3555 209 18440 NA
13001 Appling County, Georgia Native 94 140 18440 NA
13001 Appling County, Georgia Asian 74 111 18440 NA
13001 Appling County, Georgia HIPI 0 21 18440 NA
13001 Appling County, Georgia Hispanic 1830 NA 18440 NA
13003 Atkinson County, Georgia White 4657 18 8239 NA
13003 Atkinson County, Georgia Black 1434 112 8239 NA
13003 Atkinson County, Georgia Native 81 111 8239 NA
13003 Atkinson County, Georgia Asian 0 19 8239 NA

3.2 Creating New Variables

  • We can use mutate() from dplyr to create new variables, like creating percentages from population estimates by race
ga_race_percent <- ga_race %>%
    # divide estimates for each observation by the summary estimate
mutate(percent = 100 * (estimate/summary_est)) %>%
    select(NAME, variable, percent)
NAME variable percent
Appling County, Georgia White 69.1160521
Appling County, Georgia Black 19.2787419
Appling County, Georgia Native 0.5097614
Appling County, Georgia Asian 0.4013015
Appling County, Georgia HIPI 0.0000000
Appling County, Georgia Hispanic 9.9240781
Atkinson County, Georgia White 56.5238500
Atkinson County, Georgia Black 17.4050249
Atkinson County, Georgia Native 0.9831290
Atkinson County, Georgia Asian 0.0000000

4 Analysis

4.1 Creating and Comparing Groups

  • Once data is loaded and processed, we can begin our analysis
    • This analysis often involves creating and/or comparing groups
      • demographic groups described by variables, geographic areas, time periods, etc.
    • The structure of tidycensus data makes it simple to identify and create groups
    • In this example, we will obtain county income data from Alabama and then create new income brackets
# median household income for Alabama counties
al_hh_income <- get_acs(geography = "county", table = "B19001", state = "AL", year = 2019)
GEOID NAME variable estimate moe
01001 Autauga County, Alabama B19001_001 21397 325
01001 Autauga County, Alabama B19001_002 1417 254
01001 Autauga County, Alabama B19001_003 1172 249
01001 Autauga County, Alabama B19001_004 1337 248
01001 Autauga County, Alabama B19001_005 882 214
01001 Autauga County, Alabama B19001_006 985 245
01001 Autauga County, Alabama B19001_007 699 210
01001 Autauga County, Alabama B19001_008 1050 299
01001 Autauga County, Alabama B19001_009 995 234
01001 Autauga County, Alabama B19001_010 676 180
  • Tables will often have variables with a variety of units
    • Here, we filter out the total households variable before recategorizing our income variables
al_hh_income_recode <- al_hh_income %>%
    # remove total households variable
filter(variable != "B19001_001") %>%
    # combine categories into households making less than $35,000, between
    # $35,000 and $75,000, and over $75,000
mutate(incgroup = case_when(variable < "B19001_008" ~ "below35k", variable < "B19001_013" ~
    "bw35kand75k", TRUE ~ "above75k"))
GEOID NAME variable estimate moe incgroup
01001 Autauga County, Alabama B19001_002 1417 254 below35k
01001 Autauga County, Alabama B19001_003 1172 249 below35k
01001 Autauga County, Alabama B19001_004 1337 248 below35k
01001 Autauga County, Alabama B19001_005 882 214 below35k
01001 Autauga County, Alabama B19001_006 985 245 below35k
01001 Autauga County, Alabama B19001_007 699 210 below35k
01001 Autauga County, Alabama B19001_008 1050 299 bw35kand75k
01001 Autauga County, Alabama B19001_009 995 234 bw35kand75k
01001 Autauga County, Alabama B19001_010 676 180 bw35kand75k
01001 Autauga County, Alabama B19001_011 1620 351 bw35kand75k
  • We can pass our groups to summarize() to calculate the estimated population size of each income bracket
al_group_sums <- al_hh_income_recode %>%
    # group by new income brackets
group_by(GEOID, incgroup) %>%
    # get new sums for each group
summarize(estimate = sum(estimate))
GEOID incgroup estimate
01001 above75k 8367
01001 below35k 6492
01001 bw35kand75k 6538
01003 above75k 31509
01003 below35k 23248
01003 bw35kand75k 26173
01005 above75k 1855
01005 below35k 4920
01005 bw35kand75k 2570
01007 above75k 2139

4.2 Time-Series

  • We will often need to analyze data across a range of years
    • Rather than getting samples for each year separately and combining them, we can tell get_acs() what years we want and return a single dataset with all of them
college_vars <- c("B15002_015", "B15002_016", "B15002_017", "B15002_018", "B15002_032",
    "B15002_033", "B15002_034", "B15002_035")
# create named vector of years from 2010-2019
years <- 2010:2019
names(years) <- years
# map_dfr() maps the get_acs() function to each year, resulting in data with
# estimates for each county for each specified year
college_by_year <- map_dfr(years, ~{
    get_acs(geography = "county", variables = college_vars, state = "FL", summary_var = "B15002_001",
        survey = "acs1", year = .x)
    # organize data by year
}, .id = "year") %>%
    # arrange data by place, variable, then year
arrange(NAME, variable, year)
year GEOID NAME variable estimate moe summary_est summary_moe
2010 12001 Alachua County, Florida B15002_015 13368 1780 145526 564
2011 12001 Alachua County, Florida B15002_015 13472 1840 144360 1607
2012 12001 Alachua County, Florida B15002_015 14944 1985 148044 966
2013 12001 Alachua County, Florida B15002_015 14107 1929 149906 226
2014 12001 Alachua County, Florida B15002_015 14951 1926 151519 571
2015 12001 Alachua County, Florida B15002_015 16416 1969 156666 159
2016 12001 Alachua County, Florida B15002_015 14255 2034 159567 342
2017 12001 Alachua County, Florida B15002_015 15048 2047 161896 351
2018 12001 Alachua County, Florida B15002_015 18820 1971 167364 2346
2019 12001 Alachua County, Florida B15002_015 19122 2058 165792 708

5 Visualization

  • Another advantage of tidycensus is that it yields data that can quickly be supplied to ggplot for visualization
    • This allows for the efficient creation of both exploratory and informational plots
    • Here, we use a bar chart to visualize public transit commuting in the 20 largest U.S. metro areas
metro_transit <-  get_acs(
# core-based statistical area
  geography = "cbsa",
# percentage of people who take public transit
  variables = "DP03_0021P",
# total population for selecting only largest metros
  summary_var = "B01003_001",
  survey = "acs1",
  year = 2019
) %>%
# slice the data to include only the largest metros by population
  slice_max(summary_est, n = 20)
# simple bar chart
metrobar <- ggplot(metro_transit, aes(x = NAME, y = estimate)) +
  geom_col()

5.1 Refining Plots

  • Less time spent on acquiring data leaves more time for refining our visualizations for sharing and publication
# Use just the primary city in the metro for the label
metrobarfancy <- metro_transit %>%
    mutate(NAME = str_remove(NAME, "-.*$")) %>%
    mutate(NAME = str_remove(NAME, ",.*$")) %>%
    ggplot(aes(y = reorder(NAME, estimate), x = estimate)) + geom_col(color = "navy",
    fill = "navy", alpha = 0.5, width = 0.85) + theme_minimal(base_size = 12) + scale_x_continuous(breaks = c(0,
    10, 20, 30), labels = c("0%", "10%", "20%", "30%")) + labs(title = "Metro Public Transit Commuting",
    subtitle = "2019 ACS 1-year Estimates", y = "", x = "Share of All Commuters Using Public Transit",
    caption = "Source: U.S. Census Bureau via the tidycensus R Package")

5.2 Visualizating Margin of Error

  • The inclusion of the margin of error in ACS data makes it simple to examine the reliability of estimates
    • This will be particularly useful given concerns about data quality from the 2020 Census
    • Here, we look at median household income estimates in Louisiana parishes
# household income for Louisiana counties
la_income <- get_acs(state = "Louisiana", geography = "county", variables = c(hhincome = "B19013_001"),
    year = 2019) %>%
    # simplify county name
mutate(NAME = str_remove(NAME, " County, Louisiana")) %>%
    # arrange by margin of error largest to smallest
arrange(desc(moe))
GEOID NAME variable estimate moe
22023 Cameron Parish, Louisiana hhincome 53423 11624
22025 Catahoula Parish, Louisiana hhincome 40129 9276
22007 Assumption Parish, Louisiana hhincome 43759 7458
22091 St. Helena Parish, Louisiana hhincome 43886 7437
22021 Caldwell Parish, Louisiana hhincome 37691 7351
22059 LaSalle Parish, Louisiana hhincome 42104 6936
22123 West Carroll Parish, Louisiana hhincome 38500 6370
22029 Concordia Parish, Louisiana hhincome 32500 6319
22127 Winn Parish, Louisiana hhincome 38353 5742
22125 West Feliciana Parish, Louisiana hhincome 59637 5661
  • ggplot has features that allow for the creation of confidence intervals around our plots
# reorder by estimate, add error bar to show confidence interval from margin of
# error
la_income_moe <- ggplot(la_income, aes(x = estimate, y = reorder(NAME, estimate))) +
    geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) + geom_point(size = 3,
    color = "darkgreen") + theme_minimal(base_size = 12.5) + labs(title = "Median Household Income",
    subtitle = "Counties in Louisiana", x = "2015-2019 ACS 5-year estimate", y = "") +
    scale_x_continuous(labels = scales::dollar)

5.3 Showing Change Over Time

  • We may also want to see how a variable and/or the margin of error changes over time
    • Here we create another multi-year dataset for median home values in Fulton County, GA
# supply years we want samples for
years <- 2005:2019
names(years) <- years
# use map_dfr() to supply each year to get_acs() and return median home value
# by county for 2005-2019
fulton_value <- map_dfr(years, ~{
    get_acs(geography = "county", variables = "B25077_001", state = "GA", county = "Fulton",
        year = .x, survey = "acs1")
}, .id = "year")
year GEOID NAME variable estimate moe
2005 13121 Fulton County, Georgia B25077_001 243600 7046
2006 13121 Fulton County, Georgia B25077_001 270000 9456
2007 13121 Fulton County, Georgia B25077_001 267800 12865
2008 13121 Fulton County, Georgia B25077_001 284400 8583
2009 13121 Fulton County, Georgia B25077_001 257700 8968
2010 13121 Fulton County, Georgia B25077_001 244300 7932
2011 13121 Fulton County, Georgia B25077_001 230500 9223
2012 13121 Fulton County, Georgia B25077_001 229900 8839
2013 13121 Fulton County, Georgia B25077_001 231800 8471
2014 13121 Fulton County, Georgia B25077_001 248800 9176
  • We can see how home values changed from 2005 to 2019 along with the margin of error
# group = 1 overrides the default grouping so ggplot treats the dataset as a
# whole
fulton_timeseries <- ggplot(fulton_value, aes(x = year, y = estimate, group = 1)) +
    geom_ribbon(aes(ymax = estimate + moe, ymin = estimate - moe), fill = "navy",
        alpha = 0.4) + geom_line(color = "navy") + geom_point(color = "navy", size = 2) +
    theme_minimal(base_size = 12) + scale_y_continuous(labels = scales::dollar) +
    labs(title = "Median Home Value in Fulton County, GA", x = "Year", y = "AMedian Home Value",
        caption = "Shaded Area Represents the Margin of Error of the ACS Estimate")

5.4 Group Visualizations

  • We created groups from income levels earlier, but the ability to filter our geographies by state and/or county, allows us to compare groups of different geographies
    • Here we look at median home values across the five-county Atlanta metro
housing_val <- get_acs(
  geography = "tract",
# median home value
  variables = "B25077_001",
  state = "GA",
# only these counties
  county = c(
    "Fulton",
    "DeKalb",
    "Clayton",
    "Cobb",
    "Gwinnett",
    "Henry"
  )
) %>%
# separate NAME column into columns for tract, county, and state
  separate(
  NAME,
  into = c("tract", "county", "state"),
  sep = ", "
)
GEOID tract county state variable estimate moe
13063040202 Census Tract 402.02 Clayton County Georgia B25077_001 82600 10685
13063040203 Census Tract 402.03 Clayton County Georgia B25077_001 135900 21955
13063040204 Census Tract 402.04 Clayton County Georgia B25077_001 118000 9363
13063040302 Census Tract 403.02 Clayton County Georgia B25077_001 60300 7287
13063040303 Census Tract 403.03 Clayton County Georgia B25077_001 66400 9785
13063040306 Census Tract 403.06 Clayton County Georgia B25077_001 66400 30049
13063040307 Census Tract 403.07 Clayton County Georgia B25077_001 89900 10646
13063040308 Census Tract 403.08 Clayton County Georgia B25077_001 59000 19547
13063040407 Census Tract 404.07 Clayton County Georgia B25077_001 94700 10923
13063040408 Census Tract 404.08 Clayton County Georgia B25077_001 119600 16981
  • ggplot allows us to quickly generate multiple comparisons
  • We can display groups on one plot using color
# density curve colored by county
county_density <- ggplot(housing_val, aes(x = estimate, fill = county)) + geom_density(alpha = 0.3)

  • Or we can create plots for each county using facet_wrap()
wrapped_density <- ggplot(housing_val, aes(x = estimate)) + geom_density(fill = "darkgreen",
    color = "darkgreen", alpha = 0.5) + facet_wrap(~county) + scale_x_continuous(labels = function(x) paste0("$",
    x/1000, "k")) + theme_minimal(base_size = 14) + theme(axis.text.y = element_blank()) +
    labs(x = "Median Home Value", y = "", title = "Home Values by Census Tract, 2015-2019 ACS")

6 Bringing in Geographic Data

  • Census and ACS data are associated with/aggregated at geographies
    • Legal entities (states and counties)
    • Statistical entities (census tracts and block groups)
    • Geographic features (roads and water features)
  • Shapefiles are available from the U.S. Census Bureau’s TIGER/Line database (Topologically Integrated Geographic Encoding and Referencing)
    • Ready for mapping and spatial analysis
  • tigris package allows to access these files programmatically
  • simple features (sf) package represents spatial data much like an R data frame, but with a special geometry column that represents the shape of each feature

6.1 Structure of sf objects from tigris

  • sf objects from tigris contain geographic identifiers, names, coordinate information, and land and water area
    • The id and name columns can be used for joining with tidycensus data
library(tigris)
# get boundaries of all Florida counties
fl_counties <- counties("FL")
# show object type
class(fl_counties)
## Simple feature collection with 67 features and 17 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -87.6349 ymin: 24.39631 xmax: -79.97431 ymax: 31.00097
## CRS:            4269
## First 10 features:
##     STATEFP COUNTYFP COUNTYNS GEOID       NAME          NAMELSAD LSAD CLASSFP
## 72       12      053 00295751 12053   Hernando   Hernando County   06      H1
## 73       12      129 00306912 12129    Wakulla    Wakulla County   06      H1
## 74       12      131 00295727 12131     Walton     Walton County   06      H1
## 79       12      127 00306921 12127    Volusia    Volusia County   06      H1
## 196      12      051 00307626 12051     Hendry     Hendry County   06      H1
## 227      12      095 00295750 12095     Orange     Orange County   06      H1
## 284      12      011 00295753 12011    Broward    Broward County   06      H1
## 312      12      099 00295761 12099 Palm Beach Palm Beach County   06      H1
## 367      12      041 00303633 12041  Gilchrist  Gilchrist County   06      H1
## 371      12      086 00295755 12086 Miami-Dade Miami-Dade County   06      H1
##     MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT      ALAND     AWATER    INTPTLAT
## 72  G4020  <NA>  45300     <NA>        A 1224962206  300591492 +28.5730426
## 73  G4020  <NA>  45220     <NA>        A 1570615733  334902363 +30.1394320
## 74  G4020  <NA>  18880     <NA>        A 2687740188  522795160 +30.6312106
## 79  G4020   422  19660     <NA>        A 2852204344  857791723 +29.0577690
## 196 G4020   163  17500     <NA>        A 2994619266   87037213 +26.5399670
## 227 G4020   422  36740     <NA>        A 2335820971  262462219 +28.5143906
## 284 G4020   370  33100    22744        A 3115289087  310801098 +26.1935353
## 312 G4020   370  33100    48424        A 5087468869 1084901519 +26.6491257
## 367 G4020   264  23540     <NA>        A  905705573   14362143 +29.7234556
## 371 G4020   370  33100    33124        A 4920565755 1376144237 +25.6105799
##         INTPTLON                       geometry
## 72  -082.4662272 MULTIPOLYGON (((-82.25329 2...
## 73  -084.3748463 MULTIPOLYGON (((-84.50434 3...
## 74  -086.1766139 MULTIPOLYGON (((-86.3916 30...
## 79  -081.1617920 MULTIPOLYGON (((-81.5028 29...
## 196 -081.1521142 MULTIPOLYGON (((-81.27172 2...
## 227 -081.3232839 MULTIPOLYGON (((-81.65739 2...
## 284 -080.4766834 MULTIPOLYGON (((-80.29699 2...
## 312 -080.4483542 MULTIPOLYGON (((-80.4824 26...
## 367 -082.7958011 MULTIPOLYGON (((-82.95908 2...
## 371 -080.4970989 MULTIPOLYGON (((-80.44093 2...
## [1] "sf"         "data.frame"
# plot geometry
plot(fl_counties$geometry)
plot(sf::st_geometry(fl_counties))
  • Using the sf object’s geometry variable or the function st_geometry return the same result

6.2 Polygons, Points, and Lines

  • We can grab a variety of geometry types from tigris and quickly plot them
# get boundaries for all tracts in Washington, D.C.
dc_tracts <- tracts("DC")
# all points of interest in DC
dc_landmarks <- landmarks("DC", type = "point")
# and all major roads in DC
dc_roads <- primary_secondary_roads("DC")
# plot each geometry
par(mar = c(4, 4, 0.1, 0.1))
plot(dc_tracts$geometry)
plot(dc_landmarks$geometry)
plot(st_geometry(dc_roads))

7 Mapping Census Data

  • Combining tidycensus data and tigris geographies offer many mapping possibilities
    • It is often useful to examine basic geometry plots before adding data
library(ggplot2)
# cowplot allows us to combine separate plots
library(cowplot)
# get Broward County, FL census tracts from tigris
broward_tracts <- tracts("FL", "Broward")
# get Broward block groups from tigris
broward_block_groups <- block_groups("FL", "Broward")
# plot Broward census tracts with ggplot
gg1 <- ggplot(broward_tracts) + geom_sf() + theme_void() + labs(title = "Census tracts")
# plot Broward block groups with ggplot
gg2 <- ggplot(broward_block_groups) + geom_sf() + theme_void() + labs(title = "Block groups")
# combine two plots into one grid with cowplot
cowplot::plot_grid(gg1, gg2)

7.1 TIGER/Line vs. Cartographic Boundaries

  • tigris and tidycensus cartographic boundaries have different levels of detail
ms_counties <- counties("MS")
ms_counties_cb <- counties("MS", cb = TRUE)

ms_tiger_gg <- ggplot(ms_counties) + geom_sf() + theme_void() + labs(title = "TIGER/Line")

ms_cb_gg <- ggplot(ms_counties_cb) + geom_sf() + theme_void() + labs(title = "Cartographic boundary")

7.2 Adding Census Data

  • If we pass geometry = TRUE to tidycensus, we can pass the data to ggplot for mapping
    • geom_sf() reads and plots geographic coordinates from sf objects
    • Here, we map census tracts in Montgomery County, Alabama and shade them by Black population
library(tidycensus)
library(tidyverse)
library(tigris)
options(tigris_use_cache = TRUE)

racevars <- c(White = "B03002_003", Black = "B03002_004", Asian = "B03002_006", Hispanic = "B03002_012")

montgomery <- get_acs(state = "AL", county = "Montgomery", geography = "tract", variables = racevars,
    geometry = TRUE, summary_var = "B03002_001")

head(montgomery)

# map of montgomery counties by total Black population
montgomery_map <- montgomery %>%
    filter(variable == "Black") %>%
    ggplot(aes(fill = estimate)) + geom_sf(color = NA) + scale_fill_viridis_c(option = "magma") +
    theme_void()

# create different map for each race showing population share
montgomery_facetmap <- montgomery %>%
    mutate(pct = 100 * (estimate/summary_est)) %>%
    ggplot(aes(fill = pct)) + facet_wrap(~variable) + geom_sf(color = NA) + scale_fill_viridis_c() +
    theme_void()
  • Just as we split an earlier plot by county, we can use facet_wrap to split this plot by each racial group contained in the data

7.3 Modeling

  • tidycensus also makes it easy to model the characteristics of our data
    • First, we pull data for metro Miami counties
library(tidycensus)
library(sf)
# counties of interest
mia_counties <- c("Miami-Dade", "Broward", "Palm Beach")
# labeled vector of variables
variables_to_get <- c(median_value = "B25077_001", median_rooms = "B25018_001", median_income = "DP03_0062",
    total_population = "B01003_001", median_age = "B01002_001", pct_college = "DP02_0068P",
    pct_foreign_born = "DP02_0094P", pct_white = "DP05_0077P", median_year_built = "B25037_001",
    percent_ooh = "DP04_0046P")

mia_data <- get_acs(geography = "tract", variables = variables_to_get, state = "FL",
    county = mia_counties, geometry = TRUE, output = "wide", year = 2019) %>%
    select(-NAME)
GEOID median_valueE median_valueM median_roomsE median_roomsM total_populationE total_populationM median_ageE median_ageM median_year_builtE median_year_builtM median_incomeE median_incomeM pct_collegeE pct_collegeM pct_foreign_bornE pct_foreign_bornM pct_whiteE pct_whiteM percent_oohE percent_oohM geometry
12011010102 440900 30194 5.2 0.3 2898 242 49.4 2.8 1966 2 92734 17427 46.1 7.8 306 NA 91.7 3.6 84.6 5.5 MULTIPOLYGON (((-80.09418 2…
12011010103 348000 23554 4.4 0.2 3144 298 58.5 1.6 1967 2 69583 18900 41.0 6.6 419 NA 88.9 4.0 75.3 6.0 MULTIPOLYGON (((-80.09123 2…
12011010104 345500 32173 3.7 0.2 2156 331 57.0 3.8 1972 5 55612 6151 47.8 8.6 314 NA 84.6 6.5 61.9 9.7 MULTIPOLYGON (((-80.08109 2…
12011010200 166400 26243 4.3 0.2 6462 685 43.9 3.2 1973 2 48442 3829 25.9 6.3 2815 NA 65.6 8.0 58.0 8.1 MULTIPOLYGON (((-80.10748 2…
12011010304 193600 13564 4.8 0.3 4584 756 34.3 3.9 1974 2 47168 7151 8.6 4.1 1071 NA 8.3 4.3 55.4 8.4 MULTIPOLYGON (((-80.11944 2…
12011010305 96900 22040 3.8 0.2 4319 587 35.7 4.6 1985 2 44193 5169 18.0 7.1 1482 NA 38.6 10.5 12.5 3.6 MULTIPOLYGON (((-80.12199 2…
12011010306 126700 16456 4.1 0.2 2708 354 41.4 5.8 1973 3 40843 6934 21.3 7.2 476 NA 25.5 5.9 42.9 6.1 MULTIPOLYGON (((-80.10789 2…
12011010307 165900 13992 4.1 0.3 4028 353 38.6 4.9 1980 2 38105 4475 10.7 3.7 802 NA 27.7 5.5 23.7 6.3 MULTIPOLYGON (((-80.12833 2…
12011010308 296700 28258 4.3 0.2 5088 547 47.5 6.3 1986 2 64821 12747 41.1 7.8 1186 NA 68.1 10.0 43.2 6.7 MULTIPOLYGON (((-80.15345 2…
12011010401 333700 12144 5.3 0.3 4973 331 38.6 3.4 1986 2 76940 7263 34.3 5.3 1146 NA 64.6 7.8 70.0 6.2 MULTIPOLYGON (((-80.17015 2…

7.4 Examining the Data

  • Then we create a map of median home values and visualize its distribution with a histogram
library(cowplot)

mhv_map <- ggplot(mia_data, aes(fill = median_valueE)) + geom_sf(color = NA) + scale_fill_viridis_c(labels = scales::dollar) +
    theme_void() + labs(fill = "Median home value")

mhv_histogram <- ggplot(mia_data, aes(x = median_valueE)) + geom_histogram(alpha = 0.5,
    fill = "navy", color = "navy", bins = 100) + theme_minimal() + scale_x_continuous(labels = scales::label_number_si(accuracy = NULL)) +
    labs(x = "Median home value")

cowplot::plot_grid(mhv_map, mhv_histogram)

7.5 Comparing Distributions

  • As the data is skewed, we can apply a log transformation to the data in both forms
library(tidyverse)
library(cowplot)

mhv_map_log <- ggplot(mia_data, aes(fill = log(median_valueE))) + geom_sf(color = NA) +
    scale_fill_viridis_c() + theme_void() + labs(fill = "Median home\nvalue (log)")

mhv_histogram_log <- ggplot(mia_data, aes(x = log(median_valueE))) + geom_histogram(alpha = 0.5,
    fill = "navy", color = "navy", bins = 100) + theme_minimal() + scale_x_continuous() +
    labs(x = "Median home value (log)")

plot_grid(mhv_map_log, mhv_histogram_log)

7.6 Looking at Correlations

  • We may also want to know how variables in our data are correlated
    • Compatibility with tidyverse functions makes it easy to drop unnecessary variables, remove NA values, and format variables of interest for calculating correlations
library(corrr)
library(units)

mia_data_for_model <- mia_data %>%
# use area variable to calculate population density, specify units as square kilometers
  mutate(pop_density = as.numeric(set_units(total_populationE / st_area(.), "1/km2")),
# subtract median_structure_age from midpoint of sample range (2015-2019)
         median_structure_age = 2017 - median_year_builtE) %>%
# select estimates but not margins of error
  select(!ends_with("M")) %>%
# remove "E" at end of variable names
  rename_with(.fn = ~str_remove(.x, "E$")) %>%
# remove NA values
  na.omit()
GEOID median_value median_rooms total_population median_age median_year_built median_income pct_college pct_foreign_born pct_white percent_ooh geometry pop_density median_structure_age
12011010102 440900 5.2 2898 49.4 1966 92734 46.1 306 91.7 84.6 MULTIPOLYGON (((-80.09418 2… 1726.025 51
12011010103 348000 4.4 3144 58.5 1967 69583 41.0 419 88.9 75.3 MULTIPOLYGON (((-80.09123 2… 1371.705 50
12011010104 345500 3.7 2156 57.0 1972 55612 47.8 314 84.6 61.9 MULTIPOLYGON (((-80.08109 2… 2605.186 45
12011010200 166400 4.3 6462 43.9 1973 48442 25.9 2815 65.6 58.0 MULTIPOLYGON (((-80.10748 2… 2344.356 44
12011010304 193600 4.8 4584 34.3 1974 47168 8.6 1071 8.3 55.4 MULTIPOLYGON (((-80.11944 2… 3985.536 43
12011010305 96900 3.8 4319 35.7 1985 44193 18.0 1482 38.6 12.5 MULTIPOLYGON (((-80.12199 2… 1754.610 32
12011010306 126700 4.1 2708 41.4 1973 40843 21.3 476 25.5 42.9 MULTIPOLYGON (((-80.10789 2… 2186.491 44
12011010307 165900 4.1 4028 38.6 1980 38105 10.7 802 27.7 23.7 MULTIPOLYGON (((-80.12833 2… 1317.291 37
12011010308 296700 4.3 5088 47.5 1986 64821 41.1 1186 68.1 43.2 MULTIPOLYGON (((-80.15345 2… 1904.333 31
12011010401 333700 5.3 4973 38.6 1986 76940 34.3 1146 64.6 70.0 MULTIPOLYGON (((-80.17015 2… 2665.559 31
  • We need to make sure to use the sf function st_drop_geometry() before calculating correlations
    • This turns the sf object into a regular data frame and is an essential step as they geometry column is incompatible with a lot of R functions
mia_estimates <- mia_data_for_model %>%
    # remove GEOID, median_value, and median_year_built variables from data
select(-GEOID, -median_value, -median_year_built) %>%
    # remove geometry column (converts sf object to regular data frame)
st_drop_geometry()
# calculate pearson correlation
correlations <- correlate(mia_estimates, method = "pearson")
# network plot of correlation
network_plot(correlations)
median_rooms total_population median_age median_income pct_college pct_foreign_born pct_white percent_ooh pop_density median_structure_age
5.2 2898 49.4 92734 46.1 306 91.7 84.6 1726.025 51
4.4 3144 58.5 69583 41.0 419 88.9 75.3 1371.705 50
3.7 2156 57.0 55612 47.8 314 84.6 61.9 2605.186 45
4.3 6462 43.9 48442 25.9 2815 65.6 58.0 2344.356 44
4.8 4584 34.3 47168 8.6 1071 8.3 55.4 3985.536 43
3.8 4319 35.7 44193 18.0 1482 38.6 12.5 1754.610 32
4.1 2708 41.4 40843 21.3 476 25.5 42.9 2186.491 44
4.1 4028 38.6 38105 10.7 802 27.7 23.7 1317.291 37
4.3 5088 47.5 64821 41.1 1186 68.1 43.2 1904.333 31
5.3 4973 38.6 76940 34.3 1146 64.6 70.0 2665.559 31

8 Microdata

  • ACS tables are individual responses aggregated to a geography
  • For most purposes, these tables have the data we need
  • Census Bureau also releases microdata from the ACS via the Public Use Microdata Sample (PUMS)
    • Individual- and household-level responses to the ACS are available with the get_pums() function
    • One row per respondent instead of geography
    • Smallest geography is Public Use Microdata Area (PUMA)
  • Allows for custom estimates and more granular analysis

8.1 PUMS variables

  • Like with ACS data, tidycensus allows us to load, filter, and inspect PUMS variables
install.packages(c("survey", "srvyr"))
library(tidyverse)
library(tidycensus)
# 2019 ACS 1-year variables
pums_vars_2019 <- pums_variables %>%
    filter(year == 2019, survey == "acs1")
# only unique varuables
pums_vars_2019 %>%
    distinct(var_code, var_label, data_type, level)
  • PUMS variables come with codes, value labels, and levels (person or household)
    var_code var_label data_type level
    RT Record Type chr NA
    SERIALNO Housing unit/GQ person serial number chr NA
    DIVISION Division code based on 2010 Census definitions chr NA
    PUMA Public use microdata area code (PUMA) based on 2010 Census definition (areas with population of 100,000 or more, use with ST for unique code) chr NA
    REGION Region code based on 2010 Census definitions chr NA
    ST State Code based on 2010 Census definitions chr NA
    ADJHSG Adjustment factor for housing dollar amounts (6 implied decimal places) chr housing
    ADJINC Adjustment factor for income and earnings dollar amounts (6 implied decimal places) chr housing
    WGTP Housing Unit Weight num housing
    NP Number of persons in this household num housing
# only individual level records
pums_vars_2019 %>%
    distinct(var_code, var_label, data_type, level) %>%
    filter(level == "person")
  • Many variables are only available at one level, so we need to make sure our variable of interest matches the level of detail we want to use
    var_code var_label data_type level
    SPORDER Person number num person
    PWGTP Person’s weight num person
    AGEP Age num person
    CIT Citizenship status chr person
    CITWP Year of naturalization write-in num person
    COW Class of worker chr person
    DDRS Self-care difficulty chr person
    DEAR Hearing difficulty chr person
    DEYE Vision difficulty chr person
    DOUT Independent living difficulty chr person

8.2 Loading PUMS Data and Recoding

  • Many PUMS variables have numeric values that represent categories
ga_pums <- get_pums(variables = c("PUMA", "SEX", "AGEP", "SCHL"), state = "GA", survey = "acs1",
    year = 2018)
  • Here, we load PUMA data for Georgia
    • May be unclear what values mean for some variables
      SERIALNO SPORDER WGTP PWGTP AGEP PUMA ST SCHL SEX
      2018GQ0000025 1 0 68 51 03700 13 13 1
      2018GQ0000043 1 0 89 23 04000 13 20 2
      2018GQ0000061 1 0 10 43 00500 13 17 1
      2018GQ0000076 1 0 11 20 04300 13 19 2
      2018GQ0000145 1 0 4 23 04200 13 13 1
      2018GQ0000336 1 0 90 36 01800 13 19 1
      2018GQ0000482 1 0 76 92 04400 13 16 2
      2018GQ0000518 1 0 76 48 02600 13 17 1
      2018GQ0000530 1 0 15 19 03200 13 19 2
      2018GQ0000563 1 0 50 87 03900 13 21 2
    • By including recode = TRUE in our function call, any labeled variables will be added alongside their unlabeled versions
    ga_pums_recoded <- get_pums(variables = c("PUMA", "SEX", "AGEP", "SCHL"), state = "GA",
        survey = "acs1", year = 2018, recode = TRUE)
    SERIALNO SPORDER WGTP PWGTP AGEP PUMA ST SCHL SEX ST_label SCHL_label SEX_label
    2018GQ0000025 1 0 68 51 03700 13 13 1 Georgia/GA Grade 10 Male
    2018GQ0000043 1 0 89 23 04000 13 20 2 Georgia/GA Associate’s degree Female
    2018GQ0000061 1 0 10 43 00500 13 17 1 Georgia/GA GED or alternative credential Male
    2018GQ0000076 1 0 11 20 04300 13 19 2 Georgia/GA 1 or more years of college credit, no degree Female
    2018GQ0000145 1 0 4 23 04200 13 13 1 Georgia/GA Grade 10 Male
    2018GQ0000336 1 0 90 36 01800 13 19 1 Georgia/GA 1 or more years of college credit, no degree Male
    2018GQ0000482 1 0 76 92 04400 13 16 2 Georgia/GA Regular high school diploma Female
    2018GQ0000518 1 0 76 48 02600 13 17 1 Georgia/GA GED or alternative credential Male
    2018GQ0000530 1 0 15 19 03200 13 19 2 Georgia/GA 1 or more years of college credit, no degree Female
    2018GQ0000563 1 0 50 87 03900 13 21 2 Georgia/GA Bachelor’s degree Female
    • PUMS data uses weighting to allow estimation of national level statistics from a relatively small sample
    • Recoding allows us to then create a count of population by sex by summing the person weight
    # weighting sample
    ga_pums_recoded %>%
        count(PUMA, SEX_label, wt = PWGTP)
    PUMA SEX_label n
    00100 Male 74196
    00100 Female 79139
    00200 Male 59547
    00200 Female 58650
    00300 Male 77869
    00300 Female 77896
    00401 Male 77689
    00401 Female 87415
    00402 Male 61002
    00402 Female 63279
    • In addition to weighted counts, we can also create weighted means and percentages
      • Here, we calculate summary statistics for male and female individuals 25 and older with a Bachelor’s degree or above
    # create a new variable that is whether or not the person has a Bachelor’s
    # degree or above, group by PUMA and sex, then calculate the total population,
    # average age, total with BA or above (only for people 25 and older), and
    # percent with BA or above.
    ga_pums_recoded %>%
        mutate(ba_above = SCHL %in% c("21", "22", "23", "24")) %>%
        group_by(PUMA, SEX_label) %>%
        summarize(total_pop = sum(PWGTP), mean_age = weighted.mean(AGEP, PWGTP), ba_above = sum(PWGTP[ba_above ==
            TRUE & AGEP >= 25]), ba_above_pct = ba_above/sum(PWGTP[AGEP >= 25]))
    • This yields population counts, average age, and counts and percentages of educational attainment
      PUMA SEX_label total_pop mean_age ba_above ba_above_pct
      00100 Male 74196 38.47365 12848 0.2598233
      00100 Female 79139 40.34941 16232 0.2947468
      00200 Male 59547 31.83490 6436 0.1846613
      00200 Female 58650 34.55173 9715 0.2611489
      00300 Male 77869 34.31346 9234 0.2053871
      00300 Female 77896 36.61063 10139 0.2078729
      00401 Male 77689 35.36655 14187 0.2845881
      00401 Female 87415 37.84550 16109 0.2770106
      00402 Male 61002 38.24930 14338 0.3405215
      00402 Female 63279 41.18079 19165 0.4163951

8.3 Modeling Microdata

  • Because PUMS data sample sizes can get very small when we filter or create groups, we will often want to calculate standard errors and check the reliability of the data

8.4 Mapping Micerodata

  • PUMS data is mapped at the PUMA level, so it is useful to examine how that data differs from more common geographies
    • Here we use tigris to pull PUMA boundaries for six states
# create vector of southeastern state abbreviations
se_states <- c("GA", "FL", "AL", "TN", "LA", "MS")
# use map() to pass each state to the tigris pumas() function to get their puma
# geographic files
se_pumas <- map(se_states, tigris::pumas, class = "sf", cb = TRUE) %>%
    reduce(rbind)
  • We can see that naming conventions are much different than other geographies
    STATEFP10 PUMACE10 AFFGEOID10 GEOID10 NAME10 LSAD10 ALAND10 AWATER10 geometry
    13 00700 7950000US1300700 1300700 Southern Georgia Regional Commission (West) P0 6023715229 99859689 MULTIPOLYGON (((-83.33829 3…
    13 01800 7950000US1301800 1301800 River Valley Regional Commission (Outside Muscogee & Chattahoochee Counties) P0 12319655378 221465656 MULTIPOLYGON (((-85.18491 3…
    13 03900 7950000US1303900 1303900 Northeast Georgia Regional Commission (Southwest)–Walton, Morgan & Jasper Counties P0 2700385304 40042206 MULTIPOLYGON (((-83.98154 3…
    13 02700 7950000US1302700 1302700 Northwest Georgia Regional Commission (North Central)–Whitfield County P0 752230670 1597015 MULTIPOLYGON (((-85.16705 3…
    13 04000 7950000US1304000 1304000 Central Savannah River Area Regional Commission (East Central)–Richmond County P0 840018751 11014326 MULTIPOLYGON (((-82.3503 33…
  • Here we select the PUMA and POVPIP (income-to-poverty-ratio) variables for states in the Fed 6th District
# PUMA and income-to-poverty ratio
se_pums <- get_pums(variables = c("PUMA", "POVPIP"), state = se_states, survey = "acs1",
    year = 2018)
SERIALNO SPORDER WGTP PWGTP POVPIP PUMA ST
2018GQ0000025 1 0 68 -1 03700 13
2018GQ0000028 1 0 20 -1 01800 47
2018GQ0000043 1 0 89 -1 04000 13
2018GQ0000049 1 0 75 -1 01600 01
2018GQ0000052 1 0 42 -1 01700 28
  • Before mapping, we group by state and PUMA and calculate the total population and the share of the population that is below 200% of the poverty line
# calculate share of population below 200% of poverty line
se_pov <- se_pums %>%
    group_by(ST, PUMA) %>%
    summarize(total_pop = sum(PWGTP), pct_in_pov = sum(PWGTP[POVPIP < 200])/total_pop)
ST PUMA total_pop pct_in_pov
01 00100 184908 0.3595410
01 00200 196543 0.2522807
01 00301 132862 0.3144240
01 00302 101612 0.3541708
01 00400 123128 0.5115408
  • We then join our microdata to the PUMA boundaries and shade our map with the poverty variable
# join data and create map shaded by population share below 200% poverty line
se_pumas %>%
    left_join(se_pov, by = c(STATEFP10 = "ST", PUMACE10 = "PUMA")) %>%
    ggplot(aes(fill = pct_in_pov)) + geom_sf() + scale_fill_viridis_b(name = NULL,
    option = "magma", labels = scales::label_percent(1)) + labs(title = "Percentage of Population Below 200% of the Poverty Line") +
    theme_void()

9 Census Microdata from IPUMS

Source: Minnesota Population Center

Source: Minnesota Population Center

  • We may choose to work with data from 3rd party providers rather than the Census API depending on our needs and preferences
  • IPUMS is easily the most comprehensive
    • Maintained by the Institute for Social Research and Data Innovation at University of Minnesota
    • Data from over 750 censuses and surveys representing over one billion individuals from
      • Including the Current Population Survey, the American Community Survey, the National Health Interview Survey, the Demographic and Health Surveys, among others
  • We will focus on ACS IPUMS data, but same syntax can be applied to CPS and other sources

9.1 What is ipumsr?

Source: Minnesota Population Center

Source: Minnesota Population Center

  • Created by Greg Freedman Ellis
  • Allows users to:
    • Read IPUMS data extracts into R
      • Manage the large size, hierarchical structure, and variable labeling conventions of IPUMS extracts
    • Use IPUMS metadata to transform data and conduct analysis
    • In the near future, query the IPUMS API similar to tidycensus

9.2 IPUMS Extracts

Source: Minnesota Population Center

Source: Minnesota Population Center

Source: Minnesota Population Center

Source: Minnesota Population Center

  • IPUMS USA
    • US Census and American Community Survey microdata from 1850 to the present
      • Millions of unique individual records from decennial census and ACS, including historical records from 1850 to 1940
      • https://usa.ipums.org/usa/
  • IPUMS CPS
    • Current Population Survey microdata from 1962 to the present

9.3 Creating an Extract

Source: Minnesota Population Center

Source: Minnesota Population Center

  • Select variables and samples from IPUMS site
  • Choose format (file type and data structure)
  • Download extract and DDI codebook (XML metadata)
    • Save both files in the same folder

9.4 Reading in Your Extract

  • Package ecosystem
# CRAN version
install.packages("ipumsr")
# development version
if (!require(remotes)) install.packages("remotes")
remotes::install_github("mnpopcenter/ipumsr", ref = "api-alpha-dev")
# tidyverse packages
install.packages("dplyr")
install.packages("ggplot2")
install.packages("stringr")
install.packages("purrr")
# HTML tables
install.packages("DT")
# spatial analysis and mapping
install.packages("sf")

library(ipumsr)
library(dplyr)
library(ggplot2)
library(stringr)
library(sf)
library(purrr)

9.5 Reading in Your Extract

  • Import the data from the extract using the metadata as a map
  • Many IPUMS variables names and values are not self-explanatory, so we use ipumsr functions to view labels, descriptions, and value labels
# read in xml metadata codebook
ddi <- read_ipums_ddi("/Users/victorhaley/QSC2021Master/usa_00013.xml")
data <- read_ipums_micro(ddi)
# use metadata to create data frame from IPUMS extract
data <- read_ipums_micro("usa_00013.xml")
  • The data file is just raw, uncoded data
    • The ddi codebook provides a map for assembling and analyzing it
names(ddi)
##  [1] "file_name"        "file_path"        "file_type"        "ipums_project"   
##  [5] "extract_date"     "extract_notes"    "rectypes"         "rectype_idvar"   
##  [9] "rectypes_keyvars" "var_info"         "conditions"       "citation"        
## [13] "file_encoding"
  • IPUMS variable codes are not always intuitive
# print variable names
names(data)
##  [1] "YEAR"     "DATANUM"  "SERIAL"   "HHWT"     "STATEFIP" "CONSPUMA"
##  [7] "GQ"       "PHONE"    "PERNUM"   "PERWT"    "EDUC"     "EDUCD"   
## [13] "EDUCD2"   "EDUCD3"   "PHONE2"   "PHONE3"
  • We can use ipumsr functions to view full variable names
# view full name of variable PHONE
ipums_var_label(ddi, PHONE)
## [1] "Telephone availability"
  • detailed variable descriptions
# view variable descriptions, strwrap() transforms them into paragraphs
ipums_var_desc(ddi, PHONE) %>%
    strwrap(60)
## [1] "PHONE indicates whether residents of the housing unit had"
## [2] "telephone access."
  • and the values and labels for different variables
# view values and labels for variable
ipums_val_labels(ddi, PHONE)
val lbl
0 N/A
1 No, no phone available
2 Yes, phone available
8 Suppressed (2012 and 2015 ACS)

9.6 Handling Labels

  • IPUMS value labels don’t translate perfectly to the R factor object type
    • All factor values must be labeled
    • Factor values must count up from 1
  • ipumsr has functions to help work with values and labels while following R logic
head(data)
  • We can see that our data has both labeled and unlabeled variables
    YEAR DATANUM SERIAL HHWT STATEFIP CONSPUMA GQ PHONE PERNUM PERWT EDUC EDUCD EDUCD2 EDUCD3 PHONE2 PHONE3
    1960 1 336455 100 27 NA 1 2 1 100 5 50 Grade 11 Less than High School Yes, phone available Yes, phone available
    1960 1 336455 100 27 NA 1 2 2 100 6 60 Grade 12 High school Yes, phone available Yes, phone available
    1960 1 336456 100 27 NA 1 2 1 100 6 60 Grade 12 High school Yes, phone available Yes, phone available
    1960 1 336456 100 27 NA 1 2 2 100 6 60 Grade 12 High school Yes, phone available Yes, phone available
    1960 1 336456 100 27 NA 1 2 3 100 2 22 Grade 5, 6, 7, or 8 Less than High School Yes, phone available Yes, phone available
    1960 1 336456 100 27 NA 1 2 4 100 0 1 N/A or no schooling N/A or no schooling Yes, phone available Yes, phone available
    1960 1 336457 99 27 NA 1 2 1 99 6 60 Grade 12 High school Yes, phone available Yes, phone available
    1960 1 336457 99 27 NA 1 2 2 99 4 40 Grade 10 Less than High School Yes, phone available Yes, phone available
  • The as_factor function adds ordered factor levels to variable labels
# add factor levels to labels
ipumsr::as_factor(data)
YEAR DATANUM SERIAL HHWT STATEFIP CONSPUMA GQ PHONE PERNUM PERWT EDUC EDUCD EDUCD2 EDUCD3 PHONE2 PHONE3
1960 1 336455 100 Minnesota NA Households under 1970 definition Yes, phone available 1 100 Grade 11 Grade 11 Grade 11 Less than High School Yes, phone available Yes, phone available
1960 1 336455 100 Minnesota NA Households under 1970 definition Yes, phone available 2 100 Grade 12 Grade 12 Grade 12 High school Yes, phone available Yes, phone available
1960 1 336456 100 Minnesota NA Households under 1970 definition Yes, phone available 1 100 Grade 12 Grade 12 Grade 12 High school Yes, phone available Yes, phone available
1960 1 336456 100 Minnesota NA Households under 1970 definition Yes, phone available 2 100 Grade 12 Grade 12 Grade 12 High school Yes, phone available Yes, phone available
1960 1 336456 100 Minnesota NA Households under 1970 definition Yes, phone available 3 100 Grade 5, 6, 7, or 8 Grade 5 Grade 5, 6, 7, or 8 Less than High School Yes, phone available Yes, phone available
1960 1 336456 100 Minnesota NA Households under 1970 definition Yes, phone available 4 100 N/A or no schooling N/A N/A or no schooling N/A or no schooling Yes, phone available Yes, phone available
1960 1 336457 99 Minnesota NA Households under 1970 definition Yes, phone available 1 99 Grade 12 Grade 12 Grade 12 High school Yes, phone available Yes, phone available
1960 1 336457 99 Minnesota NA Households under 1970 definition Yes, phone available 2 99 Grade 10 Grade 10 Grade 10 Less than High School Yes, phone available Yes, phone available
  • We can also use zap_labels() to remove labels and show raw values
# remove labels to show values
zap_labels(data)
YEAR DATANUM SERIAL HHWT STATEFIP CONSPUMA GQ PHONE PERNUM PERWT EDUC EDUCD EDUCD2 EDUCD3 PHONE2 PHONE3
1960 1 336455 100 27 NA 1 2 1 100 5 50 Grade 11 Less than High School Yes, phone available Yes, phone available
1960 1 336455 100 27 NA 1 2 2 100 6 60 Grade 12 High school Yes, phone available Yes, phone available
1960 1 336456 100 27 NA 1 2 1 100 6 60 Grade 12 High school Yes, phone available Yes, phone available
1960 1 336456 100 27 NA 1 2 2 100 6 60 Grade 12 High school Yes, phone available Yes, phone available
1960 1 336456 100 27 NA 1 2 3 100 2 22 Grade 5, 6, 7, or 8 Less than High School Yes, phone available Yes, phone available
1960 1 336456 100 27 NA 1 2 4 100 0 1 N/A or no schooling N/A or no schooling Yes, phone available Yes, phone available
1960 1 336457 99 27 NA 1 2 1 99 6 60 Grade 12 High school Yes, phone available Yes, phone available
1960 1 336457 99 27 NA 1 2 2 99 4 40 Grade 10 Less than High School Yes, phone available Yes, phone available

9.7 Handling Labels: Consolidating

  • lbl_collapse() allows you to take advantage of the hierarchical structure of value labels
ipums_val_labels(data$EDUCD)
  • We have far too many categories for useful analysis
    val lbl
    0 N/A or no schooling
    1 N/A
    2 No schooling completed
    10 Nursery school to grade 4
    11 Nursery school, preschool
    12 Kindergarten
    13 Grade 1, 2, 3, or 4
    14 Grade 1
    15 Grade 2
    16 Grade 3
    17 Grade 4
    20 Grade 5, 6, 7, or 8
    21 Grade 5 or 6
    22 Grade 5
    23 Grade 6
    24 Grade 7 or 8
    25 Grade 7
    26 Grade 8
    30 Grade 9
    40 Grade 10
    50 Grade 11
    60 Grade 12
    61 12th grade, no diploma
    62 High school graduate or GED
    63 Regular high school diploma
    64 GED or alternative credential
    65 Some college, but less than 1 year
    70 1 year of college
    71 1 or more years of college credit, no degree
    80 2 years of college
    81 Associate’s degree, type not specified
    82 Associate’s degree, occupational program
    83 Associate’s degree, academic program
    90 3 years of college
    100 4 years of college
    101 Bachelor’s degree
    110 5+ years of college
    111 6 years of college (6+ in 1960-1970)
    112 7 years of college
    113 8+ years of college
    114 Master’s degree
    115 Professional degree beyond a bachelor’s degree
    116 Doctoral degree
    999 Missing
  • The variable values are hierarchical, with the values divisible by ten containing all of the others
    • We simplify our values by collapsing on those variables
# still too detailed
data$EDUCD %>%
    lbl_collapse(~.val%/%10) %>%
    ipums_val_labels()
val lbl
0 N/A or no schooling
1 Nursery school to grade 4
2 Grade 5, 6, 7, or 8
3 Grade 9
4 Grade 10
5 Grade 11
6 Grade 12
7 1 year of college
8 2 years of college
9 3 years of college
10 4 years of college
11 5+ years of college
99 Missing

9.8 Handling Labels: Consolidating

  • We still have too many categories, so we may want to combine some to make our data more specific
  • By using some string detection and relabeling, we reduce the numner of categories to six
# expression for filtering variable labels with a particular string
college_regex <- "^[123] year(s)? of college$"
data$EDUCD3 <- data$EDUCD %>%
    # divide values by 10 to make single digit
lbl_collapse(~.val%/%10) %>%
    # combine all less than high school values into one classify any amount of
    # college without completion as 'Some college' create one category for all
    # college degree types
lbl_relabel(lbl(2, "Less than High School") ~ .val > 0 & .val < 6, lbl(3, "High school") ~
    .lbl == "Grade 12", lbl(4, "Some college") ~ str_detect(.lbl, college_regex),
    lbl(5, "College or more") ~ .val %in% c(10, 11)) %>%
    as_factor()
# new levels
levels(data$EDUCD3)
## [1] "N/A or no schooling"   "Less than High School" "High school"          
## [4] "Some college"          "College or more"       "Missing"

9.9 Removing Missing Data

  • lbl_na_if() allows you to set certain values or labels to missing
  • In this variable, we have one value for suppressed results that we need to conver to an NA before analysis
ipums_val_labels(data$PHONE)
val lbl
0 N/A
1 No, no phone available
2 Yes, phone available
8 Suppressed (2012 and 2015 ACS)
  • We can do this using the numeric values
# convert these values to NAs
data$PHONE2 <- lbl_na_if(data$PHONE, ~.val %in% c(0, 8)) %>%
    as_factor()
levels(data$PHONE2)
## [1] "No, no phone available" "Yes, phone available"
  • or the string labels
# works with labels or values
drop_labels <- c("N/A", "Suppressed (2012 and 2015 ACS)")
data$PHONE3 <- lbl_na_if(data$PHONE, ~.lbl %in% drop_labels) %>%
    as_factor()
# updated levels
levels(data$PHONE3)
## [1] "No, no phone available" "Yes, phone available"

10 Analysis

  • Now the extract is factored, relabeled and ready for graphing

10.1 Selecting the Right Level, Weight, and Aggregation

  • Extracts are samples and come with household and person levels and weights
  • Here, we use person weight to estimate share of people with phone lines by year
# prepare data for graphing by grouping by year, and using the person weight to
# calculate a weighted mean. This yields the estimated share of the Minnesota
# population who have a phone
graph_data <- data %>%
    group_by(YEAR) %>%
    summarize(`% with phone` = weighted.mean(PHONE2 == "Yes, phone available", PERWT,
        na.rm = TRUE), .groups = "drop")
YEAR % with phone
1960 0.8863756
1970 0.9579866
1980 0.9745530
1990 0.9780473
2000 0.9904893
2001 0.9903932
2002 0.9843541
2003 0.9871095
2004 0.9771358
2005 0.9742268
  • We also have educational attainment data in the extract, so we prepare another grouped dataset by both year and education
# group data by both year and education level
graph_data2 <- data %>%
    group_by(YEAR, EDUCD3) %>%
    summarize(`% with phone` = weighted.mean(PHONE2 == "Yes, phone available", PERWT,
        na.rm = TRUE), .groups = "drop")
YEAR EDUCD3 % with phone
1960 N/A or no schooling 0.8826448
1960 Less than High School 0.8590952
1960 High school 0.9217318
1960 Some college 0.9581475
1960 College or more 0.9831610
1970 N/A or no schooling 0.9566989
1970 Less than High School 0.9458146
1970 High school 0.9698355
1970 Some college 0.9758722
1970 College or more 0.9888245

10.2 Plotting the Results

  • Like with tidycensus data, we can plot just one variable over time
# Plotting one variable year over year
ggplot(graph_data, aes(x = YEAR, y = `% with phone`)) + geom_point() + geom_line() +
    # Use variable description from metadata as caption
labs(title = "Percent of Minnesota with phone line", subtitle = paste0("Data source: ",
    ddi$ipums_project), caption = paste(strwrap(ipums_var_desc(ddi, PHONE), 90),
    collapse = "\n"))

  • or use facet_wrap to create plots for different groups
# Plotting one variable year over year with individual plots for education
# level
ggplot(graph_data2, aes(x = YEAR, y = `% with phone`)) + geom_point() + geom_line() +
    labs(title = "Percent of Minnesota with Phone Line by Education", subtitle = paste0("Source: ",
        ddi$ipums_project)) + facet_wrap(~EDUCD3)

11 ipumsr and Mapping

  • IPUMS provides geographic boundary files for IPUMS USA and other sources
  • ipumsr provides support for both sf and sp data (we will use sf)
  • Load with the ipums_read_sf() function
# read in CONSPUMA shapefile from IPUMS
shape_data <- read_ipums_sf("shape/")
# view data structure
as_tibble(shape_data)
CONSPUMA STATEFIP State geometry
540 02 Alaska MULTIPOLYGON (((-3043869 34…
541 02 Alaska MULTIPOLYGON (((-2291901 23…
542 15 Hawaii MULTIPOLYGON (((-6021813 59…
491 51 Virginia MULTIPOLYGON (((1720696 104…
492 51 Virginia MULTIPOLYGON (((1745400 116…
493 51 Virginia MULTIPOLYGON (((1728178 119…
494 51 Virginia MULTIPOLYGON (((1717473 129…
495 53 Washington MULTIPOLYGON (((-1584859 14…
496 53 Washington MULTIPOLYGON (((-2031282 13…
497 53 Washington MULTIPOLYGON (((-1970094 14…

11.1 Joining Extract and Spatial Data

  • ipumsr has functions for merging extracts with spatial data
# group extract data by CONSPUMA and year, create weighted variable, and
# ungroup
conspuma_data <- data %>%
    group_by(CONSPUMA, YEAR) %>%
    summarize(`% with phone` = weighted.mean(PHONE2 == "Yes, phone available", PERWT,
        na.rm = TRUE), .groups = "drop")
CONSPUMA YEAR % with phone STATEFIP State geometry
245 1980 0.9641267 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 1990 0.9669292 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2000 0.9859847 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2005 0.9689021 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2006 0.9566019 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2007 0.9544157 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2008 0.9850778 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2009 0.9852029 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2010 0.9831482 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2011 0.9811322 27 Minnesota MULTIPOLYGON (((62837.34 13…
  • Similar to the GEOID in ACS data, we can use the CONSPUMA code as our joining variable
# join extract data with spatial data using CONSPUMA id
conspuma_data_join <- ipums_shape_inner_join(conspuma_data, shape_data, by = "CONSPUMA")
CONSPUMA YEAR % with phone STATEFIP State geometry
245 1980 0.9641267 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 1990 0.9669292 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2000 0.9859847 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2005 0.9689021 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2006 0.9566019 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2007 0.9544157 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2008 0.9850778 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2009 0.9852029 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2010 0.9831482 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2011 0.9811322 27 Minnesota MULTIPOLYGON (((62837.34 13…

11.2 Mapping with ipumsr, ggplot2, and sf

  • Once the data is joined, we can filter it by year to measure change across decades
# filter data to show every decade
graph_conspuma <- conspuma_data_join %>%
    filter(YEAR %in% c(1980, 1990, 2000, 2010))
CONSPUMA YEAR % with phone STATEFIP State geometry
245 1980 0.9641267 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 1990 0.9669292 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2000 0.9859847 27 Minnesota MULTIPOLYGON (((62837.34 13…
245 2010 0.9831482 27 Minnesota MULTIPOLYGON (((62837.34 13…
246 1980 0.9881295 27 Minnesota MULTIPOLYGON (((254440.1 86…
246 1990 0.9961230 27 Minnesota MULTIPOLYGON (((254440.1 86…
246 2000 0.9959761 27 Minnesota MULTIPOLYGON (((254440.1 86…
246 2010 0.9895193 27 Minnesota MULTIPOLYGON (((254440.1 86…
247 1980 0.9673522 27 Minnesota MULTIPOLYGON (((227254.6 12…
247 1990 0.9760921 27 Minnesota MULTIPOLYGON (((227254.6 12…
  • and visualize the share of Minnesota’s population with phone access
# color CONSPUMA polygons by % with phone value make one plot for each year
# remove grid lines and coordinates
ggplot(graph_conspuma, aes(fill = `% with phone`)) + facet_wrap(~YEAR) + geom_sf() +
    theme_void()

12 IPUMS API

  • Currently in internal testing
  • Beta testing before the end of 2021
  • IPUMS USA public launch early 2022
  • Will allow users to
    • Define and submit extract requests
    • Check extract status or “wait” for an extract to finish
    • Download completed extracts
    • Get info on past extracts
    • Share extract definitions

12.1 Other IPUMS Data Sources

  • NHGIS
  • Longitudinal and Employer-Household Dynamics (LEHD) Origin-Destination Employment Statistics (LODES) data
  • Small Area Health Insurance Estimates
  • Economic Census
  • Bureau of Labor Statistics
  • CPS
  • LAUS
  • USDA
  • Department of Housing and Urban Development Comprehensive Affordable Housing Strategy (CHAS)

13 References

14 Thank You!

-Email: