1 U.S. Census Data

Census Bureau conducts many surveys on a variety of subjects and shares data with the public
- Economic and business surveys, housing surveys, international data, population estimates and projections, and more
Decennial Census – full count for apportionment every 10 years
American Community Survey – Source of estimated demographic data about the US population, sample of about 3% of US population.
- 1-year ACS (areas with population of 65,000 or greater)
- 5-year ACS (moving 5-year average for all areas down to block group)

1.1 Census Hierarchy

Data from the decennial Census, ACS, and other Census surveys are aggregated and shared at different geographies
- Legal entities, such as states and counties
- Statistical entities, not official jurisdictions but used to standardize data
- Smallest unit for decennial is block, block group for ACS
- Availability and accuracy of estimates varies depending on geography
  
  Source: U.S. Census Bureau

2 Census Data and R

There are many R packages designed to help analysts navigate this hierarchy
Which data do I need? From where? From when?
Downloading bulk files from the Census FTP site or using data.census.gov are not practical if you regularly work with Census data
- That leaves Census API or 3rd party providers
- Choice depends on specific data needs (variables, geography, survey, scope, level of detail)

2.1 Census API? Use tidycensus and tigris

Created by Kyle Walker (https://walker-data.com/)
tidycensus
- Allows R users to access Census data in a format that can be plugged into tidyverse workflows
- Makes data wrangling, analysis, and visualization of Census data simpler and easier
tigris
- Pulls official geographic boundary shapefiles for mapping and spatial analysis

2.2 Getting started with tidycensus

To use tidycensus, you will need to sign up for a free Census API Key from https://api.census.gov/data/key_signup.html and activate it
- After the key is installed in your .Renviron file, it no longer needs to be included in your code

install.packages("tidycensus", repos = "http://cran.us.r-project.org")
install.packages("tigris", repos = "http://cran.us.r-project.org")
library(tidycensus)
library(tigris)
# stores boundaries in cache to preserve memory
options(tigris_use_cache = TRUE)
census_api_key("YOUR KEY GOES HERE", install = TRUE)
# only need to include install = TRUE once

2.3 Pulling Data: Decennial Census

get_decennial() can query decennial Census data from the 2000, 2010, and 2020 decennial U.S. Censuses
Users must include:
- a geographic area
- a vector of Census variable IDs
- The following code below gets data on total population by state from the 2010 decennial Census

total_population_10 <- get_decennial(geography = "state", variables = "P001001",
    year = 2010)

GEOID	NAME	variable	value
01	Alabama	P001001	4779736
02	Alaska	P001001	710231
04	Arizona	P001001	6392017
05	Arkansas	P001001	2915918
06	California	P001001	37253956
22	Louisiana	P001001	4533372
21	Kentucky	P001001	4339367
08	Colorado	P001001	5029196
09	Connecticut	P001001	3574097
10	Delaware	P001001	897934

2.4 Pulling Data: American Community Survey

get_acs() pulls data from annual ACS estimates
Parameters:
- geographic area
- variable(s)
- table id (optional)
- year (defaults to the most recent five-year ACS sample)
- survey (defaults to the 5-year ACS, can be changed to the 1-year ACS by using survey = “acs1”)
Similar to data from get_decennial(), but includes columns for estimate and margin of error instead of value

2.5 Selecting Variables

load_variables() returns a dataset of variables from the Census Bureau website and formats it for fast searching
two required arguments
- year of the Census dataset or ACS sample
- dataset name (sf1, sf3, pl, acs1, or acs5)
  - For ACS datasets, append “/profile” for the Data Profile, and “/summary” for the Summary Tables.
user can include cache = TRUE in the function call to store the data in the user’s cache directory for faster processing

v19 <- load_variables(2019, "acs5", cache = TRUE)

2.6 Querying Variables

Variables from the ACS detailed tables, data profiles, summary tables, and supplemental estimates are available
Supplying a table name to the table parameter returns all variables for that table
The code below returns all variables associated with table B01001 (sex by age), from the 2015-2019 5-year ACS:

dp_table <- get_acs(geography = "state", table = "B01001", year = 2019)

GEOID	NAME	variable	estimate	moe
01	Alabama	B01001_001	4876250	NA
01	Alabama	B01001_002	2359355	1270
01	Alabama	B01001_003	149090	704
01	Alabama	B01001_004	153494	2290
01	Alabama	B01001_005	158617	2274
01	Alabama	B01001_006	98257	468
01	Alabama	B01001_007	64980	834
01	Alabama	B01001_008	35870	1436
01	Alabama	B01001_009	35040	1472
01	Alabama	B01001_010	95065	1916

2.7 Narrowing Our Query

To get data from within specific states and/or counties, users can pass state and county names to their respective parameters

ga_income <- get_acs(geography = "county", variables = "B19013_001", state = "GA",
    year = 2019)

GEOID	NAME	variable	estimate	moe
13001	Appling County, Georgia	B19013_001	40304	5180
13003	Atkinson County, Georgia	B19013_001	37197	3686
13005	Bacon County, Georgia	B19013_001	37519	5492
13007	Baker County, Georgia	B19013_001	32917	6967
13009	Baldwin County, Georgia	B19013_001	43672	3736
13011	Banks County, Georgia	B19013_001	47811	3240
13013	Barrow County, Georgia	B19013_001	62345	2204
13015	Bartow County, Georgia	B19013_001	57423	2743
13017	Ben Hill County, Georgia	B19013_001	32229	3845
13019	Berrien County, Georgia	B19013_001	40415	3024

fulton_income <- get_acs(geography = "tract", variables = "B19013_001", state = "GA",
    county = "Fulton")

GEOID	NAME	variable	estimate	moe
13121000100	Census Tract 1, Fulton County, Georgia	B19013_001	168396	18644
13121000200	Census Tract 2, Fulton County, Georgia	B19013_001	158011	37856
13121000400	Census Tract 4, Fulton County, Georgia	B19013_001	97257	30528
13121000500	Census Tract 5, Fulton County, Georgia	B19013_001	90140	13217
13121000600	Census Tract 6, Fulton County, Georgia	B19013_001	53686	9322
13121000700	Census Tract 7, Fulton County, Georgia	B19013_001	84107	33365
13121001001	Census Tract 10.01, Fulton County, Georgia	B19013_001	103846	10554
13121001002	Census Tract 10.02, Fulton County, Georgia	B19013_001	31042	9051
13121001100	Census Tract 11, Fulton County, Georgia	B19013_001	109426	7959
13121001201	Census Tract 12.01, Fulton County, Georgia	B19013_001	80476	16521

2.8 Renaming Variables

We can rename variables from within our function call if we know the default names

ga_rename <- get_acs(geography = "county", state = "Georgia", variables = c(medinc = "B19013_001",
    medage = "B01002_001"), output = "wide")

GEOID	NAME	medincE	medincM	medageE	medageM
13005	Bacon County, Georgia	37519	5492	36.7	0.7
13025	Brantley County, Georgia	38857	3480	41.1	0.8
13017	Ben Hill County, Georgia	32229	3845	39.9	1.1
13033	Burke County, Georgia	44151	2438	37.4	0.6
13047	Catoosa County, Georgia	56235	2290	40.4	0.4
13053	Chattahoochee County, Georgia	47096	5158	24.5	0.5
13055	Chattooga County, Georgia	36807	2268	39.4	0.7
13073	Columbia County, Georgia	82339	3532	36.9	0.4
13087	Decatur County, Georgia	41481	3584	37.8	0.6
13115	Floyd County, Georgia	48336	2266	38.3	0.3

2.9 Choosing a Data Structure

By default, tidycensus returns a tibble in “tidy” (or long) format:
- Each observation forms a row
- Each variable forms a column with one data type
- Each observational unit forms a table

hhinc_tidy <- get_acs(geography = "state", table = "B19001", survey = "acs1", year = 2019)

GEOID	NAME	variable	estimate	moe
01	Alabama	B19001_001	1897576	10370
01	Alabama	B19001_002	154558	5883
01	Alabama	B19001_003	103653	6001
01	Alabama	B19001_004	108500	5926
01	Alabama	B19001_005	98706	6491
01	Alabama	B19001_006	90916	5859
01	Alabama	B19001_007	105146	4149
01	Alabama	B19001_008	85014	5417
01	Alabama	B19001_009	87118	5163
01	Alabama	B19001_010	82323	4231

W can specify “wide” format if we prefer, with columns for each variable

hhinc_wide <- get_acs(geography = "state", table = "B19001", survey = "acs1", year = 2019,
    output = "wide")

GEOID	NAME	B19001_001E	B19001_001M	B19001_002E	B19001_002M	B19001_003E	B19001_003M	B19001_004E	B19001_004M	B19001_005E	B19001_005M	B19001_006E	B19001_006M	B19001_007E	B19001_007M	B19001_008E	B19001_008M	B19001_009E	B19001_009M	B19001_010E	B19001_010M	B19001_011E	B19001_011M	B19001_012E	B19001_012M	B19001_013E	B19001_013M	B19001_014E	B19001_014M	B19001_015E	B19001_015M	B19001_016E	B19001_016M	B19001_017E	B19001_017M
28	Mississippi	1100229	9887	102072	5891	72195	5563	74167	4559	63780	4127	64942	4703	58033	4432	55748	4483	51897	4901	43423	3597	86214	5966	105304	5907	122825	6357	72151	4143	47726	3282	44171	3353	35581	2826
29	Missouri	2458337	10406	155827	6687	111813	5577	115140	5621	122095	5489	111981	5528	121608	5756	113730	5553	116231	5683	104380	4723	198458	7477	260646	7646	318344	8489	216220	6681	135532	5371	129949	6087	126383	5183
30	Montana	437651	4637	23489	2194	21550	2415	21335	1726	23022	2418	18577	1955	21148	2408	20959	2769	21917	2316	20023	2670	36769	3056	44994	3051	58858	3855	39064	2723	23818	2653	23210	2351	18918	2211
31	Nebraska	771444	4475	37219	2805	30395	2679	26514	2433	35113	2753	30390	2643	36654	2792	32908	2509	37925	3333	31514	2681	64039	3492	86318	3863	111438	4655	75244	3690	47119	3029	47279	2722	41375	2731
32	Nevada	1143557	8281	68825	4164	45099	3502	39946	3403	50808	4299	44720	3607	48167	3496	46762	3920	51108	3854	51006	4531	90046	5207	123473	5807	159361	6715	116598	5686	64746	3977	69365	4085	73527	4035
33	New Hampshire	541396	5389	21229	2566	16539	2663	16216	2271	19054	2165	18377	2294	18368	2010	15983	1838	22257	2391	21203	2169	39230	3351	52252	3758	76433	4378	61021	4132	40422	3563	49858	3039	52954	3332
34	New Jersey	3286264	9466	154374	7219	95123	5069	91608	5576	113018	5398	96889	6271	112428	6077	98003	5978	104770	6257	93426	5753	202459	7998	290552	8825	401067	11429	343771	10134	250060	8064	338989	9700	499727	8735
35	New Mexico	793420	7074	65613	4413	42519	3479	41543	3752	48909	4239	41728	4027	39967	3532	32647	3073	37765	2701	28960	3159	63845	4301	79144	4473	96945	5032	60173	4229	38540	3416	39437	3428	35685	2595
36	New York	7446812	13820	497771	10405	329325	10854	288805	8790	286239	9953	261940	10138	277219	9361	246739	9135	268533	10075	242835	9161	472300	12232	658024	13405	883154	14989	709376	12558	488838	11556	634178	12227	901536	12818
37	North Carolina	4046348	15467	251158	9332	187653	8080	187757	7905	200849	7810	196035	8288	194732	6745	187488	8731	195142	7604	176142	7460	317444	9175	412432	12009	513941	13478	350665	10038	207679	8060	224461	7108	242770	6452

3 Manipulating Data

We can refine our queries even more by using pipes and tidyverse functions
- Filter based on certain conditions
- Rearrange and sort the data
- Create new variables
- Generate summary statistics

library(tidyverse)
library(dplyr)

median_age <- get_acs(geography = "county", variables = "B01002_001", year = 2019) %>%
    # arrange data by largest estimate value to smallest
arrange(desc(estimate)) %>%
    # only records with median age 50 or older
filter(estimate >= 50) %>%
    # create individual county and state columns
separate(NAME, into = c("county", "state"), sep = ", ")

GEOID	county	state	variable	estimate	moe
12119	Sumter County	Florida	B01002_001	67.4	0.2
51091	Highland County	Virginia	B01002_001	60.9	3.5
08027	Custer County	Colorado	B01002_001	59.7	2.6
12015	Charlotte County	Florida	B01002_001	59.1	0.2
41069	Wheeler County	Oregon	B01002_001	59.0	3.3
51133	Northumberland County	Virginia	B01002_001	58.9	0.7
26131	Ontonagon County	Michigan	B01002_001	58.6	0.4
35021	Harding County	New Mexico	B01002_001	58.5	5.5
53031	Jefferson County	Washington	B01002_001	58.3	0.7
26001	Alcona County	Michigan	B01002_001	58.2	0.3

3.1 Summary Variables

tidycensus can calculate summary variables for us
- these can be useful in calculating percentages and totals

# useful to assign lengthy and/or frequently used sets of variables to a vector
race_vars <- c(White = "B03002_003", Black = "B03002_004", Native = "B03002_005",
    Asian = "B03002_006", HIPI = "B03002_007", Hispanic = "B03002_012")
# this gives us the total population count for all GA counties by race
ga_race <- get_acs(geography = "county", state = "GA", variables = race_vars, summary_var = "B03002_001")

GEOID	NAME	variable	estimate	moe	summary_est	summary_moe
13001	Appling County, Georgia	White	12745	21	18440	NA
13001	Appling County, Georgia	Black	3555	209	18440	NA
13001	Appling County, Georgia	Native	94	140	18440	NA
13001	Appling County, Georgia	Asian	74	111	18440	NA
13001	Appling County, Georgia	HIPI	0	21	18440	NA
13001	Appling County, Georgia	Hispanic	1830	NA	18440	NA
13003	Atkinson County, Georgia	White	4657	18	8239	NA
13003	Atkinson County, Georgia	Black	1434	112	8239	NA
13003	Atkinson County, Georgia	Native	81	111	8239	NA
13003	Atkinson County, Georgia	Asian	0	19	8239	NA

3.2 Creating New Variables

We can use mutate() from dplyr to create new variables, like creating percentages from population estimates by race

ga_race_percent <- ga_race %>%
    # divide estimates for each observation by the summary estimate
mutate(percent = 100 * (estimate/summary_est)) %>%
    select(NAME, variable, percent)

NAME	variable	percent
Appling County, Georgia	White	69.1160521
Appling County, Georgia	Black	19.2787419
Appling County, Georgia	Native	0.5097614
Appling County, Georgia	Asian	0.4013015
Appling County, Georgia	HIPI	0.0000000
Appling County, Georgia	Hispanic	9.9240781
Atkinson County, Georgia	White	56.5238500
Atkinson County, Georgia	Black	17.4050249
Atkinson County, Georgia	Native	0.9831290
Atkinson County, Georgia	Asian	0.0000000

4 Analysis

4.1 Creating and Comparing Groups

Once data is loaded and processed, we can begin our analysis
- This analysis often involves creating and/or comparing groups
  - demographic groups described by variables, geographic areas, time periods, etc.
- The structure of tidycensus data makes it simple to identify and create groups
- In this example, we will obtain county income data from Alabama and then create new income brackets

# median household income for Alabama counties
al_hh_income <- get_acs(geography = "county", table = "B19001", state = "AL", year = 2019)

GEOID	NAME	variable	estimate	moe
01001	Autauga County, Alabama	B19001_001	21397	325
01001	Autauga County, Alabama	B19001_002	1417	254
01001	Autauga County, Alabama	B19001_003	1172	249
01001	Autauga County, Alabama	B19001_004	1337	248
01001	Autauga County, Alabama	B19001_005	882	214
01001	Autauga County, Alabama	B19001_006	985	245
01001	Autauga County, Alabama	B19001_007	699	210
01001	Autauga County, Alabama	B19001_008	1050	299
01001	Autauga County, Alabama	B19001_009	995	234
01001	Autauga County, Alabama	B19001_010	676	180

Tables will often have variables with a variety of units
- Here, we filter out the total households variable before recategorizing our income variables

al_hh_income_recode <- al_hh_income %>%
    # remove total households variable
filter(variable != "B19001_001") %>%
    # combine categories into households making less than $35,000, between
    # $35,000 and $75,000, and over $75,000
mutate(incgroup = case_when(variable < "B19001_008" ~ "below35k", variable < "B19001_013" ~
    "bw35kand75k", TRUE ~ "above75k"))

GEOID	NAME	variable	estimate	moe	incgroup
01001	Autauga County, Alabama	B19001_002	1417	254	below35k
01001	Autauga County, Alabama	B19001_003	1172	249	below35k
01001	Autauga County, Alabama	B19001_004	1337	248	below35k
01001	Autauga County, Alabama	B19001_005	882	214	below35k
01001	Autauga County, Alabama	B19001_006	985	245	below35k
01001	Autauga County, Alabama	B19001_007	699	210	below35k
01001	Autauga County, Alabama	B19001_008	1050	299	bw35kand75k
01001	Autauga County, Alabama	B19001_009	995	234	bw35kand75k
01001	Autauga County, Alabama	B19001_010	676	180	bw35kand75k
01001	Autauga County, Alabama	B19001_011	1620	351	bw35kand75k

We can pass our groups to summarize() to calculate the estimated population size of each income bracket

al_group_sums <- al_hh_income_recode %>%
    # group by new income brackets
group_by(GEOID, incgroup) %>%
    # get new sums for each group
summarize(estimate = sum(estimate))

GEOID	incgroup	estimate
01001	above75k	8367
01001	below35k	6492
01001	bw35kand75k	6538
01003	above75k	31509
01003	below35k	23248
01003	bw35kand75k	26173
01005	above75k	1855
01005	below35k	4920
01005	bw35kand75k	2570
01007	above75k	2139

4.2 Time-Series

We will often need to analyze data across a range of years
- Rather than getting samples for each year separately and combining them, we can tell get_acs() what years we want and return a single dataset with all of them

college_vars <- c("B15002_015", "B15002_016", "B15002_017", "B15002_018", "B15002_032",
    "B15002_033", "B15002_034", "B15002_035")
# create named vector of years from 2010-2019
years <- 2010:2019
names(years) <- years
# map_dfr() maps the get_acs() function to each year, resulting in data with
# estimates for each county for each specified year
college_by_year <- map_dfr(years, ~{
    get_acs(geography = "county", variables = college_vars, state = "FL", summary_var = "B15002_001",
        survey = "acs1", year = .x)
    # organize data by year
}, .id = "year") %>%
    # arrange data by place, variable, then year
arrange(NAME, variable, year)

year	GEOID	NAME	variable	estimate	moe	summary_est	summary_moe
2010	12001	Alachua County, Florida	B15002_015	13368	1780	145526	564
2011	12001	Alachua County, Florida	B15002_015	13472	1840	144360	1607
2012	12001	Alachua County, Florida	B15002_015	14944	1985	148044	966
2013	12001	Alachua County, Florida	B15002_015	14107	1929	149906	226
2014	12001	Alachua County, Florida	B15002_015	14951	1926	151519	571
2015	12001	Alachua County, Florida	B15002_015	16416	1969	156666	159
2016	12001	Alachua County, Florida	B15002_015	14255	2034	159567	342
2017	12001	Alachua County, Florida	B15002_015	15048	2047	161896	351
2018	12001	Alachua County, Florida	B15002_015	18820	1971	167364	2346
2019	12001	Alachua County, Florida	B15002_015	19122	2058	165792	708

5 Visualization

Another advantage of tidycensus is that it yields data that can quickly be supplied to ggplot for visualization
- This allows for the efficient creation of both exploratory and informational plots
- Here, we use a bar chart to visualize public transit commuting in the 20 largest U.S. metro areas

metro_transit <-  get_acs(
# core-based statistical area
  geography = "cbsa",
# percentage of people who take public transit
  variables = "DP03_0021P",
# total population for selecting only largest metros
  summary_var = "B01003_001",
  survey = "acs1",
  year = 2019
) %>%
# slice the data to include only the largest metros by population
  slice_max(summary_est, n = 20)
# simple bar chart
metrobar <- ggplot(metro_transit, aes(x = NAME, y = estimate)) +
  geom_col()

5.1 Refining Plots

Less time spent on acquiring data leaves more time for refining our visualizations for sharing and publication

# Use just the primary city in the metro for the label
metrobarfancy <- metro_transit %>%
    mutate(NAME = str_remove(NAME, "-.*$")) %>%
    mutate(NAME = str_remove(NAME, ",.*$")) %>%
    ggplot(aes(y = reorder(NAME, estimate), x = estimate)) + geom_col(color = "navy",
    fill = "navy", alpha = 0.5, width = 0.85) + theme_minimal(base_size = 12) + scale_x_continuous(breaks = c(0,
    10, 20, 30), labels = c("0%", "10%", "20%", "30%")) + labs(title = "Metro Public Transit Commuting",
    subtitle = "2019 ACS 1-year Estimates", y = "", x = "Share of All Commuters Using Public Transit",
    caption = "Source: U.S. Census Bureau via the tidycensus R Package")

5.2 Visualizating Margin of Error

The inclusion of the margin of error in ACS data makes it simple to examine the reliability of estimates
- This will be particularly useful given concerns about data quality from the 2020 Census
- Here, we look at median household income estimates in Louisiana parishes

# household income for Louisiana counties
la_income <- get_acs(state = "Louisiana", geography = "county", variables = c(hhincome = "B19013_001"),
    year = 2019) %>%
    # simplify county name
mutate(NAME = str_remove(NAME, " County, Louisiana")) %>%
    # arrange by margin of error largest to smallest
arrange(desc(moe))

GEOID	NAME	variable	estimate	moe
22023	Cameron Parish, Louisiana	hhincome	53423	11624
22025	Catahoula Parish, Louisiana	hhincome	40129	9276
22007	Assumption Parish, Louisiana	hhincome	43759	7458
22091	St. Helena Parish, Louisiana	hhincome	43886	7437
22021	Caldwell Parish, Louisiana	hhincome	37691	7351
22059	LaSalle Parish, Louisiana	hhincome	42104	6936
22123	West Carroll Parish, Louisiana	hhincome	38500	6370
22029	Concordia Parish, Louisiana	hhincome	32500	6319
22127	Winn Parish, Louisiana	hhincome	38353	5742
22125	West Feliciana Parish, Louisiana	hhincome	59637	5661

ggplot has features that allow for the creation of confidence intervals around our plots

# reorder by estimate, add error bar to show confidence interval from margin of
# error
la_income_moe <- ggplot(la_income, aes(x = estimate, y = reorder(NAME, estimate))) +
    geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) + geom_point(size = 3,
    color = "darkgreen") + theme_minimal(base_size = 12.5) + labs(title = "Median Household Income",
    subtitle = "Counties in Louisiana", x = "2015-2019 ACS 5-year estimate", y = "") +
    scale_x_continuous(labels = scales::dollar)

5.3 Showing Change Over Time

We may also want to see how a variable and/or the margin of error changes over time
- Here we create another multi-year dataset for median home values in Fulton County, GA

# supply years we want samples for
years <- 2005:2019
names(years) <- years
# use map_dfr() to supply each year to get_acs() and return median home value
# by county for 2005-2019
fulton_value <- map_dfr(years, ~{
    get_acs(geography = "county", variables = "B25077_001", state = "GA", county = "Fulton",
        year = .x, survey = "acs1")
}, .id = "year")

year	GEOID	NAME	variable	estimate	moe
2005	13121	Fulton County, Georgia	B25077_001	243600	7046
2006	13121	Fulton County, Georgia	B25077_001	270000	9456
2007	13121	Fulton County, Georgia	B25077_001	267800	12865
2008	13121	Fulton County, Georgia	B25077_001	284400	8583
2009	13121	Fulton County, Georgia	B25077_001	257700	8968
2010	13121	Fulton County, Georgia	B25077_001	244300	7932
2011	13121	Fulton County, Georgia	B25077_001	230500	9223
2012	13121	Fulton County, Georgia	B25077_001	229900	8839
2013	13121	Fulton County, Georgia	B25077_001	231800	8471
2014	13121	Fulton County, Georgia	B25077_001	248800	9176

We can see how home values changed from 2005 to 2019 along with the margin of error

# group = 1 overrides the default grouping so ggplot treats the dataset as a
# whole
fulton_timeseries <- ggplot(fulton_value, aes(x = year, y = estimate, group = 1)) +
    geom_ribbon(aes(ymax = estimate + moe, ymin = estimate - moe), fill = "navy",
        alpha = 0.4) + geom_line(color = "navy") + geom_point(color = "navy", size = 2) +
    theme_minimal(base_size = 12) + scale_y_continuous(labels = scales::dollar) +
    labs(title = "Median Home Value in Fulton County, GA", x = "Year", y = "AMedian Home Value",
        caption = "Shaded Area Represents the Margin of Error of the ACS Estimate")

5.4 Group Visualizations

We created groups from income levels earlier, but the ability to filter our geographies by state and/or county, allows us to compare groups of different geographies
- Here we look at median home values across the five-county Atlanta metro

housing_val <- get_acs(
  geography = "tract",
# median home value
  variables = "B25077_001",
  state = "GA",
# only these counties
  county = c(
    "Fulton",
    "DeKalb",
    "Clayton",
    "Cobb",
    "Gwinnett",
    "Henry"
  )
) %>%
# separate NAME column into columns for tract, county, and state
  separate(
  NAME,
  into = c("tract", "county", "state"),
  sep = ", "
)

GEOID	tract	county	state	variable	estimate	moe
13063040202	Census Tract 402.02	Clayton County	Georgia	B25077_001	82600	10685
13063040203	Census Tract 402.03	Clayton County	Georgia	B25077_001	135900	21955
13063040204	Census Tract 402.04	Clayton County	Georgia	B25077_001	118000	9363
13063040302	Census Tract 403.02	Clayton County	Georgia	B25077_001	60300	7287
13063040303	Census Tract 403.03	Clayton County	Georgia	B25077_001	66400	9785
13063040306	Census Tract 403.06	Clayton County	Georgia	B25077_001	66400	30049
13063040307	Census Tract 403.07	Clayton County	Georgia	B25077_001	89900	10646
13063040308	Census Tract 403.08	Clayton County	Georgia	B25077_001	59000	19547
13063040407	Census Tract 404.07	Clayton County	Georgia	B25077_001	94700	10923
13063040408	Census Tract 404.08	Clayton County	Georgia	B25077_001	119600	16981

ggplot allows us to quickly generate multiple comparisons
We can display groups on one plot using color

# density curve colored by county
county_density <- ggplot(housing_val, aes(x = estimate, fill = county)) + geom_density(alpha = 0.3)

Or we can create plots for each county using facet_wrap()

wrapped_density <- ggplot(housing_val, aes(x = estimate)) + geom_density(fill = "darkgreen",
    color = "darkgreen", alpha = 0.5) + facet_wrap(~county) + scale_x_continuous(labels = function(x) paste0("$",
    x/1000, "k")) + theme_minimal(base_size = 14) + theme(axis.text.y = element_blank()) +
    labs(x = "Median Home Value", y = "", title = "Home Values by Census Tract, 2015-2019 ACS")

6 Bringing in Geographic Data

Census and ACS data are associated with/aggregated at geographies
- Legal entities (states and counties)
- Statistical entities (census tracts and block groups)
- Geographic features (roads and water features)
Shapefiles are available from the U.S. Census Bureau’s TIGER/Line database (Topologically Integrated Geographic Encoding and Referencing)
- Ready for mapping and spatial analysis
tigris package allows to access these files programmatically
simple features (sf) package represents spatial data much like an R data frame, but with a special geometry column that represents the shape of each feature

6.1 Structure of sf objects from tigris

sf objects from tigris contain geographic identifiers, names, coordinate information, and land and water area
- The id and name columns can be used for joining with tidycensus data

library(tigris)
# get boundaries of all Florida counties
fl_counties <- counties("FL")
# show object type
class(fl_counties)

## Simple feature collection with 67 features and 17 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -87.6349 ymin: 24.39631 xmax: -79.97431 ymax: 31.00097
## CRS:            4269
## First 10 features:
##     STATEFP COUNTYFP COUNTYNS GEOID       NAME          NAMELSAD LSAD CLASSFP
## 72       12      053 00295751 12053   Hernando   Hernando County   06      H1
## 73       12      129 00306912 12129    Wakulla    Wakulla County   06      H1
## 74       12      131 00295727 12131     Walton     Walton County   06      H1
## 79       12      127 00306921 12127    Volusia    Volusia County   06      H1
## 196      12      051 00307626 12051     Hendry     Hendry County   06      H1
## 227      12      095 00295750 12095     Orange     Orange County   06      H1
## 284      12      011 00295753 12011    Broward    Broward County   06      H1
## 312      12      099 00295761 12099 Palm Beach Palm Beach County   06      H1
## 367      12      041 00303633 12041  Gilchrist  Gilchrist County   06      H1
## 371      12      086 00295755 12086 Miami-Dade Miami-Dade County   06      H1
##     MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT      ALAND     AWATER    INTPTLAT
## 72  G4020  <NA>  45300     <NA>        A 1224962206  300591492 +28.5730426
## 73  G4020  <NA>  45220     <NA>        A 1570615733  334902363 +30.1394320
## 74  G4020  <NA>  18880     <NA>        A 2687740188  522795160 +30.6312106
## 79  G4020   422  19660     <NA>        A 2852204344  857791723 +29.0577690
## 196 G4020   163  17500     <NA>        A 2994619266   87037213 +26.5399670
## 227 G4020   422  36740     <NA>        A 2335820971  262462219 +28.5143906
## 284 G4020   370  33100    22744        A 3115289087  310801098 +26.1935353
## 312 G4020   370  33100    48424        A 5087468869 1084901519 +26.6491257
## 367 G4020   264  23540     <NA>        A  905705573   14362143 +29.7234556
## 371 G4020   370  33100    33124        A 4920565755 1376144237 +25.6105799
##         INTPTLON                       geometry
## 72  -082.4662272 MULTIPOLYGON (((-82.25329 2...
## 73  -084.3748463 MULTIPOLYGON (((-84.50434 3...
## 74  -086.1766139 MULTIPOLYGON (((-86.3916 30...
## 79  -081.1617920 MULTIPOLYGON (((-81.5028 29...
## 196 -081.1521142 MULTIPOLYGON (((-81.27172 2...
## 227 -081.3232839 MULTIPOLYGON (((-81.65739 2...
## 284 -080.4766834 MULTIPOLYGON (((-80.29699 2...
## 312 -080.4483542 MULTIPOLYGON (((-80.4824 26...
## 367 -082.7958011 MULTIPOLYGON (((-82.95908 2...
## 371 -080.4970989 MULTIPOLYGON (((-80.44093 2...

## [1] "sf"         "data.frame"

# plot geometry
plot(fl_counties$geometry)
plot(sf::st_geometry(fl_counties))

Using the sf object’s geometry variable or the function st_geometry return the same result

6.2 Polygons, Points, and Lines

We can grab a variety of geometry types from tigris and quickly plot them

# get boundaries for all tracts in Washington, D.C.
dc_tracts <- tracts("DC")
# all points of interest in DC
dc_landmarks <- landmarks("DC", type = "point")
# and all major roads in DC
dc_roads <- primary_secondary_roads("DC")
# plot each geometry
par(mar = c(4, 4, 0.1, 0.1))
plot(dc_tracts$geometry)
plot(dc_landmarks$geometry)
plot(st_geometry(dc_roads))

7 Mapping Census Data

Combining tidycensus data and tigris geographies offer many mapping possibilities
- It is often useful to examine basic geometry plots before adding data

library(ggplot2)
# cowplot allows us to combine separate plots
library(cowplot)
# get Broward County, FL census tracts from tigris
broward_tracts <- tracts("FL", "Broward")
# get Broward block groups from tigris
broward_block_groups <- block_groups("FL", "Broward")
# plot Broward census tracts with ggplot
gg1 <- ggplot(broward_tracts) + geom_sf() + theme_void() + labs(title = "Census tracts")
# plot Broward block groups with ggplot
gg2 <- ggplot(broward_block_groups) + geom_sf() + theme_void() + labs(title = "Block groups")
# combine two plots into one grid with cowplot
cowplot::plot_grid(gg1, gg2)

7.1 TIGER/Line vs. Cartographic Boundaries

tigris and tidycensus cartographic boundaries have different levels of detail

ms_counties <- counties("MS")
ms_counties_cb <- counties("MS", cb = TRUE)

ms_tiger_gg <- ggplot(ms_counties) + geom_sf() + theme_void() + labs(title = "TIGER/Line")

ms_cb_gg <- ggplot(ms_counties_cb) + geom_sf() + theme_void() + labs(title = "Cartographic boundary")

7.2 Adding Census Data

If we pass geometry = TRUE to tidycensus, we can pass the data to ggplot for mapping
- geom_sf() reads and plots geographic coordinates from sf objects
- Here, we map census tracts in Montgomery County, Alabama and shade them by Black population

library(tidycensus)
library(tidyverse)
library(tigris)
options(tigris_use_cache = TRUE)

racevars <- c(White = "B03002_003", Black = "B03002_004", Asian = "B03002_006", Hispanic = "B03002_012")

montgomery <- get_acs(state = "AL", county = "Montgomery", geography = "tract", variables = racevars,
    geometry = TRUE, summary_var = "B03002_001")

head(montgomery)

# map of montgomery counties by total Black population
montgomery_map <- montgomery %>%
    filter(variable == "Black") %>%
    ggplot(aes(fill = estimate)) + geom_sf(color = NA) + scale_fill_viridis_c(option = "magma") +
    theme_void()

# create different map for each race showing population share
montgomery_facetmap <- montgomery %>%
    mutate(pct = 100 * (estimate/summary_est)) %>%
    ggplot(aes(fill = pct)) + facet_wrap(~variable) + geom_sf(color = NA) + scale_fill_viridis_c() +
    theme_void()

Just as we split an earlier plot by county, we can use facet_wrap to split this plot by each racial group contained in the data

7.3 Modeling

tidycensus also makes it easy to model the characteristics of our data
- First, we pull data for metro Miami counties

library(tidycensus)
library(sf)
# counties of interest
mia_counties <- c("Miami-Dade", "Broward", "Palm Beach")
# labeled vector of variables
variables_to_get <- c(median_value = "B25077_001", median_rooms = "B25018_001", median_income = "DP03_0062",
    total_population = "B01003_001", median_age = "B01002_001", pct_college = "DP02_0068P",
    pct_foreign_born = "DP02_0094P", pct_white = "DP05_0077P", median_year_built = "B25037_001",
    percent_ooh = "DP04_0046P")

mia_data <- get_acs(geography = "tract", variables = variables_to_get, state = "FL",
    county = mia_counties, geometry = TRUE, output = "wide", year = 2019) %>%
    select(-NAME)

GEOID	median_valueE	median_valueM	median_roomsE	median_roomsM	total_populationE	total_populationM	median_ageE	median_ageM	median_year_builtE	median_year_builtM	median_incomeE	median_incomeM	pct_collegeE	pct_collegeM	pct_foreign_bornE	pct_foreign_bornM	pct_whiteE	pct_whiteM	percent_oohE	percent_oohM	geometry
12011010102	440900	30194	5.2	0.3	2898	242	49.4	2.8	1966	2	92734	17427	46.1	7.8	306	NA	91.7	3.6	84.6	5.5	MULTIPOLYGON (((-80.09418 2…
12011010103	348000	23554	4.4	0.2	3144	298	58.5	1.6	1967	2	69583	18900	41.0	6.6	419	NA	88.9	4.0	75.3	6.0	MULTIPOLYGON (((-80.09123 2…
12011010104	345500	32173	3.7	0.2	2156	331	57.0	3.8	1972	5	55612	6151	47.8	8.6	314	NA	84.6	6.5	61.9	9.7	MULTIPOLYGON (((-80.08109 2…
12011010200	166400	26243	4.3	0.2	6462	685	43.9	3.2	1973	2	48442	3829	25.9	6.3	2815	NA	65.6	8.0	58.0	8.1	MULTIPOLYGON (((-80.10748 2…
12011010304	193600	13564	4.8	0.3	4584	756	34.3	3.9	1974	2	47168	7151	8.6	4.1	1071	NA	8.3	4.3	55.4	8.4	MULTIPOLYGON (((-80.11944 2…
12011010305	96900	22040	3.8	0.2	4319	587	35.7	4.6	1985	2	44193	5169	18.0	7.1	1482	NA	38.6	10.5	12.5	3.6	MULTIPOLYGON (((-80.12199 2…
12011010306	126700	16456	4.1	0.2	2708	354	41.4	5.8	1973	3	40843	6934	21.3	7.2	476	NA	25.5	5.9	42.9	6.1	MULTIPOLYGON (((-80.10789 2…
12011010307	165900	13992	4.1	0.3	4028	353	38.6	4.9	1980	2	38105	4475	10.7	3.7	802	NA	27.7	5.5	23.7	6.3	MULTIPOLYGON (((-80.12833 2…
12011010308	296700	28258	4.3	0.2	5088	547	47.5	6.3	1986	2	64821	12747	41.1	7.8	1186	NA	68.1	10.0	43.2	6.7	MULTIPOLYGON (((-80.15345 2…
12011010401	333700	12144	5.3	0.3	4973	331	38.6	3.4	1986	2	76940	7263	34.3	5.3	1146	NA	64.6	7.8	70.0	6.2	MULTIPOLYGON (((-80.17015 2…

7.4 Examining the Data

Then we create a map of median home values and visualize its distribution with a histogram

library(cowplot)

mhv_map <- ggplot(mia_data, aes(fill = median_valueE)) + geom_sf(color = NA) + scale_fill_viridis_c(labels = scales::dollar) +
    theme_void() + labs(fill = "Median home value")

mhv_histogram <- ggplot(mia_data, aes(x = median_valueE)) + geom_histogram(alpha = 0.5,
    fill = "navy", color = "navy", bins = 100) + theme_minimal() + scale_x_continuous(labels = scales::label_number_si(accuracy = NULL)) +
    labs(x = "Median home value")

cowplot::plot_grid(mhv_map, mhv_histogram)

7.5 Comparing Distributions

As the data is skewed, we can apply a log transformation to the data in both forms

library(tidyverse)
library(cowplot)

mhv_map_log <- ggplot(mia_data, aes(fill = log(median_valueE))) + geom_sf(color = NA) +
    scale_fill_viridis_c() + theme_void() + labs(fill = "Median home\nvalue (log)")

mhv_histogram_log <- ggplot(mia_data, aes(x = log(median_valueE))) + geom_histogram(alpha = 0.5,
    fill = "navy", color = "navy", bins = 100) + theme_minimal() + scale_x_continuous() +
    labs(x = "Median home value (log)")

plot_grid(mhv_map_log, mhv_histogram_log)

7.6 Looking at Correlations

We may also want to know how variables in our data are correlated
- Compatibility with tidyverse functions makes it easy to drop unnecessary variables, remove NA values, and format variables of interest for calculating correlations

library(corrr)
library(units)

mia_data_for_model <- mia_data %>%
# use area variable to calculate population density, specify units as square kilometers
  mutate(pop_density = as.numeric(set_units(total_populationE / st_area(.), "1/km2")),
# subtract median_structure_age from midpoint of sample range (2015-2019)
         median_structure_age = 2017 - median_year_builtE) %>%
# select estimates but not margins of error
  select(!ends_with("M")) %>%
# remove "E" at end of variable names
  rename_with(.fn = ~str_remove(.x, "E$")) %>%
# remove NA values
  na.omit()

GEOID	median_value	median_rooms	total_population	median_age	median_year_built	median_income	pct_college	pct_foreign_born	pct_white	percent_ooh	geometry	pop_density	median_structure_age
12011010102	440900	5.2	2898	49.4	1966	92734	46.1	306	91.7	84.6	MULTIPOLYGON (((-80.09418 2…	1726.025	51
12011010103	348000	4.4	3144	58.5	1967	69583	41.0	419	88.9	75.3	MULTIPOLYGON (((-80.09123 2…	1371.705	50
12011010104	345500	3.7	2156	57.0	1972	55612	47.8	314	84.6	61.9	MULTIPOLYGON (((-80.08109 2…	2605.186	45
12011010200	166400	4.3	6462	43.9	1973	48442	25.9	2815	65.6	58.0	MULTIPOLYGON (((-80.10748 2…	2344.356	44
12011010304	193600	4.8	4584	34.3	1974	47168	8.6	1071	8.3	55.4	MULTIPOLYGON (((-80.11944 2…	3985.536	43
12011010305	96900	3.8	4319	35.7	1985	44193	18.0	1482	38.6	12.5	MULTIPOLYGON (((-80.12199 2…	1754.610	32
12011010306	126700	4.1	2708	41.4	1973	40843	21.3	476	25.5	42.9	MULTIPOLYGON (((-80.10789 2…	2186.491	44
12011010307	165900	4.1	4028	38.6	1980	38105	10.7	802	27.7	23.7	MULTIPOLYGON (((-80.12833 2…	1317.291	37
12011010308	296700	4.3	5088	47.5	1986	64821	41.1	1186	68.1	43.2	MULTIPOLYGON (((-80.15345 2…	1904.333	31
12011010401	333700	5.3	4973	38.6	1986	76940	34.3	1146	64.6	70.0	MULTIPOLYGON (((-80.17015 2…	2665.559	31

We need to make sure to use the sf function st_drop_geometry() before calculating correlations
- This turns the sf object into a regular data frame and is an essential step as they geometry column is incompatible with a lot of R functions

mia_estimates <- mia_data_for_model %>%
    # remove GEOID, median_value, and median_year_built variables from data
select(-GEOID, -median_value, -median_year_built) %>%
    # remove geometry column (converts sf object to regular data frame)
st_drop_geometry()
# calculate pearson correlation
correlations <- correlate(mia_estimates, method = "pearson")
# network plot of correlation
network_plot(correlations)

median_rooms	total_population	median_age	median_income	pct_college	pct_foreign_born	pct_white	percent_ooh	pop_density	median_structure_age
5.2	2898	49.4	92734	46.1	306	91.7	84.6	1726.025	51
4.4	3144	58.5	69583	41.0	419	88.9	75.3	1371.705	50
3.7	2156	57.0	55612	47.8	314	84.6	61.9	2605.186	45
4.3	6462	43.9	48442	25.9	2815	65.6	58.0	2344.356	44
4.8	4584	34.3	47168	8.6	1071	8.3	55.4	3985.536	43
3.8	4319	35.7	44193	18.0	1482	38.6	12.5	1754.610	32
4.1	2708	41.4	40843	21.3	476	25.5	42.9	2186.491	44
4.1	4028	38.6	38105	10.7	802	27.7	23.7	1317.291	37
4.3	5088	47.5	64821	41.1	1186	68.1	43.2	1904.333	31
5.3	4973	38.6	76940	34.3	1146	64.6	70.0	2665.559	31

8 Microdata

ACS tables are individual responses aggregated to a geography
For most purposes, these tables have the data we need
Census Bureau also releases microdata from the ACS via the Public Use Microdata Sample (PUMS)
- Individual- and household-level responses to the ACS are available with the get_pums() function
- One row per respondent instead of geography
- Smallest geography is Public Use Microdata Area (PUMA)
Allows for custom estimates and more granular analysis

8.1 PUMS variables

Like with ACS data, tidycensus allows us to load, filter, and inspect PUMS variables

install.packages(c("survey", "srvyr"))
library(tidyverse)
library(tidycensus)
# 2019 ACS 1-year variables
pums_vars_2019 <- pums_variables %>%
    filter(year == 2019, survey == "acs1")
# only unique varuables
pums_vars_2019 %>%
    distinct(var_code, var_label, data_type, level)

PUMS variables come with codes, value labels, and levels (person or household)

var_code	var_label	data_type	level
RT	Record Type	chr	NA
SERIALNO	Housing unit/GQ person serial number	chr	NA
DIVISION	Division code based on 2010 Census definitions	chr	NA
PUMA	Public use microdata area code (PUMA) based on 2010 Census definition (areas with population of 100,000 or more, use with ST for unique code)	chr	NA
REGION	Region code based on 2010 Census definitions	chr	NA
ST	State Code based on 2010 Census definitions	chr	NA
ADJHSG	Adjustment factor for housing dollar amounts (6 implied decimal places)	chr	housing
ADJINC	Adjustment factor for income and earnings dollar amounts (6 implied decimal places)	chr	housing
WGTP	Housing Unit Weight	num	housing
NP	Number of persons in this household	num	housing

# only individual level records
pums_vars_2019 %>%
    distinct(var_code, var_label, data_type, level) %>%
    filter(level == "person")

Many variables are only available at one level, so we need to make sure our variable of interest matches the level of detail we want to use

var_code	var_label	data_type	level
SPORDER	Person number	num	person
PWGTP	Person’s weight	num	person
AGEP	Age	num	person
CIT	Citizenship status	chr	person
CITWP	Year of naturalization write-in	num	person
COW	Class of worker	chr	person
DDRS	Self-care difficulty	chr	person
DEAR	Hearing difficulty	chr	person
DEYE	Vision difficulty	chr	person
DOUT	Independent living difficulty	chr	person

8.2 Loading PUMS Data and Recoding

Many PUMS variables have numeric values that represent categories

ga_pums <- get_pums(variables = c("PUMA", "SEX", "AGEP", "SCHL"), state = "GA", survey = "acs1",
    year = 2018)

Here, we load PUMA data for Georgia

May be unclear what values mean for some variables

SERIALNO	SPORDER	PWGTP	AGEP	PUMA	ST	SCHL	SEX
2018GQ0000025	1	68	51	03700	13	13	1
2018GQ0000043	1	89	23	04000	13	20	2
2018GQ0000061	1	10	43	00500	13	17	1
2018GQ0000076	1	11	20	04300	13	19	2
2018GQ0000145	1	4	23	04200	13	13	1
2018GQ0000336	1	90	36	01800	13	19	1
2018GQ0000482	1	76	92	04400	13	16	2
2018GQ0000518	1	76	48	02600	13	17	1
2018GQ0000530	1	15	19	03200	13	19	2
2018GQ0000563	1	50	87	03900	13	21	2

By including recode = TRUE in our function call, any labeled variables will be added alongside their unlabeled versions

ga_pums_recoded <- get_pums(variables = c("PUMA", "SEX", "AGEP", "SCHL"), state = "GA",
    survey = "acs1", year = 2018, recode = TRUE)

SERIALNO	SPORDER	PWGTP	AGEP	PUMA	ST	SCHL	SEX	ST_label	SCHL_label	SEX_label
2018GQ0000025	1	68	51	03700	13	13	1	Georgia/GA	Grade 10	Male
2018GQ0000043	1	89	23	04000	13	20	2	Georgia/GA	Associate’s degree	Female
2018GQ0000061	1	10	43	00500	13	17	1	Georgia/GA	GED or alternative credential	Male
2018GQ0000076	1	11	20	04300	13	19	2	Georgia/GA	1 or more years of college credit, no degree	Female
2018GQ0000145	1	4	23	04200	13	13	1	Georgia/GA	Grade 10	Male
2018GQ0000336	1	90	36	01800	13	19	1	Georgia/GA	1 or more years of college credit, no degree	Male
2018GQ0000482	1	76	92	04400	13	16	2	Georgia/GA	Regular high school diploma	Female
2018GQ0000518	1	76	48	02600	13	17	1	Georgia/GA	GED or alternative credential	Male
2018GQ0000530	1	15	19	03200	13	19	2	Georgia/GA	1 or more years of college credit, no degree	Female
2018GQ0000563	1	50	87	03900	13	21	2	Georgia/GA	Bachelor’s degree	Female

PUMS data uses weighting to allow estimation of national level statistics from a relatively small sample
Recoding allows us to then create a count of population by sex by summing the person weight

# weighting sample
ga_pums_recoded %>%
    count(PUMA, SEX_label, wt = PWGTP)

PUMA	SEX_label	n
00100	Male	74196
00100	Female	79139
00200	Male	59547
00200	Female	58650
00300	Male	77869
00300	Female	77896
00401	Male	77689
00401	Female	87415
00402	Male	61002
00402	Female	63279

In addition to weighted counts, we can also create weighted means and percentages
- Here, we calculate summary statistics for male and female individuals 25 and older with a Bachelor’s degree or above

# create a new variable that is whether or not the person has a Bachelor’s
# degree or above, group by PUMA and sex, then calculate the total population,
# average age, total with BA or above (only for people 25 and older), and
# percent with BA or above.
ga_pums_recoded %>%
    mutate(ba_above = SCHL %in% c("21", "22", "23", "24")) %>%
    group_by(PUMA, SEX_label) %>%
    summarize(total_pop = sum(PWGTP), mean_age = weighted.mean(AGEP, PWGTP), ba_above = sum(PWGTP[ba_above ==
        TRUE & AGEP >= 25]), ba_above_pct = ba_above/sum(PWGTP[AGEP >= 25]))

This yields population counts, average age, and counts and percentages of educational attainment

PUMA	SEX_label	total_pop	mean_age	ba_above	ba_above_pct
00100	Male	74196	38.47365	12848	0.2598233
00100	Female	79139	40.34941	16232	0.2947468
00200	Male	59547	31.83490	6436	0.1846613
00200	Female	58650	34.55173	9715	0.2611489
00300	Male	77869	34.31346	9234	0.2053871
00300	Female	77896	36.61063	10139	0.2078729
00401	Male	77689	35.36655	14187	0.2845881
00401	Female	87415	37.84550	16109	0.2770106
00402	Male	61002	38.24930	14338	0.3405215
00402	Female	63279	41.18079	19165	0.4163951

8.3 Modeling Microdata

Because PUMS data sample sizes can get very small when we filter or create groups, we will often want to calculate standard errors and check the reliability of the data
- We will not look at this here, but useful examples can be found at https://walker-data.com/tidycensus/articles/pums-data.html

8.4 Mapping Micerodata

PUMS data is mapped at the PUMA level, so it is useful to examine how that data differs from more common geographies
- Here we use tigris to pull PUMA boundaries for six states

# create vector of southeastern state abbreviations
se_states <- c("GA", "FL", "AL", "TN", "LA", "MS")
# use map() to pass each state to the tigris pumas() function to get their puma
# geographic files
se_pumas <- map(se_states, tigris::pumas, class = "sf", cb = TRUE) %>%
    reduce(rbind)

We can see that naming conventions are much different than other geographies

STATEFP10	PUMACE10	AFFGEOID10	GEOID10	NAME10	LSAD10	ALAND10	AWATER10	geometry
13	00700	7950000US1300700	1300700	Southern Georgia Regional Commission (West)	P0	6023715229	99859689	MULTIPOLYGON (((-83.33829 3…
13	01800	7950000US1301800	1301800	River Valley Regional Commission (Outside Muscogee & Chattahoochee Counties)	P0	12319655378	221465656	MULTIPOLYGON (((-85.18491 3…
13	03900	7950000US1303900	1303900	Northeast Georgia Regional Commission (Southwest)–Walton, Morgan & Jasper Counties	P0	2700385304	40042206	MULTIPOLYGON (((-83.98154 3…
13	02700	7950000US1302700	1302700	Northwest Georgia Regional Commission (North Central)–Whitfield County	P0	752230670	1597015	MULTIPOLYGON (((-85.16705 3…
13	04000	7950000US1304000	1304000	Central Savannah River Area Regional Commission (East Central)–Richmond County	P0	840018751	11014326	MULTIPOLYGON (((-82.3503 33…

Here we select the PUMA and POVPIP (income-to-poverty-ratio) variables for states in the Fed 6th District

# PUMA and income-to-poverty ratio
se_pums <- get_pums(variables = c("PUMA", "POVPIP"), state = se_states, survey = "acs1",
    year = 2018)

SERIALNO	SPORDER	PWGTP	POVPIP	PUMA	ST
2018GQ0000025	1	68	-1	03700	13
2018GQ0000028	1	20	-1	01800	47
2018GQ0000043	1	89	-1	04000	13
2018GQ0000049	1	75	-1	01600	01
2018GQ0000052	1	42	-1	01700	28

Before mapping, we group by state and PUMA and calculate the total population and the share of the population that is below 200% of the poverty line

# calculate share of population below 200% of poverty line
se_pov <- se_pums %>%
    group_by(ST, PUMA) %>%
    summarize(total_pop = sum(PWGTP), pct_in_pov = sum(PWGTP[POVPIP < 200])/total_pop)

ST	PUMA	total_pop	pct_in_pov
01	00100	184908	0.3595410
01	00200	196543	0.2522807
01	00301	132862	0.3144240
01	00302	101612	0.3541708
01	00400	123128	0.5115408

We then join our microdata to the PUMA boundaries and shade our map with the poverty variable

# join data and create map shaded by population share below 200% poverty line
se_pumas %>%
    left_join(se_pov, by = c(STATEFP10 = "ST", PUMACE10 = "PUMA")) %>%
    ggplot(aes(fill = pct_in_pov)) + geom_sf() + scale_fill_viridis_b(name = NULL,
    option = "magma", labels = scales::label_percent(1)) + labs(title = "Percentage of Population Below 200% of the Poverty Line") +
    theme_void()

9 Census Microdata from IPUMS

Source: Minnesota Population Center

We may choose to work with data from 3rd party providers rather than the Census API depending on our needs and preferences
IPUMS is easily the most comprehensive
- Maintained by the Institute for Social Research and Data Innovation at University of Minnesota
- Data from over 750 censuses and surveys representing over one billion individuals from
  - Including the Current Population Survey, the American Community Survey, the National Health Interview Survey, the Demographic and Health Surveys, among others
We will focus on ACS IPUMS data, but same syntax can be applied to CPS and other sources

9.1 What is ipumsr?

$Source: Minnesota Population Center$

Source: Minnesota Population Center

Created by Greg Freedman Ellis
Allows users to:
- Read IPUMS data extracts into R
  - Manage the large size, hierarchical structure, and variable labeling conventions of IPUMS extracts
- Use IPUMS metadata to transform data and conduct analysis
- In the near future, query the IPUMS API similar to tidycensus

9.2 IPUMS Extracts

Source: Minnesota Population Center

IPUMS USA
- US Census and American Community Survey microdata from 1850 to the present
  - Millions of unique individual records from decennial census and ACS, including historical records from 1850 to 1940
  - https://usa.ipums.org/usa/
IPUMS CPS
- Current Population Survey microdata from 1962 to the present
  - Monthly labor force surveys and supplements
  - https://cps.ipums.org/cps/

9.3 Creating an Extract

Source: Minnesota Population Center

Select variables and samples from IPUMS site
Choose format (file type and data structure)
Download extract and DDI codebook (XML metadata)
- Save both files in the same folder

9.4 Reading in Your Extract

Package ecosystem

# CRAN version
install.packages("ipumsr")
# development version
if (!require(remotes)) install.packages("remotes")
remotes::install_github("mnpopcenter/ipumsr", ref = "api-alpha-dev")
# tidyverse packages
install.packages("dplyr")
install.packages("ggplot2")
install.packages("stringr")
install.packages("purrr")
# HTML tables
install.packages("DT")
# spatial analysis and mapping
install.packages("sf")

library(ipumsr)
library(dplyr)
library(ggplot2)
library(stringr)
library(sf)
library(purrr)

9.5 Reading in Your Extract

Import the data from the extract using the metadata as a map
Many IPUMS variables names and values are not self-explanatory, so we use ipumsr functions to view labels, descriptions, and value labels

# read in xml metadata codebook
ddi <- read_ipums_ddi("/Users/victorhaley/QSC2021Master/usa_00013.xml")
data <- read_ipums_micro(ddi)
# use metadata to create data frame from IPUMS extract
data <- read_ipums_micro("usa_00013.xml")

The data file is just raw, uncoded data
- The ddi codebook provides a map for assembling and analyzing it

names(ddi)

##  [1] "file_name"        "file_path"        "file_type"        "ipums_project"   
##  [5] "extract_date"     "extract_notes"    "rectypes"         "rectype_idvar"   
##  [9] "rectypes_keyvars" "var_info"         "conditions"       "citation"        
## [13] "file_encoding"

IPUMS variable codes are not always intuitive

# print variable names
names(data)

##  [1] "YEAR"     "DATANUM"  "SERIAL"   "HHWT"     "STATEFIP" "CONSPUMA"
##  [7] "GQ"       "PHONE"    "PERNUM"   "PERWT"    "EDUC"     "EDUCD"   
## [13] "EDUCD2"   "EDUCD3"   "PHONE2"   "PHONE3"

We can use ipumsr functions to view full variable names

# view full name of variable PHONE
ipums_var_label(ddi, PHONE)

## [1] "Telephone availability"

detailed variable descriptions

# view variable descriptions, strwrap() transforms them into paragraphs
ipums_var_desc(ddi, PHONE) %>%
    strwrap(60)

## [1] "PHONE indicates whether residents of the housing unit had"
## [2] "telephone access."

and the values and labels for different variables

# view values and labels for variable
ipums_val_labels(ddi, PHONE)

val	lbl
0	N/A
1	No, no phone available
2	Yes, phone available
8	Suppressed (2012 and 2015 ACS)

9.6 Handling Labels

IPUMS value labels don’t translate perfectly to the R factor object type
- All factor values must be labeled
- Factor values must count up from 1
ipumsr has functions to help work with values and labels while following R logic

head(data)

We can see that our data has both labeled and unlabeled variables

YEAR	DATANUM	SERIAL	HHWT	STATEFIP	CONSPUMA	GQ	PHONE	PERNUM	PERWT	EDUC	EDUCD	EDUCD2	EDUCD3	PHONE2	PHONE3
1960	1	336455	100	27	NA	1	2	1	100	5	50	Grade 11	Less than High School	Yes, phone available	Yes, phone available
1960	1	336455	100	27	NA	1	2	2	100	6	60	Grade 12	High school	Yes, phone available	Yes, phone available
1960	1	336456	100	27	NA	1	2	1	100	6	60	Grade 12	High school	Yes, phone available	Yes, phone available
1960	1	336456	100	27	NA	1	2	2	100	6	60	Grade 12	High school	Yes, phone available	Yes, phone available
1960	1	336456	100	27	NA	1	2	3	100	2	22	Grade 5, 6, 7, or 8	Less than High School	Yes, phone available	Yes, phone available
1960	1	336456	100	27	NA	1	2	4	100	0	1	N/A or no schooling	N/A or no schooling	Yes, phone available	Yes, phone available
1960	1	336457	99	27	NA	1	2	1	99	6	60	Grade 12	High school	Yes, phone available	Yes, phone available
1960	1	336457	99	27	NA	1	2	2	99	4	40	Grade 10	Less than High School	Yes, phone available	Yes, phone available

The as_factor function adds ordered factor levels to variable labels

# add factor levels to labels
ipumsr::as_factor(data)

YEAR	DATANUM	SERIAL	HHWT	STATEFIP	CONSPUMA	GQ	PHONE	PERNUM	PERWT	EDUC	EDUCD	EDUCD2	EDUCD3	PHONE2	PHONE3
1960	1	336455	100	Minnesota	NA	Households under 1970 definition	Yes, phone available	1	100	Grade 11	Grade 11	Grade 11	Less than High School	Yes, phone available	Yes, phone available
1960	1	336455	100	Minnesota	NA	Households under 1970 definition	Yes, phone available	2	100	Grade 12	Grade 12	Grade 12	High school	Yes, phone available	Yes, phone available
1960	1	336456	100	Minnesota	NA	Households under 1970 definition	Yes, phone available	1	100	Grade 12	Grade 12	Grade 12	High school	Yes, phone available	Yes, phone available
1960	1	336456	100	Minnesota	NA	Households under 1970 definition	Yes, phone available	2	100	Grade 12	Grade 12	Grade 12	High school	Yes, phone available	Yes, phone available
1960	1	336456	100	Minnesota	NA	Households under 1970 definition	Yes, phone available	3	100	Grade 5, 6, 7, or 8	Grade 5	Grade 5, 6, 7, or 8	Less than High School	Yes, phone available	Yes, phone available
1960	1	336456	100	Minnesota	NA	Households under 1970 definition	Yes, phone available	4	100	N/A or no schooling	N/A	N/A or no schooling	N/A or no schooling	Yes, phone available	Yes, phone available
1960	1	336457	99	Minnesota	NA	Households under 1970 definition	Yes, phone available	1	99	Grade 12	Grade 12	Grade 12	High school	Yes, phone available	Yes, phone available
1960	1	336457	99	Minnesota	NA	Households under 1970 definition	Yes, phone available	2	99	Grade 10	Grade 10	Grade 10	Less than High School	Yes, phone available	Yes, phone available

We can also use zap_labels() to remove labels and show raw values

# remove labels to show values
zap_labels(data)

YEAR	DATANUM	SERIAL	HHWT	STATEFIP	CONSPUMA	GQ	PHONE	PERNUM	PERWT	EDUC	EDUCD	EDUCD2	EDUCD3	PHONE2	PHONE3
1960	1	336455	100	27	NA	1	2	1	100	5	50	Grade 11	Less than High School	Yes, phone available	Yes, phone available
1960	1	336455	100	27	NA	1	2	2	100	6	60	Grade 12	High school	Yes, phone available	Yes, phone available
1960	1	336456	100	27	NA	1	2	1	100	6	60	Grade 12	High school	Yes, phone available	Yes, phone available
1960	1	336456	100	27	NA	1	2	2	100	6	60	Grade 12	High school	Yes, phone available	Yes, phone available
1960	1	336456	100	27	NA	1	2	3	100	2	22	Grade 5, 6, 7, or 8	Less than High School	Yes, phone available	Yes, phone available
1960	1	336456	100	27	NA	1	2	4	100	0	1	N/A or no schooling	N/A or no schooling	Yes, phone available	Yes, phone available
1960	1	336457	99	27	NA	1	2	1	99	6	60	Grade 12	High school	Yes, phone available	Yes, phone available
1960	1	336457	99	27	NA	1	2	2	99	4	40	Grade 10	Less than High School	Yes, phone available	Yes, phone available

9.7 Handling Labels: Consolidating

lbl_collapse() allows you to take advantage of the hierarchical structure of value labels

ipums_val_labels(data$EDUCD)

We have far too many categories for useful analysis

val	lbl
0	N/A or no schooling
1	N/A
2	No schooling completed
10	Nursery school to grade 4
11	Nursery school, preschool
12	Kindergarten
13	Grade 1, 2, 3, or 4
14	Grade 1
15	Grade 2
16	Grade 3
17	Grade 4
20	Grade 5, 6, 7, or 8
21	Grade 5 or 6
22	Grade 5
23	Grade 6
24	Grade 7 or 8
25	Grade 7
26	Grade 8
30	Grade 9
40	Grade 10
50	Grade 11
60	Grade 12
61	12th grade, no diploma
62	High school graduate or GED
63	Regular high school diploma
64	GED or alternative credential
65	Some college, but less than 1 year
70	1 year of college
71	1 or more years of college credit, no degree
80	2 years of college
81	Associate’s degree, type not specified
82	Associate’s degree, occupational program
83	Associate’s degree, academic program
90	3 years of college
100	4 years of college
101	Bachelor’s degree
110	5+ years of college
111	6 years of college (6+ in 1960-1970)
112	7 years of college
113	8+ years of college
114	Master’s degree
115	Professional degree beyond a bachelor’s degree
116	Doctoral degree
999	Missing

The variable values are hierarchical, with the values divisible by ten containing all of the others
- We simplify our values by collapsing on those variables

# still too detailed
data$EDUCD %>%
    lbl_collapse(~.val%/%10) %>%
    ipums_val_labels()

val	lbl
0	N/A or no schooling
1	Nursery school to grade 4
2	Grade 5, 6, 7, or 8
3	Grade 9
4	Grade 10
5	Grade 11
6	Grade 12
7	1 year of college
8	2 years of college
9	3 years of college
10	4 years of college
11	5+ years of college
99	Missing

9.8 Handling Labels: Consolidating

We still have too many categories, so we may want to combine some to make our data more specific
By using some string detection and relabeling, we reduce the numner of categories to six

# expression for filtering variable labels with a particular string
college_regex <- "^[123] year(s)? of college$"
data$EDUCD3 <- data$EDUCD %>%
    # divide values by 10 to make single digit
lbl_collapse(~.val%/%10) %>%
    # combine all less than high school values into one classify any amount of
    # college without completion as 'Some college' create one category for all
    # college degree types
lbl_relabel(lbl(2, "Less than High School") ~ .val > 0 & .val < 6, lbl(3, "High school") ~
    .lbl == "Grade 12", lbl(4, "Some college") ~ str_detect(.lbl, college_regex),
    lbl(5, "College or more") ~ .val %in% c(10, 11)) %>%
    as_factor()
# new levels
levels(data$EDUCD3)

## [1] "N/A or no schooling"   "Less than High School" "High school"          
## [4] "Some college"          "College or more"       "Missing"

9.9 Removing Missing Data

lbl_na_if() allows you to set certain values or labels to missing
In this variable, we have one value for suppressed results that we need to conver to an NA before analysis

ipums_val_labels(data$PHONE)

val	lbl
0	N/A
1	No, no phone available
2	Yes, phone available
8	Suppressed (2012 and 2015 ACS)

We can do this using the numeric values

# convert these values to NAs
data$PHONE2 <- lbl_na_if(data$PHONE, ~.val %in% c(0, 8)) %>%
    as_factor()
levels(data$PHONE2)

## [1] "No, no phone available" "Yes, phone available"

or the string labels

# works with labels or values
drop_labels <- c("N/A", "Suppressed (2012 and 2015 ACS)")
data$PHONE3 <- lbl_na_if(data$PHONE, ~.lbl %in% drop_labels) %>%
    as_factor()
# updated levels
levels(data$PHONE3)

## [1] "No, no phone available" "Yes, phone available"

10 Analysis

Now the extract is factored, relabeled and ready for graphing

10.1 Selecting the Right Level, Weight, and Aggregation

Extracts are samples and come with household and person levels and weights
Here, we use person weight to estimate share of people with phone lines by year

# prepare data for graphing by grouping by year, and using the person weight to
# calculate a weighted mean. This yields the estimated share of the Minnesota
# population who have a phone
graph_data <- data %>%
    group_by(YEAR) %>%
    summarize(`% with phone` = weighted.mean(PHONE2 == "Yes, phone available", PERWT,
        na.rm = TRUE), .groups = "drop")

YEAR	% with phone
1960	0.8863756
1970	0.9579866
1980	0.9745530
1990	0.9780473
2000	0.9904893
2001	0.9903932
2002	0.9843541
2003	0.9871095
2004	0.9771358
2005	0.9742268

We also have educational attainment data in the extract, so we prepare another grouped dataset by both year and education

# group data by both year and education level
graph_data2 <- data %>%
    group_by(YEAR, EDUCD3) %>%
    summarize(`% with phone` = weighted.mean(PHONE2 == "Yes, phone available", PERWT,
        na.rm = TRUE), .groups = "drop")

YEAR	EDUCD3	% with phone
1960	N/A or no schooling	0.8826448
1960	Less than High School	0.8590952
1960	High school	0.9217318
1960	Some college	0.9581475
1960	College or more	0.9831610
1970	N/A or no schooling	0.9566989
1970	Less than High School	0.9458146
1970	High school	0.9698355
1970	Some college	0.9758722
1970	College or more	0.9888245

10.2 Plotting the Results

Like with tidycensus data, we can plot just one variable over time

# Plotting one variable year over year
ggplot(graph_data, aes(x = YEAR, y = `% with phone`)) + geom_point() + geom_line() +
    # Use variable description from metadata as caption
labs(title = "Percent of Minnesota with phone line", subtitle = paste0("Data source: ",
    ddi$ipums_project), caption = paste(strwrap(ipums_var_desc(ddi, PHONE), 90),
    collapse = "\n"))

or use facet_wrap to create plots for different groups

# Plotting one variable year over year with individual plots for education
# level
ggplot(graph_data2, aes(x = YEAR, y = `% with phone`)) + geom_point() + geom_line() +
    labs(title = "Percent of Minnesota with Phone Line by Education", subtitle = paste0("Source: ",
        ddi$ipums_project)) + facet_wrap(~EDUCD3)

11 ipumsr and Mapping

IPUMS provides geographic boundary files for IPUMS USA and other sources
- CONSPUMA (Consistent Public Use Microdata Area) available at https://usa.ipums.org/usa/volii/tgeotools.shtml
ipumsr provides support for both sf and sp data (we will use sf)
Load with the ipums_read_sf() function

# read in CONSPUMA shapefile from IPUMS
shape_data <- read_ipums_sf("shape/")
# view data structure
as_tibble(shape_data)

CONSPUMA	STATEFIP	State	geometry
540	02	Alaska	MULTIPOLYGON (((-3043869 34…
541	02	Alaska	MULTIPOLYGON (((-2291901 23…
542	15	Hawaii	MULTIPOLYGON (((-6021813 59…
491	51	Virginia	MULTIPOLYGON (((1720696 104…
492	51	Virginia	MULTIPOLYGON (((1745400 116…
493	51	Virginia	MULTIPOLYGON (((1728178 119…
494	51	Virginia	MULTIPOLYGON (((1717473 129…
495	53	Washington	MULTIPOLYGON (((-1584859 14…
496	53	Washington	MULTIPOLYGON (((-2031282 13…
497	53	Washington	MULTIPOLYGON (((-1970094 14…

11.1 Joining Extract and Spatial Data

ipumsr has functions for merging extracts with spatial data

# group extract data by CONSPUMA and year, create weighted variable, and
# ungroup
conspuma_data <- data %>%
    group_by(CONSPUMA, YEAR) %>%
    summarize(`% with phone` = weighted.mean(PHONE2 == "Yes, phone available", PERWT,
        na.rm = TRUE), .groups = "drop")

CONSPUMA	YEAR	% with phone	STATEFIP	State	geometry
245	1980	0.9641267	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	1990	0.9669292	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2000	0.9859847	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2005	0.9689021	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2006	0.9566019	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2007	0.9544157	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2008	0.9850778	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2009	0.9852029	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2010	0.9831482	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2011	0.9811322	27	Minnesota	MULTIPOLYGON (((62837.34 13…

Similar to the GEOID in ACS data, we can use the CONSPUMA code as our joining variable

# join extract data with spatial data using CONSPUMA id
conspuma_data_join <- ipums_shape_inner_join(conspuma_data, shape_data, by = "CONSPUMA")

CONSPUMA	YEAR	% with phone	STATEFIP	State	geometry
245	1980	0.9641267	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	1990	0.9669292	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2000	0.9859847	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2005	0.9689021	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2006	0.9566019	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2007	0.9544157	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2008	0.9850778	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2009	0.9852029	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2010	0.9831482	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2011	0.9811322	27	Minnesota	MULTIPOLYGON (((62837.34 13…

11.2 Mapping with ipumsr, ggplot2, and sf

Once the data is joined, we can filter it by year to measure change across decades

# filter data to show every decade
graph_conspuma <- conspuma_data_join %>%
    filter(YEAR %in% c(1980, 1990, 2000, 2010))

CONSPUMA	YEAR	% with phone	STATEFIP	State	geometry
245	1980	0.9641267	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	1990	0.9669292	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2000	0.9859847	27	Minnesota	MULTIPOLYGON (((62837.34 13…
245	2010	0.9831482	27	Minnesota	MULTIPOLYGON (((62837.34 13…
246	1980	0.9881295	27	Minnesota	MULTIPOLYGON (((254440.1 86…
246	1990	0.9961230	27	Minnesota	MULTIPOLYGON (((254440.1 86…
246	2000	0.9959761	27	Minnesota	MULTIPOLYGON (((254440.1 86…
246	2010	0.9895193	27	Minnesota	MULTIPOLYGON (((254440.1 86…
247	1980	0.9673522	27	Minnesota	MULTIPOLYGON (((227254.6 12…
247	1990	0.9760921	27	Minnesota	MULTIPOLYGON (((227254.6 12…

and visualize the share of Minnesota’s population with phone access

# color CONSPUMA polygons by % with phone value make one plot for each year
# remove grid lines and coordinates
ggplot(graph_conspuma, aes(fill = `% with phone`)) + facet_wrap(~YEAR) + geom_sf() +
    theme_void()

12 IPUMS API

Currently in internal testing
Beta testing before the end of 2021
IPUMS USA public launch early 2022
Will allow users to
- Define and submit extract requests
- Check extract status or “wait” for an extract to finish
- Download completed extracts
- Get info on past extracts
- Share extract definitions

12.1 Other IPUMS Data Sources

NHGIS
Longitudinal and Employer-Household Dynamics (LEHD) Origin-Destination Employment Statistics (LODES) data
Small Area Health Insurance Estimates
Economic Census
Bureau of Labor Statistics
CPS
LAUS
USDA
Department of Housing and Urban Development Comprehensive Affordable Housing Strategy (CHAS)

13 References

Both tidycensus and ipumsr have excellent documentation and package vignettes that I drew on heavily for this presentation
tidycensus information can be found at:
- https://walker-data.com/tidycensus/
- and at the website for Kyle Walker’s new book Analyzing US Census Data: Methods, Maps, and Models in R
  - https://walker-data.com/census-r/index.html
ipumsr information can be found at:

14 Thank You!

-Email: pearse.haley@atl.frb.org

Wrangling and Visualizing Census Data with R

Pearse Haley, Research Analyst, Federal Reserve Bank of Atlanta

11/4/2021

1 U.S. Census Data

1.1 Census Hierarchy

2 Census Data and R

2.1 Census API? Use tidycensus and tigris

2.2 Getting started with tidycensus

2.3 Pulling Data: Decennial Census

2.4 Pulling Data: American Community Survey

2.5 Selecting Variables

2.6 Querying Variables

2.7 Narrowing Our Query

2.8 Renaming Variables

2.9 Choosing a Data Structure

3 Manipulating Data

3.1 Summary Variables

3.2 Creating New Variables

4 Analysis

4.1 Creating and Comparing Groups

4.2 Time-Series

5 Visualization

5.1 Refining Plots

5.2 Visualizating Margin of Error

5.3 Showing Change Over Time

5.4 Group Visualizations

6 Bringing in Geographic Data

6.1 Structure of sf objects from tigris

6.2 Polygons, Points, and Lines

7 Mapping Census Data

7.1 TIGER/Line vs. Cartographic Boundaries

7.2 Adding Census Data

7.3 Modeling

7.4 Examining the Data

7.5 Comparing Distributions

7.6 Looking at Correlations

8 Microdata

8.1 PUMS variables

8.2 Loading PUMS Data and Recoding

8.3 Modeling Microdata

8.4 Mapping Micerodata

9 Census Microdata from IPUMS

9.1 What is ipumsr?

9.2 IPUMS Extracts

9.3 Creating an Extract

9.4 Reading in Your Extract

9.5 Reading in Your Extract

9.6 Handling Labels

9.7 Handling Labels: Consolidating

9.8 Handling Labels: Consolidating

9.9 Removing Missing Data

10 Analysis

10.1 Selecting the Right Level, Weight, and Aggregation

10.2 Plotting the Results

11 ipumsr and Mapping

11.1 Joining Extract and Spatial Data

11.2 Mapping with ipumsr, ggplot2, and sf

12 IPUMS API

12.1 Other IPUMS Data Sources

13 References

14 Thank You!