Census Data and R
- There are many R packages designed to help analysts navigate this hierarchy
- Which data do I need? From where? From when?
- Downloading bulk files from the Census FTP site or using data.census.gov are not practical if you regularly work with Census data
- That leaves Census API or 3rd party providers
- Choice depends on specific data needs (variables, geography, survey, scope, level of detail)
Census API? Use tidycensus and tigris

- Created by Kyle Walker (https://walker-data.com/)
- tidycensus
- Allows R users to access Census data in a format that can be plugged into tidyverse workflows
- Makes data wrangling, analysis, and visualization of Census data simpler and easier
- tigris
- Pulls official geographic boundary shapefiles for mapping and spatial analysis
Getting started with tidycensus
- To use tidycensus, you will need to sign up for a free Census API Key from https://api.census.gov/data/key_signup.html and activate it
- After the key is installed in your .Renviron file, it no longer needs to be included in your code
install.packages("tidycensus", repos = "http://cran.us.r-project.org")
install.packages("tigris", repos = "http://cran.us.r-project.org")
library(tidycensus)
library(tigris)
# stores boundaries in cache to preserve memory
options(tigris_use_cache = TRUE)
census_api_key("YOUR KEY GOES HERE", install = TRUE)
# only need to include install = TRUE once
Pulling Data: Decennial Census
- get_decennial() can query decennial Census data from the 2000, 2010, and 2020 decennial U.S. Censuses
- Users must include:
- a geographic area
- a vector of Census variable IDs
- The following code below gets data on total population by state from the 2010 decennial Census
total_population_10 <- get_decennial(geography = "state", variables = "P001001",
year = 2010)
|
GEOID
|
NAME
|
variable
|
value
|
|
01
|
Alabama
|
P001001
|
4779736
|
|
02
|
Alaska
|
P001001
|
710231
|
|
04
|
Arizona
|
P001001
|
6392017
|
|
05
|
Arkansas
|
P001001
|
2915918
|
|
06
|
California
|
P001001
|
37253956
|
|
22
|
Louisiana
|
P001001
|
4533372
|
|
21
|
Kentucky
|
P001001
|
4339367
|
|
08
|
Colorado
|
P001001
|
5029196
|
|
09
|
Connecticut
|
P001001
|
3574097
|
|
10
|
Delaware
|
P001001
|
897934
|
Selecting Variables
- load_variables() returns a dataset of variables from the Census Bureau website and formats it for fast searching
- two required arguments
- year of the Census dataset or ACS sample
- dataset name (sf1, sf3, pl, acs1, or acs5)
- For ACS datasets, append “/profile” for the Data Profile, and “/summary” for the Summary Tables.
- user can include cache = TRUE in the function call to store the data in the user’s cache directory for faster processing
v19 <- load_variables(2019, "acs5", cache = TRUE)
Querying Variables
- Variables from the ACS detailed tables, data profiles, summary tables, and supplemental estimates are available
- Supplying a table name to the table parameter returns all variables for that table
- The code below returns all variables associated with table B01001 (sex by age), from the 2015-2019 5-year ACS:
dp_table <- get_acs(geography = "state", table = "B01001", year = 2019)
|
GEOID
|
NAME
|
variable
|
estimate
|
moe
|
|
01
|
Alabama
|
B01001_001
|
4876250
|
NA
|
|
01
|
Alabama
|
B01001_002
|
2359355
|
1270
|
|
01
|
Alabama
|
B01001_003
|
149090
|
704
|
|
01
|
Alabama
|
B01001_004
|
153494
|
2290
|
|
01
|
Alabama
|
B01001_005
|
158617
|
2274
|
|
01
|
Alabama
|
B01001_006
|
98257
|
468
|
|
01
|
Alabama
|
B01001_007
|
64980
|
834
|
|
01
|
Alabama
|
B01001_008
|
35870
|
1436
|
|
01
|
Alabama
|
B01001_009
|
35040
|
1472
|
|
01
|
Alabama
|
B01001_010
|
95065
|
1916
|
Narrowing Our Query
- To get data from within specific states and/or counties, users can pass state and county names to their respective parameters
ga_income <- get_acs(geography = "county", variables = "B19013_001", state = "GA",
year = 2019)
|
GEOID
|
NAME
|
variable
|
estimate
|
moe
|
|
13001
|
Appling County, Georgia
|
B19013_001
|
40304
|
5180
|
|
13003
|
Atkinson County, Georgia
|
B19013_001
|
37197
|
3686
|
|
13005
|
Bacon County, Georgia
|
B19013_001
|
37519
|
5492
|
|
13007
|
Baker County, Georgia
|
B19013_001
|
32917
|
6967
|
|
13009
|
Baldwin County, Georgia
|
B19013_001
|
43672
|
3736
|
|
13011
|
Banks County, Georgia
|
B19013_001
|
47811
|
3240
|
|
13013
|
Barrow County, Georgia
|
B19013_001
|
62345
|
2204
|
|
13015
|
Bartow County, Georgia
|
B19013_001
|
57423
|
2743
|
|
13017
|
Ben Hill County, Georgia
|
B19013_001
|
32229
|
3845
|
|
13019
|
Berrien County, Georgia
|
B19013_001
|
40415
|
3024
|
fulton_income <- get_acs(geography = "tract", variables = "B19013_001", state = "GA",
county = "Fulton")
|
GEOID
|
NAME
|
variable
|
estimate
|
moe
|
|
13121000100
|
Census Tract 1, Fulton County, Georgia
|
B19013_001
|
168396
|
18644
|
|
13121000200
|
Census Tract 2, Fulton County, Georgia
|
B19013_001
|
158011
|
37856
|
|
13121000400
|
Census Tract 4, Fulton County, Georgia
|
B19013_001
|
97257
|
30528
|
|
13121000500
|
Census Tract 5, Fulton County, Georgia
|
B19013_001
|
90140
|
13217
|
|
13121000600
|
Census Tract 6, Fulton County, Georgia
|
B19013_001
|
53686
|
9322
|
|
13121000700
|
Census Tract 7, Fulton County, Georgia
|
B19013_001
|
84107
|
33365
|
|
13121001001
|
Census Tract 10.01, Fulton County, Georgia
|
B19013_001
|
103846
|
10554
|
|
13121001002
|
Census Tract 10.02, Fulton County, Georgia
|
B19013_001
|
31042
|
9051
|
|
13121001100
|
Census Tract 11, Fulton County, Georgia
|
B19013_001
|
109426
|
7959
|
|
13121001201
|
Census Tract 12.01, Fulton County, Georgia
|
B19013_001
|
80476
|
16521
|
Renaming Variables
- We can rename variables from within our function call if we know the default names
ga_rename <- get_acs(geography = "county", state = "Georgia", variables = c(medinc = "B19013_001",
medage = "B01002_001"), output = "wide")
|
GEOID
|
NAME
|
medincE
|
medincM
|
medageE
|
medageM
|
|
13005
|
Bacon County, Georgia
|
37519
|
5492
|
36.7
|
0.7
|
|
13025
|
Brantley County, Georgia
|
38857
|
3480
|
41.1
|
0.8
|
|
13017
|
Ben Hill County, Georgia
|
32229
|
3845
|
39.9
|
1.1
|
|
13033
|
Burke County, Georgia
|
44151
|
2438
|
37.4
|
0.6
|
|
13047
|
Catoosa County, Georgia
|
56235
|
2290
|
40.4
|
0.4
|
|
13053
|
Chattahoochee County, Georgia
|
47096
|
5158
|
24.5
|
0.5
|
|
13055
|
Chattooga County, Georgia
|
36807
|
2268
|
39.4
|
0.7
|
|
13073
|
Columbia County, Georgia
|
82339
|
3532
|
36.9
|
0.4
|
|
13087
|
Decatur County, Georgia
|
41481
|
3584
|
37.8
|
0.6
|
|
13115
|
Floyd County, Georgia
|
48336
|
2266
|
38.3
|
0.3
|
Choosing a Data Structure
- By default, tidycensus returns a tibble in “tidy” (or long) format:
- Each observation forms a row
- Each variable forms a column with one data type
- Each observational unit forms a table
hhinc_tidy <- get_acs(geography = "state", table = "B19001", survey = "acs1", year = 2019)
|
GEOID
|
NAME
|
variable
|
estimate
|
moe
|
|
01
|
Alabama
|
B19001_001
|
1897576
|
10370
|
|
01
|
Alabama
|
B19001_002
|
154558
|
5883
|
|
01
|
Alabama
|
B19001_003
|
103653
|
6001
|
|
01
|
Alabama
|
B19001_004
|
108500
|
5926
|
|
01
|
Alabama
|
B19001_005
|
98706
|
6491
|
|
01
|
Alabama
|
B19001_006
|
90916
|
5859
|
|
01
|
Alabama
|
B19001_007
|
105146
|
4149
|
|
01
|
Alabama
|
B19001_008
|
85014
|
5417
|
|
01
|
Alabama
|
B19001_009
|
87118
|
5163
|
|
01
|
Alabama
|
B19001_010
|
82323
|
4231
|
- W can specify “wide” format if we prefer, with columns for each variable
hhinc_wide <- get_acs(geography = "state", table = "B19001", survey = "acs1", year = 2019,
output = "wide")
|
GEOID
|
NAME
|
B19001_001E
|
B19001_001M
|
B19001_002E
|
B19001_002M
|
B19001_003E
|
B19001_003M
|
B19001_004E
|
B19001_004M
|
B19001_005E
|
B19001_005M
|
B19001_006E
|
B19001_006M
|
B19001_007E
|
B19001_007M
|
B19001_008E
|
B19001_008M
|
B19001_009E
|
B19001_009M
|
B19001_010E
|
B19001_010M
|
B19001_011E
|
B19001_011M
|
B19001_012E
|
B19001_012M
|
B19001_013E
|
B19001_013M
|
B19001_014E
|
B19001_014M
|
B19001_015E
|
B19001_015M
|
B19001_016E
|
B19001_016M
|
B19001_017E
|
B19001_017M
|
|
28
|
Mississippi
|
1100229
|
9887
|
102072
|
5891
|
72195
|
5563
|
74167
|
4559
|
63780
|
4127
|
64942
|
4703
|
58033
|
4432
|
55748
|
4483
|
51897
|
4901
|
43423
|
3597
|
86214
|
5966
|
105304
|
5907
|
122825
|
6357
|
72151
|
4143
|
47726
|
3282
|
44171
|
3353
|
35581
|
2826
|
|
29
|
Missouri
|
2458337
|
10406
|
155827
|
6687
|
111813
|
5577
|
115140
|
5621
|
122095
|
5489
|
111981
|
5528
|
121608
|
5756
|
113730
|
5553
|
116231
|
5683
|
104380
|
4723
|
198458
|
7477
|
260646
|
7646
|
318344
|
8489
|
216220
|
6681
|
135532
|
5371
|
129949
|
6087
|
126383
|
5183
|
|
30
|
Montana
|
437651
|
4637
|
23489
|
2194
|
21550
|
2415
|
21335
|
1726
|
23022
|
2418
|
18577
|
1955
|
21148
|
2408
|
20959
|
2769
|
21917
|
2316
|
20023
|
2670
|
36769
|
3056
|
44994
|
3051
|
58858
|
3855
|
39064
|
2723
|
23818
|
2653
|
23210
|
2351
|
18918
|
2211
|
|
31
|
Nebraska
|
771444
|
4475
|
37219
|
2805
|
30395
|
2679
|
26514
|
2433
|
35113
|
2753
|
30390
|
2643
|
36654
|
2792
|
32908
|
2509
|
37925
|
3333
|
31514
|
2681
|
64039
|
3492
|
86318
|
3863
|
111438
|
4655
|
75244
|
3690
|
47119
|
3029
|
47279
|
2722
|
41375
|
2731
|
|
32
|
Nevada
|
1143557
|
8281
|
68825
|
4164
|
45099
|
3502
|
39946
|
3403
|
50808
|
4299
|
44720
|
3607
|
48167
|
3496
|
46762
|
3920
|
51108
|
3854
|
51006
|
4531
|
90046
|
5207
|
123473
|
5807
|
159361
|
6715
|
116598
|
5686
|
64746
|
3977
|
69365
|
4085
|
73527
|
4035
|
|
33
|
New Hampshire
|
541396
|
5389
|
21229
|
2566
|
16539
|
2663
|
16216
|
2271
|
19054
|
2165
|
18377
|
2294
|
18368
|
2010
|
15983
|
1838
|
22257
|
2391
|
21203
|
2169
|
39230
|
3351
|
52252
|
3758
|
76433
|
4378
|
61021
|
4132
|
40422
|
3563
|
49858
|
3039
|
52954
|
3332
|
|
34
|
New Jersey
|
3286264
|
9466
|
154374
|
7219
|
95123
|
5069
|
91608
|
5576
|
113018
|
5398
|
96889
|
6271
|
112428
|
6077
|
98003
|
5978
|
104770
|
6257
|
93426
|
5753
|
202459
|
7998
|
290552
|
8825
|
401067
|
11429
|
343771
|
10134
|
250060
|
8064
|
338989
|
9700
|
499727
|
8735
|
|
35
|
New Mexico
|
793420
|
7074
|
65613
|
4413
|
42519
|
3479
|
41543
|
3752
|
48909
|
4239
|
41728
|
4027
|
39967
|
3532
|
32647
|
3073
|
37765
|
2701
|
28960
|
3159
|
63845
|
4301
|
79144
|
4473
|
96945
|
5032
|
60173
|
4229
|
38540
|
3416
|
39437
|
3428
|
35685
|
2595
|
|
36
|
New York
|
7446812
|
13820
|
497771
|
10405
|
329325
|
10854
|
288805
|
8790
|
286239
|
9953
|
261940
|
10138
|
277219
|
9361
|
246739
|
9135
|
268533
|
10075
|
242835
|
9161
|
472300
|
12232
|
658024
|
13405
|
883154
|
14989
|
709376
|
12558
|
488838
|
11556
|
634178
|
12227
|
901536
|
12818
|
|
37
|
North Carolina
|
4046348
|
15467
|
251158
|
9332
|
187653
|
8080
|
187757
|
7905
|
200849
|
7810
|
196035
|
8288
|
194732
|
6745
|
187488
|
8731
|
195142
|
7604
|
176142
|
7460
|
317444
|
9175
|
412432
|
12009
|
513941
|
13478
|
350665
|
10038
|
207679
|
8060
|
224461
|
7108
|
242770
|
6452
|
Manipulating Data
- We can refine our queries even more by using pipes and tidyverse functions
- Filter based on certain conditions
- Rearrange and sort the data
- Create new variables
- Generate summary statistics
library(tidyverse)
library(dplyr)
median_age <- get_acs(geography = "county", variables = "B01002_001", year = 2019) %>%
# arrange data by largest estimate value to smallest
arrange(desc(estimate)) %>%
# only records with median age 50 or older
filter(estimate >= 50) %>%
# create individual county and state columns
separate(NAME, into = c("county", "state"), sep = ", ")
|
GEOID
|
county
|
state
|
variable
|
estimate
|
moe
|
|
12119
|
Sumter County
|
Florida
|
B01002_001
|
67.4
|
0.2
|
|
51091
|
Highland County
|
Virginia
|
B01002_001
|
60.9
|
3.5
|
|
08027
|
Custer County
|
Colorado
|
B01002_001
|
59.7
|
2.6
|
|
12015
|
Charlotte County
|
Florida
|
B01002_001
|
59.1
|
0.2
|
|
41069
|
Wheeler County
|
Oregon
|
B01002_001
|
59.0
|
3.3
|
|
51133
|
Northumberland County
|
Virginia
|
B01002_001
|
58.9
|
0.7
|
|
26131
|
Ontonagon County
|
Michigan
|
B01002_001
|
58.6
|
0.4
|
|
35021
|
Harding County
|
New Mexico
|
B01002_001
|
58.5
|
5.5
|
|
53031
|
Jefferson County
|
Washington
|
B01002_001
|
58.3
|
0.7
|
|
26001
|
Alcona County
|
Michigan
|
B01002_001
|
58.2
|
0.3
|
Summary Variables
- tidycensus can calculate summary variables for us
- these can be useful in calculating percentages and totals
# useful to assign lengthy and/or frequently used sets of variables to a vector
race_vars <- c(White = "B03002_003", Black = "B03002_004", Native = "B03002_005",
Asian = "B03002_006", HIPI = "B03002_007", Hispanic = "B03002_012")
# this gives us the total population count for all GA counties by race
ga_race <- get_acs(geography = "county", state = "GA", variables = race_vars, summary_var = "B03002_001")
|
GEOID
|
NAME
|
variable
|
estimate
|
moe
|
summary_est
|
summary_moe
|
|
13001
|
Appling County, Georgia
|
White
|
12745
|
21
|
18440
|
NA
|
|
13001
|
Appling County, Georgia
|
Black
|
3555
|
209
|
18440
|
NA
|
|
13001
|
Appling County, Georgia
|
Native
|
94
|
140
|
18440
|
NA
|
|
13001
|
Appling County, Georgia
|
Asian
|
74
|
111
|
18440
|
NA
|
|
13001
|
Appling County, Georgia
|
HIPI
|
0
|
21
|
18440
|
NA
|
|
13001
|
Appling County, Georgia
|
Hispanic
|
1830
|
NA
|
18440
|
NA
|
|
13003
|
Atkinson County, Georgia
|
White
|
4657
|
18
|
8239
|
NA
|
|
13003
|
Atkinson County, Georgia
|
Black
|
1434
|
112
|
8239
|
NA
|
|
13003
|
Atkinson County, Georgia
|
Native
|
81
|
111
|
8239
|
NA
|
|
13003
|
Atkinson County, Georgia
|
Asian
|
0
|
19
|
8239
|
NA
|
Creating New Variables
- We can use mutate() from dplyr to create new variables, like creating percentages from population estimates by race
ga_race_percent <- ga_race %>%
# divide estimates for each observation by the summary estimate
mutate(percent = 100 * (estimate/summary_est)) %>%
select(NAME, variable, percent)
|
NAME
|
variable
|
percent
|
|
Appling County, Georgia
|
White
|
69.1160521
|
|
Appling County, Georgia
|
Black
|
19.2787419
|
|
Appling County, Georgia
|
Native
|
0.5097614
|
|
Appling County, Georgia
|
Asian
|
0.4013015
|
|
Appling County, Georgia
|
HIPI
|
0.0000000
|
|
Appling County, Georgia
|
Hispanic
|
9.9240781
|
|
Atkinson County, Georgia
|
White
|
56.5238500
|
|
Atkinson County, Georgia
|
Black
|
17.4050249
|
|
Atkinson County, Georgia
|
Native
|
0.9831290
|
|
Atkinson County, Georgia
|
Asian
|
0.0000000
|
Analysis
Creating and Comparing Groups
- Once data is loaded and processed, we can begin our analysis
- This analysis often involves creating and/or comparing groups
- demographic groups described by variables, geographic areas, time periods, etc.
- The structure of tidycensus data makes it simple to identify and create groups
- In this example, we will obtain county income data from Alabama and then create new income brackets
# median household income for Alabama counties
al_hh_income <- get_acs(geography = "county", table = "B19001", state = "AL", year = 2019)
|
GEOID
|
NAME
|
variable
|
estimate
|
moe
|
|
01001
|
Autauga County, Alabama
|
B19001_001
|
21397
|
325
|
|
01001
|
Autauga County, Alabama
|
B19001_002
|
1417
|
254
|
|
01001
|
Autauga County, Alabama
|
B19001_003
|
1172
|
249
|
|
01001
|
Autauga County, Alabama
|
B19001_004
|
1337
|
248
|
|
01001
|
Autauga County, Alabama
|
B19001_005
|
882
|
214
|
|
01001
|
Autauga County, Alabama
|
B19001_006
|
985
|
245
|
|
01001
|
Autauga County, Alabama
|
B19001_007
|
699
|
210
|
|
01001
|
Autauga County, Alabama
|
B19001_008
|
1050
|
299
|
|
01001
|
Autauga County, Alabama
|
B19001_009
|
995
|
234
|
|
01001
|
Autauga County, Alabama
|
B19001_010
|
676
|
180
|
- Tables will often have variables with a variety of units
- Here, we filter out the total households variable before recategorizing our income variables
al_hh_income_recode <- al_hh_income %>%
# remove total households variable
filter(variable != "B19001_001") %>%
# combine categories into households making less than $35,000, between
# $35,000 and $75,000, and over $75,000
mutate(incgroup = case_when(variable < "B19001_008" ~ "below35k", variable < "B19001_013" ~
"bw35kand75k", TRUE ~ "above75k"))
|
GEOID
|
NAME
|
variable
|
estimate
|
moe
|
incgroup
|
|
01001
|
Autauga County, Alabama
|
B19001_002
|
1417
|
254
|
below35k
|
|
01001
|
Autauga County, Alabama
|
B19001_003
|
1172
|
249
|
below35k
|
|
01001
|
Autauga County, Alabama
|
B19001_004
|
1337
|
248
|
below35k
|
|
01001
|
Autauga County, Alabama
|
B19001_005
|
882
|
214
|
below35k
|
|
01001
|
Autauga County, Alabama
|
B19001_006
|
985
|
245
|
below35k
|
|
01001
|
Autauga County, Alabama
|
B19001_007
|
699
|
210
|
below35k
|
|
01001
|
Autauga County, Alabama
|
B19001_008
|
1050
|
299
|
bw35kand75k
|
|
01001
|
Autauga County, Alabama
|
B19001_009
|
995
|
234
|
bw35kand75k
|
|
01001
|
Autauga County, Alabama
|
B19001_010
|
676
|
180
|
bw35kand75k
|
|
01001
|
Autauga County, Alabama
|
B19001_011
|
1620
|
351
|
bw35kand75k
|
- We can pass our groups to summarize() to calculate the estimated population size of each income bracket
al_group_sums <- al_hh_income_recode %>%
# group by new income brackets
group_by(GEOID, incgroup) %>%
# get new sums for each group
summarize(estimate = sum(estimate))
|
GEOID
|
incgroup
|
estimate
|
|
01001
|
above75k
|
8367
|
|
01001
|
below35k
|
6492
|
|
01001
|
bw35kand75k
|
6538
|
|
01003
|
above75k
|
31509
|
|
01003
|
below35k
|
23248
|
|
01003
|
bw35kand75k
|
26173
|
|
01005
|
above75k
|
1855
|
|
01005
|
below35k
|
4920
|
|
01005
|
bw35kand75k
|
2570
|
|
01007
|
above75k
|
2139
|
Time-Series
- We will often need to analyze data across a range of years
- Rather than getting samples for each year separately and combining them, we can tell get_acs() what years we want and return a single dataset with all of them
college_vars <- c("B15002_015", "B15002_016", "B15002_017", "B15002_018", "B15002_032",
"B15002_033", "B15002_034", "B15002_035")
# create named vector of years from 2010-2019
years <- 2010:2019
names(years) <- years
# map_dfr() maps the get_acs() function to each year, resulting in data with
# estimates for each county for each specified year
college_by_year <- map_dfr(years, ~{
get_acs(geography = "county", variables = college_vars, state = "FL", summary_var = "B15002_001",
survey = "acs1", year = .x)
# organize data by year
}, .id = "year") %>%
# arrange data by place, variable, then year
arrange(NAME, variable, year)
|
year
|
GEOID
|
NAME
|
variable
|
estimate
|
moe
|
summary_est
|
summary_moe
|
|
2010
|
12001
|
Alachua County, Florida
|
B15002_015
|
13368
|
1780
|
145526
|
564
|
|
2011
|
12001
|
Alachua County, Florida
|
B15002_015
|
13472
|
1840
|
144360
|
1607
|
|
2012
|
12001
|
Alachua County, Florida
|
B15002_015
|
14944
|
1985
|
148044
|
966
|
|
2013
|
12001
|
Alachua County, Florida
|
B15002_015
|
14107
|
1929
|
149906
|
226
|
|
2014
|
12001
|
Alachua County, Florida
|
B15002_015
|
14951
|
1926
|
151519
|
571
|
|
2015
|
12001
|
Alachua County, Florida
|
B15002_015
|
16416
|
1969
|
156666
|
159
|
|
2016
|
12001
|
Alachua County, Florida
|
B15002_015
|
14255
|
2034
|
159567
|
342
|
|
2017
|
12001
|
Alachua County, Florida
|
B15002_015
|
15048
|
2047
|
161896
|
351
|
|
2018
|
12001
|
Alachua County, Florida
|
B15002_015
|
18820
|
1971
|
167364
|
2346
|
|
2019
|
12001
|
Alachua County, Florida
|
B15002_015
|
19122
|
2058
|
165792
|
708
|
Visualization
- Another advantage of tidycensus is that it yields data that can quickly be supplied to ggplot for visualization
- This allows for the efficient creation of both exploratory and informational plots
- Here, we use a bar chart to visualize public transit commuting in the 20 largest U.S. metro areas
metro_transit <- get_acs(
# core-based statistical area
geography = "cbsa",
# percentage of people who take public transit
variables = "DP03_0021P",
# total population for selecting only largest metros
summary_var = "B01003_001",
survey = "acs1",
year = 2019
) %>%
# slice the data to include only the largest metros by population
slice_max(summary_est, n = 20)
# simple bar chart
metrobar <- ggplot(metro_transit, aes(x = NAME, y = estimate)) +
geom_col()

Refining Plots
- Less time spent on acquiring data leaves more time for refining our visualizations for sharing and publication
# Use just the primary city in the metro for the label
metrobarfancy <- metro_transit %>%
mutate(NAME = str_remove(NAME, "-.*$")) %>%
mutate(NAME = str_remove(NAME, ",.*$")) %>%
ggplot(aes(y = reorder(NAME, estimate), x = estimate)) + geom_col(color = "navy",
fill = "navy", alpha = 0.5, width = 0.85) + theme_minimal(base_size = 12) + scale_x_continuous(breaks = c(0,
10, 20, 30), labels = c("0%", "10%", "20%", "30%")) + labs(title = "Metro Public Transit Commuting",
subtitle = "2019 ACS 1-year Estimates", y = "", x = "Share of All Commuters Using Public Transit",
caption = "Source: U.S. Census Bureau via the tidycensus R Package")

Visualizating Margin of Error
- The inclusion of the margin of error in ACS data makes it simple to examine the reliability of estimates
- This will be particularly useful given concerns about data quality from the 2020 Census
- Here, we look at median household income estimates in Louisiana parishes
# household income for Louisiana counties
la_income <- get_acs(state = "Louisiana", geography = "county", variables = c(hhincome = "B19013_001"),
year = 2019) %>%
# simplify county name
mutate(NAME = str_remove(NAME, " County, Louisiana")) %>%
# arrange by margin of error largest to smallest
arrange(desc(moe))
|
GEOID
|
NAME
|
variable
|
estimate
|
moe
|
|
22023
|
Cameron Parish, Louisiana
|
hhincome
|
53423
|
11624
|
|
22025
|
Catahoula Parish, Louisiana
|
hhincome
|
40129
|
9276
|
|
22007
|
Assumption Parish, Louisiana
|
hhincome
|
43759
|
7458
|
|
22091
|
St. Helena Parish, Louisiana
|
hhincome
|
43886
|
7437
|
|
22021
|
Caldwell Parish, Louisiana
|
hhincome
|
37691
|
7351
|
|
22059
|
LaSalle Parish, Louisiana
|
hhincome
|
42104
|
6936
|
|
22123
|
West Carroll Parish, Louisiana
|
hhincome
|
38500
|
6370
|
|
22029
|
Concordia Parish, Louisiana
|
hhincome
|
32500
|
6319
|
|
22127
|
Winn Parish, Louisiana
|
hhincome
|
38353
|
5742
|
|
22125
|
West Feliciana Parish, Louisiana
|
hhincome
|
59637
|
5661
|
- ggplot has features that allow for the creation of confidence intervals around our plots
# reorder by estimate, add error bar to show confidence interval from margin of
# error
la_income_moe <- ggplot(la_income, aes(x = estimate, y = reorder(NAME, estimate))) +
geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) + geom_point(size = 3,
color = "darkgreen") + theme_minimal(base_size = 12.5) + labs(title = "Median Household Income",
subtitle = "Counties in Louisiana", x = "2015-2019 ACS 5-year estimate", y = "") +
scale_x_continuous(labels = scales::dollar)

Showing Change Over Time
- We may also want to see how a variable and/or the margin of error changes over time
- Here we create another multi-year dataset for median home values in Fulton County, GA
# supply years we want samples for
years <- 2005:2019
names(years) <- years
# use map_dfr() to supply each year to get_acs() and return median home value
# by county for 2005-2019
fulton_value <- map_dfr(years, ~{
get_acs(geography = "county", variables = "B25077_001", state = "GA", county = "Fulton",
year = .x, survey = "acs1")
}, .id = "year")
|
year
|
GEOID
|
NAME
|
variable
|
estimate
|
moe
|
|
2005
|
13121
|
Fulton County, Georgia
|
B25077_001
|
243600
|
7046
|
|
2006
|
13121
|
Fulton County, Georgia
|
B25077_001
|
270000
|
9456
|
|
2007
|
13121
|
Fulton County, Georgia
|
B25077_001
|
267800
|
12865
|
|
2008
|
13121
|
Fulton County, Georgia
|
B25077_001
|
284400
|
8583
|
|
2009
|
13121
|
Fulton County, Georgia
|
B25077_001
|
257700
|
8968
|
|
2010
|
13121
|
Fulton County, Georgia
|
B25077_001
|
244300
|
7932
|
|
2011
|
13121
|
Fulton County, Georgia
|
B25077_001
|
230500
|
9223
|
|
2012
|
13121
|
Fulton County, Georgia
|
B25077_001
|
229900
|
8839
|
|
2013
|
13121
|
Fulton County, Georgia
|
B25077_001
|
231800
|
8471
|
|
2014
|
13121
|
Fulton County, Georgia
|
B25077_001
|
248800
|
9176
|
- We can see how home values changed from 2005 to 2019 along with the margin of error
# group = 1 overrides the default grouping so ggplot treats the dataset as a
# whole
fulton_timeseries <- ggplot(fulton_value, aes(x = year, y = estimate, group = 1)) +
geom_ribbon(aes(ymax = estimate + moe, ymin = estimate - moe), fill = "navy",
alpha = 0.4) + geom_line(color = "navy") + geom_point(color = "navy", size = 2) +
theme_minimal(base_size = 12) + scale_y_continuous(labels = scales::dollar) +
labs(title = "Median Home Value in Fulton County, GA", x = "Year", y = "AMedian Home Value",
caption = "Shaded Area Represents the Margin of Error of the ACS Estimate")

Group Visualizations
- We created groups from income levels earlier, but the ability to filter our geographies by state and/or county, allows us to compare groups of different geographies
- Here we look at median home values across the five-county Atlanta metro
housing_val <- get_acs(
geography = "tract",
# median home value
variables = "B25077_001",
state = "GA",
# only these counties
county = c(
"Fulton",
"DeKalb",
"Clayton",
"Cobb",
"Gwinnett",
"Henry"
)
) %>%
# separate NAME column into columns for tract, county, and state
separate(
NAME,
into = c("tract", "county", "state"),
sep = ", "
)
|
GEOID
|
tract
|
county
|
state
|
variable
|
estimate
|
moe
|
|
13063040202
|
Census Tract 402.02
|
Clayton County
|
Georgia
|
B25077_001
|
82600
|
10685
|
|
13063040203
|
Census Tract 402.03
|
Clayton County
|
Georgia
|
B25077_001
|
135900
|
21955
|
|
13063040204
|
Census Tract 402.04
|
Clayton County
|
Georgia
|
B25077_001
|
118000
|
9363
|
|
13063040302
|
Census Tract 403.02
|
Clayton County
|
Georgia
|
B25077_001
|
60300
|
7287
|
|
13063040303
|
Census Tract 403.03
|
Clayton County
|
Georgia
|
B25077_001
|
66400
|
9785
|
|
13063040306
|
Census Tract 403.06
|
Clayton County
|
Georgia
|
B25077_001
|
66400
|
30049
|
|
13063040307
|
Census Tract 403.07
|
Clayton County
|
Georgia
|
B25077_001
|
89900
|
10646
|
|
13063040308
|
Census Tract 403.08
|
Clayton County
|
Georgia
|
B25077_001
|
59000
|
19547
|
|
13063040407
|
Census Tract 404.07
|
Clayton County
|
Georgia
|
B25077_001
|
94700
|
10923
|
|
13063040408
|
Census Tract 404.08
|
Clayton County
|
Georgia
|
B25077_001
|
119600
|
16981
|
- ggplot allows us to quickly generate multiple comparisons
- We can display groups on one plot using color
# density curve colored by county
county_density <- ggplot(housing_val, aes(x = estimate, fill = county)) + geom_density(alpha = 0.3)

- Or we can create plots for each county using facet_wrap()
wrapped_density <- ggplot(housing_val, aes(x = estimate)) + geom_density(fill = "darkgreen",
color = "darkgreen", alpha = 0.5) + facet_wrap(~county) + scale_x_continuous(labels = function(x) paste0("$",
x/1000, "k")) + theme_minimal(base_size = 14) + theme(axis.text.y = element_blank()) +
labs(x = "Median Home Value", y = "", title = "Home Values by Census Tract, 2015-2019 ACS")

Bringing in Geographic Data
- Census and ACS data are associated with/aggregated at geographies
- Legal entities (states and counties)
- Statistical entities (census tracts and block groups)
- Geographic features (roads and water features)
- Shapefiles are available from the U.S. Census Bureau’s TIGER/Line database (Topologically Integrated Geographic Encoding and Referencing)
- Ready for mapping and spatial analysis
- tigris package allows to access these files programmatically
- simple features (sf) package represents spatial data much like an R data frame, but with a special geometry column that represents the shape of each feature
Structure of sf objects from tigris
- sf objects from tigris contain geographic identifiers, names, coordinate information, and land and water area
- The id and name columns can be used for joining with tidycensus data
library(tigris)
# get boundaries of all Florida counties
fl_counties <- counties("FL")
# show object type
class(fl_counties)
## Simple feature collection with 67 features and 17 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -87.6349 ymin: 24.39631 xmax: -79.97431 ymax: 31.00097
## CRS: 4269
## First 10 features:
## STATEFP COUNTYFP COUNTYNS GEOID NAME NAMELSAD LSAD CLASSFP
## 72 12 053 00295751 12053 Hernando Hernando County 06 H1
## 73 12 129 00306912 12129 Wakulla Wakulla County 06 H1
## 74 12 131 00295727 12131 Walton Walton County 06 H1
## 79 12 127 00306921 12127 Volusia Volusia County 06 H1
## 196 12 051 00307626 12051 Hendry Hendry County 06 H1
## 227 12 095 00295750 12095 Orange Orange County 06 H1
## 284 12 011 00295753 12011 Broward Broward County 06 H1
## 312 12 099 00295761 12099 Palm Beach Palm Beach County 06 H1
## 367 12 041 00303633 12041 Gilchrist Gilchrist County 06 H1
## 371 12 086 00295755 12086 Miami-Dade Miami-Dade County 06 H1
## MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT ALAND AWATER INTPTLAT
## 72 G4020 <NA> 45300 <NA> A 1224962206 300591492 +28.5730426
## 73 G4020 <NA> 45220 <NA> A 1570615733 334902363 +30.1394320
## 74 G4020 <NA> 18880 <NA> A 2687740188 522795160 +30.6312106
## 79 G4020 422 19660 <NA> A 2852204344 857791723 +29.0577690
## 196 G4020 163 17500 <NA> A 2994619266 87037213 +26.5399670
## 227 G4020 422 36740 <NA> A 2335820971 262462219 +28.5143906
## 284 G4020 370 33100 22744 A 3115289087 310801098 +26.1935353
## 312 G4020 370 33100 48424 A 5087468869 1084901519 +26.6491257
## 367 G4020 264 23540 <NA> A 905705573 14362143 +29.7234556
## 371 G4020 370 33100 33124 A 4920565755 1376144237 +25.6105799
## INTPTLON geometry
## 72 -082.4662272 MULTIPOLYGON (((-82.25329 2...
## 73 -084.3748463 MULTIPOLYGON (((-84.50434 3...
## 74 -086.1766139 MULTIPOLYGON (((-86.3916 30...
## 79 -081.1617920 MULTIPOLYGON (((-81.5028 29...
## 196 -081.1521142 MULTIPOLYGON (((-81.27172 2...
## 227 -081.3232839 MULTIPOLYGON (((-81.65739 2...
## 284 -080.4766834 MULTIPOLYGON (((-80.29699 2...
## 312 -080.4483542 MULTIPOLYGON (((-80.4824 26...
## 367 -082.7958011 MULTIPOLYGON (((-82.95908 2...
## 371 -080.4970989 MULTIPOLYGON (((-80.44093 2...
## [1] "sf" "data.frame"
# plot geometry
plot(fl_counties$geometry)
plot(sf::st_geometry(fl_counties))
- Using the sf object’s geometry variable or the function st_geometry return the same result

Mapping Census Data
- Combining tidycensus data and tigris geographies offer many mapping possibilities
- It is often useful to examine basic geometry plots before adding data
library(ggplot2)
# cowplot allows us to combine separate plots
library(cowplot)
# get Broward County, FL census tracts from tigris
broward_tracts <- tracts("FL", "Broward")
# get Broward block groups from tigris
broward_block_groups <- block_groups("FL", "Broward")
# plot Broward census tracts with ggplot
gg1 <- ggplot(broward_tracts) + geom_sf() + theme_void() + labs(title = "Census tracts")
# plot Broward block groups with ggplot
gg2 <- ggplot(broward_block_groups) + geom_sf() + theme_void() + labs(title = "Block groups")
# combine two plots into one grid with cowplot
cowplot::plot_grid(gg1, gg2)

TIGER/Line vs. Cartographic Boundaries
- tigris and tidycensus cartographic boundaries have different levels of detail
ms_counties <- counties("MS")
ms_counties_cb <- counties("MS", cb = TRUE)
ms_tiger_gg <- ggplot(ms_counties) + geom_sf() + theme_void() + labs(title = "TIGER/Line")
ms_cb_gg <- ggplot(ms_counties_cb) + geom_sf() + theme_void() + labs(title = "Cartographic boundary")

Adding Census Data
- If we pass geometry = TRUE to tidycensus, we can pass the data to ggplot for mapping
- geom_sf() reads and plots geographic coordinates from sf objects
- Here, we map census tracts in Montgomery County, Alabama and shade them by Black population
library(tidycensus)
library(tidyverse)
library(tigris)
options(tigris_use_cache = TRUE)
racevars <- c(White = "B03002_003", Black = "B03002_004", Asian = "B03002_006", Hispanic = "B03002_012")
montgomery <- get_acs(state = "AL", county = "Montgomery", geography = "tract", variables = racevars,
geometry = TRUE, summary_var = "B03002_001")
head(montgomery)
# map of montgomery counties by total Black population
montgomery_map <- montgomery %>%
filter(variable == "Black") %>%
ggplot(aes(fill = estimate)) + geom_sf(color = NA) + scale_fill_viridis_c(option = "magma") +
theme_void()

# create different map for each race showing population share
montgomery_facetmap <- montgomery %>%
mutate(pct = 100 * (estimate/summary_est)) %>%
ggplot(aes(fill = pct)) + facet_wrap(~variable) + geom_sf(color = NA) + scale_fill_viridis_c() +
theme_void()
- Just as we split an earlier plot by county, we can use facet_wrap to split this plot by each racial group contained in the data

Modeling
- tidycensus also makes it easy to model the characteristics of our data
- First, we pull data for metro Miami counties
library(tidycensus)
library(sf)
# counties of interest
mia_counties <- c("Miami-Dade", "Broward", "Palm Beach")
# labeled vector of variables
variables_to_get <- c(median_value = "B25077_001", median_rooms = "B25018_001", median_income = "DP03_0062",
total_population = "B01003_001", median_age = "B01002_001", pct_college = "DP02_0068P",
pct_foreign_born = "DP02_0094P", pct_white = "DP05_0077P", median_year_built = "B25037_001",
percent_ooh = "DP04_0046P")
mia_data <- get_acs(geography = "tract", variables = variables_to_get, state = "FL",
county = mia_counties, geometry = TRUE, output = "wide", year = 2019) %>%
select(-NAME)
|
GEOID
|
median_valueE
|
median_valueM
|
median_roomsE
|
median_roomsM
|
total_populationE
|
total_populationM
|
median_ageE
|
median_ageM
|
median_year_builtE
|
median_year_builtM
|
median_incomeE
|
median_incomeM
|
pct_collegeE
|
pct_collegeM
|
pct_foreign_bornE
|
pct_foreign_bornM
|
pct_whiteE
|
pct_whiteM
|
percent_oohE
|
percent_oohM
|
geometry
|
|
12011010102
|
440900
|
30194
|
5.2
|
0.3
|
2898
|
242
|
49.4
|
2.8
|
1966
|
2
|
92734
|
17427
|
46.1
|
7.8
|
306
|
NA
|
91.7
|
3.6
|
84.6
|
5.5
|
MULTIPOLYGON (((-80.09418 2…
|
|
12011010103
|
348000
|
23554
|
4.4
|
0.2
|
3144
|
298
|
58.5
|
1.6
|
1967
|
2
|
69583
|
18900
|
41.0
|
6.6
|
419
|
NA
|
88.9
|
4.0
|
75.3
|
6.0
|
MULTIPOLYGON (((-80.09123 2…
|
|
12011010104
|
345500
|
32173
|
3.7
|
0.2
|
2156
|
331
|
57.0
|
3.8
|
1972
|
5
|
55612
|
6151
|
47.8
|
8.6
|
314
|
NA
|
84.6
|
6.5
|
61.9
|
9.7
|
MULTIPOLYGON (((-80.08109 2…
|
|
12011010200
|
166400
|
26243
|
4.3
|
0.2
|
6462
|
685
|
43.9
|
3.2
|
1973
|
2
|
48442
|
3829
|
25.9
|
6.3
|
2815
|
NA
|
65.6
|
8.0
|
58.0
|
8.1
|
MULTIPOLYGON (((-80.10748 2…
|
|
12011010304
|
193600
|
13564
|
4.8
|
0.3
|
4584
|
756
|
34.3
|
3.9
|
1974
|
2
|
47168
|
7151
|
8.6
|
4.1
|
1071
|
NA
|
8.3
|
4.3
|
55.4
|
8.4
|
MULTIPOLYGON (((-80.11944 2…
|
|
12011010305
|
96900
|
22040
|
3.8
|
0.2
|
4319
|
587
|
35.7
|
4.6
|
1985
|
2
|
44193
|
5169
|
18.0
|
7.1
|
1482
|
NA
|
38.6
|
10.5
|
12.5
|
3.6
|
MULTIPOLYGON (((-80.12199 2…
|
|
12011010306
|
126700
|
16456
|
4.1
|
0.2
|
2708
|
354
|
41.4
|
5.8
|
1973
|
3
|
40843
|
6934
|
21.3
|
7.2
|
476
|
NA
|
25.5
|
5.9
|
42.9
|
6.1
|
MULTIPOLYGON (((-80.10789 2…
|
|
12011010307
|
165900
|
13992
|
4.1
|
0.3
|
4028
|
353
|
38.6
|
4.9
|
1980
|
2
|
38105
|
4475
|
10.7
|
3.7
|
802
|
NA
|
27.7
|
5.5
|
23.7
|
6.3
|
MULTIPOLYGON (((-80.12833 2…
|
|
12011010308
|
296700
|
28258
|
4.3
|
0.2
|
5088
|
547
|
47.5
|
6.3
|
1986
|
2
|
64821
|
12747
|
41.1
|
7.8
|
1186
|
NA
|
68.1
|
10.0
|
43.2
|
6.7
|
MULTIPOLYGON (((-80.15345 2…
|
|
12011010401
|
333700
|
12144
|
5.3
|
0.3
|
4973
|
331
|
38.6
|
3.4
|
1986
|
2
|
76940
|
7263
|
34.3
|
5.3
|
1146
|
NA
|
64.6
|
7.8
|
70.0
|
6.2
|
MULTIPOLYGON (((-80.17015 2…
|
Examining the Data
- Then we create a map of median home values and visualize its distribution with a histogram
library(cowplot)
mhv_map <- ggplot(mia_data, aes(fill = median_valueE)) + geom_sf(color = NA) + scale_fill_viridis_c(labels = scales::dollar) +
theme_void() + labs(fill = "Median home value")
mhv_histogram <- ggplot(mia_data, aes(x = median_valueE)) + geom_histogram(alpha = 0.5,
fill = "navy", color = "navy", bins = 100) + theme_minimal() + scale_x_continuous(labels = scales::label_number_si(accuracy = NULL)) +
labs(x = "Median home value")
cowplot::plot_grid(mhv_map, mhv_histogram)

Comparing Distributions
- As the data is skewed, we can apply a log transformation to the data in both forms
library(tidyverse)
library(cowplot)
mhv_map_log <- ggplot(mia_data, aes(fill = log(median_valueE))) + geom_sf(color = NA) +
scale_fill_viridis_c() + theme_void() + labs(fill = "Median home\nvalue (log)")
mhv_histogram_log <- ggplot(mia_data, aes(x = log(median_valueE))) + geom_histogram(alpha = 0.5,
fill = "navy", color = "navy", bins = 100) + theme_minimal() + scale_x_continuous() +
labs(x = "Median home value (log)")
plot_grid(mhv_map_log, mhv_histogram_log)

Looking at Correlations
- We may also want to know how variables in our data are correlated
- Compatibility with tidyverse functions makes it easy to drop unnecessary variables, remove NA values, and format variables of interest for calculating correlations
library(corrr)
library(units)
mia_data_for_model <- mia_data %>%
# use area variable to calculate population density, specify units as square kilometers
mutate(pop_density = as.numeric(set_units(total_populationE / st_area(.), "1/km2")),
# subtract median_structure_age from midpoint of sample range (2015-2019)
median_structure_age = 2017 - median_year_builtE) %>%
# select estimates but not margins of error
select(!ends_with("M")) %>%
# remove "E" at end of variable names
rename_with(.fn = ~str_remove(.x, "E$")) %>%
# remove NA values
na.omit()
|
GEOID
|
median_value
|
median_rooms
|
total_population
|
median_age
|
median_year_built
|
median_income
|
pct_college
|
pct_foreign_born
|
pct_white
|
percent_ooh
|
geometry
|
pop_density
|
median_structure_age
|
|
12011010102
|
440900
|
5.2
|
2898
|
49.4
|
1966
|
92734
|
46.1
|
306
|
91.7
|
84.6
|
MULTIPOLYGON (((-80.09418 2…
|
1726.025
|
51
|
|
12011010103
|
348000
|
4.4
|
3144
|
58.5
|
1967
|
69583
|
41.0
|
419
|
88.9
|
75.3
|
MULTIPOLYGON (((-80.09123 2…
|
1371.705
|
50
|
|
12011010104
|
345500
|
3.7
|
2156
|
57.0
|
1972
|
55612
|
47.8
|
314
|
84.6
|
61.9
|
MULTIPOLYGON (((-80.08109 2…
|
2605.186
|
45
|
|
12011010200
|
166400
|
4.3
|
6462
|
43.9
|
1973
|
48442
|
25.9
|
2815
|
65.6
|
58.0
|
MULTIPOLYGON (((-80.10748 2…
|
2344.356
|
44
|
|
12011010304
|
193600
|
4.8
|
4584
|
34.3
|
1974
|
47168
|
8.6
|
1071
|
8.3
|
55.4
|
MULTIPOLYGON (((-80.11944 2…
|
3985.536
|
43
|
|
12011010305
|
96900
|
3.8
|
4319
|
35.7
|
1985
|
44193
|
18.0
|
1482
|
38.6
|
12.5
|
MULTIPOLYGON (((-80.12199 2…
|
1754.610
|
32
|
|
12011010306
|
126700
|
4.1
|
2708
|
41.4
|
1973
|
40843
|
21.3
|
476
|
25.5
|
42.9
|
MULTIPOLYGON (((-80.10789 2…
|
2186.491
|
44
|
|
12011010307
|
165900
|
4.1
|
4028
|
38.6
|
1980
|
38105
|
10.7
|
802
|
27.7
|
23.7
|
MULTIPOLYGON (((-80.12833 2…
|
1317.291
|
37
|
|
12011010308
|
296700
|
4.3
|
5088
|
47.5
|
1986
|
64821
|
41.1
|
1186
|
68.1
|
43.2
|
MULTIPOLYGON (((-80.15345 2…
|
1904.333
|
31
|
|
12011010401
|
333700
|
5.3
|
4973
|
38.6
|
1986
|
76940
|
34.3
|
1146
|
64.6
|
70.0
|
MULTIPOLYGON (((-80.17015 2…
|
2665.559
|
31
|
- We need to make sure to use the sf function st_drop_geometry() before calculating correlations
- This turns the sf object into a regular data frame and is an essential step as they geometry column is incompatible with a lot of R functions
mia_estimates <- mia_data_for_model %>%
# remove GEOID, median_value, and median_year_built variables from data
select(-GEOID, -median_value, -median_year_built) %>%
# remove geometry column (converts sf object to regular data frame)
st_drop_geometry()
# calculate pearson correlation
correlations <- correlate(mia_estimates, method = "pearson")
# network plot of correlation
network_plot(correlations)
|
median_rooms
|
total_population
|
median_age
|
median_income
|
pct_college
|
pct_foreign_born
|
pct_white
|
percent_ooh
|
pop_density
|
median_structure_age
|
|
5.2
|
2898
|
49.4
|
92734
|
46.1
|
306
|
91.7
|
84.6
|
1726.025
|
51
|
|
4.4
|
3144
|
58.5
|
69583
|
41.0
|
419
|
88.9
|
75.3
|
1371.705
|
50
|
|
3.7
|
2156
|
57.0
|
55612
|
47.8
|
314
|
84.6
|
61.9
|
2605.186
|
45
|
|
4.3
|
6462
|
43.9
|
48442
|
25.9
|
2815
|
65.6
|
58.0
|
2344.356
|
44
|
|
4.8
|
4584
|
34.3
|
47168
|
8.6
|
1071
|
8.3
|
55.4
|
3985.536
|
43
|
|
3.8
|
4319
|
35.7
|
44193
|
18.0
|
1482
|
38.6
|
12.5
|
1754.610
|
32
|
|
4.1
|
2708
|
41.4
|
40843
|
21.3
|
476
|
25.5
|
42.9
|
2186.491
|
44
|
|
4.1
|
4028
|
38.6
|
38105
|
10.7
|
802
|
27.7
|
23.7
|
1317.291
|
37
|
|
4.3
|
5088
|
47.5
|
64821
|
41.1
|
1186
|
68.1
|
43.2
|
1904.333
|
31
|
|
5.3
|
4973
|
38.6
|
76940
|
34.3
|
1146
|
64.6
|
70.0
|
2665.559
|
31
|

Microdata
- ACS tables are individual responses aggregated to a geography
- For most purposes, these tables have the data we need
- Census Bureau also releases microdata from the ACS via the Public Use Microdata Sample (PUMS)
- Individual- and household-level responses to the ACS are available with the get_pums() function
- One row per respondent instead of geography
- Smallest geography is Public Use Microdata Area (PUMA)
- Allows for custom estimates and more granular analysis
PUMS variables
- Like with ACS data, tidycensus allows us to load, filter, and inspect PUMS variables
install.packages(c("survey", "srvyr"))
library(tidyverse)
library(tidycensus)
# 2019 ACS 1-year variables
pums_vars_2019 <- pums_variables %>%
filter(year == 2019, survey == "acs1")
# only unique varuables
pums_vars_2019 %>%
distinct(var_code, var_label, data_type, level)
- PUMS variables come with codes, value labels, and levels (person or household)
|
var_code
|
var_label
|
data_type
|
level
|
|
RT
|
Record Type
|
chr
|
NA
|
|
SERIALNO
|
Housing unit/GQ person serial number
|
chr
|
NA
|
|
DIVISION
|
Division code based on 2010 Census definitions
|
chr
|
NA
|
|
PUMA
|
Public use microdata area code (PUMA) based on 2010 Census definition (areas with population of 100,000 or more, use with ST for unique code)
|
chr
|
NA
|
|
REGION
|
Region code based on 2010 Census definitions
|
chr
|
NA
|
|
ST
|
State Code based on 2010 Census definitions
|
chr
|
NA
|
|
ADJHSG
|
Adjustment factor for housing dollar amounts (6 implied decimal places)
|
chr
|
housing
|
|
ADJINC
|
Adjustment factor for income and earnings dollar amounts (6 implied decimal places)
|
chr
|
housing
|
|
WGTP
|
Housing Unit Weight
|
num
|
housing
|
|
NP
|
Number of persons in this household
|
num
|
housing
|
# only individual level records
pums_vars_2019 %>%
distinct(var_code, var_label, data_type, level) %>%
filter(level == "person")
- Many variables are only available at one level, so we need to make sure our variable of interest matches the level of detail we want to use
|
var_code
|
var_label
|
data_type
|
level
|
|
SPORDER
|
Person number
|
num
|
person
|
|
PWGTP
|
Person’s weight
|
num
|
person
|
|
AGEP
|
Age
|
num
|
person
|
|
CIT
|
Citizenship status
|
chr
|
person
|
|
CITWP
|
Year of naturalization write-in
|
num
|
person
|
|
COW
|
Class of worker
|
chr
|
person
|
|
DDRS
|
Self-care difficulty
|
chr
|
person
|
|
DEAR
|
Hearing difficulty
|
chr
|
person
|
|
DEYE
|
Vision difficulty
|
chr
|
person
|
|
DOUT
|
Independent living difficulty
|
chr
|
person
|
Loading PUMS Data and Recoding
- Many PUMS variables have numeric values that represent categories
ga_pums <- get_pums(variables = c("PUMA", "SEX", "AGEP", "SCHL"), state = "GA", survey = "acs1",
year = 2018)
- Here, we load PUMA data for Georgia
- May be unclear what values mean for some variables
|
SERIALNO
|
SPORDER
|
WGTP
|
PWGTP
|
AGEP
|
PUMA
|
ST
|
SCHL
|
SEX
|
|
2018GQ0000025
|
1
|
0
|
68
|
51
|
03700
|
13
|
13
|
1
|
|
2018GQ0000043
|
1
|
0
|
89
|
23
|
04000
|
13
|
20
|
2
|
|
2018GQ0000061
|
1
|
0
|
10
|
43
|
00500
|
13
|
17
|
1
|
|
2018GQ0000076
|
1
|
0
|
11
|
20
|
04300
|
13
|
19
|
2
|
|
2018GQ0000145
|
1
|
0
|
4
|
23
|
04200
|
13
|
13
|
1
|
|
2018GQ0000336
|
1
|
0
|
90
|
36
|
01800
|
13
|
19
|
1
|
|
2018GQ0000482
|
1
|
0
|
76
|
92
|
04400
|
13
|
16
|
2
|
|
2018GQ0000518
|
1
|
0
|
76
|
48
|
02600
|
13
|
17
|
1
|
|
2018GQ0000530
|
1
|
0
|
15
|
19
|
03200
|
13
|
19
|
2
|
|
2018GQ0000563
|
1
|
0
|
50
|
87
|
03900
|
13
|
21
|
2
|
- By including recode = TRUE in our function call, any labeled variables will be added alongside their unlabeled versions
ga_pums_recoded <- get_pums(variables = c("PUMA", "SEX", "AGEP", "SCHL"), state = "GA",
survey = "acs1", year = 2018, recode = TRUE)
|
SERIALNO
|
SPORDER
|
WGTP
|
PWGTP
|
AGEP
|
PUMA
|
ST
|
SCHL
|
SEX
|
ST_label
|
SCHL_label
|
SEX_label
|
|
2018GQ0000025
|
1
|
0
|
68
|
51
|
03700
|
13
|
13
|
1
|
Georgia/GA
|
Grade 10
|
Male
|
|
2018GQ0000043
|
1
|
0
|
89
|
23
|
04000
|
13
|
20
|
2
|
Georgia/GA
|
Associate’s degree
|
Female
|
|
2018GQ0000061
|
1
|
0
|
10
|
43
|
00500
|
13
|
17
|
1
|
Georgia/GA
|
GED or alternative credential
|
Male
|
|
2018GQ0000076
|
1
|
0
|
11
|
20
|
04300
|
13
|
19
|
2
|
Georgia/GA
|
1 or more years of college credit, no degree
|
Female
|
|
2018GQ0000145
|
1
|
0
|
4
|
23
|
04200
|
13
|
13
|
1
|
Georgia/GA
|
Grade 10
|
Male
|
|
2018GQ0000336
|
1
|
0
|
90
|
36
|
01800
|
13
|
19
|
1
|
Georgia/GA
|
1 or more years of college credit, no degree
|
Male
|
|
2018GQ0000482
|
1
|
0
|
76
|
92
|
04400
|
13
|
16
|
2
|
Georgia/GA
|
Regular high school diploma
|
Female
|
|
2018GQ0000518
|
1
|
0
|
76
|
48
|
02600
|
13
|
17
|
1
|
Georgia/GA
|
GED or alternative credential
|
Male
|
|
2018GQ0000530
|
1
|
0
|
15
|
19
|
03200
|
13
|
19
|
2
|
Georgia/GA
|
1 or more years of college credit, no degree
|
Female
|
|
2018GQ0000563
|
1
|
0
|
50
|
87
|
03900
|
13
|
21
|
2
|
Georgia/GA
|
Bachelor’s degree
|
Female
|
- PUMS data uses weighting to allow estimation of national level statistics from a relatively small sample
- Recoding allows us to then create a count of population by sex by summing the person weight
# weighting sample
ga_pums_recoded %>%
count(PUMA, SEX_label, wt = PWGTP)
|
PUMA
|
SEX_label
|
n
|
|
00100
|
Male
|
74196
|
|
00100
|
Female
|
79139
|
|
00200
|
Male
|
59547
|
|
00200
|
Female
|
58650
|
|
00300
|
Male
|
77869
|
|
00300
|
Female
|
77896
|
|
00401
|
Male
|
77689
|
|
00401
|
Female
|
87415
|
|
00402
|
Male
|
61002
|
|
00402
|
Female
|
63279
|
- In addition to weighted counts, we can also create weighted means and percentages
- Here, we calculate summary statistics for male and female individuals 25 and older with a Bachelor’s degree or above
# create a new variable that is whether or not the person has a Bachelor’s
# degree or above, group by PUMA and sex, then calculate the total population,
# average age, total with BA or above (only for people 25 and older), and
# percent with BA or above.
ga_pums_recoded %>%
mutate(ba_above = SCHL %in% c("21", "22", "23", "24")) %>%
group_by(PUMA, SEX_label) %>%
summarize(total_pop = sum(PWGTP), mean_age = weighted.mean(AGEP, PWGTP), ba_above = sum(PWGTP[ba_above ==
TRUE & AGEP >= 25]), ba_above_pct = ba_above/sum(PWGTP[AGEP >= 25]))
- This yields population counts, average age, and counts and percentages of educational attainment
|
PUMA
|
SEX_label
|
total_pop
|
mean_age
|
ba_above
|
ba_above_pct
|
|
00100
|
Male
|
74196
|
38.47365
|
12848
|
0.2598233
|
|
00100
|
Female
|
79139
|
40.34941
|
16232
|
0.2947468
|
|
00200
|
Male
|
59547
|
31.83490
|
6436
|
0.1846613
|
|
00200
|
Female
|
58650
|
34.55173
|
9715
|
0.2611489
|
|
00300
|
Male
|
77869
|
34.31346
|
9234
|
0.2053871
|
|
00300
|
Female
|
77896
|
36.61063
|
10139
|
0.2078729
|
|
00401
|
Male
|
77689
|
35.36655
|
14187
|
0.2845881
|
|
00401
|
Female
|
87415
|
37.84550
|
16109
|
0.2770106
|
|
00402
|
Male
|
61002
|
38.24930
|
14338
|
0.3405215
|
|
00402
|
Female
|
63279
|
41.18079
|
19165
|
0.4163951
|
Modeling Microdata
- Because PUMS data sample sizes can get very small when we filter or create groups, we will often want to calculate standard errors and check the reliability of the data
Mapping Micerodata
- PUMS data is mapped at the PUMA level, so it is useful to examine how that data differs from more common geographies
- Here we use tigris to pull PUMA boundaries for six states
# create vector of southeastern state abbreviations
se_states <- c("GA", "FL", "AL", "TN", "LA", "MS")
# use map() to pass each state to the tigris pumas() function to get their puma
# geographic files
se_pumas <- map(se_states, tigris::pumas, class = "sf", cb = TRUE) %>%
reduce(rbind)
- We can see that naming conventions are much different than other geographies
|
STATEFP10
|
PUMACE10
|
AFFGEOID10
|
GEOID10
|
NAME10
|
LSAD10
|
ALAND10
|
AWATER10
|
geometry
|
|
13
|
00700
|
7950000US1300700
|
1300700
|
Southern Georgia Regional Commission (West)
|
P0
|
6023715229
|
99859689
|
MULTIPOLYGON (((-83.33829 3…
|
|
13
|
01800
|
7950000US1301800
|
1301800
|
River Valley Regional Commission (Outside Muscogee & Chattahoochee Counties)
|
P0
|
12319655378
|
221465656
|
MULTIPOLYGON (((-85.18491 3…
|
|
13
|
03900
|
7950000US1303900
|
1303900
|
Northeast Georgia Regional Commission (Southwest)–Walton, Morgan & Jasper Counties
|
P0
|
2700385304
|
40042206
|
MULTIPOLYGON (((-83.98154 3…
|
|
13
|
02700
|
7950000US1302700
|
1302700
|
Northwest Georgia Regional Commission (North Central)–Whitfield County
|
P0
|
752230670
|
1597015
|
MULTIPOLYGON (((-85.16705 3…
|
|
13
|
04000
|
7950000US1304000
|
1304000
|
Central Savannah River Area Regional Commission (East Central)–Richmond County
|
P0
|
840018751
|
11014326
|
MULTIPOLYGON (((-82.3503 33…
|
- Here we select the PUMA and POVPIP (income-to-poverty-ratio) variables for states in the Fed 6th District
# PUMA and income-to-poverty ratio
se_pums <- get_pums(variables = c("PUMA", "POVPIP"), state = se_states, survey = "acs1",
year = 2018)
|
SERIALNO
|
SPORDER
|
WGTP
|
PWGTP
|
POVPIP
|
PUMA
|
ST
|
|
2018GQ0000025
|
1
|
0
|
68
|
-1
|
03700
|
13
|
|
2018GQ0000028
|
1
|
0
|
20
|
-1
|
01800
|
47
|
|
2018GQ0000043
|
1
|
0
|
89
|
-1
|
04000
|
13
|
|
2018GQ0000049
|
1
|
0
|
75
|
-1
|
01600
|
01
|
|
2018GQ0000052
|
1
|
0
|
42
|
-1
|
01700
|
28
|
- Before mapping, we group by state and PUMA and calculate the total population and the share of the population that is below 200% of the poverty line
# calculate share of population below 200% of poverty line
se_pov <- se_pums %>%
group_by(ST, PUMA) %>%
summarize(total_pop = sum(PWGTP), pct_in_pov = sum(PWGTP[POVPIP < 200])/total_pop)
|
ST
|
PUMA
|
total_pop
|
pct_in_pov
|
|
01
|
00100
|
184908
|
0.3595410
|
|
01
|
00200
|
196543
|
0.2522807
|
|
01
|
00301
|
132862
|
0.3144240
|
|
01
|
00302
|
101612
|
0.3541708
|
|
01
|
00400
|
123128
|
0.5115408
|
- We then join our microdata to the PUMA boundaries and shade our map with the poverty variable
# join data and create map shaded by population share below 200% poverty line
se_pumas %>%
left_join(se_pov, by = c(STATEFP10 = "ST", PUMACE10 = "PUMA")) %>%
ggplot(aes(fill = pct_in_pov)) + geom_sf() + scale_fill_viridis_b(name = NULL,
option = "magma", labels = scales::label_percent(1)) + labs(title = "Percentage of Population Below 200% of the Poverty Line") +
theme_void()

Census Microdata from IPUMS
- We may choose to work with data from 3rd party providers rather than the Census API depending on our needs and preferences
- IPUMS is easily the most comprehensive
- Maintained by the Institute for Social Research and Data Innovation at University of Minnesota
- Data from over 750 censuses and surveys representing over one billion individuals from
- Including the Current Population Survey, the American Community Survey, the National Health Interview Survey, the Demographic and Health Surveys, among others
- We will focus on ACS IPUMS data, but same syntax can be applied to CPS and other sources
What is ipumsr?
- Created by Greg Freedman Ellis
- Allows users to:
- Read IPUMS data extracts into R
- Manage the large size, hierarchical structure, and variable labeling conventions of IPUMS extracts
- Use IPUMS metadata to transform data and conduct analysis
- In the near future, query the IPUMS API similar to tidycensus
Handling Labels
- IPUMS value labels don’t translate perfectly to the R factor object type
- All factor values must be labeled
- Factor values must count up from 1
- ipumsr has functions to help work with values and labels while following R logic
head(data)
- We can see that our data has both labeled and unlabeled variables
|
YEAR
|
DATANUM
|
SERIAL
|
HHWT
|
STATEFIP
|
CONSPUMA
|
GQ
|
PHONE
|
PERNUM
|
PERWT
|
EDUC
|
EDUCD
|
EDUCD2
|
EDUCD3
|
PHONE2
|
PHONE3
|
|
1960
|
1
|
336455
|
100
|
27
|
NA
|
1
|
2
|
1
|
100
|
5
|
50
|
Grade 11
|
Less than High School
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336455
|
100
|
27
|
NA
|
1
|
2
|
2
|
100
|
6
|
60
|
Grade 12
|
High school
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336456
|
100
|
27
|
NA
|
1
|
2
|
1
|
100
|
6
|
60
|
Grade 12
|
High school
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336456
|
100
|
27
|
NA
|
1
|
2
|
2
|
100
|
6
|
60
|
Grade 12
|
High school
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336456
|
100
|
27
|
NA
|
1
|
2
|
3
|
100
|
2
|
22
|
Grade 5, 6, 7, or 8
|
Less than High School
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336456
|
100
|
27
|
NA
|
1
|
2
|
4
|
100
|
0
|
1
|
N/A or no schooling
|
N/A or no schooling
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336457
|
99
|
27
|
NA
|
1
|
2
|
1
|
99
|
6
|
60
|
Grade 12
|
High school
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336457
|
99
|
27
|
NA
|
1
|
2
|
2
|
99
|
4
|
40
|
Grade 10
|
Less than High School
|
Yes, phone available
|
Yes, phone available
|
- The as_factor function adds ordered factor levels to variable labels
# add factor levels to labels
ipumsr::as_factor(data)
|
YEAR
|
DATANUM
|
SERIAL
|
HHWT
|
STATEFIP
|
CONSPUMA
|
GQ
|
PHONE
|
PERNUM
|
PERWT
|
EDUC
|
EDUCD
|
EDUCD2
|
EDUCD3
|
PHONE2
|
PHONE3
|
|
1960
|
1
|
336455
|
100
|
Minnesota
|
NA
|
Households under 1970 definition
|
Yes, phone available
|
1
|
100
|
Grade 11
|
Grade 11
|
Grade 11
|
Less than High School
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336455
|
100
|
Minnesota
|
NA
|
Households under 1970 definition
|
Yes, phone available
|
2
|
100
|
Grade 12
|
Grade 12
|
Grade 12
|
High school
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336456
|
100
|
Minnesota
|
NA
|
Households under 1970 definition
|
Yes, phone available
|
1
|
100
|
Grade 12
|
Grade 12
|
Grade 12
|
High school
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336456
|
100
|
Minnesota
|
NA
|
Households under 1970 definition
|
Yes, phone available
|
2
|
100
|
Grade 12
|
Grade 12
|
Grade 12
|
High school
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336456
|
100
|
Minnesota
|
NA
|
Households under 1970 definition
|
Yes, phone available
|
3
|
100
|
Grade 5, 6, 7, or 8
|
Grade 5
|
Grade 5, 6, 7, or 8
|
Less than High School
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336456
|
100
|
Minnesota
|
NA
|
Households under 1970 definition
|
Yes, phone available
|
4
|
100
|
N/A or no schooling
|
N/A
|
N/A or no schooling
|
N/A or no schooling
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336457
|
99
|
Minnesota
|
NA
|
Households under 1970 definition
|
Yes, phone available
|
1
|
99
|
Grade 12
|
Grade 12
|
Grade 12
|
High school
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336457
|
99
|
Minnesota
|
NA
|
Households under 1970 definition
|
Yes, phone available
|
2
|
99
|
Grade 10
|
Grade 10
|
Grade 10
|
Less than High School
|
Yes, phone available
|
Yes, phone available
|
- We can also use zap_labels() to remove labels and show raw values
# remove labels to show values
zap_labels(data)
|
YEAR
|
DATANUM
|
SERIAL
|
HHWT
|
STATEFIP
|
CONSPUMA
|
GQ
|
PHONE
|
PERNUM
|
PERWT
|
EDUC
|
EDUCD
|
EDUCD2
|
EDUCD3
|
PHONE2
|
PHONE3
|
|
1960
|
1
|
336455
|
100
|
27
|
NA
|
1
|
2
|
1
|
100
|
5
|
50
|
Grade 11
|
Less than High School
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336455
|
100
|
27
|
NA
|
1
|
2
|
2
|
100
|
6
|
60
|
Grade 12
|
High school
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336456
|
100
|
27
|
NA
|
1
|
2
|
1
|
100
|
6
|
60
|
Grade 12
|
High school
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336456
|
100
|
27
|
NA
|
1
|
2
|
2
|
100
|
6
|
60
|
Grade 12
|
High school
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336456
|
100
|
27
|
NA
|
1
|
2
|
3
|
100
|
2
|
22
|
Grade 5, 6, 7, or 8
|
Less than High School
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336456
|
100
|
27
|
NA
|
1
|
2
|
4
|
100
|
0
|
1
|
N/A or no schooling
|
N/A or no schooling
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336457
|
99
|
27
|
NA
|
1
|
2
|
1
|
99
|
6
|
60
|
Grade 12
|
High school
|
Yes, phone available
|
Yes, phone available
|
|
1960
|
1
|
336457
|
99
|
27
|
NA
|
1
|
2
|
2
|
99
|
4
|
40
|
Grade 10
|
Less than High School
|
Yes, phone available
|
Yes, phone available
|
Handling Labels: Consolidating
- lbl_collapse() allows you to take advantage of the hierarchical structure of value labels
ipums_val_labels(data$EDUCD)
- We have far too many categories for useful analysis
|
val
|
lbl
|
|
0
|
N/A or no schooling
|
|
1
|
N/A
|
|
2
|
No schooling completed
|
|
10
|
Nursery school to grade 4
|
|
11
|
Nursery school, preschool
|
|
12
|
Kindergarten
|
|
13
|
Grade 1, 2, 3, or 4
|
|
14
|
Grade 1
|
|
15
|
Grade 2
|
|
16
|
Grade 3
|
|
17
|
Grade 4
|
|
20
|
Grade 5, 6, 7, or 8
|
|
21
|
Grade 5 or 6
|
|
22
|
Grade 5
|
|
23
|
Grade 6
|
|
24
|
Grade 7 or 8
|
|
25
|
Grade 7
|
|
26
|
Grade 8
|
|
30
|
Grade 9
|
|
40
|
Grade 10
|
|
50
|
Grade 11
|
|
60
|
Grade 12
|
|
61
|
12th grade, no diploma
|
|
62
|
High school graduate or GED
|
|
63
|
Regular high school diploma
|
|
64
|
GED or alternative credential
|
|
65
|
Some college, but less than 1 year
|
|
70
|
1 year of college
|
|
71
|
1 or more years of college credit, no degree
|
|
80
|
2 years of college
|
|
81
|
Associate’s degree, type not specified
|
|
82
|
Associate’s degree, occupational program
|
|
83
|
Associate’s degree, academic program
|
|
90
|
3 years of college
|
|
100
|
4 years of college
|
|
101
|
Bachelor’s degree
|
|
110
|
5+ years of college
|
|
111
|
6 years of college (6+ in 1960-1970)
|
|
112
|
7 years of college
|
|
113
|
8+ years of college
|
|
114
|
Master’s degree
|
|
115
|
Professional degree beyond a bachelor’s degree
|
|
116
|
Doctoral degree
|
|
999
|
Missing
|
- The variable values are hierarchical, with the values divisible by ten containing all of the others
- We simplify our values by collapsing on those variables
# still too detailed
data$EDUCD %>%
lbl_collapse(~.val%/%10) %>%
ipums_val_labels()
|
val
|
lbl
|
|
0
|
N/A or no schooling
|
|
1
|
Nursery school to grade 4
|
|
2
|
Grade 5, 6, 7, or 8
|
|
3
|
Grade 9
|
|
4
|
Grade 10
|
|
5
|
Grade 11
|
|
6
|
Grade 12
|
|
7
|
1 year of college
|
|
8
|
2 years of college
|
|
9
|
3 years of college
|
|
10
|
4 years of college
|
|
11
|
5+ years of college
|
|
99
|
Missing
|
Handling Labels: Consolidating
- We still have too many categories, so we may want to combine some to make our data more specific
- By using some string detection and relabeling, we reduce the numner of categories to six
# expression for filtering variable labels with a particular string
college_regex <- "^[123] year(s)? of college$"
data$EDUCD3 <- data$EDUCD %>%
# divide values by 10 to make single digit
lbl_collapse(~.val%/%10) %>%
# combine all less than high school values into one classify any amount of
# college without completion as 'Some college' create one category for all
# college degree types
lbl_relabel(lbl(2, "Less than High School") ~ .val > 0 & .val < 6, lbl(3, "High school") ~
.lbl == "Grade 12", lbl(4, "Some college") ~ str_detect(.lbl, college_regex),
lbl(5, "College or more") ~ .val %in% c(10, 11)) %>%
as_factor()
# new levels
levels(data$EDUCD3)
## [1] "N/A or no schooling" "Less than High School" "High school"
## [4] "Some college" "College or more" "Missing"
Removing Missing Data
- lbl_na_if() allows you to set certain values or labels to missing
- In this variable, we have one value for suppressed results that we need to conver to an NA before analysis
ipums_val_labels(data$PHONE)
|
val
|
lbl
|
|
0
|
N/A
|
|
1
|
No, no phone available
|
|
2
|
Yes, phone available
|
|
8
|
Suppressed (2012 and 2015 ACS)
|
- We can do this using the numeric values
# convert these values to NAs
data$PHONE2 <- lbl_na_if(data$PHONE, ~.val %in% c(0, 8)) %>%
as_factor()
levels(data$PHONE2)
## [1] "No, no phone available" "Yes, phone available"
# works with labels or values
drop_labels <- c("N/A", "Suppressed (2012 and 2015 ACS)")
data$PHONE3 <- lbl_na_if(data$PHONE, ~.lbl %in% drop_labels) %>%
as_factor()
# updated levels
levels(data$PHONE3)
## [1] "No, no phone available" "Yes, phone available"
Analysis
- Now the extract is factored, relabeled and ready for graphing
Selecting the Right Level, Weight, and Aggregation
- Extracts are samples and come with household and person levels and weights
- Here, we use person weight to estimate share of people with phone lines by year
# prepare data for graphing by grouping by year, and using the person weight to
# calculate a weighted mean. This yields the estimated share of the Minnesota
# population who have a phone
graph_data <- data %>%
group_by(YEAR) %>%
summarize(`% with phone` = weighted.mean(PHONE2 == "Yes, phone available", PERWT,
na.rm = TRUE), .groups = "drop")
|
YEAR
|
% with phone
|
|
1960
|
0.8863756
|
|
1970
|
0.9579866
|
|
1980
|
0.9745530
|
|
1990
|
0.9780473
|
|
2000
|
0.9904893
|
|
2001
|
0.9903932
|
|
2002
|
0.9843541
|
|
2003
|
0.9871095
|
|
2004
|
0.9771358
|
|
2005
|
0.9742268
|
- We also have educational attainment data in the extract, so we prepare another grouped dataset by both year and education
# group data by both year and education level
graph_data2 <- data %>%
group_by(YEAR, EDUCD3) %>%
summarize(`% with phone` = weighted.mean(PHONE2 == "Yes, phone available", PERWT,
na.rm = TRUE), .groups = "drop")
|
YEAR
|
EDUCD3
|
% with phone
|
|
1960
|
N/A or no schooling
|
0.8826448
|
|
1960
|
Less than High School
|
0.8590952
|
|
1960
|
High school
|
0.9217318
|
|
1960
|
Some college
|
0.9581475
|
|
1960
|
College or more
|
0.9831610
|
|
1970
|
N/A or no schooling
|
0.9566989
|
|
1970
|
Less than High School
|
0.9458146
|
|
1970
|
High school
|
0.9698355
|
|
1970
|
Some college
|
0.9758722
|
|
1970
|
College or more
|
0.9888245
|
Plotting the Results
- Like with tidycensus data, we can plot just one variable over time
# Plotting one variable year over year
ggplot(graph_data, aes(x = YEAR, y = `% with phone`)) + geom_point() + geom_line() +
# Use variable description from metadata as caption
labs(title = "Percent of Minnesota with phone line", subtitle = paste0("Data source: ",
ddi$ipums_project), caption = paste(strwrap(ipums_var_desc(ddi, PHONE), 90),
collapse = "\n"))

- or use facet_wrap to create plots for different groups
# Plotting one variable year over year with individual plots for education
# level
ggplot(graph_data2, aes(x = YEAR, y = `% with phone`)) + geom_point() + geom_line() +
labs(title = "Percent of Minnesota with Phone Line by Education", subtitle = paste0("Source: ",
ddi$ipums_project)) + facet_wrap(~EDUCD3)

ipumsr and Mapping
- IPUMS provides geographic boundary files for IPUMS USA and other sources
- ipumsr provides support for both sf and sp data (we will use sf)
- Load with the ipums_read_sf() function
# read in CONSPUMA shapefile from IPUMS
shape_data <- read_ipums_sf("shape/")
# view data structure
as_tibble(shape_data)
|
CONSPUMA
|
STATEFIP
|
State
|
geometry
|
|
540
|
02
|
Alaska
|
MULTIPOLYGON (((-3043869 34…
|
|
541
|
02
|
Alaska
|
MULTIPOLYGON (((-2291901 23…
|
|
542
|
15
|
Hawaii
|
MULTIPOLYGON (((-6021813 59…
|
|
491
|
51
|
Virginia
|
MULTIPOLYGON (((1720696 104…
|
|
492
|
51
|
Virginia
|
MULTIPOLYGON (((1745400 116…
|
|
493
|
51
|
Virginia
|
MULTIPOLYGON (((1728178 119…
|
|
494
|
51
|
Virginia
|
MULTIPOLYGON (((1717473 129…
|
|
495
|
53
|
Washington
|
MULTIPOLYGON (((-1584859 14…
|
|
496
|
53
|
Washington
|
MULTIPOLYGON (((-2031282 13…
|
|
497
|
53
|
Washington
|
MULTIPOLYGON (((-1970094 14…
|
Joining Extract and Spatial Data
- ipumsr has functions for merging extracts with spatial data
# group extract data by CONSPUMA and year, create weighted variable, and
# ungroup
conspuma_data <- data %>%
group_by(CONSPUMA, YEAR) %>%
summarize(`% with phone` = weighted.mean(PHONE2 == "Yes, phone available", PERWT,
na.rm = TRUE), .groups = "drop")
|
CONSPUMA
|
YEAR
|
% with phone
|
STATEFIP
|
State
|
geometry
|
|
245
|
1980
|
0.9641267
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
1990
|
0.9669292
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2000
|
0.9859847
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2005
|
0.9689021
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2006
|
0.9566019
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2007
|
0.9544157
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2008
|
0.9850778
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2009
|
0.9852029
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2010
|
0.9831482
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2011
|
0.9811322
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
- Similar to the GEOID in ACS data, we can use the CONSPUMA code as our joining variable
# join extract data with spatial data using CONSPUMA id
conspuma_data_join <- ipums_shape_inner_join(conspuma_data, shape_data, by = "CONSPUMA")
|
CONSPUMA
|
YEAR
|
% with phone
|
STATEFIP
|
State
|
geometry
|
|
245
|
1980
|
0.9641267
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
1990
|
0.9669292
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2000
|
0.9859847
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2005
|
0.9689021
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2006
|
0.9566019
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2007
|
0.9544157
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2008
|
0.9850778
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2009
|
0.9852029
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2010
|
0.9831482
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2011
|
0.9811322
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
Mapping with ipumsr, ggplot2, and sf
- Once the data is joined, we can filter it by year to measure change across decades
# filter data to show every decade
graph_conspuma <- conspuma_data_join %>%
filter(YEAR %in% c(1980, 1990, 2000, 2010))
|
CONSPUMA
|
YEAR
|
% with phone
|
STATEFIP
|
State
|
geometry
|
|
245
|
1980
|
0.9641267
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
1990
|
0.9669292
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2000
|
0.9859847
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
245
|
2010
|
0.9831482
|
27
|
Minnesota
|
MULTIPOLYGON (((62837.34 13…
|
|
246
|
1980
|
0.9881295
|
27
|
Minnesota
|
MULTIPOLYGON (((254440.1 86…
|
|
246
|
1990
|
0.9961230
|
27
|
Minnesota
|
MULTIPOLYGON (((254440.1 86…
|
|
246
|
2000
|
0.9959761
|
27
|
Minnesota
|
MULTIPOLYGON (((254440.1 86…
|
|
246
|
2010
|
0.9895193
|
27
|
Minnesota
|
MULTIPOLYGON (((254440.1 86…
|
|
247
|
1980
|
0.9673522
|
27
|
Minnesota
|
MULTIPOLYGON (((227254.6 12…
|
|
247
|
1990
|
0.9760921
|
27
|
Minnesota
|
MULTIPOLYGON (((227254.6 12…
|
- and visualize the share of Minnesota’s population with phone access
# color CONSPUMA polygons by % with phone value make one plot for each year
# remove grid lines and coordinates
ggplot(graph_conspuma, aes(fill = `% with phone`)) + facet_wrap(~YEAR) + geom_sf() +
theme_void()
