Today we talk about the US Census and the tidycensus package.
Now that you’ve had some practice, we’re going to create a new project to have a fresh start without your beginner scripts
install.packages("tidycensus")
Helps determine how federal funds are distributed across the country, including:
The American Community Survey (ACS) is a demographics survey conducted by the US Census Bureau every year since 2005.
ACS data are estimates. ACS data has a margin of error, and it is larger for areas with smaller populations.
It always depends on your analysis questions!
Decennial vs ACS
If you use ACS, never forget that it is an estimation.
The questions and methods of collection change constantly
The Census Bureau counts incarcerated people as residents of the district where they are confined.
Resources:
tidycensus
is a package that imports census data directly from the U.S. Census as tidyverse-ready dataframes. It is very nice.
There are two major functions implemented in tidycensus:
get_decennial()
: grants access to the 2000, 2010, and 2020 decennial US Census APIsget_acs()
: grants access to the 1-year and 5-year American Community Survey APIs
get_decennial()
get_decennial(
)
find available geographies here
The 2020 Census data release is still delayed from covid. Data available so far:
pl
dhc
dp
To view all of the variables in any of these tables use the load_variables()
function
cache = TRUE
means it’s faster to load the next timeGEOID | NAME | variable | value |
---|---|---|---|
42 | Pennsylvania | housing_units | 5742828 |
06 | California | housing_units | 14392140 |
What percentage of housing units receive an American Community Survey each year?
GEOID | NAME | variable | value |
---|---|---|---|
42 | Pennsylvania | H1_001N | 5742828 |
06 | California | H1_001N | 14392140 |
54 | West Virginia | H1_001N | 855635 |
49 | Utah | H1_001N | 1151414 |
36 | New York | H1_001N | 8488066 |
11 | District of Columbia | H1_001N | 350364 |
Each state has 3 rows, one for each of the 3 housing variables. This is called long format.
We want the data to be in wide format to make it easier to work with.
GEOID | NAME | H1_001N | H1_002N | H1_003N |
---|---|---|---|---|
42 | Pennsylvania | 5742828 | 5210598 | 532230 |
06 | California | 14392140 | 13475623 | 916517 |
54 | West Virginia | 855635 | 743442 | 112193 |
49 | Utah | 1151414 | 1057252 | 94162 |
36 | New York | 8488066 | 7715172 | 772894 |
11 | District of Columbia | 350364 | 312448 | 37916 |
Rename the variables and calculate percent occupied and percent vacant
GEOID | state | tot_housing_units | occupied | vacant | pct_occupied | pct_vacant |
---|---|---|---|---|---|---|
42 | Pennsylvania | 5742828 | 5210598 | 532230 | 0.907 | 0.093 |
06 | California | 14392140 | 13475623 | 916517 | 0.936 | 0.064 |
54 | West Virginia | 855635 | 743442 | 112193 | 0.869 | 0.131 |
49 | Utah | 1151414 | 1057252 | 94162 | 0.918 | 0.082 |
36 | New York | 8488066 | 7715172 | 772894 | 0.909 | 0.091 |
Use geom_col
to create a bar plot of percent vacant
reorder()
function to alphabetize the statesggplot(data=housing_2020, aes(x=reorder(state,pct_vacant), y=pct_vacant)) +
geom_col() +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
labs(x = "State", y = "Percent Vacant",
title = "Proportion of Housing Units that are vacant")
Let’s look at the list of variables again.
Find the variable for:
Use the get_decennial
function to create a state dataframe with the following variables:
There are a lot of race/ethnicity variables. It is not easy to determine which one to use!
Create a bar chart of the percent Black population, with the states ordered by population.
Read the census questionnaire for 2020 (and previous years if you want to)
Read Black Is Over (Or, Special Black) by Tressie McMillan Cottom.
Create a dataframe of estimated Median Household Income and selected race/ethnicity variables for every county in one state. Use this data to understand the relationship between race/ethnicity and income in this state. Explore the dataframe by:
Write a paragraph explaining at least 3 things you have learned about your state by exploring the data. Include plots and/or statistics to support your conclusions. (You can upload the plots separately or create a pdf with text and images)
Use the get_decennial
function to create a dataframe of all counties in one state (pick any state) with the following variables:
Use the get_acs
function to create a dataframe of the estimated Median household income all counties in the same state. Use the code below. We’ll learn more about ACS next week.
Join these two dataframes together. Explore as described in the assignment overview.