Methods 1, Week 7

Outline

File structure
Homework questions and overview
US Census Race/Ethnicity variables
ACS data
In-class exercise
Homework

File structure

File organization is very important for being able to pick up a project later. So far we’ve been practicing and learning R. Starting today, we’ll start organizing our files so we can find them later.

Download the main data and project_template folders and save them in your methods1 folder.

main_data folder

The main_data folder will store all of the census data that we process and any other data we might want to use again.

Create a new R project in the main_data folder.
We’ll always work in this project when we process and explore data.

project_template folder

You will create a project folder for any projects that you work on (like your final project!). We’ll get into this in more detail later.

Homework: select census variables

library(tidyverse)
library(tidycensus)
library(scales)
library(viridis)
options(scipen = 999)

# load a list of all decennial census data from the Redistricting File
pl_vars_2020 <- load_variables(2020, "pl", cache = T) 

# load a list of all acs variables
acs_vars_2020 <- load_variables(2020, "acs5", cache = T)

# Total Population - P1_001N
# Percent Hispanic or Latino - P2_002N/P1_001N
# Percent White alone, not Hispanic or Latino - P2_005N/P1_001N
# Percent Black alone, not Hispanic or Latino - P2_006N/P1_001N
# Percent Asian alone, not Hispanic or Latino - P2_008N/P1_001N

# import population by race for each county in Mississippi
raw_race_2020 = get_decennial(geography = "county", 
                            variables=c("P1_001N", "P2_002N", "P2_005N", "P2_006N", 
                                        "P2_007N", "P2_008N", "P2_009N", "P2_010N"),
                            state='MS', 
                            geometry = FALSE, 
                            year = 2020)

Homework: tidy race/ethnicity data

#########  Create Tidy Data ############

pop20 <- raw_race_2020 %>%
  pivot_wider(names_from = variable, values_from = value) %>% 
  rename(pop20 = P1_001N, #  !!Total:, universe =   RACE
         hisp_pop20 = P2_002N, #  Hispanic or Latino
         black_pop20 = P2_006N, #  Not Hispanic or Latino:!!Population of one race:!!Black or African American alone
         asian_pop20 = P2_008N, #  Not Hispanic or Latino:!!Population of one race:!!Asian alone
         white_pop20 = P2_005N, # Not Hispanic or Latino:!!Population of one race:!!White alone
         native_pop20 = P2_007N, #  Not Hispanic or Latino:!!Population of one race:!!American Indian and Alaska Native alone
         haw_pi_pop20 = P2_009N, #  Not Hispanic or Latino:!!Population of one race:!!Native Hawaiian and Other Pacific Islander alone
         other_pop20 = P2_010N) %>% #  Not Hispanic or Latino:!!Population of one race:!!Some Other Race alone
  mutate(bipoc_pop20 = pop20 - white_pop20,
         pct_hisp= round(hisp_pop20/pop20, 3), 
         pct_white_alone_not_hisp = round(white_pop20/pop20, 3), 
         pct_black_alone_not_hisp = round(black_pop20/pop20, 3), 
         pct_asian_alone_not_hisp = round(asian_pop20/pop20, 3), 
         pct_indiginous_alone_not_hisp = round(native_pop20/pop20, 3), 
         pct_haw_pi_alone_not_hisp = round(haw_pi_pop20/pop20, 3), 
         pct_other_alone_not_hisp = round(other_pop20/pop20, 3), 
         pct_bipoc_20 = round(bipoc_pop20/pop20, 3))

Homework: import ACS data

get_acs() defaults to 5-year data

# import mhi estimate for Mississippi counties
raw_mhi_2020 <- get_acs(geography = "county", 
                        variables = c(mhi = "B19013_001"), 
                        state = "MS",
                        year = 2020)

## create data frame with estimate, margin of error, and GEOID for joining
mhi <- raw_mhi_2020 %>% 
  rename(mhi20 = estimate,
         mhi_moe = moe) %>% 
  select(GEOID, mhi20, mhi_moe)

# join data
miss_income_race <- pop20 %>% 
  left_join(mhi, by = "GEOID")

GEOID	NAME	pop20	hisp_pop20	white_pop20	black_pop20	native_pop20	asian_pop20	haw_pi_pop20	other_pop20	bipoc_pop20	pct_hisp	pct_white_alone_not_hisp	pct_black_alone_not_hisp	pct_asian_alone_not_hisp	pct_indiginous_alone_not_hisp	pct_haw_pi_alone_not_hisp	pct_other_alone_not_hisp	pct_bipoc_20	mhi20	mhi_moe
28001	Adams County, Mississippi	29538	1012	10926	16709	56	165	7	35	18612	0.034	0.370	0.566	0.006	0.002	0.000	0.001	0.630	30633	1459
28003	Alcorn County, Mississippi	34740	1248	27738	4316	61	180	19	50	7002	0.036	0.798	0.124	0.005	0.002	0.001	0.001	0.202	40938	2944

Homework: create summary stats

######### Create summary stats 
## using the percent function from the scales package to format nicely
income_race_stats <- miss_income_race %>% 
  summarise(Counties = n(),
            `Average Median Household Income` = dollar(mean(mhi20)),
            `Percent Latinx` = percent(sum(hisp_pop20)/sum(pop20)),
            `Percent White Alone, Not Latinx` = percent(sum(white_pop20)/sum(pop20)),
            `Percent Black Alone, Not Latinx` = percent(sum(black_pop20)/sum(pop20)),
            `Percent Asian Alone, Not Latinx` = 
              percent(sum(asian_pop20)/sum(pop20)),
            `Percent Other Alone, Not Latinx` = percent(sum(other_pop20)/sum(pop20)),
            `Percent BIPOC` = percent(sum(bipoc_pop20)/sum(pop20)))

Counties	Average Median Household Income	Percent Latinx	Percent White Alone, Not Latinx	Percent Black Alone, Not Latinx	Percent Asian Alone, Not Latinx	Percent Other Alone, Not Latinx	Percent BIPOC
82	$40,818.78	4%	55%	36%	1%	0%	45%

Defining colors in R: hex codes

Hex colors are 6-digit way to represent a color that is common in web development. You can pick colors and find their hex codes here:

Defining colors in R: Color names

R also recognizes hundreds of predefined color names like “blue” and “ivory”. You can find a full list of them here: ggplot colors

Homework: plot mhi and percent bipoc

Homework: code mhi and percent bipoc

ms_scatter <- ggplot(data = miss_income_race,
         aes(x = pct_bipoc_20, y = mhi20, 
             size = pop20, color = pct_bipoc_20)) +
  geom_point(alpha = .85) +
  # adjust color range
  scale_color_viridis(direction = -1,
                      name = "Percent BIPOC",
                      labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = dollar_format(accuracy = 1)) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Percent BIPOC Population", y = "Estimated Median Household Income",
       title = "Counties in Mississippi, Median Household Income and Percent BIPOC",
       caption = "Sources: US Census, 2020; American Community Survey",
       # add nice label for size element
       size = "Population",
       color = "Percent BIPOC Population") +
  theme_bw()

ms_scatter

Homework: calculate difference from average MHI

Use the average median household income value from summary stats table to calculate difference from average

Counties	Average Median Household Income	Percent Latinx	Percent White Alone, Not Latinx	Percent Black Alone, Not Latinx	Percent Asian Alone, Not Latinx	Percent Other Alone, Not Latinx	Percent BIPOC
82	$40,818.78	4%	55%	36%	1%	0%	45%

miss_income_race_diff <- miss_income_race %>% 
  mutate(County = gsub(" County, Mississippi", "", NAME),
           mhi_diff_from_avg = mhi20 - 40819)

gsub(): function to replace text, use it in a mutate

gsub(pattern, replacement, x)

Homework: bar chart of deviation from average

Homework: code, bar chart of deviation from average

scale_fill_gradient(): creates a gradient palette based on two colors

define low and high colors

ms_income_diff <- ggplot(data = miss_income_race_diff, 
       aes(x=mhi_diff_from_avg,
           y= reorder(County, mhi_diff_from_avg),
           fill = pct_bipoc_20)) + 
  geom_col(width = 1) +
  # create color palette with hex code for high and low value
  scale_fill_gradient(low = "#ffffff", 
                      high = "#5e1b81",
                      name = "Percent BIPOC",
                      labels = percent_format(accuracy = 1)) + 
  scale_x_continuous(labels = dollar_format(accuracy = 1)) +
  labs(x = "Median Household Income, Difference from Average", 
       y = "",
       title= "Mississippi Counties, Median Household Income ",
       subtitle="Difference from Average ($49,230)") +
  theme_bw() +
  theme(axis.text.y = element_text(size = 5))  # change the text size for y-axis labels

ms_income_diff

Write to main_data folder

Write data to raw and processed folder in main_data

write_csv(raw_race_2020, "raw/county/mississippi_county_race_pl_selected_vars_2020.csv")

write_csv(raw_mhi_2020, "raw/county/mississippi_county_mhi_acs_2016-20.csv")

write_csv(miss_income_race, "processed/county/mississippi_county_race_2020_mhi_1620.csv")

Write summary tables to main_data/output/summary_tables

write_csv(income_race_stats, "output/summary_tables/county/mississippi_income_race_county_summary.csv")

Write plots to main_data/output/plots

ggsave("output/plots/county/mississippi_county_mhi_diff_plot.png",
       plot = ms_income_diff,
       units = "in",
       height = 5, width = 7)

ggsave("output/plots/county/mississippi_county_income_race_scatter.png",
       plot = ms_income_diff,
       units = "in",
       height = 5, width = 7)

ACS - import mhi for all counties

Import median household income for all counties in the country. Create a chart to compare MHI and look at the margin of error.

# import mhi estimate for all counties
raw_mhi_2020_all_states <- get_acs(geography = "county", 
                        variables = c(mhi = "B19013_001"), 
                        year = 2020)

GEOID	NAME	variable	estimate	moe
01001	Autauga County, Alabama	mhi	57982	4839
01003	Baldwin County, Alabama	mhi	61756	2268
01005	Barbour County, Alabama	mhi	34990	2909
01007	Bibb County, Alabama	mhi	51721	6237
01009	Blount County, Alabama	mhi	48922	2269

ACS example: process ACS mhi data for plot

separate(): divide column into multiple multiple columns

separate(col, into, sep = ““)

## create data frame with county and state column
## calculate margin of error as proportion of value
mhi <- raw_mhi_2020_all_states %>% 
  rename(mhi20 = estimate,
         mhi_moe = moe,) %>% 
  separate(NAME, into = c("county", "state"), sep = ",") %>% 
  select(GEOID,county, state, mhi20, mhi_moe) %>% 
  mutate(moe_ratio = round(mhi_moe/mhi20, 4))

GEOID	county	state	mhi20	mhi_moe	moe_ratio
01001	Autauga County	Alabama	57982	4839	0.0835
01003	Baldwin County	Alabama	61756	2268	0.0367

Count plot

First we’ll compare the range in MHI in each state with a count plot. A count plot is a scatterplot with categories as one of the variables. You can see the range of values for each category, and compare them.

Count plot, code

ggplot(data = mhi, 
           aes(x=state, 
               y=mhi20)) + 
  geom_count(alpha=0.5, size = 2, color = "green4") +
  scale_y_continuous(labels = dollar_format()) +
  theme(axis.title=element_blank(),
        axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
  labs(title = "Estimated Median Household Income by County",
       caption = "Source: American Community Survey, 2016-20")

Count plot - to explore margin of error

You should always look at the margin of error and assess whether the estimate is good enough for you to keep in your analysis. See the margin of error, sorted by highest proportion). The margin of error is VERY high in some counties.

GEOID	county	state	mhi20	mhi_moe	moe_ratio
28055	Issaquena County	Mississippi	28333	21120	0.7454
48271	Kinney County	Texas	39972	26645	0.6666
48017	Bailey County	Texas	55038	35559	0.6461
47095	Lake County	Tennessee	34230	19077	0.5573
20067	Grant County	Kansas	60625	29159	0.4810
48127	Dimmit County	Texas	25996	12247	0.4711
35011	De Baca County	New Mexico	31532	13343	0.4232
48033	Borden County	Texas	83281	34136	0.4099
32011	Eureka County	Nevada	67478	27402	0.4061
30069	Petroleum County	Montana	40000	16091	0.4023
16023	Butte County	Idaho	37404	14947	0.3996
48173	Glasscock County	Texas	74375	29366	0.3948
13309	Wheeler County	Georgia	28864	11145	0.3861
49031	Piute County	Utah	29125	11083	0.3805
48261	Kenedy County	Texas	40083	14958	0.3732
49009	Daggett County	Utah	74911	27810	0.3712

Count plot, code - to explore margin of error

ggplot(data = mhi, 
       aes(x=state, 
           y=mhi_moe)) + 
  geom_count(alpha=0.5, size = 2, color = "green4") +
  scale_y_continuous(labels = dollar_format()) +
  theme(axis.title=element_blank(),
        axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
  labs(title = "Margin of Error, Median Household Income by County",
       caption = "Source: American Community Survey, 2016-20")

Count plot - to explore margin of error

Count plot - to explore margin of error as proportion

Count plot EXCLUDING HIGH MARGIN OF ERROR

There is not a rule about margin or error. In general, you get to know your data, and determine what margin of error you can tolerate depending on the question you are asking.

In this case, I have a high tolerance since I am just getting a snapshot of the country with a chart. I think that if the margin of error is higher than 1/5th of the MHI, the estimate is not precise enough to include.

mhi_final <- mhi %>% 
  filter(moe_ratio < .2)

In-class 1 - create data and plot of educational attainment

Download the educational attainment data from the American Community Survey and create a dataframe and plot of the percent of people with at least a Bachelors Degree in each state. See instructions on this and the next 3 slides:

If you haven’t already, create a new R project in existing folder: main_data
create a new script, save it in main_data/scripts/data_processing as edicational_attainment_by_state

In-class 2 - find EDUCATIONAL ATTAINMENT data

Use the load_variables() function to look at the variables in the available variables in the 5-year ACS. There are a lot of them!
- acs201620 <- load_variables(2020, "acs5", cache = T)
Open the dataframe and search the variables for ‘educational attainment’
- There will be a lot of them, find the variables where: CONCEPT = “EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER” without any other words
- Look at the variable names - they are in the format Table_variable
- ex. B15003_002:
  - table = B15003
  - variable = 002 (No schooling completed)

In-class 2 - find EDUCATIONAL ATTAINMENT data

Create a dataframe of variable names with only the variables in the EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER table
use str_detect() to filter rows with “B15003” in the name
- str_detect(column, “text to filter based on”)

acs_vars <- load_variables(2020, "acs5", cache = T)

b15003_vars <- acs_vars %>% 
  filter(str_detect(name, "B15003"))

name	label	concept	geography
B15003_001	Estimate!!Total:	EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER	block group
B15003_002	Estimate!!Total:!!No schooling completed	EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER	block group
B15003_003	Estimate!!Total:!!Nursery school	EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER	block group
B15003_004	Estimate!!Total:!!Kindergarten	EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER	block group

In-class 3 - import EDUCATIONAL ATTAINMENT data for all states and plot

The b15003_vars dataframe is a more manageable way to look at the variables about educational attainment.

Use it to identify the variables you need to calculate the percent of people with at least a bachelors degree
Import those variables for every state using the get_acs() function
Process the data to create a tidy dataframe with the percent bachelors degree
Create a bar chart to display the percent of people with at least a bachelors degree for every state, ordered from lowest to highest
write out the raw data, processed data, and bar chart to the correct folders in your main_data folder

Homework 7a.

Clean up your script from the in-class assignment processing educational attainment data. Submit the script on canvas.

Homework 7b.

Download census data to answer a research question about access to home ownership in cities across the country inspired by this article about the decline in access to home ownership in San Diego from the New York Times.

When Ms. Coats moved into the Baxter Street house, a family needed right around the area’s median income to afford the $82 monthly mortgage payment — the definition of middle class. Today a typical Clairemont home costs $850,000, up 30 percent from 2019. A family would need to make about double San Diego’s median income to afford one.

Homework 7b overview:

Read the article
We’re going to do a back-of-the-envelope calculation to determine which cities the “average” household could afford to buy the “average” home.
We will use:
- the general rule of thumb that you can afford a house that is 2.5 times your income source
- Median Household Income from ACS 2016-20 to identify the “average” household
- Median owner-occupied property value from ACS 2016-20 to identify the “average” home

Homework 7b steps:

In your main_data/scripts/data_processing create a new script called housing_affordability_variables_by_place
Download the following variables by “place” (aka city) from the 2016-20 ACS 5-yr
- median household income (MHI)
- median owner-occupied property value (MPV)
- total population
Create a tidy dataframe with all 3 variables for every “place” in the country with population > 50,000
Create a new column with the price of home that a household making the median income could afford in that city (2.5 * mhi)
Determine in which cities the median household can afford to buy the median home
Create a plot to show them
Import at least one additional dataset (think poverty, race, educational attainment)
Create a plot to explore the characteristics of cities
write out the raw dataframes to main_data/raw/place and the processed dataframe to main_data/processed/place
write out your plotsto main_data/output/plots
submit your script on Canvas

Methods 1, Week 7

Outline

File structure

Homework questions and overview

US Census Race/Ethnicity variables

ACS data

In-class exercise

Homework

File structure

main_data folder

project_template folder

Homework: select census variables

Homework: tidy race/ethnicity data

Homework: import ACS data

Homework: create summary stats

Defining colors in R: hex codes

Defining colors in R: Color names

Homework: plot mhi and percent bipoc

Homework: code mhi and percent bipoc

Homework: calculate difference from average MHI

Homework: bar chart of deviation from average

Homework: code, bar chart of deviation from average

Write to main_data folder

ACS - import mhi for all counties

ACS example: process ACS mhi data for plot

Count plot

Count plot, code

Count plot - to explore margin of error

Count plot, code - to explore margin of error

Count plot - to explore margin of error

Count plot - to explore margin of error as proportion

Count plot EXCLUDING HIGH MARGIN OF ERROR

In-class 1 - create data and plot of educational attainment

In-class 2 - find EDUCATIONAL ATTAINMENT data

In-class 2 - find EDUCATIONAL ATTAINMENT data

In-class 3 - import EDUCATIONAL ATTAINMENT data for all states and plot

Homework 7a.

Homework 7b.

Homework 7b overview:

Homework 7b steps: