Methods 1, Week 7

Outline

  • File structure

  • Homework questions and overview

  • US Census Race/Ethnicity variables

  • ACS data

  • In-class exercise

  • Homework

File structure

File organization is very important for being able to pick up a project later. So far we’ve been practicing and learning R. Starting today, we’ll start organizing our files so we can find them later.

main_data folder

The main_data folder will store all of the census data that we process and any other data we might want to use again.

  • Create a new R project in the main_data folder.
  • We’ll always work in this project when we process and explore data.

project_template folder

You will create a project folder for any projects that you work on (like your final project!). We’ll get into this in more detail later.

Homework: select census variables

library(tidyverse)
library(tidycensus)
library(scales)
library(viridis)
options(scipen = 999)

# load a list of all decennial census data from the Redistricting File
pl_vars_2020 <- load_variables(2020, "pl", cache = T) 

# load a list of all acs variables
acs_vars_2020 <- load_variables(2020, "acs5", cache = T)

# Total Population - P1_001N
# Percent Hispanic or Latino - P2_002N/P1_001N
# Percent White alone, not Hispanic or Latino - P2_005N/P1_001N
# Percent Black alone, not Hispanic or Latino - P2_006N/P1_001N
# Percent Asian alone, not Hispanic or Latino - P2_008N/P1_001N

# import population by race for each county in Mississippi
raw_race_2020 = get_decennial(geography = "county", 
                            variables=c("P1_001N", "P2_002N", "P2_005N", "P2_006N", 
                                        "P2_007N", "P2_008N", "P2_009N", "P2_010N"),
                            state='MS', 
                            geometry = FALSE, 
                            year = 2020)

Homework: tidy race/ethnicity data


#########  Create Tidy Data ############

pop20 <- raw_race_2020 %>%
  pivot_wider(names_from = variable, values_from = value) %>% 
  rename(pop20 = P1_001N, #  !!Total:, universe =   RACE
         hisp_pop20 = P2_002N, #  Hispanic or Latino
         black_pop20 = P2_006N, #  Not Hispanic or Latino:!!Population of one race:!!Black or African American alone
         asian_pop20 = P2_008N, #  Not Hispanic or Latino:!!Population of one race:!!Asian alone
         white_pop20 = P2_005N, # Not Hispanic or Latino:!!Population of one race:!!White alone
         native_pop20 = P2_007N, #  Not Hispanic or Latino:!!Population of one race:!!American Indian and Alaska Native alone
         haw_pi_pop20 = P2_009N, #  Not Hispanic or Latino:!!Population of one race:!!Native Hawaiian and Other Pacific Islander alone
         other_pop20 = P2_010N) %>% #  Not Hispanic or Latino:!!Population of one race:!!Some Other Race alone
  mutate(bipoc_pop20 = pop20 - white_pop20,
         pct_hisp= round(hisp_pop20/pop20, 3), 
         pct_white_alone_not_hisp = round(white_pop20/pop20, 3), 
         pct_black_alone_not_hisp = round(black_pop20/pop20, 3), 
         pct_asian_alone_not_hisp = round(asian_pop20/pop20, 3), 
         pct_indiginous_alone_not_hisp = round(native_pop20/pop20, 3), 
         pct_haw_pi_alone_not_hisp = round(haw_pi_pop20/pop20, 3), 
         pct_other_alone_not_hisp = round(other_pop20/pop20, 3), 
         pct_bipoc_20 = round(bipoc_pop20/pop20, 3))

Homework: import ACS data

get_acs() defaults to 5-year data

# import mhi estimate for Mississippi counties
raw_mhi_2020 <- get_acs(geography = "county", 
                        variables = c(mhi = "B19013_001"), 
                        state = "MS",
                        year = 2020)

## create data frame with estimate, margin of error, and GEOID for joining
mhi <- raw_mhi_2020 %>% 
  rename(mhi20 = estimate,
         mhi_moe = moe) %>% 
  select(GEOID, mhi20, mhi_moe)

# join data
miss_income_race <- pop20 %>% 
  left_join(mhi, by = "GEOID")
GEOID NAME pop20 hisp_pop20 white_pop20 black_pop20 native_pop20 asian_pop20 haw_pi_pop20 other_pop20 bipoc_pop20 pct_hisp pct_white_alone_not_hisp pct_black_alone_not_hisp pct_asian_alone_not_hisp pct_indiginous_alone_not_hisp pct_haw_pi_alone_not_hisp pct_other_alone_not_hisp pct_bipoc_20 mhi20 mhi_moe
28001 Adams County, Mississippi 29538 1012 10926 16709 56 165 7 35 18612 0.034 0.370 0.566 0.006 0.002 0.000 0.001 0.630 30633 1459
28003 Alcorn County, Mississippi 34740 1248 27738 4316 61 180 19 50 7002 0.036 0.798 0.124 0.005 0.002 0.001 0.001 0.202 40938 2944

Homework: create summary stats

######### Create summary stats 
## using the percent function from the scales package to format nicely
income_race_stats <- miss_income_race %>% 
  summarise(Counties = n(),
            `Average Median Household Income` = dollar(mean(mhi20)),
            `Percent Latinx` = percent(sum(hisp_pop20)/sum(pop20)),
            `Percent White Alone, Not Latinx` = percent(sum(white_pop20)/sum(pop20)),
            `Percent Black Alone, Not Latinx` = percent(sum(black_pop20)/sum(pop20)),
            `Percent Asian Alone, Not Latinx` = 
              percent(sum(asian_pop20)/sum(pop20)),
            `Percent Other Alone, Not Latinx` = percent(sum(other_pop20)/sum(pop20)),
            `Percent BIPOC` = percent(sum(bipoc_pop20)/sum(pop20)))
Counties Average Median Household Income Percent Latinx Percent White Alone, Not Latinx Percent Black Alone, Not Latinx Percent Asian Alone, Not Latinx Percent Other Alone, Not Latinx Percent BIPOC
82 $40,818.78 4% 55% 36% 1% 0% 45%

Defining colors in R: hex codes

Hex colors are 6-digit way to represent a color that is common in web development. You can pick colors and find their hex codes here:

Defining colors in R: Color names

R also recognizes hundreds of predefined color names like “blue” and “ivory”. You can find a full list of them here: ggplot colors

Homework: plot mhi and percent bipoc

Homework: code mhi and percent bipoc

ms_scatter <- ggplot(data = miss_income_race,
         aes(x = pct_bipoc_20, y = mhi20, 
             size = pop20, color = pct_bipoc_20)) +
  geom_point(alpha = .85) +
  # adjust color range
  scale_color_viridis(direction = -1,
                      name = "Percent BIPOC",
                      labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = dollar_format(accuracy = 1)) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Percent BIPOC Population", y = "Estimated Median Household Income",
       title = "Counties in Mississippi, Median Household Income and Percent BIPOC",
       caption = "Sources: US Census, 2020; American Community Survey",
       # add nice label for size element
       size = "Population",
       color = "Percent BIPOC Population") +
  theme_bw()

ms_scatter

Homework: calculate difference from average MHI


Use the average median household income value from summary stats table to calculate difference from average

Counties Average Median Household Income Percent Latinx Percent White Alone, Not Latinx Percent Black Alone, Not Latinx Percent Asian Alone, Not Latinx Percent Other Alone, Not Latinx Percent BIPOC
82 $40,818.78 4% 55% 36% 1% 0% 45%


miss_income_race_diff <- miss_income_race %>% 
  mutate(County = gsub(" County, Mississippi", "", NAME),
           mhi_diff_from_avg = mhi20 - 40819)  


gsub(): function to replace text, use it in a mutate

gsub(pattern, replacement, x)

Homework: bar chart of deviation from average

Homework: code, bar chart of deviation from average

scale_fill_gradient(): creates a gradient palette based on two colors

  • define low and high colors
ms_income_diff <- ggplot(data = miss_income_race_diff, 
       aes(x=mhi_diff_from_avg,
           y= reorder(County, mhi_diff_from_avg),
           fill = pct_bipoc_20)) + 
  geom_col(width = 1) +
  # create color palette with hex code for high and low value
  scale_fill_gradient(low = "#ffffff", 
                      high = "#5e1b81",
                      name = "Percent BIPOC",
                      labels = percent_format(accuracy = 1)) + 
  scale_x_continuous(labels = dollar_format(accuracy = 1)) +
  labs(x = "Median Household Income, Difference from Average", 
       y = "",
       title= "Mississippi Counties, Median Household Income ",
       subtitle="Difference from Average ($49,230)") +
  theme_bw() +
  theme(axis.text.y = element_text(size = 5))  # change the text size for y-axis labels

ms_income_diff

Write to main_data folder

Write data to raw and processed folder in main_data

write_csv(raw_race_2020, "raw/county/mississippi_county_race_pl_selected_vars_2020.csv")

write_csv(raw_mhi_2020, "raw/county/mississippi_county_mhi_acs_2016-20.csv")

write_csv(miss_income_race, "processed/county/mississippi_county_race_2020_mhi_1620.csv")


Write summary tables to main_data/output/summary_tables

write_csv(income_race_stats, "output/summary_tables/county/mississippi_income_race_county_summary.csv")


Write plots to main_data/output/plots

ggsave("output/plots/county/mississippi_county_mhi_diff_plot.png",
       plot = ms_income_diff,
       units = "in",
       height = 5, width = 7)

ggsave("output/plots/county/mississippi_county_income_race_scatter.png",
       plot = ms_income_diff,
       units = "in",
       height = 5, width = 7)

ACS - import mhi for all counties


Import median household income for all counties in the country. Create a chart to compare MHI and look at the margin of error.

# import mhi estimate for all counties
raw_mhi_2020_all_states <- get_acs(geography = "county", 
                        variables = c(mhi = "B19013_001"), 
                        year = 2020)


GEOID NAME variable estimate moe
01001 Autauga County, Alabama mhi 57982 4839
01003 Baldwin County, Alabama mhi 61756 2268
01005 Barbour County, Alabama mhi 34990 2909
01007 Bibb County, Alabama mhi 51721 6237
01009 Blount County, Alabama mhi 48922 2269

ACS example: process ACS mhi data for plot


separate(): divide column into multiple multiple columns

  • separate(col, into, sep = ““)
## create data frame with county and state column
## calculate margin of error as proportion of value
mhi <- raw_mhi_2020_all_states %>% 
  rename(mhi20 = estimate,
         mhi_moe = moe,) %>% 
  separate(NAME, into = c("county", "state"), sep = ",") %>% 
  select(GEOID,county, state, mhi20, mhi_moe) %>% 
  mutate(moe_ratio = round(mhi_moe/mhi20, 4))


GEOID county state mhi20 mhi_moe moe_ratio
01001 Autauga County Alabama 57982 4839 0.0835
01003 Baldwin County Alabama 61756 2268 0.0367

Count plot

First we’ll compare the range in MHI in each state with a count plot. A count plot is a scatterplot with categories as one of the variables. You can see the range of values for each category, and compare them.

Count plot, code

ggplot(data = mhi, 
           aes(x=state, 
               y=mhi20)) + 
  geom_count(alpha=0.5, size = 2, color = "green4") +
  scale_y_continuous(labels = dollar_format()) +
  theme(axis.title=element_blank(),
        axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
  labs(title = "Estimated Median Household Income by County",
       caption = "Source: American Community Survey, 2016-20")

Count plot - to explore margin of error

You should always look at the margin of error and assess whether the estimate is good enough for you to keep in your analysis. See the margin of error, sorted by highest proportion). The margin of error is VERY high in some counties.

GEOID county state mhi20 mhi_moe moe_ratio
28055 Issaquena County Mississippi 28333 21120 0.7454
48271 Kinney County Texas 39972 26645 0.6666
48017 Bailey County Texas 55038 35559 0.6461
47095 Lake County Tennessee 34230 19077 0.5573
20067 Grant County Kansas 60625 29159 0.4810
48127 Dimmit County Texas 25996 12247 0.4711
35011 De Baca County New Mexico 31532 13343 0.4232
48033 Borden County Texas 83281 34136 0.4099
32011 Eureka County Nevada 67478 27402 0.4061
30069 Petroleum County Montana 40000 16091 0.4023
16023 Butte County Idaho 37404 14947 0.3996
48173 Glasscock County Texas 74375 29366 0.3948
13309 Wheeler County Georgia 28864 11145 0.3861
49031 Piute County Utah 29125 11083 0.3805
48261 Kenedy County Texas 40083 14958 0.3732
49009 Daggett County Utah 74911 27810 0.3712

Count plot, code - to explore margin of error

ggplot(data = mhi, 
       aes(x=state, 
           y=mhi_moe)) + 
  geom_count(alpha=0.5, size = 2, color = "green4") +
  scale_y_continuous(labels = dollar_format()) +
  theme(axis.title=element_blank(),
        axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
  labs(title = "Margin of Error, Median Household Income by County",
       caption = "Source: American Community Survey, 2016-20")

Count plot - to explore margin of error

Count plot - to explore margin of error as proportion

Count plot EXCLUDING HIGH MARGIN OF ERROR

There is not a rule about margin or error. In general, you get to know your data, and determine what margin of error you can tolerate depending on the question you are asking.

In this case, I have a high tolerance since I am just getting a snapshot of the country with a chart. I think that if the margin of error is higher than 1/5th of the MHI, the estimate is not precise enough to include.

mhi_final <- mhi %>% 
  filter(moe_ratio < .2)

In-class 1 - create data and plot of educational attainment

Download the educational attainment data from the American Community Survey and create a dataframe and plot of the percent of people with at least a Bachelors Degree in each state. See instructions on this and the next 3 slides:

  • If you haven’t already, create a new R project in existing folder: main_data
  • create a new script, save it in main_data/scripts/data_processing as edicational_attainment_by_state

In-class 2 - find EDUCATIONAL ATTAINMENT data

  • Use the load_variables() function to look at the variables in the available variables in the 5-year ACS. There are a lot of them!
    • acs201620 <- load_variables(2020, "acs5", cache = T)
  • Open the dataframe and search the variables for ‘educational attainment’
    • There will be a lot of them, find the variables where: CONCEPT = “EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER” without any other words
    • Look at the variable names - they are in the format Table_variable
    • ex. B15003_002:
      • table = B15003
      • variable = 002 (No schooling completed)

In-class 2 - find EDUCATIONAL ATTAINMENT data

  • Create a dataframe of variable names with only the variables in the EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER table
  • use str_detect() to filter rows with “B15003” in the name
    • str_detect(column, “text to filter based on”)
acs_vars <- load_variables(2020, "acs5", cache = T)

b15003_vars <- acs_vars %>% 
  filter(str_detect(name, "B15003"))


name label concept geography
B15003_001 Estimate!!Total: EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER block group
B15003_002 Estimate!!Total:!!No schooling completed EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER block group
B15003_003 Estimate!!Total:!!Nursery school EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER block group
B15003_004 Estimate!!Total:!!Kindergarten EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER block group

In-class 3 - import EDUCATIONAL ATTAINMENT data for all states and plot


The b15003_vars dataframe is a more manageable way to look at the variables about educational attainment.

  • Use it to identify the variables you need to calculate the percent of people with at least a bachelors degree
  • Import those variables for every state using the get_acs() function
  • Process the data to create a tidy dataframe with the percent bachelors degree
  • Create a bar chart to display the percent of people with at least a bachelors degree for every state, ordered from lowest to highest
  • write out the raw data, processed data, and bar chart to the correct folders in your main_data folder

Homework 7a.

Clean up your script from the in-class assignment processing educational attainment data. Submit the script on canvas.

Homework 7b.


Download census data to answer a research question about access to home ownership in cities across the country inspired by this article about the decline in access to home ownership in San Diego from the New York Times.


When Ms. Coats moved into the Baxter Street house, a family needed right around the area’s median income to afford the $82 monthly mortgage payment — the definition of middle class. Today a typical Clairemont home costs $850,000, up 30 percent from 2019. A family would need to make about double San Diego’s median income to afford one.

Homework 7b overview:

  • Read the article
  • We’re going to do a back-of-the-envelope calculation to determine which cities the “average” household could afford to buy the “average” home.
  • We will use:
    • the general rule of thumb that you can afford a house that is 2.5 times your income source
    • Median Household Income from ACS 2016-20 to identify the “average” household
    • Median owner-occupied property value from ACS 2016-20 to identify the “average” home

Homework 7b steps:

  • In your main_data/scripts/data_processing create a new script called housing_affordability_variables_by_place
  • Download the following variables by “place” (aka city) from the 2016-20 ACS 5-yr
    • median household income (MHI)
    • median owner-occupied property value (MPV)
    • total population
  • Create a tidy dataframe with all 3 variables for every “place” in the country with population > 50,000
  • Create a new column with the price of home that a household making the median income could afford in that city (2.5 * mhi)
  • Determine in which cities the median household can afford to buy the median home
  • Create a plot to show them
  • Import at least one additional dataset (think poverty, race, educational attainment)
  • Create a plot to explore the characteristics of cities
  • write out the raw dataframes to main_data/raw/place and the processed dataframe to main_data/processed/place
  • write out your plotsto main_data/output/plots
  • submit your script on Canvas