Methods 1, Week 5

Download folder for class 5 data

Download the data in the class5 folder and save it in the ny_poverty_analysis/data/raw folder.

Open project:

  • methods1/class3/ny_poverty_analysis

Install the scales packages

  • install.packages("scales"):for formatting text

Outline

  • County joins questions

  • Homework review

  • Visualization

    • Iterating your way to beauty and underderstanding ggplot2

  • County, school district poverty analysis continues

  • Assignment 5

    • Describe your visualization results

    • New York county visualizations

County data joins

library(tidyverse)
library(readxl)

# import raw county data
raw_atms <- read_csv("data/raw/Bank-Owned_ATM_Locations_in_New_York_State.csv")
raw_lottery <- read_csv("data/raw/NYS_Lottery_Retailers.csv")
raw_asthma <- read_excel("data/raw/Asthma-SubCountyData.xlsx", 
                         sheet = "AD21", skip = 6)

# import our processed county dataset
county_pov <- read_csv("data/processed/county_pov_rate_2019.csv")

# process atm data, number of atms per coiunty
atms_by_county <- raw_atms %>% 
  group_by(County) %>% 
  summarise(atms = n()) %>% 
  mutate(County = paste0(County, " County"))

# process lottery data - county of lottery retailers per county
lottery_count <- raw_lottery %>% 
  group_by(County, GEOID) %>% 
  summarise(lottery_retailers = n()) %>% 
  mutate(GEOID = as.numeric(GEOID))

# process asthma data - number of hospitalizations per 10,000 people
asthma <- raw_asthma %>% 
  group_by(County) %>% 
  summarise(asthma_hospitalizations = sum(Numerator)) %>% 
  mutate(County = paste0(County, " County"))

county_data <- county_pov %>% 
  left_join(atms_by_county, by = c("COUNTY" = "County")) %>% 
  mutate(banks_per10k = atms/county_pop*10000) %>% 
  left_join(lottery_count, by = "GEOID")  %>%
  mutate(lottery_per10k = lottery_retailers/county_pop*10000) %>%
  left_join(asthma, by = c("COUNTY" = "County")) %>% 
  mutate(asthma_per10k = asthma_hospitalizations/county_pop*10000) %>% 
  select(-County)

write_csv(county_data, "data/processed/county_all_data_2019.csv")

Homework


Electoral Votes = 538

  • Seats in the U.S. House of Representatives = 435
  • Seats in the Senate = 100
  • D.C. Electoral Votes = 3

Seats in the U.S. House of Representatives

  • allocated to each state by population
    • U.S. population (2020) ~ 331 million
    • Each House District ~ 761,000

Seats in the U.S. Senate

  • each state has 2 Senators, regardless of population

Homework script

library(tidyverse)
library(readxl)

## remove scientific notation
options(scipen = 999)

# import apportionment and race/ethnicity data
raw_apportion <- read_excel("data/raw/apportionment-2020-table01.xlsx", 
                                                      skip = 3)
raw_race <- read_csv("data/raw/DECENNIALPL2020.P2_Hispanic_Latino_by_race/DECENNIALPL2020.P2_data.csv")

# process race data
race <- raw_race %>% 
  mutate(percent_latinx = P2_003N/P2_001N,
         percent_white = P2_005N/P2_001N,
         percent_bipoc = 1 - percent_white) %>% 
  select(GEO_ID, NAME, percent_latinx, percent_white, percent_bipoc)

# process apportionment
apportion <- raw_apportion %>% 
  select(GEO_ID, STATE, `POPULATION`,
         `APPORTIONED REPRESENTATIVES`) %>% 
  rename(pop = `POPULATION`,
         representatives = `APPORTIONED REPRESENTATIVES`) %>% 
  mutate(electoral_votes = representatives + 2,
         pop_per_electoral_vote = round(pop/electoral_votes, 0)) %>% 
  full_join(race, by = "GEO_ID")

Homework plot

plot(apportion$percent_white, apportion$pop_per_electoral_vote)

Visualization as part of analysis


Visualization is a tool:

  • to explore our datasets
  • check results
  • share with colleagues
  • share final analysis results

ggplot scatterplot

ggplot code

library(tidyverse)
library(scales)

# use ggplot
ggplot(apportion, aes(x = percent_white, y = pop_per_electoral_vote)) +
  geom_point() + 
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = comma) + 
  labs(x = "Percent White", y = "People per Electoral Vote",
       title = "Race and Electoral Power",
       caption = "Source: U.S. Census, 2020")

ggplot2

Tidyverse package for producing statistical graphics

Every ggplot has 3 key components:

  • data: the information you want to visualize
  • aestheic mappings that indicate how to visualize the data’s variables
    • examples: color, size
  • at least one layer to display the data
    • examples: points, bars, lines

and we often add:

  • theme elements to control other display elements
    • examples: font, background color

ggplot scatterplot example

ggplot(apportion, # data
       aes(x = percent_white, 
           y = pop_per_electoral_vote)) + # aesthetics
  geom_point() # point layer

ggplot line example

A line graph doesn’t make sense for this data, but as an example:

  • the layer type determines how you display the data
ggplot(data = apportion, # data
       aes(x = percent_white, 
           y = pop_per_electoral_vote)) + # aesthetics
  geom_line() # line layer

ggplot scatterplot example

ggplot(data = apportion, 
       aes(x = percent_white, 
           y = pop_per_electoral_vote)) +
  geom_point() + 
  scale_y_continuous(labels = comma) + # y-axis labels, format as numbers with commas
  scale_x_continuous(labels = percent_format(accuracy = 1)) + # x-axis labels
  labs(x = "Percent White", y = "People per Electoral Vote",
       title = "Race and Electoral Power",
       subtitle = "One person, one vote means the number of people per electoral vote should be the same for each state",
       caption = "Source: U.S. Census, 2020") # titles

Iterating your way to beauty with ggplot

Analysis graphics should follow a simple, iterative workflow:

  1. Create a basic plot
  2. Address any missing values
  3. Clean up formatting of chart elements
  4. Add a new layer of data
  5. Tidy up formatting
  6. Repeat steps 4-5 as needed
  7. Save output

New York Poverty Analysis

Explore the level of economic inequality in school districts across New York State.

  • What is the difference between the student poverty rate in each school district and:
    • the poverty rate of the county as a whole?
    • the poverty rate of the state as a whole?
  • What counties have the most economic inequality, as measured by the student poverty rate of school districts?

Analysis plan

  • Create dataframe of poverty rate by county
  • Create dataframe of student poverty rate by school district
  • Calculate the statewide student poverty rate
  • Join the school district and county poverty dataframes to compare the poverty rates
  • Measure the difference in poverty rates of each school district and it’s county and the state
  • Use summary statistics to explore and gain understanding
  • Use visualizations to explore and gain understanding

Analysis so far

We have 3 scripts for our analysis so far:

  • 1_process_school_district_data_2019
  • 2_process_county_data_2019
  • 3_student_poverty_analysis

You’ll continue with ny_county_dataset.R for homework

Student Poverty Analysis script

In 3_student_poverty_analysis we’ll remove NYC and write out the data:

library(tidyverse)

# Import processed dataframes
county_pov <- read_csv("data/processed/county_all_data_2019.csv")
sd_pov <- read_csv("data/processed/school_district_student_pov_rate_2019.csv")

sd_county_pov <- sd_pov %>% 
  left_join(county_pov, by = c("CONUM"="GEOID")) %>% 
  select(district_id, district, County, CONUM, tpop, stpop, stpov, 
         stpovrate, county_pop, county_pov_rate, state_stpovrate) %>% 
  mutate(pov_diff_county = round(stpovrate - county_pov_rate, 3),
         pov_diff_state = round(stpovrate - state_stpovrate, 3)) 

# Calculate summary statistics of poverty rate
ny_pov_stats <- sd_county_pov %>%
  summarise(districts = n(),
            kids = sum(stpop),
            kids_in_pov = sum(stpov),
            stud_poverty_rate = kids_in_pov/kids,
            mean_sd_stpovrate = mean(stpovrate),
            max_sd_stpovrate = max(stpovrate), 
            min_sd_stpovrate = min(stpovrate),
            poverty_range = max_sd_stpovrate - min_sd_stpovrate)

ny_county_pov_stats <- sd_county_pov %>%
  group_by(County) %>% 
  summarise(districts = n(),
            kids = sum(stpop),
            kids_in_pov = sum(stpov),
            stud_poverty_rate = round(kids_in_pov/kids, 3),
            mean_sd_stpovrate = round(mean(stpovrate), 3),
            max_sd_stpovrate = round(max(stpovrate), 3),
            min_sd_stpovrate = round(min(stpovrate), 3),
            poverty_range = max_sd_stpovrate - min_sd_stpovrate) %>% 
  filter(County != "New York County")

# write out dataframes
write_csv(sd_county_pov, "data/processed/school_district_county_poverty_2019.csv")
write_csv(ny_county_pov_stats, "data/output/ny_county_poverty_stats.csv")

Visualization plan

What counties have the most economic inequality, as measured by the student poverty rate of school districts?

  • Add school district enrollment data for context
  • Create a scatterplot to explore the county with the largest range in student poverty
  • Create scatterplots to explore other counties
  • Create scatterplots to explore the state as a whole

Visualization script

Create a new script 4_visualize_poverty_analysis.R

  • add necessary packages
  • import data
  • join school district enrollment data
library(tidyverse)
library(scales)
library(viridis)

### Import the summary data
county_stats <- read_csv("data/output/ny_county_poverty_stats.csv")

# import some extra school district data
sd_enroll <- read_csv("data/raw/ny_sd_enrollment_2019.csv")

# import the school district - county poverty data and join the school district data
sd_county_pov <- read_csv("data/output/ny_sd_county_pov_data.csv") %>% 
  left_join(sd_enroll, by = "district_id")

What county has the largest range in student poverty?

Orange County scatterplot, v1

### Create a scatterplot to explore the county with the largest range in student poverty 
ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, y = pct_bipoc)) +
  geom_point()

Orange County scatterplot, v2 script

Format the labels as percent, with no decimal place

  • use scale_x_continuous() to format the x-axis
  • percent_format() is a scales package function
  • accuracy = 1 rounds to a whole number
    • accuracy = .1 includes one decimal place
### Create a scatterplot to explore the county with the largest range in student poverty 

ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, y = pct_bipoc)) +
  geom_point() +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) 

Orange County scatterplot, v2 plot

Orange County scatterplot, v3 code

  • Within the aesthetic mapping aes(), size the dots by enrollment
  • Within geom_point() make the dots 50% transparent with alpha = .5
ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) 

Orange County scatterplot, v3 plot

Orange County scatterplot, v4 code

  • Within the aesthetic mapping aes(), color the dots by “urbanicity category”
ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) 

Orange County scatterplot, v4 plot

Orange County scatterplot, v5 code

Add formatted axis labels, title, and caption within the labs function

ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019") 

Orange County scatterplot, v5 plot

Orange County scatterplot, v6 code

Add a theme to add some standard styling

ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019") +
  theme_bw()

Orange County scatterplot, v6 plot

Orange County scatterplot, v7 code

Fix the legend

  • format the enrollment number with commas
  • In labs() add nice legend titles
ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

Orange County scatterplot, v7 plot

New York scatterplot code

Remove the filter to look at New York as a whole

ggplot(data = sd_county_pov, 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in New York School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

New York scatterplot

New York scatterplot (no nyc) code

Remove New York City to see how it changes

ggplot(data = sd_county_pov %>% 
         filter(district != "New York City Department Of Education"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       subtitle = "Excluding New York City",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

New York scatterplot (no nyc)

Save a plot

To save a plot:

  • first save it as an object
  • use ggsave to save it
ny_scatter <- ggplot(data = sd_county_pov %>% 
         filter(district != "New York City Department Of Education"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

# example code to save last plot as a 5" by 7" .png file
ggsave("data/output/NewYork_school_district_poverty.png", #specify the file path/name/type
       plot = ny_scatter, # specify the ggplot object you stored
       units = "in", # specify the units for your image
       height = 5, width = 7) # specify the image dimensions

Homework 5a.

Save a plot from the in-class exercise that shows the poverty range on one county in New York. Upload it to canvas with a short paragraph description of what the scatterplot shows.

Homework 5b.

Use the visualization skills you learned today to create 3 plots to explore the New York County data from last week (county poverty, asthma hospitalization rates, lottery retailers, atms).

Follow your interest. Add more data if you desire. On canvas upload your plots and a short paragraph description of what each scatterplot shows.