Methods 1, Week 5

Download folder for class 5 data

Download the data in the class5 folder and save it in the ny_poverty_analysis/data/raw folder.

Open project:

methods1/class3/ny_poverty_analysis

Install the scales packages

install.packages("scales"):for formatting text

Outline

County joins questions
Homework review
Visualization
- Iterating your way to beauty and underderstanding ggplot2
County, school district poverty analysis continues
Assignment 5
- Describe your visualization results
- New York county visualizations

County data joins

library(tidyverse)
library(readxl)

# import raw county data
raw_atms <- read_csv("data/raw/Bank-Owned_ATM_Locations_in_New_York_State.csv")
raw_lottery <- read_csv("data/raw/NYS_Lottery_Retailers.csv")
raw_asthma <- read_excel("data/raw/Asthma-SubCountyData.xlsx", 
                         sheet = "AD21", skip = 6)

# import our processed county dataset
county_pov <- read_csv("data/processed/county_pov_rate_2019.csv")

# process atm data, number of atms per coiunty
atms_by_county <- raw_atms %>% 
  group_by(County) %>% 
  summarise(atms = n()) %>% 
  mutate(County = paste0(County, " County"))

# process lottery data - county of lottery retailers per county
lottery_count <- raw_lottery %>% 
  group_by(County, GEOID) %>% 
  summarise(lottery_retailers = n()) %>% 
  mutate(GEOID = as.numeric(GEOID))

# process asthma data - number of hospitalizations per 10,000 people
asthma <- raw_asthma %>% 
  group_by(County) %>% 
  summarise(asthma_hospitalizations = sum(Numerator)) %>% 
  mutate(County = paste0(County, " County"))

county_data <- county_pov %>% 
  left_join(atms_by_county, by = c("COUNTY" = "County")) %>% 
  mutate(banks_per10k = atms/county_pop*10000) %>% 
  left_join(lottery_count, by = "GEOID")  %>%
  mutate(lottery_per10k = lottery_retailers/county_pop*10000) %>%
  left_join(asthma, by = c("COUNTY" = "County")) %>% 
  mutate(asthma_per10k = asthma_hospitalizations/county_pop*10000) %>% 
  select(-County)

write_csv(county_data, "data/processed/county_all_data_2019.csv")

Homework

Electoral Votes = 538

Seats in the U.S. House of Representatives = 435
Seats in the Senate = 100
D.C. Electoral Votes = 3

Seats in the U.S. House of Representatives

allocated to each state by population
- U.S. population (2020) ~ 331 million
- Each House District ~ 761,000

Seats in the U.S. Senate

each state has 2 Senators, regardless of population

Homework script

library(tidyverse)
library(readxl)

## remove scientific notation
options(scipen = 999)

# import apportionment and race/ethnicity data
raw_apportion <- read_excel("data/raw/apportionment-2020-table01.xlsx", 
                                                      skip = 3)
raw_race <- read_csv("data/raw/DECENNIALPL2020.P2_Hispanic_Latino_by_race/DECENNIALPL2020.P2_data.csv")

# process race data
race <- raw_race %>% 
  mutate(percent_latinx = P2_003N/P2_001N,
         percent_white = P2_005N/P2_001N,
         percent_bipoc = 1 - percent_white) %>% 
  select(GEO_ID, NAME, percent_latinx, percent_white, percent_bipoc)

# process apportionment
apportion <- raw_apportion %>% 
  select(GEO_ID, STATE, `POPULATION`,
         `APPORTIONED REPRESENTATIVES`) %>% 
  rename(pop = `POPULATION`,
         representatives = `APPORTIONED REPRESENTATIVES`) %>% 
  mutate(electoral_votes = representatives + 2,
         pop_per_electoral_vote = round(pop/electoral_votes, 0)) %>% 
  full_join(race, by = "GEO_ID")

Homework plot

plot(apportion$percent_white, apportion$pop_per_electoral_vote)

Visualization as part of analysis

Visualization is a tool:

to explore our datasets
check results
share with colleagues
share final analysis results

ggplot scatterplot

ggplot code

library(tidyverse)
library(scales)

# use ggplot
ggplot(apportion, aes(x = percent_white, y = pop_per_electoral_vote)) +
  geom_point() + 
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = comma) + 
  labs(x = "Percent White", y = "People per Electoral Vote",
       title = "Race and Electoral Power",
       caption = "Source: U.S. Census, 2020")

`ggplot2`

Tidyverse package for producing statistical graphics

Every ggplot has 3 key components:

data: the information you want to visualize
aestheic mappings that indicate how to visualize the data’s variables
- examples: color, size
at least one layer to display the data
- examples: points, bars, lines

and we often add:

theme elements to control other display elements
- examples: font, background color

ggplot scatterplot example

ggplot(apportion, # data
       aes(x = percent_white, 
           y = pop_per_electoral_vote)) + # aesthetics
  geom_point() # point layer

ggplot line example

A line graph doesn’t make sense for this data, but as an example:

the layer type determines how you display the data

ggplot(data = apportion, # data
       aes(x = percent_white, 
           y = pop_per_electoral_vote)) + # aesthetics
  geom_line() # line layer

ggplot scatterplot example

ggplot(data = apportion, 
       aes(x = percent_white, 
           y = pop_per_electoral_vote)) +
  geom_point() + 
  scale_y_continuous(labels = comma) + # y-axis labels, format as numbers with commas
  scale_x_continuous(labels = percent_format(accuracy = 1)) + # x-axis labels
  labs(x = "Percent White", y = "People per Electoral Vote",
       title = "Race and Electoral Power",
       subtitle = "One person, one vote means the number of people per electoral vote should be the same for each state",
       caption = "Source: U.S. Census, 2020") # titles

Iterating your way to beauty with ggplot

Analysis graphics should follow a simple, iterative workflow:

Create a basic plot
Address any missing values
Clean up formatting of chart elements
Add a new layer of data
Tidy up formatting
Repeat steps 4-5 as needed
Save output

New York Poverty Analysis

Explore the level of economic inequality in school districts across New York State.

What is the difference between the student poverty rate in each school district and:
- the poverty rate of the county as a whole?
- the poverty rate of the state as a whole?

What counties have the most economic inequality, as measured by the student poverty rate of school districts?

Analysis plan

Create dataframe of poverty rate by county
Create dataframe of student poverty rate by school district
Calculate the statewide student poverty rate
Join the school district and county poverty dataframes to compare the poverty rates
Measure the difference in poverty rates of each school district and it’s county and the state
Use summary statistics to explore and gain understanding
Use visualizations to explore and gain understanding

Analysis so far

We have 3 scripts for our analysis so far:

1_process_school_district_data_2019
2_process_county_data_2019
3_student_poverty_analysis

You’ll continue with ny_county_dataset.R for homework

Student Poverty Analysis script

In 3_student_poverty_analysis we’ll remove NYC and write out the data:

library(tidyverse)

# Import processed dataframes
county_pov <- read_csv("data/processed/county_all_data_2019.csv")
sd_pov <- read_csv("data/processed/school_district_student_pov_rate_2019.csv")

sd_county_pov <- sd_pov %>% 
  left_join(county_pov, by = c("CONUM"="GEOID")) %>% 
  select(district_id, district, County, CONUM, tpop, stpop, stpov, 
         stpovrate, county_pop, county_pov_rate, state_stpovrate) %>% 
  mutate(pov_diff_county = round(stpovrate - county_pov_rate, 3),
         pov_diff_state = round(stpovrate - state_stpovrate, 3)) 

# Calculate summary statistics of poverty rate
ny_pov_stats <- sd_county_pov %>%
  summarise(districts = n(),
            kids = sum(stpop),
            kids_in_pov = sum(stpov),
            stud_poverty_rate = kids_in_pov/kids,
            mean_sd_stpovrate = mean(stpovrate),
            max_sd_stpovrate = max(stpovrate), 
            min_sd_stpovrate = min(stpovrate),
            poverty_range = max_sd_stpovrate - min_sd_stpovrate)

ny_county_pov_stats <- sd_county_pov %>%
  group_by(County) %>% 
  summarise(districts = n(),
            kids = sum(stpop),
            kids_in_pov = sum(stpov),
            stud_poverty_rate = round(kids_in_pov/kids, 3),
            mean_sd_stpovrate = round(mean(stpovrate), 3),
            max_sd_stpovrate = round(max(stpovrate), 3),
            min_sd_stpovrate = round(min(stpovrate), 3),
            poverty_range = max_sd_stpovrate - min_sd_stpovrate) %>% 
  filter(County != "New York County")

# write out dataframes
write_csv(sd_county_pov, "data/processed/school_district_county_poverty_2019.csv")
write_csv(ny_county_pov_stats, "data/output/ny_county_poverty_stats.csv")

Visualization plan

What counties have the most economic inequality, as measured by the student poverty rate of school districts?

Add school district enrollment data for context
Create a scatterplot to explore the county with the largest range in student poverty
Create scatterplots to explore other counties
Create scatterplots to explore the state as a whole

Visualization script

Create a new script 4_visualize_poverty_analysis.R

add necessary packages
import data
join school district enrollment data

library(tidyverse)
library(scales)
library(viridis)

### Import the summary data
county_stats <- read_csv("data/output/ny_county_poverty_stats.csv")

# import some extra school district data
sd_enroll <- read_csv("data/raw/ny_sd_enrollment_2019.csv")

# import the school district - county poverty data and join the school district data
sd_county_pov <- read_csv("data/output/ny_sd_county_pov_data.csv") %>% 
  left_join(sd_enroll, by = "district_id")

What county has the largest range in student poverty?

Orange County scatterplot, v1

### Create a scatterplot to explore the county with the largest range in student poverty 
ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, y = pct_bipoc)) +
  geom_point()

Orange County scatterplot, v2 script

Format the labels as percent, with no decimal place

use scale_x_continuous() to format the x-axis
percent_format() is a scales package function
accuracy = 1 rounds to a whole number
- accuracy = .1 includes one decimal place

### Create a scatterplot to explore the county with the largest range in student poverty 

ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, y = pct_bipoc)) +
  geom_point() +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1))

Orange County scatterplot, v2 plot

Orange County scatterplot, v3 code

Within the aesthetic mapping aes(), size the dots by enrollment
Within geom_point() make the dots 50% transparent with alpha = .5

ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1))

Orange County scatterplot, v3 plot

Orange County scatterplot, v4 code

Within the aesthetic mapping aes(), color the dots by “urbanicity category”

ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1))

Orange County scatterplot, v4 plot

Orange County scatterplot, v5 code

Add formatted axis labels, title, and caption within the labs function

ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019")

Orange County scatterplot, v5 plot

Orange County scatterplot, v6 code

Add a theme to add some standard styling

see other themes

ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019") +
  theme_bw()

Orange County scatterplot, v6 plot

Orange County scatterplot, v7 code

Fix the legend

format the enrollment number with commas
In labs() add nice legend titles

ggplot(data = sd_county_pov %>% 
         filter(County == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

Orange County scatterplot, v7 plot

New York scatterplot code

Remove the filter to look at New York as a whole

ggplot(data = sd_county_pov, 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in New York School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

New York scatterplot

New York scatterplot (no nyc) code

Remove New York City to see how it changes

ggplot(data = sd_county_pov %>% 
         filter(district != "New York City Department Of Education"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       subtitle = "Excluding New York City",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

New York scatterplot (no nyc)

Save a plot

To save a plot:

first save it as an object
use ggsave to save it

ny_scatter <- ggplot(data = sd_county_pov %>% 
         filter(district != "New York City Department Of Education"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

# example code to save last plot as a 5" by 7" .png file
ggsave("data/output/NewYork_school_district_poverty.png", #specify the file path/name/type
       plot = ny_scatter, # specify the ggplot object you stored
       units = "in", # specify the units for your image
       height = 5, width = 7) # specify the image dimensions

Homework 5a.

Save a plot from the in-class exercise that shows the poverty range on one county in New York. Upload it to canvas with a short paragraph description of what the scatterplot shows.

Homework 5b.

Use the visualization skills you learned today to create 3 plots to explore the New York County data from last week (county poverty, asthma hospitalization rates, lottery retailers, atms).

Follow your interest. Add more data if you desire. On canvas upload your plots and a short paragraph description of what each scatterplot shows.

Methods 1, Week 5

Download folder for class 5 data

Outline

County joins questions

Homework review

Visualization

Iterating your way to beauty and underderstanding ggplot2

County, school district poverty analysis continues

Assignment 5

Describe your visualization results

New York county visualizations

County data joins

Homework

Homework script

Homework plot

Visualization as part of analysis

ggplot scatterplot

ggplot code

ggplot2

ggplot scatterplot example

ggplot line example

ggplot scatterplot example

Iterating your way to beauty with ggplot

New York Poverty Analysis

Explore the level of economic inequality in school districts across New York State.

Analysis plan

Analysis so far

Student Poverty Analysis script

Visualization plan

Visualization script

Orange County scatterplot, v1

Orange County scatterplot, v2 script

Orange County scatterplot, v2 plot

Orange County scatterplot, v3 code

Orange County scatterplot, v3 plot

Orange County scatterplot, v4 code

Orange County scatterplot, v4 plot

Orange County scatterplot, v5 code

Orange County scatterplot, v5 plot

Orange County scatterplot, v6 code

Orange County scatterplot, v6 plot

Orange County scatterplot, v7 code

Orange County scatterplot, v7 plot

New York scatterplot code

New York scatterplot

New York scatterplot (no nyc) code

New York scatterplot (no nyc)

Save a plot

Homework 5a.

Homework 5b.

`ggplot2`