Download the data in the class5 folder and save it in the ny_poverty_analysis/data/raw folder.
Open project:
methods1/class3/ny_poverty_analysis
Install the scales packages
install.packages("scales"):for formatting text
Outline
County joins questions
Homework review
Visualization
Iterating your way to beauty and underderstanding ggplot2
County, school district poverty analysis continues
Assignment 5
Describe your visualization results
New York county visualizations
County data joins
library(tidyverse)library(readxl)# import raw county dataraw_atms <-read_csv("data/raw/Bank-Owned_ATM_Locations_in_New_York_State.csv")raw_lottery <-read_csv("data/raw/NYS_Lottery_Retailers.csv")raw_asthma <-read_excel("data/raw/Asthma-SubCountyData.xlsx", sheet ="AD21", skip =6)# import our processed county datasetcounty_pov <-read_csv("data/processed/county_pov_rate_2019.csv")# process atm data, number of atms per coiuntyatms_by_county <- raw_atms %>%group_by(County) %>%summarise(atms =n()) %>%mutate(County =paste0(County, " County"))# process lottery data - county of lottery retailers per countylottery_count <- raw_lottery %>%group_by(County, GEOID) %>%summarise(lottery_retailers =n()) %>%mutate(GEOID =as.numeric(GEOID))# process asthma data - number of hospitalizations per 10,000 peopleasthma <- raw_asthma %>%group_by(County) %>%summarise(asthma_hospitalizations =sum(Numerator)) %>%mutate(County =paste0(County, " County"))county_data <- county_pov %>%left_join(atms_by_county, by =c("COUNTY"="County")) %>%mutate(banks_per10k = atms/county_pop*10000) %>%left_join(lottery_count, by ="GEOID") %>%mutate(lottery_per10k = lottery_retailers/county_pop*10000) %>%left_join(asthma, by =c("COUNTY"="County")) %>%mutate(asthma_per10k = asthma_hospitalizations/county_pop*10000) %>%select(-County)write_csv(county_data, "data/processed/county_all_data_2019.csv")
Homework
Electoral Votes = 538
Seats in the U.S. House of Representatives = 435
Seats in the Senate = 100
D.C. Electoral Votes = 3
Seats in the U.S. House of Representatives
allocated to each state by population
U.S. population (2020) ~ 331 million
Each House District ~ 761,000
Seats in the U.S. Senate
each state has 2 Senators, regardless of population
library(tidyverse)library(scales)# use ggplotggplot(apportion, aes(x = percent_white, y = pop_per_electoral_vote)) +geom_point() +scale_x_continuous(labels =percent_format(accuracy =1)) +scale_y_continuous(labels = comma) +labs(x ="Percent White", y ="People per Electoral Vote",title ="Race and Electoral Power",caption ="Source: U.S. Census, 2020")
ggplot2
Tidyverse package for producing statistical graphics
Every ggplot has 3 key components:
data: the information you want to visualize
aestheic mappings that indicate how to visualize the data’s variables
examples: color, size
at least one layer to display the data
examples: points, bars, lines
and we often add:
theme elements to control other display elements
examples: font, background color
ggplot scatterplot example
ggplot(apportion, # dataaes(x = percent_white, y = pop_per_electoral_vote)) +# aestheticsgeom_point() # point layer
ggplot line example
A line graph doesn’t make sense for this data, but as an example:
the layer type determines how you display the data
ggplot(data = apportion, # dataaes(x = percent_white, y = pop_per_electoral_vote)) +# aestheticsgeom_line() # line layer
ggplot scatterplot example
ggplot(data = apportion, aes(x = percent_white, y = pop_per_electoral_vote)) +geom_point() +scale_y_continuous(labels = comma) +# y-axis labels, format as numbers with commasscale_x_continuous(labels =percent_format(accuracy =1)) +# x-axis labelslabs(x ="Percent White", y ="People per Electoral Vote",title ="Race and Electoral Power",subtitle ="One person, one vote means the number of people per electoral vote should be the same for each state",caption ="Source: U.S. Census, 2020") # titles
Iterating your way to beauty with ggplot
Analysis graphics should follow a simple, iterative workflow:
Create a basic plot
Address any missing values
Clean up formatting of chart elements
Add a new layer of data
Tidy up formatting
Repeat steps 4-5 as needed
Save output
New York Poverty Analysis
Explore the level of economic inequality in school districts across New York State.
What is the difference between the student poverty rate in each school district and:
the poverty rate of the county as a whole?
the poverty rate of the state as a whole?
What counties have the most economic inequality, as measured by the student poverty rate of school districts?
Analysis plan
Create dataframe of poverty rate by county
Create dataframe of student poverty rate by school district
Calculate the statewide student poverty rate
Join the school district and county poverty dataframes to compare the poverty rates
Measure the difference in poverty rates of each school district and it’s county and the state
Use summary statistics to explore and gain understanding
Use visualizations to explore and gain understanding
Analysis so far
We have 3 scripts for our analysis so far:
1_process_school_district_data_2019
2_process_county_data_2019
3_student_poverty_analysis
You’ll continue with ny_county_dataset.R for homework
Student Poverty Analysis script
In 3_student_poverty_analysis we’ll remove NYC and write out the data:
What counties have the most economic inequality, as measured by the student poverty rate of school districts?
Add school district enrollment data for context
Create a scatterplot to explore the county with the largest range in student poverty
Create scatterplots to explore other counties
Create scatterplots to explore the state as a whole
Visualization script
Create a new script 4_visualize_poverty_analysis.R
add necessary packages
import data
join school district enrollment data
library(tidyverse)library(scales)library(viridis)### Import the summary datacounty_stats <-read_csv("data/output/ny_county_poverty_stats.csv")# import some extra school district datasd_enroll <-read_csv("data/raw/ny_sd_enrollment_2019.csv")# import the school district - county poverty data and join the school district datasd_county_pov <-read_csv("data/output/ny_sd_county_pov_data.csv") %>%left_join(sd_enroll, by ="district_id")
What county has the largest range in student poverty?
Orange County scatterplot, v1
### Create a scatterplot to explore the county with the largest range in student poverty ggplot(data = sd_county_pov %>%filter(County =="Orange County"), aes(x = stpovrate, y = pct_bipoc)) +geom_point()
Orange County scatterplot, v2 script
Format the labels as percent, with no decimal place
use scale_x_continuous() to format the x-axis
percent_format() is a scales package function
accuracy = 1 rounds to a whole number
accuracy = .1 includes one decimal place
### Create a scatterplot to explore the county with the largest range in student poverty ggplot(data = sd_county_pov %>%filter(County =="Orange County"), aes(x = stpovrate, y = pct_bipoc)) +geom_point() +scale_x_continuous(labels =percent_format(accuracy =1)) +scale_y_continuous(labels =percent_format(accuracy =1))
Orange County scatterplot, v2 plot
Orange County scatterplot, v3 code
Within the aesthetic mapping aes(), size the dots by enrollment
Within geom_point() make the dots 50% transparent with alpha = .5
ggplot(data = sd_county_pov %>%filter(County =="Orange County"), aes(x = stpovrate, y = pct_bipoc,size = denroll_district,color = urbanicity)) +geom_point(alpha = .5) +scale_x_continuous(labels =percent_format(accuracy =1)) +scale_y_continuous(labels =percent_format(accuracy =1)) +labs(x ="Student Poverty Rate", y ="Percent BIPOC",title ="Racial Diversity and Student Poverty in Orange County School Districts",caption ="Sources: NCES, 2019 and SAIPE, 2019") +theme_bw()
Orange County scatterplot, v6 plot
Orange County scatterplot, v7 code
Fix the legend
format the enrollment number with commas
In labs() add nice legend titles
ggplot(data = sd_county_pov %>%filter(County =="Orange County"), aes(x = stpovrate, y = pct_bipoc,size = denroll_district,color = urbanicity)) +geom_point(alpha = .5) +scale_x_continuous(labels =percent_format(accuracy =1)) +scale_y_continuous(labels =percent_format(accuracy =1)) +scale_size_area(labels = comma) +labs(x ="Student Poverty Rate", y ="Percent BIPOC",title ="Racial Diversity and Student Poverty in Orange County School Districts",caption ="Sources: NCES, 2019 and SAIPE, 2019",size ="Enrollment",color ="Urbanicity") +theme_bw()
Orange County scatterplot, v7 plot
New York scatterplot code
Remove the filter to look at New York as a whole
ggplot(data = sd_county_pov, aes(x = stpovrate, y = pct_bipoc,size = denroll_district,color = urbanicity)) +geom_point(alpha = .5) +scale_x_continuous(labels =percent_format(accuracy =1)) +scale_y_continuous(labels =percent_format(accuracy =1)) +scale_size_area(labels = comma) +labs(x ="Student Poverty Rate", y ="Percent BIPOC",title ="Racial Diversity and Student Poverty in New York School Districts",caption ="Sources: NCES, 2019 and SAIPE, 2019",size ="Enrollment",color ="Urbanicity") +theme_bw()
New York scatterplot
New York scatterplot (no nyc) code
Remove New York City to see how it changes
ggplot(data = sd_county_pov %>%filter(district !="New York City Department Of Education"), aes(x = stpovrate, y = pct_bipoc,size = denroll_district,color = urbanicity)) +geom_point(alpha = .5) +scale_x_continuous(labels =percent_format(accuracy =1)) +scale_y_continuous(labels =percent_format(accuracy =1)) +scale_size_area(labels = comma) +labs(x ="Student Poverty Rate", y ="Percent BIPOC",title ="Racial Diversity and Student Poverty in Orange County School Districts",subtitle ="Excluding New York City",caption ="Sources: NCES, 2019 and SAIPE, 2019",size ="Enrollment",color ="Urbanicity") +theme_bw()
New York scatterplot (no nyc)
Save a plot
To save a plot:
first save it as an object
use ggsave to save it
ny_scatter <-ggplot(data = sd_county_pov %>%filter(district !="New York City Department Of Education"), aes(x = stpovrate, y = pct_bipoc,size = denroll_district,color = urbanicity)) +geom_point(alpha = .5) +scale_x_continuous(labels =percent_format(accuracy =1)) +scale_y_continuous(labels =percent_format(accuracy =1)) +scale_size_area(labels = comma) +labs(x ="Student Poverty Rate", y ="Percent BIPOC",title ="Racial Diversity and Student Poverty in Orange County School Districts",caption ="Sources: NCES, 2019 and SAIPE, 2019",size ="Enrollment",color ="Urbanicity") +theme_bw()# example code to save last plot as a 5" by 7" .png fileggsave("data/output/NewYork_school_district_poverty.png", #specify the file path/name/typeplot = ny_scatter, # specify the ggplot object you storedunits ="in", # specify the units for your imageheight =5, width =7) # specify the image dimensions
Homework 5a.
Save a plot from the in-class exercise that shows the poverty range on one county in New York. Upload it to canvas with a short paragraph description of what the scatterplot shows.
Homework 5b.
Use the visualization skills you learned today to create 3 plots to explore the New York County data from last week (county poverty, asthma hospitalization rates, lottery retailers, atms).
Follow your interest. Add more data if you desire. On canvas upload your plots and a short paragraph description of what each scatterplot shows.