Download the data in the week 5 data and save it in the part1/data/raw folder.
Open part1 project:
Install the scales packages
install.packages("scales"):for formatting text
Outline
County joins questions
Homework review
Visualization
Iterating your way to beauty and underderstanding ggplot2
County, school district poverty analysis continues
Assignment 5
Describe your visualization results
New York county visualizations
County data joins
library(tidyverse)library(readxl)# import raw county dataraw_atms <-read_csv("data/raw/Bank-Owned_ATM_Locations_in_New_York_State.csv")raw_lottery <-read_csv("data/raw/NYS_Lottery_Retailers.csv")raw_asthma <-read_excel("data/raw/Asthma-SubCountyData.xlsx", skip =8)# import our processed county datasetcounty_pov <-read_csv("data/processed/county_pov_rate_2019.csv")# process atm data, number of atms per coiuntyatms_by_county <- raw_atms |>group_by(County) |>summarise(atms =n()) |>mutate(County =paste0(County, " County"))# process lottery data - county of lottery retailers per countylottery_count <- raw_lottery |>group_by(County, GEOID) |>summarise(lottery_retailers =n()) |>mutate(GEOID =as.numeric(GEOID))# process asthma data - number of hospitalizations per 10,000 peopleasthma <- raw_asthma |>filter(Numerator !="s") |>mutate(Numerator =as.numeric(Numerator)) |>group_by(County) |>summarise(asthma_hospitalizations =sum(Numerator)) |>mutate(County =paste0(County, " County"))county_data <- county_pov |>left_join(atms_by_county, by =c("COUNTY"="County")) |>mutate(banks_per10k = atms/county_pop*10000) |>left_join(lottery_count, by =c("CONUM"="GEOID")) |>mutate(lottery_per10k = lottery_retailers/county_pop*10000) |>left_join(asthma, by =c("COUNTY"="County")) |>mutate(asthma_per10k = asthma_hospitalizations/county_pop*10000) |>select(-County)write_csv(county_data, "data/processed/county_all_data_2019.csv")
Homework
Electoral Votes = 538
Seats in the U.S. House of Representatives = 435
Seats in the Senate = 100
D.C. Electoral Votes = 3
Seats in the U.S. House of Representatives
allocated to each state by population
U.S. population (2020) ~ 331 million
Each House District ~ 761,000
Seats in the U.S. Senate
each state has 2 Senators, regardless of population
continental_us <- apportion |>filter(STATE !="Alaska", STATE !="Hawaii")plot(continental_us$percent_white, continental_us$pop_per_electoral_vote)
Visualization as part of analysis
Visualization is a tool:
to explore our datasets
check results
share with colleagues
share final analysis results
ggplot scatterplot
ggplot code
library(tidyverse)library(scales)# use ggplotggplot(apportion, aes(x = percent_white, y = pop_per_electoral_vote)) +geom_point() +scale_x_continuous(labels =percent_format(accuracy =1)) +scale_y_continuous(labels = comma) +labs(x ="Percent White", y ="People per Electoral Vote",title ="Race and Electoral Power",caption ="Source: U.S. Census, 2020. ** Note, excludes Alaska, Hawaii, and Washington D.C.")
ggplot2
Tidyverse package for producing plots of your dataframe
Every ggplot has 3 required components:
data: the dataframe you want to visualize
aes: variables in the dataframe that you want to visualize
at least one layer that defines what type of plot you want to create
examples: points, bars, lines
you can add many more elements to make it look nicer
A line graph doesn’t make sense for this data, but as an example:
the layer type determines how you display the data
ggplot(data = apportion, # dataaes(x = percent_white, y = pop_per_electoral_vote)) +# aestheticsgeom_line() # line layer
Analysis plan
Create dataframe of poverty rate by county
Create dataframe of student poverty rate by school district
Calculate the statewide student poverty rate
Join the school district and county poverty dataframes to compare the poverty rates
Measure the difference in poverty rates of each school district and it’s county and the state
Use summary statistics to explore and gain understanding
Use visualizations to explore and gain understanding
Analysis so far
We have 3 scripts for our analysis so far:
new_york_student_poverty_2019
ny_county_poverty_rate_19
analyze_ny_poverty
You’ll continue with analyze_ny_poverty for homework
Visualization plan
What counties have the most economic inequality, as measured by the student poverty rate of school districts?
Add school district enrollment data for context
Create a scatterplot to explore the county with the largest range in student poverty
Create scatterplots to explore other counties
Create scatterplots to explore the state as a whole
Visualization script
Create a new script visualize_poverty_analysis.R
add necessary packages
import data
join school district enrollment data
library(tidyverse)library(scales)library(viridis)### Import the summary data so we can look at it to pick the counties we want to focus oncounty_stats <-read_csv("data/output/ny_county_poverty_stats.csv")# import some extra school district datasd_enroll <-read_csv("data/raw/ny_sd_enrollment_2019.csv")# import the school district - county poverty data and join the school district datasd_county_pov <-read_csv("data/processed/ny_sd_county_pov_data.csv") |>left_join(sd_enroll, by ="id")
What county has the largest range in student poverty?
Orange County scatterplot, v1
### Create a scatterplot to explore the county with the largest range in student poverty ggplot(data = sd_county_pov |>filter(COUNTY =="Orange County"), aes(x = stpovrate, y = pct_bipoc)) +geom_point()
ggplot(data = sd_county_pov |>filter(COUNTY =="Orange County"), aes(x = stpovrate, y = pct_bipoc,size = denroll_district,color = urbanicity)) +geom_point(alpha = .5) +scale_x_continuous(labels =percent_format(accuracy =1)) +scale_y_continuous(labels =percent_format(accuracy =1)) +labs(x ="Student Poverty Rate", y ="Percent BIPOC",title ="Racial Diversity and Student Poverty in Orange County School Districts",caption ="Sources: NCES, 2019 and SAIPE, 2019",size ="Enrollment",color ="Urbanicity") +theme_bw()
Orange County scatterplot, v6 plot
New York scatterplot code
Remove the filter to look at New York as a whole
ggplot(data = sd_county_pov, aes(x = stpovrate, y = pct_bipoc,size = denroll_district,color = urbanicity)) +geom_point(alpha = .5) +scale_x_continuous(labels =percent_format(accuracy =1)) +scale_y_continuous(labels =percent_format(accuracy =1)) +scale_size_area(labels = comma) +labs(x ="Student Poverty Rate", y ="Percent BIPOC",title ="Racial Diversity and Student Poverty in New York School Districts",caption ="Sources: NCES, 2019 and SAIPE, 2019",size ="Enrollment",color ="Urbanicity") +theme_bw()
New York scatterplot
New York scatterplot (no nyc) code
Remove New York City to see how it changes
ggplot(data = sd_county_pov |>filter(district !="New York City Department Of Education"), aes(x = stpovrate, y = pct_bipoc,size = denroll_district,color = urbanicity)) +geom_point(alpha = .5) +scale_x_continuous(labels =percent_format(accuracy =1)) +scale_y_continuous(labels =percent_format(accuracy =1)) +scale_size_area(labels = comma) +labs(x ="Student Poverty Rate", y ="Percent BIPOC",title ="Racial Diversity and Student Poverty in Orange County School Districts",subtitle ="Excluding New York City",caption ="Sources: NCES, 2019 and SAIPE, 2019",size ="Enrollment",color ="Urbanicity") +theme_bw()
New York scatterplot (no nyc)
Save a plot
To save a plot:
first save it as an object
use ggsave to save it
ny_scatter <-ggplot(data = sd_county_pov |>filter(district !="New York City Department Of Education"), aes(x = stpovrate, y = pct_bipoc,size = denroll_district,color = urbanicity)) +geom_point(alpha = .5) +scale_x_continuous(labels =percent_format(accuracy =1)) +scale_y_continuous(labels =percent_format(accuracy =1)) +scale_size_area(labels = comma) +labs(x ="Student Poverty Rate", y ="Percent BIPOC",title ="Racial Diversity and Student Poverty in Orange County School Districts",caption ="Sources: NCES, 2019 and SAIPE, 2019",size ="Enrollment",color ="Urbanicity") +theme_bw()# example code to save last plot as a 5" by 7" .png fileggsave("data/output/NewYork_school_district_poverty.png", #specify the file path/name/typeplot = ny_scatter, # specify the ggplot object you storedunits ="in", # specify the units for your imageheight =5, width =7) # specify the image dimensions
Homework 5a.
Save a plot from the in-class exercise that shows the poverty range on one county in New York. Upload it to canvas with a short paragraph description of what the scatterplot shows.
Homework 5b.
Use the visualization skills you learned today to create 3 plots to explore the New York County data from last week (county poverty, asthma hospitalization rates, lottery retailers, atms).
Follow your interest. Add more data if you desire. On canvas upload your plots and a short paragraph description of what each scatterplot shows.