Download folder for class 3 and create a new project
Download the class 3 folder and save them in your methods1 folder on your computer
Create new project from existing folder:
methods1/class3/ny_poverty_analysis
Install the readxl package
install.packages("readxl")
Learning goals
Over the next two classes (week 3 & 4) we’re going to work on one analysis together.
It will be a template for how to organize small research projects.
We’ll use the two datasets you created for the homework assignment to explore economic segregation across New York.
Outline
In-class analysis overview
Homework review
Importing excel files
Joining data
Summary statistics
Simple visualizatoion
Readings discussion
Assignment 3
In-class Analysis
Explore the level of economic inequality in school districts across New York State.
What is the difference between the student poverty rate in each school district and:
the poverty rate of the county as a whole?
the poverty rate of the state as a whole?
What counties have the most economic inequality, as measured by the student poverty rate of school districts?
New York School Districts and Counties
Analysis plan
Create dataframe of poverty rate by county
Create dataframe of student poverty rate by school district
Calculate the statewide student poverty rate
Join the school district and county poverty dataframes to compare the poverty rates
Measure the difference in poverty rates of each school district and it’s county and the state
Use summary statistics to explore and gain understanding
Use visualizations to explore and gain understanding
Analysis steps
Script 1. Process data
Create dataframe of poverty rate by county
Script 2. Process data
Create dataframe of student poverty rate by school district
Calculate statewide student poverty rate
Script 3. Create analysis dataframe
Join county poverty dataframe to school district data
Create new variables
difference from county rate = district poverty rate - county poverty rate
difference from state rate = district poverty rate - state poverty rate
Calculate summary statistics
Script 4. Visualize and explore your analysis dataframe
Script 1. Poverty rate by county
Create a new script named 1.process_ny_county_poverty_data.R
save it to methods1/ny_poverty_analysis
# Processing and exploring the 2019 poverty data from the American Community Survey for New Yorklibrary(tidyverse)# import the county poverty rate datasetraw_county_pov19 <-read_csv("data/raw/nhgis0042_csv/nhgis0042_ds244_20195_2019_county.csv") ## process the data to create New York poverty rate datasetcounty_pov_19 <- raw_county_pov19 %>%select(STATE, COUNTY, GEOID, ALWVE001, ALWVE002, ALWVE003) %>%filter(STATE =="New York") %>%rename(county_pop = ALWVE001) %>%mutate(county_pov_rate = (ALWVE002 + ALWVE003)/county_pop) %>%select(STATE, COUNTY, GEOID, county_pop, county_pov_rate) # write out county poverty ratewrite_csv(county_pov_19, "data/processed/county_pov_rate_2019.csv")
Script 2. Student poverty rate by school district
Create a new script named 2.process_ny_sd_poverty_data.R
save it to methods1/ny_poverty_analysis
# Create dataframe of school district poverty ratelibrary(tidyverse)# import raw student poverty datasetraw_stpov19 <-read_csv("data/raw/ny_student_poverty_2019.csv")# create new dataframe from raw_stpov19 to process the datastpov19 <- raw_stpov19 %>%rename(Estimated_Pop_5_17 =`Estimated Population 5-17`) %>%mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,year ="2019") %>%select(Postal, County, CONUM, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17, Estimated_relevant_5_17_in_poverty, stud_pov_rate, year) %>%# filter(Estimated_Pop_5_17 >= 100) %>% # remove filterrename(district = Name,tpop = Estimated_Total_Pop,stpop = Estimated_Pop_5_17,stpov = Estimated_relevant_5_17_in_poverty,stpovrate = stud_pov_rate)
We’re keeping 2 columns that weren’t in the homework file:
County and CONUM (county id)
Next steps
Calculate the statewide student poverty rate
Figure out how to connect the school district and county data to compare the poverty rates.
Calculate the statewide student poverty rate
# Create dataframe of school district poverty ratelibrary(tidyverse)# import raw student poverty datasetraw_stpov19 <-read_csv("data/raw/ny_student_poverty_2019.csv")# create new dataframe from raw_stpov19 to process the datastpov19 <- raw_stpov19 %>%rename(Estimated_Pop_5_17 =`Estimated Population 5-17`) %>%mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,year ="2019",state_stpovrate =sum(Estimated_relevant_5_17_in_poverty)/sum(Estimated_Pop_5_17)) %>%select(Postal, County, CONUM, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17, Estimated_relevant_5_17_in_poverty, stud_pov_rate, state_stpovrate, year) %>%rename(district = Name,tpop = Estimated_Total_Pop,stpop = Estimated_Pop_5_17,stpov = Estimated_relevant_5_17_in_poverty,stpovrate = stud_pov_rate)# write out school district poverty ratewrite_csv(stpov19, "data/processed/school_district_student_pov_rate_2019.csv")
Notice
to calculate the statewide poverty rate:
add all of the students living in poverty statewide with: sum(Estimated_relevant_5_17_in_poverty)
add all of the students statewide with: sum(Estimated_Pop_5_17)
divide to calculate poverty rate: students living in poverty statewide / students statewide
Consider how to connect our datasets
To connect two dataframes, you need at least one id variable that is in both dataframes - called a key.
County dataframe
STATE
COUNTY
GEOID
county_pop
county_pov_rate
New York
Albany County
36001
291149
0.1187365
New York
Allegany County
36003
41742
0.1688228
School district dataframe
Postal
County
CONUM
district_id
district
tpop
stpop
stpov
stpovrate
state_stpovrate
year
NY
Albany County
36001
3602460
Albany City School District
98257
10895
2897
0.2659018
0.1673951
2019
NY
Orleans County
36073
3602520
Albion Central School District
14487
1847
354
0.1916622
0.1673951
2019
These two datasets do have a common key - though the column name is different.
Script 3. Create analysis dataframe
Create a new script named 3.analyze_ny_poverty.R
save it to methods1/ny_poverty_analysis
list the analysis steps (with a hash before every line so that it doesn’t run)
# Create analysis dataframelibrary(tidyverse)# Import processed dataframes# Join county poverty dataframe to school district data# Create new variables ### difference from county rate = district poverty rate - county poverty rate### difference from state rate = district poverty rate - state poverty rate# Calculate summary statistics of poverty rate
Script 3. import data
Import your 2 datasets
# Create analysis dataframelibrary(tidyverse)# Import processed dataframescounty_pov <-read_csv("data/processed/county_pov_rate_2019.csv")sd_pov <-read_csv("data/processed/school_district_student_pov_rate_2019.csv")# Join county poverty dataframe to school district data# Create new variables ### difference from county rate = district poverty rate - county poverty rate### difference from state rate = district poverty rate - state poverty rate# Calculate summary statistics of poverty rate
STATE
COUNTY
GEOID
county_pop
county_pov_rate
New York
Albany County
36001
291149
0.1187365
New York
Allegany County
36003
41742
0.1688228
Postal
County
CONUM
district_id
district
tpop
stpop
stpov
stpovrate
state_stpovrate
year
NY
Steuben County
36101
3602370
Addison Central School District
6856
1210
283
0.2338843
0.1673951
2019
NY
Oneida County
36065
3605040
Adirondack Central School District
8424
1346
189
0.1404160
0.1673951
2019
Script 3. left join
Join the county data to the school district poverty data
# Create analysis dataframelibrary(tidyverse)# Import processed dataframescounty_pov <-read_csv("data/processed/county_pov_rate_2019.csv")sd_pov <-read_csv("data/processed/school_district_student_pov_rate_2019.csv")# Join county poverty dataframe to school district datasd_county_pov <- sd_pov %>%left_join(county_pov, by =c("CONUM"="GEOID"))# Create new variables ### difference from county rate = district poverty rate - county poverty rate### difference from state rate = district poverty rate - state poverty rate# Calculate summary statistics of poverty rate
Notice
The by = argument indicates which variableto connect the two dataframes
If the key variable has a different name in the two dataframes
use by = c("CONUM"="GEOID")
If the key variable has the same name in the two dataframes
use by = "CONUM"
Joins explained
When you join two dataframes together, you need to think about which one is your primary table.
In the school district - county example:
Each school district has 1 county
Each county has many school districts
We want to keep all of the school district rows and add a column with the county poverty rate. So we:
create a new dataframe from the school districts
use a left_join() to add the county data to the school district data if they have the same county id.
sd_county_pov <- sd_pov %>%left_join(county_pov, by =c("CONUM"="GEOID"))
Types of joins
We will discuss this more later! For now, just know that there are other types of joins.
Script 3 - Select your variables
# Create analysis dataframelibrary(tidyverse)# Import processed dataframescounty_pov <-read_csv("data/processed/county_pov_rate_2019.csv")sd_pov <-read_csv("data/processed/school_district_student_pov_rate_2019.csv")# Join county poverty dataframe to school district datasd_county_pov <- sd_pov %>%left_join(county_pov, by =c("CONUM"="GEOID")) %>%select(district_id, district, County, CONUM, tpop, stpop, stpov, stpovrate, county_pop, county_pov_rate)# Create new variables ### difference from county rate = district poverty rate - county poverty rate### difference from state rate = district poverty rate - state poverty rate# Calculate summary statistics of poverty rate
Script 3 - Calculate poverty rate difference
# Create analysis dataframelibrary(tidyverse)# Import processed dataframescounty_pov <-read_csv("data/processed/county_pov_rate_2019.csv")sd_pov <-read_csv("data/processed/school_district_student_pov_rate_2019.csv")# Join county poverty dataframe to school district data# Create new variables ### difference from county rate = district poverty rate - county poverty rate### difference from state rate = district poverty rate - state poverty ratesd_county_pov <- sd_pov %>%left_join(county_pov, by =c("CONUM"="GEOID")) %>%select(district_id, district, County, CONUM, tpop, stpop, stpov, stpovrate, county_pop, county_pov_rate, state_stpovrate) %>%mutate(pov_diff_county =round(stpovrate - county_pov_rate, 3),pov_diff_state =round(stpovrate - state_stpovrate, 3))# Calculate summary statistics of poverty rate
Analysis dataframe created
selected columns from the analysis dataframe:
district
County
stpop
stpovrate
county_pov_rate
state_stpovrate
pov_diff_county
pov_diff_state
Addison Central School District
Steuben County
1210
0.2338843
0.1369303
0.1673951
0.097
0.066
Adirondack Central School District
Oneida County
1346
0.1404160
0.1547234
0.1673951
-0.014
-0.027
Afton Central School District
Chenango County
583
0.1972556
0.1353454
0.1673951
0.062
0.030
Akron Central School District
Erie County
1503
0.0951430
0.1419954
0.1673951
-0.047
-0.072
Albany City School District
Albany County
10895
0.2659018
0.1187365
0.1673951
0.147
0.099
Research Questions
We have created the dataframes to answer these questions!
What is the difference between the student poverty rate in each school district and:
the poverty rate of the county as a whole?
the poverty rate of the state as a whole?
Still to do
What counties have the most economic inequality, as measured by the student poverty rate of school districts?
In-class work time
Work through this analysis on your own up to this point for ~ 30 minutes.
I’ll come around to help!
We’ll come back together and answer these questions together:
What district has the highest student poverty rate?
What county is that district in?
What county has the highest poverty rate?
What school district has the lowest student poverty rate in that county?
If you finish, try working through the joins_simple_examples.R script in today’s folder.
Exploratory Analysis
Now that we have our analysis dataset, it’s time to use it to learn more about poverty in New York state. We’ll use descriptive statistics and visualization to interpret the data.
Summary statistics for each variable with the summary() function
A very simple scatterplot & histogram to understand the shape of our data
Calculate our own descriptive statistics with the summarise() function
for the state
for each county
Calculate some quick summary statistics
summary(sd_county_pov)
district_id district County CONUM
Min. :3600001 Length:681 Length:681 Min. :36001
1st Qu.:3609330 Class :character Class :character 1st Qu.:36033
Median :3616980 Mode :character Mode :character Median :36065
Mean :3616755 Mean :36066
3rd Qu.:3624240 3rd Qu.:36103
Max. :3632010 Max. :36123
tpop stpop stpov stpovrate
Min. : 104 Min. : 1 Min. : 0.0 Min. :0.00000
1st Qu.: 5280 1st Qu.: 782 1st Qu.: 84.0 1st Qu.:0.06954
Median : 9629 Median : 1441 Median : 156.0 Median :0.12479
Mean : 28935 Mean : 4260 Mean : 713.2 Mean :0.13102
3rd Qu.: 20556 3rd Qu.: 3306 3rd Qu.: 294.0 3rd Qu.:0.18085
Max. :8336817 Max. :1193045 Max. :259012.0 Max. :0.50109
county_pop county_pov_rate state_stpovrate pov_diff_county
Min. : 4479 Min. :0.04981 Min. :0.1674 Min. :-0.11000
1st Qu.: 63306 1st Qu.:0.08840 1st Qu.:0.1674 1st Qu.:-0.03300
Median : 153415 Median :0.12097 Median :0.1674 Median : 0.00000
Mean : 457515 Mean :0.11926 Mean :0.1674 Mean : 0.01177
3rd Qu.: 893029 3rd Qu.:0.14392 3rd Qu.:0.1674 3rd Qu.: 0.04800
Max. :1584693 Max. :0.19774 Max. :0.1674 Max. : 0.38500
pov_diff_state
Min. :-0.16700
1st Qu.:-0.09800
Median :-0.04300
Mean :-0.03638
3rd Qu.: 0.01300
Max. : 0.33400
Look at a histogram
A histogram is a chart that shows the distribution of your data.
The height of each bar indicates how many district’s poverty difference is within that range.
hist(sd_county_pov$pov_diff_county)
( pov_diff_county = stpovrate - county_pov_rate )
Look at a scatterplot
A scatterplot is a chart that is used to look at the relationship between two variables.
Each dot is a school district.
The pattern of dots helps you to determine whether a relationship exists between two variables.
Here, you can say there is a positive relationship between County Poverty Rate and School District Poverty Rate
meaning as County Poverty Rate increases, the School District Poverty Rate tends to be higher also
plot(sd_county_pov$county_pov_rate, sd_county_pov$stpovrate, xlab="County Poverty Rate", ylab="School District Poverty Rate")
summarise()
We can use the summarise() function to create our own statistics to help answer our questions.
we’ll create a new dataframe that is a single row summarizing our dataframe