Methods 1, Week 3

Download folder for class 3 and create a new project

Download the class 3 folder and save them in your methods1 folder on your computer

Create new project from existing folder:

methods1/class3/ny_poverty_analysis

Install the readxl package

install.packages("readxl")

Learning goals

Over the next two classes (week 3 & 4) we’re going to work on one analysis together.

It will be a template for how to organize small research projects.
We’ll use the two datasets you created for the homework assignment to explore economic segregation across New York.

Outline

In-class analysis overview
Homework review
Importing excel files
Joining data
Summary statistics
Simple visualizatoion
Readings discussion
Assignment 3

In-class Analysis

Explore the level of economic inequality in school districts across New York State.

What is the difference between the student poverty rate in each school district and:
- the poverty rate of the county as a whole?
- the poverty rate of the state as a whole?
What counties have the most economic inequality, as measured by the student poverty rate of school districts?

New York School Districts and Counties

Analysis plan

Create dataframe of poverty rate by county
Create dataframe of student poverty rate by school district
Calculate the statewide student poverty rate
Join the school district and county poverty dataframes to compare the poverty rates
Measure the difference in poverty rates of each school district and it’s county and the state
Use summary statistics to explore and gain understanding
Use visualizations to explore and gain understanding

Analysis steps

Script 1. Process data
- Create dataframe of poverty rate by county
Script 2. Process data
- Create dataframe of student poverty rate by school district
- Calculate statewide student poverty rate
Script 3. Create analysis dataframe
- Join county poverty dataframe to school district data
- Create new variables
  - difference from county rate = district poverty rate - county poverty rate
  - difference from state rate = district poverty rate - state poverty rate
- Calculate summary statistics
Script 4. Visualize and explore your analysis dataframe

Script 1. Poverty rate by county

Create a new script named 1.process_ny_county_poverty_data.R

save it to methods1/ny_poverty_analysis

# Processing and exploring the 2019 poverty data from the American Community Survey for New York

library(tidyverse)

# import the county poverty rate dataset
raw_county_pov19 <- read_csv("data/raw/nhgis0042_csv/nhgis0042_ds244_20195_2019_county.csv") 

## process the data to create New York poverty rate dataset
county_pov_19 <- raw_county_pov19 %>% 
  select(STATE, COUNTY, GEOID, ALWVE001, ALWVE002, ALWVE003) %>% 
  filter(STATE == "New York") %>% 
  rename(county_pop = ALWVE001) %>% 
  mutate(county_pov_rate = (ALWVE002 + ALWVE003)/county_pop) %>% 
  select(STATE, COUNTY, GEOID, county_pop, county_pov_rate) 

# write out county poverty rate
write_csv(county_pov_19, "data/processed/county_pov_rate_2019.csv")

Script 2. Student poverty rate by school district

Create a new script named 2.process_ny_sd_poverty_data.R

save it to methods1/ny_poverty_analysis

# Create dataframe of school district poverty rate

library(tidyverse)
# import raw student poverty dataset
raw_stpov19 <- read_csv("data/raw/ny_student_poverty_2019.csv")

# create new dataframe from raw_stpov19 to process the data
stpov19 <- raw_stpov19 %>%
  rename(Estimated_Pop_5_17 = `Estimated Population 5-17`) %>% 
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2019") %>% 
  select(Postal, County, CONUM, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17,
         Estimated_relevant_5_17_in_poverty, stud_pov_rate, year) %>%
  # filter(Estimated_Pop_5_17 >= 100) %>% # remove filter
  rename(district = Name,
         tpop = Estimated_Total_Pop,
         stpop = Estimated_Pop_5_17,
         stpov = Estimated_relevant_5_17_in_poverty,
         stpovrate = stud_pov_rate)

We’re keeping 2 columns that weren’t in the homework file:

County and CONUM (county id)

Next steps

Calculate the statewide student poverty rate
Figure out how to connect the school district and county data to compare the poverty rates.

Calculate the statewide student poverty rate

# Create dataframe of school district poverty rate

library(tidyverse)
# import raw student poverty dataset
raw_stpov19 <- read_csv("data/raw/ny_student_poverty_2019.csv")

# create new dataframe from raw_stpov19 to process the data
stpov19 <- raw_stpov19 %>%
  rename(Estimated_Pop_5_17 = `Estimated Population 5-17`) %>% 
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2019",
         state_stpovrate = sum(Estimated_relevant_5_17_in_poverty)/sum(Estimated_Pop_5_17)) %>% 
  select(Postal, County, CONUM, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17,
         Estimated_relevant_5_17_in_poverty, stud_pov_rate,
         state_stpovrate, year) %>%
  rename(district = Name,
         tpop = Estimated_Total_Pop,
         stpop = Estimated_Pop_5_17,
         stpov = Estimated_relevant_5_17_in_poverty,
         stpovrate = stud_pov_rate)

# write out school district poverty rate
write_csv(stpov19, "data/processed/school_district_student_pov_rate_2019.csv")

Notice

to calculate the statewide poverty rate:
- add all of the students living in poverty statewide with: sum(Estimated_relevant_5_17_in_poverty)
- add all of the students statewide with: sum(Estimated_Pop_5_17)
- divide to calculate poverty rate: students living in poverty statewide / students statewide

Consider how to connect our datasets

To connect two dataframes, you need at least one id variable that is in both dataframes - called a key.

County dataframe

STATE	COUNTY	GEOID	county_pop	county_pov_rate
New York	Albany County	36001	291149	0.1187365
New York	Allegany County	36003	41742	0.1688228

School district dataframe

Postal	County	CONUM	district_id	district	tpop	stpop	stpov	stpovrate	state_stpovrate	year
NY	Albany County	36001	3602460	Albany City School District	98257	10895	2897	0.2659018	0.1673951	2019
NY	Orleans County	36073	3602520	Albion Central School District	14487	1847	354	0.1916622	0.1673951	2019

These two datasets do have a common key - though the column name is different.

Script 3. Create analysis dataframe

Create a new script named 3.analyze_ny_poverty.R

save it to methods1/ny_poverty_analysis
list the analysis steps (with a hash before every line so that it doesn’t run)

# Create analysis dataframe

library(tidyverse)
# Import processed dataframes
# Join county poverty dataframe to school district data
# Create new variables 
### difference from county rate = district poverty rate - county poverty rate
### difference from state rate = district poverty rate - state poverty rate
# Calculate summary statistics of poverty rate

Script 3. import data

Import your 2 datasets

# Create analysis dataframe
library(tidyverse)

# Import processed dataframes
county_pov <- read_csv("data/processed/county_pov_rate_2019.csv")
sd_pov <- read_csv("data/processed/school_district_student_pov_rate_2019.csv")

# Join county poverty dataframe to school district data
# Create new variables 
### difference from county rate = district poverty rate - county poverty rate
### difference from state rate = district poverty rate - state poverty rate
# Calculate summary statistics of poverty rate

STATE	COUNTY	GEOID	county_pop	county_pov_rate
New York	Albany County	36001	291149	0.1187365
New York	Allegany County	36003	41742	0.1688228

Postal	County	CONUM	district_id	district	tpop	stpop	stpov	stpovrate	state_stpovrate	year
NY	Steuben County	36101	3602370	Addison Central School District	6856	1210	283	0.2338843	0.1673951	2019
NY	Oneida County	36065	3605040	Adirondack Central School District	8424	1346	189	0.1404160	0.1673951	2019

Script 3. left join

Join the county data to the school district poverty data

# Create analysis dataframe
library(tidyverse)

# Import processed dataframes
county_pov <- read_csv("data/processed/county_pov_rate_2019.csv")
sd_pov <- read_csv("data/processed/school_district_student_pov_rate_2019.csv")

# Join county poverty dataframe to school district data
sd_county_pov <- sd_pov %>% 
  left_join(county_pov, by = c("CONUM"="GEOID"))

# Create new variables 
### difference from county rate = district poverty rate - county poverty rate
### difference from state rate = district poverty rate - state poverty rate
# Calculate summary statistics of poverty rate

Notice

The by = argument indicates which variableto connect the two dataframes
If the key variable has a different name in the two dataframes
- use by = c("CONUM"="GEOID")
If the key variable has the same name in the two dataframes
- use by = "CONUM"

Joins explained

When you join two dataframes together, you need to think about which one is your primary table.

In the school district - county example:

Each school district has 1 county
Each county has many school districts

We want to keep all of the school district rows and add a column with the county poverty rate. So we:

create a new dataframe from the school districts
use a left_join() to add the county data to the school district data if they have the same county id.

sd_county_pov <- sd_pov %>% 
  left_join(county_pov, by = c("CONUM"="GEOID"))

Types of joins

We will discuss this more later! For now, just know that there are other types of joins.

Script 3 - Select your variables

# Create analysis dataframe
library(tidyverse)

# Import processed dataframes
county_pov <- read_csv("data/processed/county_pov_rate_2019.csv")
sd_pov <- read_csv("data/processed/school_district_student_pov_rate_2019.csv")


# Join county poverty dataframe to school district data
sd_county_pov <- sd_pov %>% 
  left_join(county_pov, by = c("CONUM"="GEOID")) %>% 
  select(district_id, district, County, CONUM, tpop, stpop, stpov, 
         stpovrate, county_pop, county_pov_rate)

# Create new variables 
### difference from county rate = district poverty rate - county poverty rate
### difference from state rate = district poverty rate - state poverty rate
# Calculate summary statistics of poverty rate

Script 3 - Calculate poverty rate difference

# Create analysis dataframe
library(tidyverse)

# Import processed dataframes
county_pov <- read_csv("data/processed/county_pov_rate_2019.csv")
sd_pov <- read_csv("data/processed/school_district_student_pov_rate_2019.csv")

# Join county poverty dataframe to school district data
# Create new variables 
### difference from county rate = district poverty rate - county poverty rate
### difference from state rate = district poverty rate - state poverty rate
sd_county_pov <- sd_pov %>% 
  left_join(county_pov, by = c("CONUM"="GEOID")) %>% 
  select(district_id, district, County, CONUM, tpop, stpop, stpov, 
         stpovrate, county_pop, county_pov_rate, state_stpovrate) %>% 
  mutate(pov_diff_county = round(stpovrate - county_pov_rate, 3),
         pov_diff_state = round(stpovrate - state_stpovrate, 3))
# Calculate summary statistics of poverty rate

Analysis dataframe created

selected columns from the analysis dataframe:

district	County	stpop	stpovrate	county_pov_rate	state_stpovrate	pov_diff_county	pov_diff_state
Addison Central School District	Steuben County	1210	0.2338843	0.1369303	0.1673951	0.097	0.066
Adirondack Central School District	Oneida County	1346	0.1404160	0.1547234	0.1673951	-0.014	-0.027
Afton Central School District	Chenango County	583	0.1972556	0.1353454	0.1673951	0.062	0.030
Akron Central School District	Erie County	1503	0.0951430	0.1419954	0.1673951	-0.047	-0.072
Albany City School District	Albany County	10895	0.2659018	0.1187365	0.1673951	0.147	0.099

Research Questions

We have created the dataframes to answer these questions!

What is the difference between the student poverty rate in each school district and:
- the poverty rate of the county as a whole?
- the poverty rate of the state as a whole?

Still to do

What counties have the most economic inequality, as measured by the student poverty rate of school districts?

In-class work time

Work through this analysis on your own up to this point for ~ 30 minutes.

I’ll come around to help!

We’ll come back together and answer these questions together:

What district has the highest student poverty rate?
What county is that district in?
What county has the highest poverty rate?
What school district has the lowest student poverty rate in that county?

If you finish, try working through the joins_simple_examples.R script in today’s folder.

Exploratory Analysis

Now that we have our analysis dataset, it’s time to use it to learn more about poverty in New York state. We’ll use descriptive statistics and visualization to interpret the data.

Summary statistics for each variable with the summary() function
A very simple scatterplot & histogram to understand the shape of our data
Calculate our own descriptive statistics with the summarise() function
- for the state
- for each county

Calculate some quick summary statistics

summary(sd_county_pov)

  district_id        district            County              CONUM      
 Min.   :3600001   Length:681         Length:681         Min.   :36001  
 1st Qu.:3609330   Class :character   Class :character   1st Qu.:36033  
 Median :3616980   Mode  :character   Mode  :character   Median :36065  
 Mean   :3616755                                         Mean   :36066  
 3rd Qu.:3624240                                         3rd Qu.:36103  
 Max.   :3632010                                         Max.   :36123  
      tpop             stpop             stpov            stpovrate      
 Min.   :    104   Min.   :      1   Min.   :     0.0   Min.   :0.00000  
 1st Qu.:   5280   1st Qu.:    782   1st Qu.:    84.0   1st Qu.:0.06954  
 Median :   9629   Median :   1441   Median :   156.0   Median :0.12479  
 Mean   :  28935   Mean   :   4260   Mean   :   713.2   Mean   :0.13102  
 3rd Qu.:  20556   3rd Qu.:   3306   3rd Qu.:   294.0   3rd Qu.:0.18085  
 Max.   :8336817   Max.   :1193045   Max.   :259012.0   Max.   :0.50109  
   county_pop      county_pov_rate   state_stpovrate  pov_diff_county   
 Min.   :   4479   Min.   :0.04981   Min.   :0.1674   Min.   :-0.11000  
 1st Qu.:  63306   1st Qu.:0.08840   1st Qu.:0.1674   1st Qu.:-0.03300  
 Median : 153415   Median :0.12097   Median :0.1674   Median : 0.00000  
 Mean   : 457515   Mean   :0.11926   Mean   :0.1674   Mean   : 0.01177  
 3rd Qu.: 893029   3rd Qu.:0.14392   3rd Qu.:0.1674   3rd Qu.: 0.04800  
 Max.   :1584693   Max.   :0.19774   Max.   :0.1674   Max.   : 0.38500  
 pov_diff_state    
 Min.   :-0.16700  
 1st Qu.:-0.09800  
 Median :-0.04300  
 Mean   :-0.03638  
 3rd Qu.: 0.01300  
 Max.   : 0.33400

Look at a histogram

A histogram is a chart that shows the distribution of your data.

The height of each bar indicates how many district’s poverty difference is within that range.

hist(sd_county_pov$pov_diff_county)

( pov_diff_county = stpovrate - county_pov_rate )

Look at a scatterplot

A scatterplot is a chart that is used to look at the relationship between two variables.

Each dot is a school district.
The pattern of dots helps you to determine whether a relationship exists between two variables.
Here, you can say there is a positive relationship between County Poverty Rate and School District Poverty Rate
- meaning as County Poverty Rate increases, the School District Poverty Rate tends to be higher also

plot(sd_county_pov$county_pov_rate, sd_county_pov$stpovrate, 
     xlab="County Poverty Rate", 
     ylab="School District Poverty Rate")

summarise()

We can use the summarise() function to create our own statistics to help answer our questions.
- we’ll create a new dataframe that is a single row summarizing our dataframe
- useful arguments within summarize:
  - mean(), median(), min(), max(), sum(), n()
  - n() returns the number of rows

summarise() New York poverty

# calculate student poverty statistics for New York

ny_pov_stats <- sd_county_pov %>%
  summarise(districts = n(),
            kids = sum(stpop),
            kids_in_pov = sum(stpov),
            stud_poverty_rate = kids_in_pov/kids,
            mean_sd_stpovrate = mean(stpovrate),
            max_sd_stpovrate = max(stpovrate), 
            min_sd_stpovrate = min(stpovrate),
            poverty_range = max_sd_stpovrate - min_sd_stpovrate)

districts	kids	kids_in_pov	stud_poverty_rate	mean_sd_stpovrate	max_sd_stpovrate	min_sd_stpovrate	poverty_range
681	2901286	485661	0.1673951	0.1310178	0.5010861	0	0.5010861

group_by() & summarise()

We can also calculate the same statistics for each county by using:
- group_by() and summarise() together
- group_by() indicates that you want to summarize based on another variable rather than the whole dataset

ny_county_pov_stats <- sd_county_pov %>%
  group_by(County) %>% 
  summarise(districts = n(),
            kids = sum(stpop),
            kids_in_pov = sum(stpov),
            stud_poverty_rate = round(kids_in_pov/kids, 3),
            mean_sd_stpovrate = round(mean(stpovrate), 3),
            max_sd_stpovrate = round(max(stpovrate), 3),
            min_sd_stpovrate = round(min(stpovrate), 3),
            poverty_range = max_sd_stpovrate - min_sd_stpovrate)

summarize poverty in new york counties

County	districts	kids	kids_in_pov	stud_poverty_rate	mean_sd_stpovrate	max_sd_stpovrate	min_sd_stpovrate	poverty_range
Albany County	12	40249	5403	0.134	0.129	0.266	0.041	0.225
Allegany County	12	6796	1633	0.240	0.241	0.315	0.185	0.130
Broome County	12	27673	5916	0.214	0.182	0.370	0.079	0.291
Cattaraugus County	12	13775	2594	0.188	0.183	0.239	0.126	0.113
Cayuga County	7	9694	1728	0.178	0.152	0.225	0.090	0.135
Chautauqua County	18	19226	4335	0.225	0.190	0.333	0.088	0.245
Chemung County	3	12057	2267	0.188	0.173	0.248	0.088	0.160
Chenango County	8	7478	1225	0.164	0.166	0.197	0.125	0.072
Clinton County	8	11080	1647	0.149	0.144	0.211	0.097	0.114
Columbia County	6	7212	894	0.124	0.118	0.185	0.079	0.106
Cortland County	5	6632	1179	0.178	0.189	0.242	0.114	0.128
Delaware County	13	5774	1012	0.175	0.165	0.211	0.106	0.105
Dutchess County	13	41543	3759	0.090	0.098	0.239	0.055	0.184
Erie County	28	134940	24775	0.184	0.124	0.381	0.041	0.340
Essex County	10	3533	578	0.164	0.160	0.246	0.054	0.192
Franklin County	7	7803	1733	0.222	0.218	0.316	0.130	0.186
Fulton County	6	7651	1643	0.215	0.185	0.309	0.131	0.178
Genesee County	8	8609	972	0.113	0.101	0.165	0.072	0.093
Greene County	6	5922	959	0.162	0.177	0.315	0.106	0.209
Hamilton County	7	413	51	0.123	0.120	0.222	0.000	0.222
Herkimer County	10	9636	1673	0.174	0.168	0.237	0.078	0.159
Jefferson County	11	17784	3598	0.202	0.191	0.293	0.127	0.166
Lewis County	5	4254	731	0.172	0.167	0.181	0.128	0.053
Livingston County	8	7917	1168	0.148	0.154	0.256	0.091	0.165
Madison County	10	10139	1190	0.117	0.125	0.214	0.066	0.148
Monroe County	18	113152	19831	0.175	0.113	0.379	0.034	0.345
Montgomery County	5	8413	2014	0.239	0.230	0.268	0.151	0.117
Nassau County	56	216350	12320	0.057	0.054	0.193	0.015	0.178
New York County	1	1193045	259012	0.217	0.217	0.217	0.217	0.000
Niagara County	10	30500	4912	0.161	0.129	0.314	0.048	0.266
Oneida County	15	35425	6159	0.174	0.119	0.321	0.045	0.276
Onondaga County	18	70736	12000	0.170	0.115	0.364	0.035	0.329
Ontario County	9	17119	1754	0.102	0.112	0.181	0.051	0.130
Orange County	17	72854	11745	0.161	0.126	0.501	0.051	0.450
Orleans County	5	5895	1027	0.174	0.169	0.192	0.123	0.069
Oswego County	9	19552	3686	0.189	0.186	0.239	0.152	0.087
Otsego County	12	6822	1067	0.156	0.155	0.235	0.090	0.145
Putnam County	6	14400	780	0.054	0.056	0.073	0.038	0.035
Rensselaer County	12	22228	3147	0.142	0.124	0.262	0.046	0.216
Rockland County	8	65876	12842	0.195	0.110	0.342	0.047	0.295
Saratoga County	12	34936	2350	0.067	0.088	0.151	0.041	0.110
Schenectady County	6	23088	4600	0.199	0.132	0.335	0.045	0.290
Schoharie County	6	4268	652	0.153	0.169	0.285	0.104	0.181
Schuyler County	3	2105	415	0.197	0.200	0.213	0.190	0.023
Seneca County	4	4926	917	0.186	0.192	0.234	0.157	0.077
St. Lawrence County	17	15926	3519	0.221	0.212	0.335	0.121	0.214
Steuben County	12	15402	2732	0.177	0.195	0.291	0.120	0.171
Suffolk County	68	230466	17818	0.077	0.081	0.417	0.000	0.417
Sullivan County	8	10351	2537	0.245	0.214	0.288	0.133	0.155
Tioga County	6	7354	1038	0.141	0.141	0.177	0.102	0.075
Tompkins County	6	11569	1457	0.126	0.137	0.175	0.101	0.074
Ulster County	9	23786	3177	0.134	0.134	0.215	0.074	0.141
Warren County	9	9039	1183	0.131	0.149	0.261	0.053	0.208
Washington County	11	8399	1283	0.153	0.163	0.267	0.102	0.165
Wayne County	11	14309	2098	0.147	0.153	0.233	0.052	0.181
Westchester County	40	157456	13588	0.086	0.058	0.165	0.015	0.150
Wyoming County	5	4410	510	0.116	0.118	0.155	0.088	0.067
Yates County	2	3339	828	0.248	0.260	0.305	0.216	0.089

Research Questions

We have created the dataframes to answer these questions!

What is the difference between the student poverty rate in each school district and:

the poverty rate of the county as a whole?
the poverty rate of the state as a whole?

What counties have the most economic inequality, as measured by the student poverty rate of school districts?

In-class work time

Work through the rest of the analysis on your own for ~ 30 minutes.

I’ll come around to help!

We’ll come back together and answer these questions together:

What county has the largest range in student poverty rate?
What county has the smallest range in student poverty rate?
What other questions could you answer?

If you finish, take a look at the homework assignments.

Write out your csvs

Save these processed dataframe and summary stats dataframe to your computer

# write out poverty rate stat
write_csv(ny_pov_stats, "data/output/ny_student_poverty_state_stats_2019.csv")

write_csv(ny_county_pov_stats, "data/ny_student_poverty_state_stats_by_county_2019.csv")

To Be Continued…

Next week we’ll use much prettier plots to explore this dataset more.

Assignment 3a

Please read and be prepared to discuss the following readings:

Listen to the CodeSwitch episode Who You Calling Hispanic

Optionally, read Chapter 13 of R for Data Science on joining data if you would like more about relational data and joining data with R.

Assignment 3b

Research Journal

Assignment 3c

Work through the in-class exercise and answer the following questions:

What district has the highest student poverty rate?
What county is that district in?
What county has the highest poverty rate? (you should use the county poverty dataset created in script 1 to answer this question)
What school district has the lowest student poverty rate in Montgomery County?
What county has the largest range in student poverty rate?
What county has the smallest range in student poverty rate?

On Canvas, submit a text document with the answer to these questions and your 3 scripts.

Methods 1, Week 3

Download folder for class 3 and create a new project

Learning goals

Outline

In-class analysis overview

Homework review

Importing excel files

Joining data

Summary statistics

Simple visualizatoion

Readings discussion

Assignment 3

In-class Analysis

Explore the level of economic inequality in school districts across New York State.

New York School Districts and Counties

Analysis plan

Analysis steps

Script 1. Poverty rate by county

Script 2. Student poverty rate by school district

Next steps

Calculate the statewide student poverty rate

Consider how to connect our datasets

Script 3. Create analysis dataframe

Script 3. import data

Script 3. left join

Joins explained

Types of joins

Script 3 - Select your variables

Script 3 - Calculate poverty rate difference

Analysis dataframe created

Research Questions

In-class work time

Exploratory Analysis

Calculate some quick summary statistics

Look at a histogram

Look at a scatterplot

summarise()

summarise() New York poverty

group_by() & summarise()

summarize poverty in new york counties

Research Questions

In-class work time

Write out your csvs

To Be Continued…

Assignment 3a

Assignment 3b

Assignment 3c