Methods 1, Week 3

Download folder for class 3 and create a new project

Download the class 3 folder and save them in your methods1 folder on your computer

Create new project from existing folder:

  • methods1/class3/ny_poverty_analysis

Install the readxl package

install.packages("readxl")

Learning goals


Over the next two classes (week 3 & 4) we’re going to work on one analysis together.

  • It will be a template for how to organize small research projects.

  • We’ll use the two datasets you created for the homework assignment to explore economic segregation across New York.

Outline

  • In-class analysis overview

  • Homework review

  • Importing excel files

  • Joining data

  • Summary statistics

  • Simple visualizatoion

  • Readings discussion

  • Assignment 3

In-class Analysis

Explore the level of economic inequality in school districts across New York State.

  • What is the difference between the student poverty rate in each school district and:
    • the poverty rate of the county as a whole?
    • the poverty rate of the state as a whole?
  • What counties have the most economic inequality, as measured by the student poverty rate of school districts?

New York School Districts and Counties

Analysis plan

  • Create dataframe of poverty rate by county
  • Create dataframe of student poverty rate by school district
  • Calculate the statewide student poverty rate
  • Join the school district and county poverty dataframes to compare the poverty rates
  • Measure the difference in poverty rates of each school district and it’s county and the state
  • Use summary statistics to explore and gain understanding
  • Use visualizations to explore and gain understanding

Analysis steps

  • Script 1. Process data
    • Create dataframe of poverty rate by county
  • Script 2. Process data
    • Create dataframe of student poverty rate by school district
    • Calculate statewide student poverty rate
  • Script 3. Create analysis dataframe
    • Join county poverty dataframe to school district data
    • Create new variables
      • difference from county rate = district poverty rate - county poverty rate
      • difference from state rate = district poverty rate - state poverty rate
    • Calculate summary statistics
  • Script 4. Visualize and explore your analysis dataframe

Script 1. Poverty rate by county

Create a new script named 1.process_ny_county_poverty_data.R

  • save it to methods1/ny_poverty_analysis
# Processing and exploring the 2019 poverty data from the American Community Survey for New York

library(tidyverse)

# import the county poverty rate dataset
raw_county_pov19 <- read_csv("data/raw/nhgis0042_csv/nhgis0042_ds244_20195_2019_county.csv") 

## process the data to create New York poverty rate dataset
county_pov_19 <- raw_county_pov19 %>% 
  select(STATE, COUNTY, GEOID, ALWVE001, ALWVE002, ALWVE003) %>% 
  filter(STATE == "New York") %>% 
  rename(county_pop = ALWVE001) %>% 
  mutate(county_pov_rate = (ALWVE002 + ALWVE003)/county_pop) %>% 
  select(STATE, COUNTY, GEOID, county_pop, county_pov_rate) 

# write out county poverty rate
write_csv(county_pov_19, "data/processed/county_pov_rate_2019.csv")

Script 2. Student poverty rate by school district

Create a new script named 2.process_ny_sd_poverty_data.R

  • save it to methods1/ny_poverty_analysis
# Create dataframe of school district poverty rate

library(tidyverse)
# import raw student poverty dataset
raw_stpov19 <- read_csv("data/raw/ny_student_poverty_2019.csv")

# create new dataframe from raw_stpov19 to process the data
stpov19 <- raw_stpov19 %>%
  rename(Estimated_Pop_5_17 = `Estimated Population 5-17`) %>% 
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2019") %>% 
  select(Postal, County, CONUM, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17,
         Estimated_relevant_5_17_in_poverty, stud_pov_rate, year) %>%
  # filter(Estimated_Pop_5_17 >= 100) %>% # remove filter
  rename(district = Name,
         tpop = Estimated_Total_Pop,
         stpop = Estimated_Pop_5_17,
         stpov = Estimated_relevant_5_17_in_poverty,
         stpovrate = stud_pov_rate)

We’re keeping 2 columns that weren’t in the homework file:

  • County and CONUM (county id)

Next steps

  1. Calculate the statewide student poverty rate
  2. Figure out how to connect the school district and county data to compare the poverty rates.

Calculate the statewide student poverty rate

# Create dataframe of school district poverty rate

library(tidyverse)
# import raw student poverty dataset
raw_stpov19 <- read_csv("data/raw/ny_student_poverty_2019.csv")

# create new dataframe from raw_stpov19 to process the data
stpov19 <- raw_stpov19 %>%
  rename(Estimated_Pop_5_17 = `Estimated Population 5-17`) %>% 
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2019",
         state_stpovrate = sum(Estimated_relevant_5_17_in_poverty)/sum(Estimated_Pop_5_17)) %>% 
  select(Postal, County, CONUM, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17,
         Estimated_relevant_5_17_in_poverty, stud_pov_rate,
         state_stpovrate, year) %>%
  rename(district = Name,
         tpop = Estimated_Total_Pop,
         stpop = Estimated_Pop_5_17,
         stpov = Estimated_relevant_5_17_in_poverty,
         stpovrate = stud_pov_rate)

# write out school district poverty rate
write_csv(stpov19, "data/processed/school_district_student_pov_rate_2019.csv")

Notice

  • to calculate the statewide poverty rate:
    • add all of the students living in poverty statewide with: sum(Estimated_relevant_5_17_in_poverty)
    • add all of the students statewide with: sum(Estimated_Pop_5_17)
    • divide to calculate poverty rate: students living in poverty statewide / students statewide

Consider how to connect our datasets

  • To connect two dataframes, you need at least one id variable that is in both dataframes - called a key.

County dataframe

STATE COUNTY GEOID county_pop county_pov_rate
New York Albany County 36001 291149 0.1187365
New York Allegany County 36003 41742 0.1688228


School district dataframe

Postal County CONUM district_id district tpop stpop stpov stpovrate state_stpovrate year
NY Albany County 36001 3602460 Albany City School District 98257 10895 2897 0.2659018 0.1673951 2019
NY Orleans County 36073 3602520 Albion Central School District 14487 1847 354 0.1916622 0.1673951 2019


  • These two datasets do have a common key - though the column name is different.

Script 3. Create analysis dataframe


Create a new script named 3.analyze_ny_poverty.R

  • save it to methods1/ny_poverty_analysis
  • list the analysis steps (with a hash before every line so that it doesn’t run)
# Create analysis dataframe

library(tidyverse)
# Import processed dataframes
# Join county poverty dataframe to school district data
# Create new variables 
### difference from county rate = district poverty rate - county poverty rate
### difference from state rate = district poverty rate - state poverty rate
# Calculate summary statistics of poverty rate

Script 3. import data

Import your 2 datasets

# Create analysis dataframe
library(tidyverse)

# Import processed dataframes
county_pov <- read_csv("data/processed/county_pov_rate_2019.csv")
sd_pov <- read_csv("data/processed/school_district_student_pov_rate_2019.csv")

# Join county poverty dataframe to school district data
# Create new variables 
### difference from county rate = district poverty rate - county poverty rate
### difference from state rate = district poverty rate - state poverty rate
# Calculate summary statistics of poverty rate
STATE COUNTY GEOID county_pop county_pov_rate
New York Albany County 36001 291149 0.1187365
New York Allegany County 36003 41742 0.1688228


Postal County CONUM district_id district tpop stpop stpov stpovrate state_stpovrate year
NY Steuben County 36101 3602370 Addison Central School District 6856 1210 283 0.2338843 0.1673951 2019
NY Oneida County 36065 3605040 Adirondack Central School District 8424 1346 189 0.1404160 0.1673951 2019

Script 3. left join

Join the county data to the school district poverty data

# Create analysis dataframe
library(tidyverse)

# Import processed dataframes
county_pov <- read_csv("data/processed/county_pov_rate_2019.csv")
sd_pov <- read_csv("data/processed/school_district_student_pov_rate_2019.csv")

# Join county poverty dataframe to school district data
sd_county_pov <- sd_pov %>% 
  left_join(county_pov, by = c("CONUM"="GEOID"))

# Create new variables 
### difference from county rate = district poverty rate - county poverty rate
### difference from state rate = district poverty rate - state poverty rate
# Calculate summary statistics of poverty rate

Notice

  • The by = argument indicates which variableto connect the two dataframes
  • If the key variable has a different name in the two dataframes
    • use by = c("CONUM"="GEOID")
  • If the key variable has the same name in the two dataframes
    • use by = "CONUM"

Joins explained

When you join two dataframes together, you need to think about which one is your primary table.

In the school district - county example:

  • Each school district has 1 county
  • Each county has many school districts

We want to keep all of the school district rows and add a column with the county poverty rate. So we:

  • create a new dataframe from the school districts
  • use a left_join() to add the county data to the school district data if they have the same county id.
sd_county_pov <- sd_pov %>% 
  left_join(county_pov, by = c("CONUM"="GEOID"))

Types of joins

We will discuss this more later! For now, just know that there are other types of joins.

Script 3 - Select your variables

# Create analysis dataframe
library(tidyverse)

# Import processed dataframes
county_pov <- read_csv("data/processed/county_pov_rate_2019.csv")
sd_pov <- read_csv("data/processed/school_district_student_pov_rate_2019.csv")


# Join county poverty dataframe to school district data
sd_county_pov <- sd_pov %>% 
  left_join(county_pov, by = c("CONUM"="GEOID")) %>% 
  select(district_id, district, County, CONUM, tpop, stpop, stpov, 
         stpovrate, county_pop, county_pov_rate)

# Create new variables 
### difference from county rate = district poverty rate - county poverty rate
### difference from state rate = district poverty rate - state poverty rate
# Calculate summary statistics of poverty rate

Script 3 - Calculate poverty rate difference

# Create analysis dataframe
library(tidyverse)

# Import processed dataframes
county_pov <- read_csv("data/processed/county_pov_rate_2019.csv")
sd_pov <- read_csv("data/processed/school_district_student_pov_rate_2019.csv")

# Join county poverty dataframe to school district data
# Create new variables 
### difference from county rate = district poverty rate - county poverty rate
### difference from state rate = district poverty rate - state poverty rate
sd_county_pov <- sd_pov %>% 
  left_join(county_pov, by = c("CONUM"="GEOID")) %>% 
  select(district_id, district, County, CONUM, tpop, stpop, stpov, 
         stpovrate, county_pop, county_pov_rate, state_stpovrate) %>% 
  mutate(pov_diff_county = round(stpovrate - county_pov_rate, 3),
         pov_diff_state = round(stpovrate - state_stpovrate, 3))
# Calculate summary statistics of poverty rate

Analysis dataframe created

selected columns from the analysis dataframe:

district County stpop stpovrate county_pov_rate state_stpovrate pov_diff_county pov_diff_state
Addison Central School District Steuben County 1210 0.2338843 0.1369303 0.1673951 0.097 0.066
Adirondack Central School District Oneida County 1346 0.1404160 0.1547234 0.1673951 -0.014 -0.027
Afton Central School District Chenango County 583 0.1972556 0.1353454 0.1673951 0.062 0.030
Akron Central School District Erie County 1503 0.0951430 0.1419954 0.1673951 -0.047 -0.072
Albany City School District Albany County 10895 0.2659018 0.1187365 0.1673951 0.147 0.099

Research Questions

We have created the dataframes to answer these questions!

  • What is the difference between the student poverty rate in each school district and:
    • the poverty rate of the county as a whole?
    • the poverty rate of the state as a whole?

Still to do

  • What counties have the most economic inequality, as measured by the student poverty rate of school districts?

In-class work time


Work through this analysis on your own up to this point for ~ 30 minutes.

I’ll come around to help!

We’ll come back together and answer these questions together:

  • What district has the highest student poverty rate?
  • What county is that district in?
  • What county has the highest poverty rate?
  • What school district has the lowest student poverty rate in that county?

If you finish, try working through the joins_simple_examples.R script in today’s folder.

Exploratory Analysis


Now that we have our analysis dataset, it’s time to use it to learn more about poverty in New York state. We’ll use descriptive statistics and visualization to interpret the data.

  • Summary statistics for each variable with the summary() function
  • A very simple scatterplot & histogram to understand the shape of our data
  • Calculate our own descriptive statistics with the summarise() function
    • for the state
    • for each county

Calculate some quick summary statistics

summary(sd_county_pov)
  district_id        district            County              CONUM      
 Min.   :3600001   Length:681         Length:681         Min.   :36001  
 1st Qu.:3609330   Class :character   Class :character   1st Qu.:36033  
 Median :3616980   Mode  :character   Mode  :character   Median :36065  
 Mean   :3616755                                         Mean   :36066  
 3rd Qu.:3624240                                         3rd Qu.:36103  
 Max.   :3632010                                         Max.   :36123  
      tpop             stpop             stpov            stpovrate      
 Min.   :    104   Min.   :      1   Min.   :     0.0   Min.   :0.00000  
 1st Qu.:   5280   1st Qu.:    782   1st Qu.:    84.0   1st Qu.:0.06954  
 Median :   9629   Median :   1441   Median :   156.0   Median :0.12479  
 Mean   :  28935   Mean   :   4260   Mean   :   713.2   Mean   :0.13102  
 3rd Qu.:  20556   3rd Qu.:   3306   3rd Qu.:   294.0   3rd Qu.:0.18085  
 Max.   :8336817   Max.   :1193045   Max.   :259012.0   Max.   :0.50109  
   county_pop      county_pov_rate   state_stpovrate  pov_diff_county   
 Min.   :   4479   Min.   :0.04981   Min.   :0.1674   Min.   :-0.11000  
 1st Qu.:  63306   1st Qu.:0.08840   1st Qu.:0.1674   1st Qu.:-0.03300  
 Median : 153415   Median :0.12097   Median :0.1674   Median : 0.00000  
 Mean   : 457515   Mean   :0.11926   Mean   :0.1674   Mean   : 0.01177  
 3rd Qu.: 893029   3rd Qu.:0.14392   3rd Qu.:0.1674   3rd Qu.: 0.04800  
 Max.   :1584693   Max.   :0.19774   Max.   :0.1674   Max.   : 0.38500  
 pov_diff_state    
 Min.   :-0.16700  
 1st Qu.:-0.09800  
 Median :-0.04300  
 Mean   :-0.03638  
 3rd Qu.: 0.01300  
 Max.   : 0.33400  

Look at a histogram

A histogram is a chart that shows the distribution of your data.

  • The height of each bar indicates how many district’s poverty difference is within that range.


hist(sd_county_pov$pov_diff_county)

( pov_diff_county = stpovrate - county_pov_rate )

Look at a scatterplot

A scatterplot is a chart that is used to look at the relationship between two variables.

  • Each dot is a school district.
  • The pattern of dots helps you to determine whether a relationship exists between two variables.
  • Here, you can say there is a positive relationship between County Poverty Rate and School District Poverty Rate
    • meaning as County Poverty Rate increases, the School District Poverty Rate tends to be higher also
plot(sd_county_pov$county_pov_rate, sd_county_pov$stpovrate, 
     xlab="County Poverty Rate", 
     ylab="School District Poverty Rate")

summarise()


  • We can use the summarise() function to create our own statistics to help answer our questions.
    • we’ll create a new dataframe that is a single row summarizing our dataframe
    • useful arguments within summarize:
      • mean(), median(), min(), max(), sum(), n()
      • n() returns the number of rows

summarise() New York poverty


# calculate student poverty statistics for New York

ny_pov_stats <- sd_county_pov %>%
  summarise(districts = n(),
            kids = sum(stpop),
            kids_in_pov = sum(stpov),
            stud_poverty_rate = kids_in_pov/kids,
            mean_sd_stpovrate = mean(stpovrate),
            max_sd_stpovrate = max(stpovrate), 
            min_sd_stpovrate = min(stpovrate),
            poverty_range = max_sd_stpovrate - min_sd_stpovrate)



districts kids kids_in_pov stud_poverty_rate mean_sd_stpovrate max_sd_stpovrate min_sd_stpovrate poverty_range
681 2901286 485661 0.1673951 0.1310178 0.5010861 0 0.5010861

group_by() & summarise()

  • We can also calculate the same statistics for each county by using:
    • group_by() and summarise() together
    • group_by() indicates that you want to summarize based on another variable rather than the whole dataset
ny_county_pov_stats <- sd_county_pov %>%
  group_by(County) %>% 
  summarise(districts = n(),
            kids = sum(stpop),
            kids_in_pov = sum(stpov),
            stud_poverty_rate = round(kids_in_pov/kids, 3),
            mean_sd_stpovrate = round(mean(stpovrate), 3),
            max_sd_stpovrate = round(max(stpovrate), 3),
            min_sd_stpovrate = round(min(stpovrate), 3),
            poverty_range = max_sd_stpovrate - min_sd_stpovrate) 

summarize poverty in new york counties

County districts kids kids_in_pov stud_poverty_rate mean_sd_stpovrate max_sd_stpovrate min_sd_stpovrate poverty_range
Albany County 12 40249 5403 0.134 0.129 0.266 0.041 0.225
Allegany County 12 6796 1633 0.240 0.241 0.315 0.185 0.130
Broome County 12 27673 5916 0.214 0.182 0.370 0.079 0.291
Cattaraugus County 12 13775 2594 0.188 0.183 0.239 0.126 0.113
Cayuga County 7 9694 1728 0.178 0.152 0.225 0.090 0.135
Chautauqua County 18 19226 4335 0.225 0.190 0.333 0.088 0.245
Chemung County 3 12057 2267 0.188 0.173 0.248 0.088 0.160
Chenango County 8 7478 1225 0.164 0.166 0.197 0.125 0.072
Clinton County 8 11080 1647 0.149 0.144 0.211 0.097 0.114
Columbia County 6 7212 894 0.124 0.118 0.185 0.079 0.106
Cortland County 5 6632 1179 0.178 0.189 0.242 0.114 0.128
Delaware County 13 5774 1012 0.175 0.165 0.211 0.106 0.105
Dutchess County 13 41543 3759 0.090 0.098 0.239 0.055 0.184
Erie County 28 134940 24775 0.184 0.124 0.381 0.041 0.340
Essex County 10 3533 578 0.164 0.160 0.246 0.054 0.192
Franklin County 7 7803 1733 0.222 0.218 0.316 0.130 0.186
Fulton County 6 7651 1643 0.215 0.185 0.309 0.131 0.178
Genesee County 8 8609 972 0.113 0.101 0.165 0.072 0.093
Greene County 6 5922 959 0.162 0.177 0.315 0.106 0.209
Hamilton County 7 413 51 0.123 0.120 0.222 0.000 0.222
Herkimer County 10 9636 1673 0.174 0.168 0.237 0.078 0.159
Jefferson County 11 17784 3598 0.202 0.191 0.293 0.127 0.166
Lewis County 5 4254 731 0.172 0.167 0.181 0.128 0.053
Livingston County 8 7917 1168 0.148 0.154 0.256 0.091 0.165
Madison County 10 10139 1190 0.117 0.125 0.214 0.066 0.148
Monroe County 18 113152 19831 0.175 0.113 0.379 0.034 0.345
Montgomery County 5 8413 2014 0.239 0.230 0.268 0.151 0.117
Nassau County 56 216350 12320 0.057 0.054 0.193 0.015 0.178
New York County 1 1193045 259012 0.217 0.217 0.217 0.217 0.000
Niagara County 10 30500 4912 0.161 0.129 0.314 0.048 0.266
Oneida County 15 35425 6159 0.174 0.119 0.321 0.045 0.276
Onondaga County 18 70736 12000 0.170 0.115 0.364 0.035 0.329
Ontario County 9 17119 1754 0.102 0.112 0.181 0.051 0.130
Orange County 17 72854 11745 0.161 0.126 0.501 0.051 0.450
Orleans County 5 5895 1027 0.174 0.169 0.192 0.123 0.069
Oswego County 9 19552 3686 0.189 0.186 0.239 0.152 0.087
Otsego County 12 6822 1067 0.156 0.155 0.235 0.090 0.145
Putnam County 6 14400 780 0.054 0.056 0.073 0.038 0.035
Rensselaer County 12 22228 3147 0.142 0.124 0.262 0.046 0.216
Rockland County 8 65876 12842 0.195 0.110 0.342 0.047 0.295
Saratoga County 12 34936 2350 0.067 0.088 0.151 0.041 0.110
Schenectady County 6 23088 4600 0.199 0.132 0.335 0.045 0.290
Schoharie County 6 4268 652 0.153 0.169 0.285 0.104 0.181
Schuyler County 3 2105 415 0.197 0.200 0.213 0.190 0.023
Seneca County 4 4926 917 0.186 0.192 0.234 0.157 0.077
St. Lawrence County 17 15926 3519 0.221 0.212 0.335 0.121 0.214
Steuben County 12 15402 2732 0.177 0.195 0.291 0.120 0.171
Suffolk County 68 230466 17818 0.077 0.081 0.417 0.000 0.417
Sullivan County 8 10351 2537 0.245 0.214 0.288 0.133 0.155
Tioga County 6 7354 1038 0.141 0.141 0.177 0.102 0.075
Tompkins County 6 11569 1457 0.126 0.137 0.175 0.101 0.074
Ulster County 9 23786 3177 0.134 0.134 0.215 0.074 0.141
Warren County 9 9039 1183 0.131 0.149 0.261 0.053 0.208
Washington County 11 8399 1283 0.153 0.163 0.267 0.102 0.165
Wayne County 11 14309 2098 0.147 0.153 0.233 0.052 0.181
Westchester County 40 157456 13588 0.086 0.058 0.165 0.015 0.150
Wyoming County 5 4410 510 0.116 0.118 0.155 0.088 0.067
Yates County 2 3339 828 0.248 0.260 0.305 0.216 0.089

Research Questions

We have created the dataframes to answer these questions!

  • What is the difference between the student poverty rate in each school district and:
    • the poverty rate of the county as a whole?
    • the poverty rate of the state as a whole?
  • What counties have the most economic inequality, as measured by the student poverty rate of school districts?

In-class work time


Work through the rest of the analysis on your own for ~ 30 minutes.

I’ll come around to help!

We’ll come back together and answer these questions together:

  • What county has the largest range in student poverty rate?
  • What county has the smallest range in student poverty rate?
  • What other questions could you answer?

If you finish, take a look at the homework assignments.

Write out your csvs

Save these processed dataframe and summary stats dataframe to your computer

# write out poverty rate stat
write_csv(ny_pov_stats, "data/output/ny_student_poverty_state_stats_2019.csv")

write_csv(ny_county_pov_stats, "data/ny_student_poverty_state_stats_by_county_2019.csv")

To Be Continued…


Next week we’ll use much prettier plots to explore this dataset more.

Assignment 3a

Please read and be prepared to discuss the following readings:

Listen to the CodeSwitch episode Who You Calling Hispanic

Optionally, read Chapter 13 of R for Data Science on joining data if you would like more about relational data and joining data with R.

Assignment 3b


Research Journal

Assignment 3c

Work through the in-class exercise and answer the following questions:

  • What district has the highest student poverty rate?
  • What county is that district in?
  • What county has the highest poverty rate? (you should use the county poverty dataset created in script 1 to answer this question)
  • What school district has the lowest student poverty rate in Montgomery County?
  • What county has the largest range in student poverty rate?
  • What county has the smallest range in student poverty rate?

On Canvas, submit a text document with the answer to these questions and your 3 scripts.