Methods 1, Week 2

Outline

Readings discussion
Research Journal discussion
Homework 1 overview and questions
File paths
Data exploration and processing with the tidyverse
Exporting data frames
Getting help
Assignment 2

Readings

Data Feminism

Why the Bronx Burned

Data Feminism

Identification of bias

Matrix of domination
Privilege hazard
Missing data
Excessive surveillance

Mitigation of bias

Minoritized
Who questions
Other ideas?

The Matrix of Domination

Privelge hazard

The idea that those in the positions to make decisions are unaware of the potential harms of their biases and blind spots.

They are the ones deciding:

What is worth counting?
What problems are worth addressing?

Missing data

“What we choose to measure is a statement of what [who] we value”

Minoritized

Who (and why, and how) questions…

To help identify and mitigate bias in data and research

Who created the data?
- And why? For what purpose?
Who performed the analysis?
- And why? For what purpose?
Who benefits, and who may be harmed, from the data and analysis?
How does that impact the quality of the data and/or analysis?
(How) can you mitigate that impact?

US Census race and ethnicity categories

Measuring Racial and Ethnic Diversity for the 2020 Census

The US Census follows “standards on race and ethnicity set by the U.S. Office of Management and Budget (OMB) in 1997. These standards guide how the federal government collects and presents data on these topics.”

Future Readings

We will use race/ethnicity data from the US Census

- it is an approximation of identity
- meant to be used at scale to see patterns

We will read writings by people talking about their identity.

Research Journal discussion

Anyone want to share?

Homework overview and questions

Homework (or Other) Questions?

File paths and working directories

The specification of the list of folders to get to a file on your computer is called a path.

Absolute path: starts at the root folder of the computer

"/Users/sarahodges/spatial/methods1/part1/data/raw/invol_data_propublica.csv"

Relative path: starts at a given folder and provides the path starting from the part1 folder

"data/raw/invol_data_propublica.csv"

In this class, we created an R project called part1.Rproj that defines the folder that your relative path starts with.

Relative paths make it easier to:

avoid using very long absolute paths
share your script with others more easily

Absolute File Paths

If you are using them

Mac

/Users/sarahodges/spatial/Data/tabular/msa/nhgis_fam_pov_20197bsa.csv

Windows (R doesn’t like the forward slashes)

use the file.path() function if you have issues:

file.path("C:\\Users\\sarahodges\\spatial\\Data\\tabular\\msa\\nhgis_fam_pov_20197.csv")

Windows filepath explainer

File structure for this class

You already have all the data you need for the first few classes. Let’s talk through the file structure we’ll all use.

any additional data that is directly downloaded from a source should be stored in data/raw

Script to explore student poverty in NY

Create new R script File > R Script
Save it in methods1/part1/scripts as ny_school_districts_student_poverty_2019.R
At the top of your script, write a comment describing the purpose of this file and the source:

# Process the 2019 student-age poverty data from SAIPE for New York
# source = https://www.census.gov/data/datasets/2019/demo/saipe/2019-school-districts.html

Add tidyverse package

Next, load the tidyverse collection of packages to your environment

library(tidyverse)
library(readxl)

Notice

Notice

The conflicts message when you load the tidyverse is fine.
- It tells you that dplyr overwrites some functions in base R.
- dplyr is one of the packages that makes up the tidyverse
You only need to install a package once, but you must load a package into your R session every time.
- Type in your console install.packages(package_name) to install the package on your computer - ONLY ONCE
- use library(package_name) to load the packages you are using into the current R environment - IN EVERY SCRIPT

tidyverse data processing functions

We’ll learn four functions that are the backbone of data transformation in R

mutate() - create or redefine a variable (column)
select() - subset variables (columns) by their names
filter() - subset observations (rows) by their values
rename() - rename the variables

Read in our dataset

Use read_csv() to import a dataset into your R Environment, and the metadata.
Name it raw_stpov19 to indicate that this is the original form of the data

raw_stpov19 <- read_csv("data/raw/school_district_child_poverty_2019.csv")

…1	Postal	FIPS	County	CONUM	district_id	Name	Estimated_Total_Pop	Estimated Population 5-17	Estimated_relevant_5_17_in_poverty
1	NY	36	Steuben County	36101	3602370	Addison Central School District	6856	1210	283
2	NY	36	Oneida County	36065	3605040	Adirondack Central School District	8424	1346	189
3	NY	36	Chenango County	36017	3602400	Afton Central School District	3696	583	115
4	NY	36	Erie County	36029	3602430	Akron Central School District	9618	1503	143
5	NY	36	Albany County	36001	3602460	Albany City School District	98257	10895	2897

Read in the metadata for our dataset

# import variable definitions to review
stpov_meta <- read_excel("data/raw/ny_student_poverty_metadata.xlsx")

variable	definition
Postal	State Potasl code
FIPS	State FIPS code
district_id	unique school district identifier
Name	school district name
Estimated_Total_Pop	estimated population
Estimated_Pop_5_17	estimated population of school-age children, aged 5 to 17 years old
Estimated_relevant_5_17_in_poverty	estimated population of school-age children, aged 5 to 17 years old that are living in poverty
NA	NA
NA	NA
data source: U.S. Census Small Area Income and Poverty Estimates (SAIPE) Program, https://www.census.gov/programs-surveys/saipe.html	NA

mutate() - create or redefine a variable

Create new data frame from raw_stpov19
New column that equals the student poverty rate

stpov19 <- raw_stpov19  |> 
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/`Estimated Population 5-17`)

Notice

Notice

In R you can use a pipe operator “|>” to string together multiple commands
- think of it as meaning “and then”
- you’ll eventually love the “|>”
- keystroke = cmd/ctrl - shift - m
- you may see older examples that use the “%>%” pipe operator
  - they are (basically) interchangeable but you should use “|>”
To create a new variable mutate(var_name = equation)
- you use the equal sign within dplyr functions
Within a code chunk, you don’t need to use the dollar sign ($) before a column name
If there is a space in your variable name, you have to wrap it in ticks ``

Create a text column

year = “2019”

stpov19 <- raw_stpov19 |>
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/`Estimated Population 5-17`,
         year = "2019")

Notice

Notice

You can create more than one new column in the same code chunk
- separate each new variable by a comma
Use double quotes around the value if you want the new variable’s data type to be character

select()

subset columns by their names

Our student poverty table has 9 columns, we may not need all of those.

Type names(stpov19) in your console to see column names

stpov19 <- raw_stpov19 |>
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty
         /`Estimated Population 5-17`,
         year = "2019") |>
  select(Postal, district_id, Name, Estimated_Total_Pop, `Estimated Population 5-17`, stud_pov_rate, year)

script so far

# Process the 2019 student-age poverty data from SAIPE for New York
# source = https://www.census.gov/data/datasets/2019/demo/saipe/2019-school-districts.html

library(tidyverse)

raw_stpov19 <- read_csv("data/raw/school_district_child_poverty_2019.csv")

# import variable definitions to review
stpov_meta <- read_excel("data/raw/ny_student_poverty_metadata.xlsx")

# create student poverty rate & year column
# select necessary variables
stpov19 <- raw_stpov19 |>
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty
         /`Estimated Population 5-17`,
         year = "2019") |>
  select(Postal, district_id, Name, Estimated_Total_Pop, 
         `Estimated Population 5-17`, stud_pov_rate, year)

Notice

Notice

Script format:
- purpose at the top
- all packages needed for the script next
- import all raw data next (name the dataframe “raw_…)
- new dataframe to create dataset you want for your analysis
Use a pipe operator |> to add a new function
- mutate() all of your variables, then |> to use select()

filter()

subset observations (rows) by their values

To remove rows with fewer than 100 children we will define the dataframe as only the rows with 100 or more children

stpov19 <- raw_stpov19 |>
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty
         /`Estimated Population 5-17`,
         year = "2019") |>
  select(Postal, district_id, Name, Estimated_Total_Pop, 
         `Estimated Population 5-17`, stud_pov_rate, year) |>
  filter(`Estimated Population 5-17` >= 100)

Notice

Notice

We used the filter function in homework 1 but it was just for viewing the dataframe
You can filter on a character variable too
You can have multiple arguments in a filter
Use “&” if every expression must be true for a row to be included in the output
Use “|” (OR) if any expression can be true for a row to be included in the output
“!” means not

filter examples

Try these!

# filter based on text value
nyc <- stpov19 |> 
  filter(Name == "New York City Department Of Education")

# remove new york city
ny_no_nyc <- stpov19 |> 
  filter(Name != "New York City Department Of Education")

# remove all districts with more than 10,000 people
ny_no_large_districts <- stpov19 |> 
  filter(Estimated_Total_Pop <= 10000)

# remove all districts with more than 10,000 people AND less than or equal to 500
ny_medium_districts <- stpov19 |> 
  filter(Estimated_Total_Pop <= 10000 & Estimated_Total_Pop > 500)

rename()

rename your columns new_name = old_name no spaces in variable names

stpov19 <- raw_stpov19 |>
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty
         /`Estimated Population 5-17`,
         year = "2019") |>
  select(Postal, district_id, Name, Estimated_Total_Pop, 
         `Estimated Population 5-17`, Estimated_relevant_5_17_in_poverty, stud_pov_rate, year) |>
  filter(`Estimated Population 5-17` >= 100) |>
  rename(id = district_id,
         district = Name,
         tpop = Estimated_Total_Pop,
         stpop = `Estimated Population 5-17`,
         stpov = Estimated_relevant_5_17_in_poverty,
         stpovrate = stud_pov_rate)

Write out your csvs

Changes that you make to the dataframe are not saved to your computer until you save the dataframe to your computer

in R we say write it out
The script is a recipe to create the data

We’ll use the write_csv() function to save the processed data frame to your computer.

# write out processed student poverty data for 2019
write_csv(stpov19, "data/processed/ny_school_district_student_poverty_rate_2019.csv")

Readings

Getting help in R presentation.
OPTIONAL: For additional context and information on R, review Chapters 4 and 5 of R for Data Science by Hadley Wickham and Garrett Grolemund

R Assignment 2b.

Use the 2019 student poverty processing script to process the same dataset for 2022.

Save your 2022 script as ny_school_districts_student_poverty_2022_your_name.R
Use it to import the 2022 student poverty data and process it in the same way
Write out the 2022 dataset with the school districts with more than 100 students, just like 2019
Create another data frame of New York districts with more than 500 children (variable = stpop) and fewer than 5000 children living in poverty (variable = stpov)
- no need to write this one out, it is just to practice creating filters

R Assignment 2c.

Import the 2022 ACS New York Poverty Data by County and create a data frame of the poverty rate

Create a new script called ny_county_poverty_rate_22_you_name.R
Write a comment at the top describing the purpose of the script
Load the tidyverse
Read in the raw poverty data (county_child_poverty_2022.csv), name it raw_county_pov22
rename:
- County FIPS Code to conum
- Poverty Estimate, Age 5-17 in Families to county_child_poverty_count
select the following columns: NAME, conum, county_child_poverty_count, county_child_poverty_rate
write out your csv to the data/processed folder ( name it ny_school_district_student_poverty_rate_2022.csv)

Upload both of you scripts to their assignment in Canvas

Methods 1, Week 2

Outline

Readings discussion

Research Journal discussion

Homework 1 overview and questions

File paths

Data exploration and processing with the tidyverse

Exporting data frames

Getting help

Assignment 2

Readings

Data Feminism

Why the Bronx Burned

Data Feminism

The Matrix of Domination

Privelge hazard

Missing data

Minoritized

Who (and why, and how) questions…

US Census race and ethnicity categories

Future Readings

Research Journal discussion

Homework overview and questions

Homework (or Other) Questions?

File paths and working directories

Relative paths make it easier to:

Absolute File Paths

File structure for this class

Script to explore student poverty in NY

Add tidyverse package

Notice

tidyverse data processing functions

Read in our dataset

Read in the metadata for our dataset

mutate() - create or redefine a variable

Notice

Create a text column

Notice

select()

script so far

Notice

filter()

Notice

filter examples

rename()

Write out your csvs

Readings

R Assignment 2b.

R Assignment 2c.