Identification of bias
Mitigation of bias
The idea that those in the positions to make decisions are unaware of the potential harms of their biases and blind spots.
They are the ones deciding:
To help identify and mitigate bias in data and research
The US Census follows “standards on race and ethnicity set by the U.S. Office of Management and Budget (OMB) in 1997
. These standards guide how the federal government collects and presents data on these topics.”
Anyone want to share?
The specification of the list of folders to get to a file on your computer is called a path.
"/Users/sarahodges/spatial/methods1/part1/data/raw/invol_data_propublica.csv"
"data/raw/invol_data_propublica.csv"
In this class, we created an R project called part1.Rproj that defines the folder that your relative path starts with.
If you are using them
/Users/sarahodges/spatial/Data/tabular/msa/nhgis_fam_pov_20197bsa.csv
use the file.path()
function if you have issues:
file.path("C:\\Users\\sarahodges\\spatial\\Data\\tabular\\msa\\nhgis_fam_pov_20197.csv")
You already have all the data you need for the first few classes. Let’s talk through the file structure we’ll all use.
data/raw
Create new R script File > R Script
Save it in methods1/part1/scripts
as new_york_student_poverty_2018.R
At the top of your script, write a comment describing the purpose of this file and the source:
Next, load the tidyverse collection of packages to your environment
Notice
load
a package into your R session every time.
install.packages(package_name)
to install the package on your computer - ONLY ONCElibrary(package_name)
to load the packages you are using into the current R environment - IN EVERY SCRIPTWe’ll learn four functions that are the backbone of data transformation in R
read_csv()
to import a dataset into your R Environment, and the metadata.raw_stpov18
to indicate that this is the original form of the dataPostal | FIPS | district_id | Name | Estimated_Total_Pop | Estimated_Pop_5_17 | Estimated_relevant_5_17_in_poverty |
---|---|---|---|---|---|---|
NY | 36 | 3602370 | Addison Central School District | 6887 | 1228 | 277 |
NY | 36 | 3605040 | Adirondack Central School District | 8456 | 1346 | 209 |
NY | 36 | 3602400 | Afton Central School District | 3723 | 589 | 111 |
NY | 36 | 3602430 | Akron Central School District | 9644 | 1517 | 157 |
NY | 36 | 3602460 | Albany City School District | 98781 | 11020 | 3259 |
variable | definition |
---|---|
Postal | State Potasl code |
FIPS | State FIPS code |
district_id | unique school district identifier |
Name | school district name |
Estimated_Total_Pop | estimated population |
Estimated_Pop_5_17 | estimated population of school-age children, aged 5 to 17 years old |
Estimated_relevant_5_17_in_poverty | estimated population of school-age children, aged 5 to 17 years old that are living in poverty |
NA | NA |
NA | NA |
data source: U.S. Census Small Area Income and Poverty Estimates (SAIPE) Program, https://www.census.gov/programs-surveys/saipe.html | NA |
Notice
Notice
character
Our student poverty table has 9 columns, we may not need all of those.
Type names(stpov18)
in your console to see column names
# Process the 2018 student-age poverty data from SAIPE for New York
# source = https://www.census.gov/data/datasets/2018/demo/saipe/2018-school-districts.html
library(tidyverse)
raw_stpov18 <- read_csv("data/ny_student_poverty_2018.csv")
# import variable definitions to review
stpov_meta <- read_csv("data/raw/ny_student_poverty_metadata.csv")
# create student poverty rate & year column
# select necessary variables
stpov18 <- raw_stpov18 |>
mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty
/Estimated_Pop_5_17,
year = "2018") |>
select(Postal, district_id, Name, Estimated_Total_Pop,
Estimated_Pop_5_17, stud_pov_rate, year)
Notice
|>
to add a new function
mutate()
all of your variables, then |>
to use select()
Notice
Try these!
# filter based on text value
nyc <- stpov18 |>
filter(Name == "New York City Department Of Education")
# remove new york city
ny_no_nyc <- stpov18 |>
filter(Name != "New York City Department Of Education")
# remove all districts with more than 10,000 people
ny_no_large_districts <- stpov18 |>
filter(Estimated_Total_Pop <= 10000)
# remove all districts with more than 10,000 people AND less than or equal to 500
ny_medium_districts <- stpov18 |>
filter(Estimated_Total_Pop <= 10000 & Estimated_Total_Pop > 500)
rename your columns new_name = old_name no spaces in variable names
stpov18 <- raw_stpov18 |>
mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty
/Estimated_Pop_5_17,
year = "2018") |>
select(Postal, district_id, Name, Estimated_Total_Pop,
Estimated_Pop_5_17, Estimated_relevant_5_17_in_poverty, stud_pov_rate, year) |>
filter(Estimated_Pop_5_17 >= 100) |>
rename(id = district_id,
district = Name,
tpop = Estimated_Total_Pop,
stpop = Estimated_Pop_5_17,
stpov = Estimated_relevant_5_17_in_poverty,
stpovrate = stud_pov_rate)
Changes that you make to the dataframe are not saved to your computer until you save the dataframe to your computer
We’ll use the write_csv() function to save the processed data frame to your computer.
Getting help in R presentation.
OPTIONAL: For additional context and information on R, review Chapters 4 and 5 of R for Data Science by Hadley Wickham and Garrett Grolemund
First, complete the filter practice that we skipped in class in your 2018 script.
Use the 2018 student poverty processing script to process the same dataset for 2019.
new_york_student_poverty_2019_your_name.R
Import the 2019 ACS New York Poverty Data by County and create a data frame of the poverty rate
ny_county_poverty_rate_19_you_name.R
raw_county_pov19
select
the following columns: STATE, COUNTY, GEOID, ALWVE001, ALWVE002, ALWVE003filter
to keep New Yorkrename
ALWVE001 to total_popmutate
to create a new variable poverty_rate where
select
the following columns: STATE, COUNTY, GEOID, total_pop, poverty_rateUpload both of you scripts to their assignment in Canvas