The US Census follows “standards on race and ethnicity set by the U.S. Office of Management and Budget (OMB) in 1997. These standards guide how the federal government collects and presents data on these topics.”
Future Readings
We will use race/ethnicity data from the US Census
it is an approximation of identity
meant to be used at scale to see patterns
We will read writings by people talking about their identity.
Research Journal discussion
Anyone want to share?
Homework overview and questions, 1-2
calculate five plus seven
calculation1 <-4+7# calcalution1calculation1
[1] 11
calculate five plus seven
five <-5seven <-"7"# calculation2 <- five + sevencalculation2 <- five +as.numeric(seven)calculation2
[1] 12
Homework overview and questions, 3
Import dataset of desegregation orders from 1957 to 2014 from the ProPublica
library(tidyverse)# all_deseg_pp <- read_csv(data/invol_data_propublica.csv)all_deseg_pp <-read_csv("data/invol_data_propublica.csv")## View the column and first rows of the data frameglimpse(all_deseg_pp)
Homework overview and questions, 4
Create a data frame of open desegregation orders from the dataset imported in question 3
# open_deseg_pp <- subset(all_deseg_pp, Year_Lifted == "STILL OPEN") ## UNCORRECTED##### CORRECTED# use the names() to see how to spell each columnnames(all_deseg_pp)# correct the name of the Year.Lifted columnopen_deseg_pp <-subset(all_deseg_pp, Year.Lifted =="STILL OPEN")
Homework overview and questions, 5.
Calculate how long the the desegregation orders have been in effect, for all open desegregation orders
# UNCORRECTED# open_deseg_pp$duration <- 2014 - open_deseg_pp$Year.Placed ## Note, expect a red message about NAs for the correct answer - it is a message, not an error##### CORRECTEDopen_deseg_pp$duration <-2014-as.numeric(open_deseg_pp$Year.Placed)
Homework overview and questions, 6.
How many of the desegregation orders were still open in 2014?
What desegregation order that was still open in 2014 had been open longest?
View(open_deseg_pp)## Sorted descending by the duration column to find that District of Columbia had deseg order from 1954 that was still open in 2014
Homework overview and questions, 8.
Is that desegregation order still open?
# import the Civil Rights Data Collection dataset crdc <-read_csv("data/lea_deseg_CRDC_2017_18.csv")# look for DC in itView(crdc)## District of Columbia is not in the CRDC dataset - in 2017-18 school year, the order was not standing
Homework overview and questions, 9.
How many open desegregation orders are there in Alabama, according to the ProPublica dataset and the CRDC dataset
# create a data frame of open desegregation orders in Alabamaal_pp <-subset(open_deseg_pp, State =="AL")# count the number of rowspp_al_count <-nrow(al_pp)pp_al_count # 47 open orders in the Pro Publica dataset
[1] 47
# create a data frame of open desegregation orders in Alabama in the CRDC datasetal_crdc <-subset(crdc, LEA_STATE =="AL")# count the number of rowscrdc_al_count <-nrow(al_crdc)crdc_al_count # 18 open orders in the CRDC dataset
[1] 18
Homework (or Other) Questions?
File paths and working directories
The specification of the list of folders to get to a file on your computer is called a path.
Absolute path: starts at the root folder of the computer
In this class, we will Create an R project for each class that defines the folder that your relative path starts with.
File structure
For each class, there will be a folder to download with the data for the in-class exercises and homework. You should download it and put it in your methods1 folder so it looks something like this:
Save it in class2 as new_york_student_poverty_2018.R
At the top of your script, write a comment describing the purpose of this file and the source:
# Process the 2018 student-age poverty data from SAIPE for New York# source = https://www.census.gov/data/datasets/2018/demo/saipe/2018-school-districts.html
Add tidyverse package
Next, load the tidyverse collection of packages to your environment
library(tidyverse)
Notice
Notice
The conflicts message when you load the tidyverse is fine.
It tells you that dplyr overwrites some functions in base R.
You only need to install a package once, but you must load a package into your R session every time.
install.packages(“package_name”) installs the package on your computer
use library(“package_name”) to load the package into the current R environment
tidyverse data processing functions
We’ll learn five functions that are the backbone of data transformation in R
mutate() - create or redefine a variable (column)
select() - subset variables (columns) by their names
filter() - subset observations (rows) by their values
rename() - rename the variables
summarise() - collapse many values down to a single summary
Read in our dataset
Use read_csv() to import a dataset into your R Environment.
Name it raw_stpov18 to indicate that this is the original form of the data
Filter does the same thing as the base R subset() we used in homework 1
You can filter on a character variable too
You can have multiple arguments in a filter
Use “&” if every expression must be true for a row to be included in the output
Use “|” (OR) if any expression can be true for a row to be included in the output
“!” means not
filter examples
Try these!
# filter based on text valuenyc <- stpov18 %>%filter(Name =="New York City Department Of Education")# remove new york cityny_no_nyc <- stpov18 %>%filter(Name !="New York City Department Of Education")# remove all districts with more than 10,000 peopleny_no_large_districts <- stpov18 %>%filter(Estimated_Total_Pop <=10000)# remove all districts with more than 10,000 people AND less than or equal to 500ny_medium_districts <- stpov18 %>%filter(Estimated_Total_Pop <=10000& Estimated_Total_Pop >500)
rename()
rename your columns * new_name = old_name * no spaces in variable names
OPTIONAL: Fo additional context and information on R, review Chapters 4 and 5 of R for Data Science by Hadley Wickham and Garrett Grolemund
R Assignment 2b.
Use the 2018 student poverty processing script to process the same dataset for 2019.
Save your 2018 script as new_york_student_poverty_2019_your_name.R
Use it to import the 2019 student poverty data and process it in the same way
Create another data frame of New York districts with more than 500 children (variable = stpop) and fewer than 5000 children living in poverty (variable = stpov)
OPTIONAL (+1 point since it requires the summarise() function) Answer the following questions in commented out text the bottom of your script:
Did the student poverty rate increase or decrease in New York in 2019?
What was the median Student Population in 2019?
R Assignment 2c.
Import the 2019 ACS New York Poverty Data by County and create a data frame of the poverty rate
Create a new script called ny_county_poverty_rate_19_you_name.R
Write a comment at the top describing the purpose of the script
Load the tidyverse
Read in the raw poverty data (nhgis0042_ds244_20195_2019_county.csv), name it raw_county_pov19
Read the data dictionary documentation in the same folder as the data
select the following columns: STATE, COUNTY, GEOID, ALWVE001, ALWVE002, ALWVE003
filterto keep New York
rename ALWVE001 to total_pop
mutate to create a new variable poverty_rate where
poverty_rate = (ALWVE002 + ALWVE003)/total_pop
select the following columns: STATE, COUNTY, GEOID, total_pop, poverty_rate
write out your csv to the output folder
Upload both of you scripts to their assignment in Canvas