Methods 1, Week 2

In today’s class we’ll write scripts to begin to explore the desegregation datasets that we used in the homework assignment.

By then end of today’s class you’ll have your first analysis script.

Outline

Readings discussion
Research Journal discussion
Homework 1 overview and questions
File paths
Data exploration and processing with the tidyverse
Exporting data frames
Getting help
Assignment 2

Readings

Data Feminism

Why the Bronx Burned

Data Feminism

Missing data

“What we choose to measure is a statement of what [who] we value”

Minoritized

US Census race and ethnicity categories

Measuring Racial and Ethnic Diversity for the 2020 Census

The US Census follows “standards on race and ethnicity set by the U.S. Office of Management and Budget (OMB) in 1997. These standards guide how the federal government collects and presents data on these topics.”

Future Readings

We will use race/ethnicity data from the US Census

- it is an approximation of identity
- meant to be used at scale to see patterns

We will read writings by people talking about their identity.

Research Journal discussion

Anyone want to share?

Homework overview and questions, 1-2

calculate five plus seven

calculation1 <- 4 + 7

# calcalution1
calculation1

[1] 11

calculate five plus seven

five <- 5
seven <- "7"
# calculation2 <- five + seven
calculation2 <- five + as.numeric(seven)

calculation2

[1] 12

Homework overview and questions, 3

Import dataset of desegregation orders from 1957 to 2014 from the ProPublica

library(tidyverse)

# all_deseg_pp <- read_csv(data/invol_data_propublica.csv)
all_deseg_pp <- read_csv("data/invol_data_propublica.csv")

## View the column and first rows of the data frame
glimpse(all_deseg_pp)

Homework overview and questions, 4

Create a data frame of open desegregation orders from the dataset imported in question 3

# open_deseg_pp <- subset(all_deseg_pp, Year_Lifted == "STILL OPEN") ## UNCORRECTED

##### CORRECTED
# use the names() to see how to spell each column
names(all_deseg_pp)
# correct the name of the Year.Lifted column
open_deseg_pp <- subset(all_deseg_pp, Year.Lifted == "STILL OPEN")

Homework overview and questions, 5.

Calculate how long the the desegregation orders have been in effect, for all open desegregation orders

# UNCORRECTED
# open_deseg_pp$duration <- 2014 - open_deseg_pp$Year.Placed  ## Note, expect a red message about NAs for the correct answer - it is a message, not an error

##### CORRECTED
open_deseg_pp$duration <- 2014 - as.numeric(open_deseg_pp$Year.Placed)

Homework overview and questions, 6.

How many of the desegregation orders were still open in 2014?

pp_open_count <- nrow(open_deseg_pp)
pp_open_count # 330

[1] 330

Homework overview and questions, 7.

What desegregation order that was still open in 2014 had been open longest?

View(open_deseg_pp)
## Sorted descending by the duration column to find that District of Columbia had deseg order from 1954 that was still open in 2014

Homework overview and questions, 8.

Is that desegregation order still open?

# import the Civil Rights Data Collection dataset 
crdc <- read_csv("data/lea_deseg_CRDC_2017_18.csv")

# look for DC in it
View(crdc)
## District of Columbia is not in the CRDC dataset - in 2017-18 school year, the order was not standing

Homework overview and questions, 9.

How many open desegregation orders are there in Alabama, according to the ProPublica dataset and the CRDC dataset

# create a data frame of open desegregation orders in Alabama
al_pp <- subset(open_deseg_pp, State == "AL")

# count the number of rows
pp_al_count <- nrow(al_pp)
pp_al_count # 47 open orders in the Pro Publica dataset

[1] 47

# create a data frame of open desegregation orders in Alabama in the CRDC dataset
al_crdc <- subset(crdc, LEA_STATE == "AL")

# count the number of rows
crdc_al_count <- nrow(al_crdc)
crdc_al_count # 18 open orders in the CRDC dataset

[1] 18

Homework (or Other) Questions?

File paths and working directories

The specification of the list of folders to get to a file on your computer is called a path.

Absolute path: starts at the root folder of the computer

"/Users/sarahodges/spatial/Data/tabular/msa/nhgis_fam_pov_2019/nhgis0032_ds244_20195_2019_cbsa.csv"

Relative path: starts at a given folder and provides the path starting from that folder

"tabular/msa/nhgis_fam_pov_2019/nhgis0032_ds244_20195_2019_cbsa.csv"

Relative paths make it easier to:

avoid mistakes
share your script with others

In this class, we will Create an R project for each class that defines the folder that your relative path starts with.

File structure

For each class, there will be a folder to download with the data for the in-class exercises and homework. You should download it and put it in your methods1 folder so it looks something like this:

Absolute File Paths

Mac

/Users/sarahodges/spatial/Data/tabular/msa/nhgis_fam_pov_2019/nhgis0032_ds244_20195_2019_cbsa.csv

Windows (R doesn’t like the forward slashes)

use the file.path() function if you have issues:

file.path("C:\\Users\\sarahodges\\spatial\\Data\\tabular\\msa\\nhgis_fam_pov_2019\\nhgis0032_ds244_20195_2019_cbsa.csv")

Windows filepath explainer

Class 2 project and data

Create New Project from New Directory in your class folder
- name it class2

Download the class 2 data folder and save the data folder in your class2 folder.

Class 2 script

Create new R script File > R Script
Save it in class2 as new_york_student_poverty_2018.R
At the top of your script, write a comment describing the purpose of this file and the source:

# Process the 2018 student-age poverty data from SAIPE for New York
# source = https://www.census.gov/data/datasets/2018/demo/saipe/2018-school-districts.html

Add tidyverse package

Next, load the tidyverse collection of packages to your environment

library(tidyverse)

Notice

Notice

The conflicts message when you load the tidyverse is fine.
- It tells you that dplyr overwrites some functions in base R.
You only need to install a package once, but you must load a package into your R session every time.
- install.packages(“package_name”) installs the package on your computer
- use library(“package_name”) to load the package into the current R environment

tidyverse data processing functions

We’ll learn five functions that are the backbone of data transformation in R

mutate() - create or redefine a variable (column)
select() - subset variables (columns) by their names
filter() - subset observations (rows) by their values
rename() - rename the variables
summarise() - collapse many values down to a single summary

Read in our dataset

Use read_csv() to import a dataset into your R Environment.
Name it raw_stpov18 to indicate that this is the original form of the data

raw_stpov18 <- read_csv("data/ny_student_poverty_2018.csv")

Postal	FIPS	district_id	Name	Estimated_Total_Pop	Estimated_Pop_5_17	Estimated_relevant_5_17_in_poverty
NY	36	02370	Addison Central School District	6887	1228	277
NY	36	05040	Adirondack Central School District	8456	1346	209

mutate() - create or redefine a variable

Create new data frame from raw_stpov18
New column that equals the student poverty rate

stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17)

Notice

Notice

The tidyverse uses a pipe operator “%>%” to string together multiple commands
- think of it as meaning “and then”
- you’ll eventually love the “%>%”
- keystroke = cmd/ctrl - shift - m
To create a new variable mutate(var_name = equation)
- you use the equal sign within dplyr functions
Within a code chunk, you don’t need to use the dollar sign ($) before a column name

Create a text column

year = “2018”

stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2018")

Notice

Notice

You can create more than one new column in the same code chunk
- separate each new variable by a comma
Use double quotes around the value if you want the new variable’s data type to be character

select()

subset columns by their names

Our student poverty table has 9 columns, we may not need all of those.

Type names(stpov18) in your console to see column names

stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty
         /Estimated_Pop_5_17,
         year = "2018") %>%
  select(Postal, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17, stud_pov_rate, year)

script so far

# Process the 2018 student-age poverty data from SAIPE for New York
# source = https://www.census.gov/data/datasets/2018/demo/saipe/2018-school-districts.html

library(tidyverse)

raw_stpov18 <- read_csv("data/ny_student_poverty_2018.csv")

# create student poverty rate & year column
# select necessary variables
stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty
         /Estimated_Pop_5_17,
         year = "2018") %>%
  select(Postal, district_id, Name, Estimated_Total_Pop, 
         Estimated_Pop_5_17, stud_pov_rate, year)

Notice

Notice

Script format:
- purpose at the top
- all packages needed for the script next
- import all raw data next (name the dataframe “raw_…)
- new dataframe to create dataset you want for your analysis
Use a pipe operator %>% to add a new function
- mutate() all of your variables, then %>% to use select()

filter()

subset observations (rows) by their values

Remove rows fewer than 100 children

stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty
         /Estimated_Pop_5_17,
         year = "2018") %>%
  select(Postal, district_id, Name, Estimated_Total_Pop, 
         Estimated_Pop_5_17, stud_pov_rate, year) %>%
  filter(Estimated_Pop_5_17 >= 100)

Notice

Notice

Filter does the same thing as the base R subset() we used in homework 1
You can filter on a character variable too
You can have multiple arguments in a filter
Use “&” if every expression must be true for a row to be included in the output
Use “|” (OR) if any expression can be true for a row to be included in the output
“!” means not

filter examples

Try these!

# filter based on text value
nyc <- stpov18 %>% 
  filter(Name == "New York City Department Of Education")

# remove new york city
ny_no_nyc <- stpov18 %>% 
  filter(Name != "New York City Department Of Education")

# remove all districts with more than 10,000 people
ny_no_large_districts <- stpov18 %>% 
  filter(Estimated_Total_Pop <= 10000)

# remove all districts with more than 10,000 people AND less than or equal to 500
ny_medium_districts <- stpov18 %>% 
  filter(Estimated_Total_Pop <= 10000 & Estimated_Total_Pop > 500)

rename()

rename your columns * new_name = old_name * no spaces in variable names

stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty
         /Estimated_Pop_5_17,
         year = "2018") %>%
  select(Postal, district_id, Name, Estimated_Total_Pop, 
         Estimated_Pop_5_17, Estimated_relevant_5_17_in_poverty, stud_pov_rate, year) %>%
  filter(Estimated_Pop_5_17 >= 100) %>%
  rename(id = district_id,
         district = Name,
         tpop = Estimated_Total_Pop,
         stpop = Estimated_Pop_5_17,
         stpov = Estimated_relevant_5_17_in_poverty,
         stpovrate = stud_pov_rate)

Write out your csvs

Save these processed data frame and summary stats data frame to your computer

# write out processed student poverty data for 2018
write_csv(stpov18, "data/output/ny_student_poverty_rate_2018.csv")

Readings

Chapter 6: Never a Real Democracy of The Sum of Us by Heather McGee
Getting help in R presentation.
OPTIONAL: Fo additional context and information on R, review Chapters 4 and 5 of R for Data Science by Hadley Wickham and Garrett Grolemund

R Assignment 2b.

Use the 2018 student poverty processing script to process the same dataset for 2019.

Save your 2018 script as new_york_student_poverty_2019_your_name.R
Use it to import the 2019 student poverty data and process it in the same way
Create another data frame of New York districts with more than 500 children (variable = stpop) and fewer than 5000 children living in poverty (variable = stpov)
OPTIONAL (+1 point since it requires the summarise() function) Answer the following questions in commented out text the bottom of your script:
- Did the student poverty rate increase or decrease in New York in 2019?
- What was the median Student Population in 2019?

R Assignment 2c.

Import the 2019 ACS New York Poverty Data by County and create a data frame of the poverty rate

Create a new script called ny_county_poverty_rate_19_you_name.R
Write a comment at the top describing the purpose of the script
Load the tidyverse
Read in the raw poverty data (nhgis0042_ds244_20195_2019_county.csv), name it raw_county_pov19
Read the data dictionary documentation in the same folder as the data
select the following columns: STATE, COUNTY, GEOID, ALWVE001, ALWVE002, ALWVE003
filterto keep New York
rename ALWVE001 to total_pop
mutate to create a new variable poverty_rate where
- poverty_rate = (ALWVE002 + ALWVE003)/total_pop
select the following columns: STATE, COUNTY, GEOID, total_pop, poverty_rate
write out your csv to the output folder

Upload both of you scripts to their assignment in Canvas

Methods 1, Week 2

Outline

Readings discussion

Research Journal discussion

Homework 1 overview and questions

File paths

Data exploration and processing with the tidyverse

Exporting data frames

Getting help

Assignment 2

Readings

Data Feminism

Why the Bronx Burned

Data Feminism

Missing data

Minoritized

US Census race and ethnicity categories

Future Readings

Research Journal discussion

Homework overview and questions, 1-2

Homework overview and questions, 3

Homework overview and questions, 4

Homework overview and questions, 5.

Homework overview and questions, 6.

Homework overview and questions, 7.

Homework overview and questions, 8.

Homework overview and questions, 9.

Homework (or Other) Questions?

File paths and working directories

Relative paths make it easier to:

File structure

Absolute File Paths

Class 2 project and data

Class 2 script

Add tidyverse package

Notice

tidyverse data processing functions

Read in our dataset

mutate() - create or redefine a variable

Notice

Create a text column

Notice

select()

script so far

Notice

filter()

Notice

filter examples

rename()

Write out your csvs

Readings

R Assignment 2b.

R Assignment 2c.