Week 2

In today’s class we’ll write scripts to begin to explore the desegregation datasets that we used in the homework assignment.

By then end of today’s class you’ll have your first analysis script.

Outline

Readings discussion

Research Journal discussion

Homework 1 overview and questions

File paths

Data exploration and processing with the tidyverse

Exporting data frames

Getting help

Assignment 2

Readings

Data Feminism

“What we choose to measure is a statement of what [who] we value”

Missing data

Minoritized

Black Is Over (Or, Special Black)

US Census race and ethnicity categories

Measuring Racial and Ethnic Diversity for the 2020 Census

Why the Bronx Burned

Research Journal discussion

Homework overview and questions

Homework 1 solution

File paths and working directories

The specification of the list of folders to get to a file on your computer is called a path.

Absolute path: starts at the root folder of the computer
Relative path: starts at a given folder and provides the path starting from that folder

Using relative paths makes it easier to avoid mistakes and to share your script with others. In this class, we will Create an R project for each class that defines the folder that your relative path starts with. For each class, there will be a folder to download with the data for the in-class exercises and homework. You should download it and put it in your methods1 folder so it looks something like this:

Starting with class2, we’ll all try to have the same file structure. Don’t worry if your class1 folder looks different.

If you need to use an absolute path to import data, they look different depending on your operating system. See examples below of the absolute path to a csv in my main data folder:

Mac “/Users/sarahodges/spatial/Data/tabular/msa/nhgis_fam_pov_2019/nhgis0032_ds244_20195_2019_cbsa.csv”
Windows “C:\Users\sarahodges\spatial\Data\tabular\msa\nhgis_fam_pov_2019\nhgis0032_ds244_20195_2019_cbsa.csv”
- Many absolute file paths on Windows machines have spaces which R doesn’t like
- If you have trouble importing data on a Windows machine use the file.path() function to convert the path to something that R can understand:
  - file.path(“C:\Users\sarahodges\spatial\Data\tabular\msa\nhgis_fam_pov_2019\nhgis0032_ds244_20195_2019_cbsa.csv”)

Download folder for class 2 and create a new project

Download the class 2 folder and save them in your methods1 folder on your computer

Create new project from existing folder:

methods1/class2

Create new R script File > R Script

Save it in class2 as new_york_student_poverty_2018.R

At the top of your script, write a comment describing the purpose of this file:

# Processing and exploring the 2018 student-age poverty data from SAIPE for New York
# source = https://www.census.gov/data/datasets/2019/demo/saipe/2019-school-districts.html

Data exploration and processing with the Tidyverse

First load the tidyverse collection of packages to your environment

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Things to notice:

Notice the conflicts message that’s printed when you load the tidyverse. Those are just fine. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().

You only need to install a package once, but you must load a package into your R session every time

install.packages(“package_name”) installs the package on your computer

use library(“package_name”) to load the package into the current R environment

dplyr package

We’ll learn five dplyr functions that are the backbone of data transformation in R

mutate() - create or redefine a variable (column)

select() - subset variables (columns) by their names

filter() - subset observations (rows) by their values

rename() - rename the variables

summarise() - collapse many values down to a single summary

These, and other dplyr functions, can be used with group_by() which changes the scope of each function from operating on the full data frame, to operating on the data frame by group. (This will become more clear with examples)

Import the dataset that we’ll explore, name it raw_stpov18 to indicate that this is the original form of the data

# import the ProPublica desegregation order dataset we used in the homework
raw_stpov18 <- read_csv("data/ny_student_poverty_2018.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Postal = col_character(),
##   FIPS = col_double(),
##   district_id = col_character(),
##   Name = col_character(),
##   Estimated_Total_Pop = col_double(),
##   Estimated_Pop_5_17 = col_double(),
##   Estimated_relevant_5_17_in_poverty = col_double()
## )

glimpse(raw_stpov18)

mutate() - create or redefine a variable

# create new data frame from raw_stpov18 to process the data
# create a new column that calculates the student poverty rate
stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17)

Things to notice:

The tidyverse uses a pipe operator “%>%” to string together multiple commands

think of it as meaning “and then”

you’ll eventually love the “%>%”

keystroke = cmd/ctrl - shift - m

To create a new variable mutate(var_name = equation)

you use the equal sign within dplyr functions

Within a code chunk, you don’t need to use the dollar sign ($) before a column name

# create a new column that calculates the student poverty rate
# create a new 'Year' column
stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2018")

Things to notice:

You can create more than one new column in the same code chunk

separate each new variable by a comma

Use double quotes around the value if you want the new variable’s data type to be character

select() - subset columns by their names

Our student poverty table has 9 columns, we may not need all of those.

Type names(stpov18) in your console to look at the names of the columns

# # then, add to the previous code chunk to select the variable you want to keep
stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2018") %>%
  select(Postal, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17, stud_pov_rate)

Things to notice:

Add a pipe operator to perform a new type of transformation on your data frame

filter() - subset observations (rows) by their values

Remove rows fewer than 100 children

Type View(stpov18) in your console to view your data frame and see how many districts have fewer than 100 children

# then, add to the previous code chunk to remove rows with stpop = 0
stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2018",
         district_id = paste0(FIPS, district_id)) %>%
  select(Postal, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17, stud_pov_rate, year) %>%
  filter(Estimated_Pop_5_17 >= 100)

Things to notice:

Filter does the same thing as the base R subset() we used in homework 1

You can filter on a character variable too

You can have multiple arguments in a filter

Use “&” if every expression must be true for a row to be included in the output

Use “|” (OR) if any expression can be true for a row to be included in the output

“!” means not

# examples of other filters

# filter based on text value
nyc <- stpov18 %>% 
  filter(Name == "New York City Department Of Education")

# remove new york city
ny_no_nyc <- stpov18 %>% 
  filter(Name != "New York City Department Of Education")

# remove all districts with more than 10,000 people
ny_no_large_districts <- stpov18 %>% 
  filter(Estimated_Total_Pop <= 10000)

# remove all districts with more than 10,000 people AND less than or equal to 500
ny_medium_districts <- stpov18 %>% 
  filter(Estimated_Total_Pop <= 10000 & Estimated_Total_Pop > 500)

# remove these test data frames from your environment
rm(nyc)
rm(ny_no_nyc)
rm(ny_no_large_districts)
rm(ny_medium_districts)

Things to notice:

rm() removes a data frame from your environment

rename() - rename your columns

# add to the previous code chunk to rename your variables shorter names
stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2018",
         district_id = paste0(FIPS, district_id)) %>%
  select(Postal, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17, Estimated_relevant_5_17_in_poverty, stud_pov_rate, year) %>%
  filter(Estimated_Pop_5_17 >= 100) %>%
  rename(id = district_id,
         district = Name,
         tpop = Estimated_Total_Pop,
         stpop = Estimated_Pop_5_17,
         stpov = Estimated_relevant_5_17_in_poverty,
         stpovrate = stud_pov_rate)


# look at your table to make sure you got it right

Things to notice:

When you rename, new_name = old_name

Never have spaces or dashes in your column name

If you are renaming multiple variables, separate them by a comma and new line for readability

Calculate some quick summary statistics

summary(stpov18)

summarise() - create a data frame of summary statistics thatyou define

# calculate the student poverty rate in the New York and the average student poverty rate of school districts in New York

ny_pov_stats <- stpov18 %>%
  summarise(districts = n(),
            kids = sum(stpop),
            kids_in_pov = sum(stpov),
            stud_povrate = round(kids_in_pov/kids, 3),
            mean_sd_stpovrate = round(mean(stpovrate), 3),
            max_sd_stpovrate = round(max(stpovrate), 3),
            min_sd_stpovrate = round(min(stpovrate), 3))

Things to notice:

Use summarise (with an s!) to calculate summary statistics for the entire data frame

New line and commas between each

arrange() - reorder the rows

Use arrange() if you want to permanently change the order of the rows in your data frame

# then, add to the previous code chunk to order the stats data frame by student poverty rate
stpov18 <- stpov18 %>%
  arrange(stpovrate)

Questions:

What district has the highest student-age poverty rate?

What district has the lowest student-age poverty rate?

Write out your csvs

Save these processed data frame and summary stats data frame to your computer

# write out processed student poverty data for 2018
write_csv(stpov18, "data/output/ny_student_poverty_rate_2018.csv")
# 
# # write out poverty rate stats for 2018
write_csv(ny_pov_stats, "data/output/student_poverty_state_stats.csv")

Getting help

See slides on how to get help for R

Homework

Readings

The discussion in our next session will be based around access to democracy. Please read and be prepared to discuss the following reading:

Chapter 6: Never a Real Democracy of The Sum of Us by Heather McGee

For additional context and information on R, review Chapters 4 and 5 of R for Data Science by Hadley Wickham and Garrett Grolemund

R Assignments

2b. Use the 2018 student poverty processing script to process the same dataset for another year.

Save your 2018 script as new_york_student_poverty_2019_your_name.R
Use it to import the 2019 student poverty data and process it in the same way
Create another data frame of New York districts with more than 500 children (variable = stpop) and fewer than 5000 children living in poverty (variable = stpov)
Answer the following questions in commented out text the bottom of your script:
- Did the student poverty rate increase or decrease in New York in 2019?
- What was the median Student Population in 2019?

2c. Import the 2019 ACS New York Poverty Data by County and create a data frame of the poverty rate

Create a new script called ny_county_poverty_rate_19_you_name.R
Write a comment at the top describing the purpose of the script
Load the tidyverse
Read in the raw poverty data (nhgis0042_ds244_20195_2019_county.csv), name it raw_county_pov19
Read the data dictionary documentation in the same folder as the data
select the following columns: STATE, COUNTY, GEOID, ALWVE001, ALWVE002, ALWVE003
filterto keep New York
rename ALWVE001 to total_pop
mutate to create a new variable poverty_rate where
- poverty_rate = (ALWVE002 + ALWVE003)/total_pop
select the following columns: STATE, COUNTY, GEOID, total_pop, poverty_rate
write out your csv to the output folder

Upload both of you scripts to their assignment in Canvas