In today’s class we’ll write scripts to begin to explore the desegregation datasets that we used in the homework assignment.

By then end of today’s class you’ll have your first analysis script.


Outline

  • Readings discussion

  • Research Journal discussion

  • Homework 1 overview and questions

  • File paths

  • Data exploration and processing with the tidyverse

  • Exporting data frames

  • Getting help

  • Assignment 2




Readings




Data Feminism

  • “What we choose to measure is a statement of what [who] we value”





  • Missing data



  • Minoritized




Black Is Over (Or, Special Black)

  • US Census race and ethnicity categories




Why the Bronx Burned




Research Journal discussion







Homework overview and questions

Homework 1 solution




File paths and working directories

The specification of the list of folders to get to a file on your computer is called a path.

Using relative paths makes it easier to avoid mistakes and to share your script with others. In this class, we will Create an R project for each class that defines the folder that your relative path starts with. For each class, there will be a folder to download with the data for the in-class exercises and homework. You should download it and put it in your methods1 folder so it looks something like this:

Starting with class2, we’ll all try to have the same file structure. Don’t worry if your class1 folder looks different.

If you need to use an absolute path to import data, they look different depending on your operating system. See examples below of the absolute path to a csv in my main data folder:




Download folder for class 2 and create a new project

Download the class 2 folder and save them in your methods1 folder on your computer

Create new project from existing folder:

  • methods1/class2


Create new R script File > R Script

Save it in class2 as new_york_student_poverty_2018.R

At the top of your script, write a comment describing the purpose of this file:

# Processing and exploring the 2018 student-age poverty data from SAIPE for New York
# source = https://www.census.gov/data/datasets/2019/demo/saipe/2019-school-districts.html


Data exploration and processing with the Tidyverse





First load the tidyverse collection of packages to your environment

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Things to notice:

  • Notice the conflicts message that’s printed when you load the tidyverse. Those are just fine. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().
  • You only need to install a package once, but you must load a package into your R session every time
    • install.packages(“package_name”) installs the package on your computer
    • use library(“package_name”) to load the package into the current R environment



dplyr package


We’ll learn five dplyr functions that are the backbone of data transformation in R

  • mutate() - create or redefine a variable (column)
  • select() - subset variables (columns) by their names
  • filter() - subset observations (rows) by their values
  • rename() - rename the variables
  • summarise() - collapse many values down to a single summary

These, and other dplyr functions, can be used with group_by() which changes the scope of each function from operating on the full data frame, to operating on the data frame by group. (This will become more clear with examples)


Import the dataset that we’ll explore, name it raw_stpov18 to indicate that this is the original form of the data

# import the ProPublica desegregation order dataset we used in the homework
raw_stpov18 <- read_csv("data/ny_student_poverty_2018.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Postal = col_character(),
##   FIPS = col_double(),
##   district_id = col_character(),
##   Name = col_character(),
##   Estimated_Total_Pop = col_double(),
##   Estimated_Pop_5_17 = col_double(),
##   Estimated_relevant_5_17_in_poverty = col_double()
## )
glimpse(raw_stpov18)



mutate() - create or redefine a variable

# create new data frame from raw_stpov18 to process the data
# create a new column that calculates the student poverty rate
stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17)

Things to notice:

  • The tidyverse uses a pipe operator “%>%” to string together multiple commands
    • think of it as meaning “and then”
    • you’ll eventually love the “%>%”
    • keystroke = cmd/ctrl - shift - m
  • To create a new variable mutate(var_name = equation)
    • you use the equal sign within dplyr functions
  • Within a code chunk, you don’t need to use the dollar sign ($) before a column name



# create a new column that calculates the student poverty rate
# create a new 'Year' column
stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2018")

Things to notice:

  • You can create more than one new column in the same code chunk
    • separate each new variable by a comma
  • Use double quotes around the value if you want the new variable’s data type to be character




select() - subset columns by their names

Our student poverty table has 9 columns, we may not need all of those.

Type names(stpov18) in your console to look at the names of the columns

# # then, add to the previous code chunk to select the variable you want to keep
stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2018") %>%
  select(Postal, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17, stud_pov_rate)

Things to notice:

  • Add a pipe operator to perform a new type of transformation on your data frame



filter() - subset observations (rows) by their values

Remove rows fewer than 100 children

Type View(stpov18) in your console to view your data frame and see how many districts have fewer than 100 children

# then, add to the previous code chunk to remove rows with stpop = 0
stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2018",
         district_id = paste0(FIPS, district_id)) %>%
  select(Postal, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17, stud_pov_rate, year) %>%
  filter(Estimated_Pop_5_17 >= 100)

Things to notice:

  • Filter does the same thing as the base R subset() we used in homework 1
  • You can filter on a character variable too
  • You can have multiple arguments in a filter
  • Use “&” if every expression must be true for a row to be included in the output
  • Use “|” (OR) if any expression can be true for a row to be included in the output
  • “!” means not
# examples of other filters

# filter based on text value
nyc <- stpov18 %>% 
  filter(Name == "New York City Department Of Education")

# remove new york city
ny_no_nyc <- stpov18 %>% 
  filter(Name != "New York City Department Of Education")

# remove all districts with more than 10,000 people
ny_no_large_districts <- stpov18 %>% 
  filter(Estimated_Total_Pop <= 10000)

# remove all districts with more than 10,000 people AND less than or equal to 500
ny_medium_districts <- stpov18 %>% 
  filter(Estimated_Total_Pop <= 10000 & Estimated_Total_Pop > 500)

# remove these test data frames from your environment
rm(nyc)
rm(ny_no_nyc)
rm(ny_no_large_districts)
rm(ny_medium_districts)

Things to notice:

  • rm() removes a data frame from your environment



rename() - rename your columns

# add to the previous code chunk to rename your variables shorter names
stpov18 <- raw_stpov18 %>%
  mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
         year = "2018",
         district_id = paste0(FIPS, district_id)) %>%
  select(Postal, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17, Estimated_relevant_5_17_in_poverty, stud_pov_rate, year) %>%
  filter(Estimated_Pop_5_17 >= 100) %>%
  rename(id = district_id,
         district = Name,
         tpop = Estimated_Total_Pop,
         stpop = Estimated_Pop_5_17,
         stpov = Estimated_relevant_5_17_in_poverty,
         stpovrate = stud_pov_rate)


# look at your table to make sure you got it right

Things to notice:

  • When you rename, new_name = old_name
  • Never have spaces or dashes in your column name
  • If you are renaming multiple variables, separate them by a comma and new line for readability



Calculate some quick summary statistics

summary(stpov18)



summarise() - create a data frame of summary statistics thatyou define

# calculate the student poverty rate in the New York and the average student poverty rate of school districts in New York

ny_pov_stats <- stpov18 %>%
  summarise(districts = n(),
            kids = sum(stpop),
            kids_in_pov = sum(stpov),
            stud_povrate = round(kids_in_pov/kids, 3),
            mean_sd_stpovrate = round(mean(stpovrate), 3),
            max_sd_stpovrate = round(max(stpovrate), 3),
            min_sd_stpovrate = round(min(stpovrate), 3))

Things to notice:

  • Use summarise (with an s!) to calculate summary statistics for the entire data frame
  • New line and commas between each



arrange() - reorder the rows

Use arrange() if you want to permanently change the order of the rows in your data frame

# then, add to the previous code chunk to order the stats data frame by student poverty rate
stpov18 <- stpov18 %>%
  arrange(stpovrate)



Questions:

What district has the highest student-age poverty rate?
What district has the lowest student-age poverty rate?


Write out your csvs

Save these processed data frame and summary stats data frame to your computer

# write out processed student poverty data for 2018
write_csv(stpov18, "data/output/ny_student_poverty_rate_2018.csv")
# 
# # write out poverty rate stats for 2018
write_csv(ny_pov_stats, "data/output/student_poverty_state_stats.csv")



Getting help

See slides on how to get help for R


Homework


Readings

The discussion in our next session will be based around access to democracy. Please read and be prepared to discuss the following reading:

Chapter 6: Never a Real Democracy of The Sum of Us by Heather McGee

For additional context and information on R, review Chapters 4 and 5 of R for Data Science by Hadley Wickham and Garrett Grolemund


R Assignments

2b. Use the 2018 student poverty processing script to process the same dataset for another year.

  • Save your 2018 script as new_york_student_poverty_2019_your_name.R
  • Use it to import the 2019 student poverty data and process it in the same way
  • Create another data frame of New York districts with more than 500 children (variable = stpop) and fewer than 5000 children living in poverty (variable = stpov)
  • Answer the following questions in commented out text the bottom of your script:
    • Did the student poverty rate increase or decrease in New York in 2019?
    • What was the median Student Population in 2019?


2c. Import the 2019 ACS New York Poverty Data by County and create a data frame of the poverty rate

  • Create a new script called ny_county_poverty_rate_19_you_name.R
  • Write a comment at the top describing the purpose of the script
  • Load the tidyverse
  • Read in the raw poverty data (nhgis0042_ds244_20195_2019_county.csv), name it raw_county_pov19
  • Read the data dictionary documentation in the same folder as the data
  • select the following columns: STATE, COUNTY, GEOID, ALWVE001, ALWVE002, ALWVE003
  • filterto keep New York
  • rename ALWVE001 to total_pop
  • mutate to create a new variable poverty_rate where
    • poverty_rate = (ALWVE002 + ALWVE003)/total_pop
  • select the following columns: STATE, COUNTY, GEOID, total_pop, poverty_rate
  • write out your csv to the output folder

Upload both of you scripts to their assignment in Canvas