In today’s class we’ll write scripts to begin to explore the desegregation datasets that we used in the homework assignment.
By then end of today’s class you’ll have your first analysis script.
Readings discussion
Research Journal discussion
Homework 1 overview and questions
File paths
Data exploration and processing with the tidyverse
Exporting data frames
Getting help
Assignment 2
- “What we choose to measure is a statement of what [who] we value”
Missing data
Minoritized
US Census race and ethnicity categories
The specification of the list of folders to get to a file on your computer is called a path.
Using relative paths makes it easier to avoid mistakes and to share your script with others. In this class, we will Create an R project for each class that defines the folder that your relative path starts with. For each class, there will be a folder to download with the data for the in-class exercises and homework. You should download it and put it in your methods1 folder so it looks something like this:
Starting with class2, we’ll all try to have the same file structure. Don’t worry if your class1 folder looks different.
If you need to use an absolute path to import data, they look different depending on your operating system. See examples below of the absolute path to a csv in my main data folder:
file.path() function to convert the path to something that R can understand:
Download the class 2 folder and save them in your methods1 folder on your computer
Create new project from existing folder:
- methods1/class2
Create new R script File > R Script
Save it in class2 as new_york_student_poverty_2018.R
At the top of your script, write a comment describing the purpose of this file:
# Processing and exploring the 2018 student-age poverty data from SAIPE for New York
# source = https://www.census.gov/data/datasets/2019/demo/saipe/2019-school-districts.html
First load the tidyverse collection of packages to your environment
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Things to notice:
- Notice the conflicts message that’s printed when you load the tidyverse. Those are just fine. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().
- You only need to install a package once, but you must load a package into your R session every time
- install.packages(“package_name”) installs the package on your computer
- use library(“package_name”) to load the package into the current R environment
We’ll learn five dplyr functions that are the backbone of data transformation in R
- mutate() - create or redefine a variable (column)
- select() - subset variables (columns) by their names
- filter() - subset observations (rows) by their values
- rename() - rename the variables
- summarise() - collapse many values down to a single summary
These, and other dplyr functions, can be used with group_by() which changes the scope of each function from operating on the full data frame, to operating on the data frame by group. (This will become more clear with examples)
Import the dataset that we’ll explore, name it raw_stpov18 to indicate that this is the original form of the data
# import the ProPublica desegregation order dataset we used in the homework
raw_stpov18 <- read_csv("data/ny_student_poverty_2018.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Postal = col_character(),
## FIPS = col_double(),
## district_id = col_character(),
## Name = col_character(),
## Estimated_Total_Pop = col_double(),
## Estimated_Pop_5_17 = col_double(),
## Estimated_relevant_5_17_in_poverty = col_double()
## )
glimpse(raw_stpov18)
# create new data frame from raw_stpov18 to process the data
# create a new column that calculates the student poverty rate
stpov18 <- raw_stpov18 %>%
mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17)
Things to notice:
- The tidyverse uses a pipe operator “%>%” to string together multiple commands
- think of it as meaning “and then”
- you’ll eventually love the “%>%”
- keystroke = cmd/ctrl - shift - m
- To create a new variable mutate(var_name = equation)
- you use the equal sign within dplyr functions
- Within a code chunk, you don’t need to use the dollar sign ($) before a column name
# create a new column that calculates the student poverty rate
# create a new 'Year' column
stpov18 <- raw_stpov18 %>%
mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
year = "2018")
Things to notice:
- You can create more than one new column in the same code chunk
- separate each new variable by a comma
- Use double quotes around the value if you want the new variable’s data type to be
character
Our student poverty table has 9 columns, we may not need all of those.
Type names(stpov18) in your console to look at the names of the columns
# # then, add to the previous code chunk to select the variable you want to keep
stpov18 <- raw_stpov18 %>%
mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
year = "2018") %>%
select(Postal, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17, stud_pov_rate)
Things to notice:
- Add a pipe operator to perform a new type of transformation on your data frame
Remove rows fewer than 100 children
Type View(stpov18) in your console to view your data frame and see how many districts have fewer than 100 children
# then, add to the previous code chunk to remove rows with stpop = 0
stpov18 <- raw_stpov18 %>%
mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
year = "2018",
district_id = paste0(FIPS, district_id)) %>%
select(Postal, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17, stud_pov_rate, year) %>%
filter(Estimated_Pop_5_17 >= 100)
Things to notice:
- Filter does the same thing as the base R subset() we used in homework 1
- You can filter on a character variable too
- You can have multiple arguments in a filter
- Use “&” if every expression must be true for a row to be included in the output
- Use “|” (OR) if any expression can be true for a row to be included in the output
- “!” means not
# examples of other filters
# filter based on text value
nyc <- stpov18 %>%
filter(Name == "New York City Department Of Education")
# remove new york city
ny_no_nyc <- stpov18 %>%
filter(Name != "New York City Department Of Education")
# remove all districts with more than 10,000 people
ny_no_large_districts <- stpov18 %>%
filter(Estimated_Total_Pop <= 10000)
# remove all districts with more than 10,000 people AND less than or equal to 500
ny_medium_districts <- stpov18 %>%
filter(Estimated_Total_Pop <= 10000 & Estimated_Total_Pop > 500)
# remove these test data frames from your environment
rm(nyc)
rm(ny_no_nyc)
rm(ny_no_large_districts)
rm(ny_medium_districts)
Things to notice:
- rm() removes a data frame from your environment
# add to the previous code chunk to rename your variables shorter names
stpov18 <- raw_stpov18 %>%
mutate(stud_pov_rate = Estimated_relevant_5_17_in_poverty/Estimated_Pop_5_17,
year = "2018",
district_id = paste0(FIPS, district_id)) %>%
select(Postal, district_id, Name, Estimated_Total_Pop, Estimated_Pop_5_17, Estimated_relevant_5_17_in_poverty, stud_pov_rate, year) %>%
filter(Estimated_Pop_5_17 >= 100) %>%
rename(id = district_id,
district = Name,
tpop = Estimated_Total_Pop,
stpop = Estimated_Pop_5_17,
stpov = Estimated_relevant_5_17_in_poverty,
stpovrate = stud_pov_rate)
# look at your table to make sure you got it right
Things to notice:
- When you rename, new_name = old_name
- Never have spaces or dashes in your column name
- If you are renaming multiple variables, separate them by a comma and new line for readability
summary(stpov18)
# calculate the student poverty rate in the New York and the average student poverty rate of school districts in New York
ny_pov_stats <- stpov18 %>%
summarise(districts = n(),
kids = sum(stpop),
kids_in_pov = sum(stpov),
stud_povrate = round(kids_in_pov/kids, 3),
mean_sd_stpovrate = round(mean(stpovrate), 3),
max_sd_stpovrate = round(max(stpovrate), 3),
min_sd_stpovrate = round(min(stpovrate), 3))
Things to notice:
- Use summarise (with an s!) to calculate summary statistics for the entire data frame
- New line and commas between each
Use arrange() if you want to permanently change the order of the rows in your data frame
# then, add to the previous code chunk to order the stats data frame by student poverty rate
stpov18 <- stpov18 %>%
arrange(stpovrate)
Save these processed data frame and summary stats data frame to your computer
# write out processed student poverty data for 2018
write_csv(stpov18, "data/output/ny_student_poverty_rate_2018.csv")
#
# # write out poverty rate stats for 2018
write_csv(ny_pov_stats, "data/output/student_poverty_state_stats.csv")
The discussion in our next session will be based around access to democracy. Please read and be prepared to discuss the following reading:
Chapter 6: Never a Real Democracy of The Sum of Us by Heather McGee
For additional context and information on R, review Chapters 4 and 5 of R for Data Science by Hadley Wickham and Garrett Grolemund
2b. Use the 2018 student poverty processing script to process the same dataset for another year.
new_york_student_poverty_2019_your_name.R2c. Import the 2019 ACS New York Poverty Data by County and create a data frame of the poverty rate
ny_county_poverty_rate_19_you_name.Rraw_county_pov19select the following columns: STATE, COUNTY, GEOID, ALWVE001, ALWVE002, ALWVE003filterto keep New Yorkrename ALWVE001 to total_popmutate to create a new variable poverty_rate where
select the following columns: STATE, COUNTY, GEOID, total_pop, poverty_rateUpload both of you scripts to their assignment in Canvas