In today’s class we are going to work on one analysis together. This week’s class folder will contain all of the data, scripts and output and you will continue to work on it for the homework assignment. It will be a template for how to organize small research projects. We’ll use the two datasets you created for the homework assignment to explore economic segregation across New York.
In the process you’ll learn how to do lots of new things in R.
Download the class 3 folder and save them in your methods1 folder on your computer
Create new project from existing folder:
- methods1/class3_poverty_analysis
Install the readxl package
install.packages("readxl")
File paths review
In-class analysis overview
Homework review
Importing excel files
Joining data
Summary statistics
Simple visualizatoion
Readings discussion
Assignment 3
The specification of the list of folders to get to a file on your computer is called a path.
Using relative paths makes it easier to avoid mistakes and to share your script with others. If you need to use an absolute path to import data, they look different depending on your operating system. See examples below of the absolute path to a csv in my main data folder:
"C:\\Users\\sarahodges\\spatial\\Data\\tabular\\msa\\nhgis_fam_pov_2019\\nhgis0032_ds244_20195_2019_cbsa.csv"
file.path() function to convert the path to something that R can understand:
file.path("C:\\Users\\sarahodges\\spatial\\Data\\tabular\\msa\\nhgis_fam_pov_2019\\nhgis0032_ds244_20195_2019_cbsa.csv")getwd()
That means get working directory and R will print in your console what directory it assumes all of your relative file paths are starting from. Try it now. If you have created an R project called class3_poverty_analysis, it should be the file path to your class3_poverty_analysis folder.
- What is the difference between the student poverty rate in each school district and the poverty rate of the county as a whole? and the state as a whole?
- What counties have the most economic inequality, as measured by student poverty rate of school districts?
If you haven’t already, create new project from existing folder:
Process the data in individual scripts, and write them to the data/processed folder (you’ll read them into your analysis scripts later)
These scripts are in your class3_poverty_analysis folder. Open them for us to review as a group.
First, we need to know what county each school district is in. We’ll join our school district data to a dataset that identifies each school district’s county. And then use that column to join our school district data to our county data.
Create a new script, add an overview of the purpose of the script, and save it as 3_student_poverty_analysis.R
# Create student poverty analysis data frame
# + Read in school district poverty data frame
# + Read in county poverty data frame
# + Read in school district/county data frame
# + Read in state poverty stats data frame
# + Add county to student poverty data
# + Join county poverty data frame to school district data frame by county
# + Create new variables:
# + difference from county rate = school district poverty rate - county poverty rate
# + difference from state rate = school district poverty rate - statewide poverty rate
library(tidyverse)
library(readxl)
# Read in school district poverty data frame
sd_pov19 <- read_csv("data/processed/ny_student_poverty_rate_2019.csv")
# Read in county poverty data frame
county_pov19 <- read_csv("data/processed/county_pov_rate_2019.csv")
# Read in state poverty stats data frame
pov_stats <- read_csv("data/processed/student_poverty_state_stats.csv")
Use the ‘Import dataset’ window to import an excel file of school districts and counties
We will use the import dataset user interface to create the code to import, and then copy it into our script. That way the next time you run the script you are sure you import the data in the same way.
# Read in school district/county data frame
sd_county <- read_excel("data/raw/ELSI_excel_export.xlsx", skip = 6)
# Write out processed school district-county key
write_csv(sd_county, "data/processed/sd_county_key_19.csv")
Find the common variable in sd_pov19 and sd_county, and join them together
# Join sd_county to sd_pov19 to add COunty to school district data frame
sd_pov19 <- sd_pov19 %>%
left_join(sd_county, by = c("id" = "NCESID"))
# View the data frame to see if the join worked
# Count the NA's to determine how well the join worked - look at the important variables from each data frame to check those most carefully
colSums(is.na(sd_pov19))
Joins
So far we’ve processed individual datasets into individual data frames in R and written them to our computer. Often there are variables from other datasets that you want to add to your data frame to complete your analysis. To connect two data frames, they need a common key - a variable that is present and unique in both data frames. The graphic below is an inner join – the resulting data frame only includes rows where the common key matches.
You can keep the rows from either data frame, with a left_join() or right_join(). Or you can keep the rows from both data frames with a full_join().
In-class exercise
Open the joins_testing script to practice joining data.
Now we can use the same process to join the county poverty data to the school district data frame as well
# create analysis data frame of school district and county poverty
sd_county_pov <- sd_pov19 %>%
mutate(CONUM = as.numeric(CONUM)) %>%
left_join(county_pov19, by = "CONUM") %>%
# Explore the data frame, count NAs, generally make sure it worked, then continue
select(id, district, County, CONUM, tpop, stpov, stpovrate, county_pop, county_pov_rate) %>%
# Create new variables
# difference from county rate = school district poverty rate - county poverty rate
# difference from state rate = school district poverty rate - statewide poverty rate
mutate(pov_rate_diff_county = round(stpovrate - county_pov_rate, 3),
pov_rate_diff_state = round(stpovrate - pov_stats$stud_povrate, 3)) %>%
filter(!is.na(County)) # remove the 3 districts with no County for now - we will research these later
# View the data frame to see if the join worked
# Count the NA's to determine how well the join worked - look at the important variables from each data frame to check those most carefully
colSums(is.na(sd_county_pov))
# Explore data frame
# summary(sd_county_pov)
summary(sd_county_pov$pov_rate_diff_county)
summary(sd_county_pov$pov_rate_diff_state)
hist(sd_county_pov$pov_rate_diff_state)
hist(sd_county_pov$pov_rate_diff_county)
You can calculate summary statistics on a group of variables just like we did for the whole state using the function group_by()
# Explore through county summary statistics
county_stats <- sd_county_pov %>%
group_by(County) %>%
summarise(Districts = n(),
`County Poverty Rate` = round(first(county_pov_rate), 2),
`Average School District Poverty Rate` = round(mean(stpovrate), 2),
`Maximum Poverty Rate` = round(max(stpovrate), 2),
`Minimum Poverty Rate` = round(min(stpovrate), 2),
`Poverty Rate Range` = round(max(stpovrate) - min(stpovrate), 2),
`Average Poverty Rate Difference` = round(mean(pov_rate_diff_county), 2),
`Maximum Poverty Rate Difference` = round(max(pov_rate_diff_county), 2))
paste0()substr()read_excel()left_join() (and all the other joins)colSums()is.na()round()group_by()Learn more about each of these by googling them, or searching for them in the R Studio help by typing a question mark before the function name in your Console, ex: ?paste0.