Week 3 Lesson

In today’s class we are going to work on one analysis together. This week’s class folder will contain all of the data, scripts and output and you will continue to work on it for the homework assignment. It will be a template for how to organize small research projects. We’ll use the two datasets you created for the homework assignment to explore economic segregation across New York.

In the process you’ll learn how to do lots of new things in R.

Download folder for class 3 and create a new project

Download the class 3 folder and save them in your methods1 folder on your computer

Create new project from existing folder:

methods1/class3_poverty_analysis

Install the readxl package

install.packages("readxl")

Outline

File paths review

In-class analysis overview

Homework review

Importing excel files

Joining data

Summary statistics

Simple visualizatoion

Readings discussion

Assignment 3

File paths and working directories

The specification of the list of folders to get to a file on your computer is called a path.

Absolute path: starts at the root folder of the computer
Relative path: starts at a given folder and provides the path starting from that folder

Using relative paths makes it easier to avoid mistakes and to share your script with others. If you need to use an absolute path to import data, they look different depending on your operating system. See examples below of the absolute path to a csv in my main data folder:

Mac “/Users/sarahodges/spatial/Data/tabular/msa/nhgis_fam_pov_2019/nhgis0032_ds244_20195_2019_cbsa.csv”
Windows "C:\\Users\\sarahodges\\spatial\\Data\\tabular\\msa\\nhgis_fam_pov_2019\\nhgis0032_ds244_20195_2019_cbsa.csv"
- File paths on Windows machines often have spaces and double backslashes which R doesn’t like
- If you have trouble importing data on a Windows machine use the file.path() function to convert the path to something that R can understand:
  - file.path("C:\\Users\\sarahodges\\spatial\\Data\\tabular\\msa\\nhgis_fam_pov_2019\\nhgis0032_ds244_20195_2019_cbsa.csv")

If you want to know what folder your R Studio session thinks you are in - called your working directory - type this in your console:

getwd()

That means get working directory and R will print in your console what directory it assumes all of your relative file paths are starting from. Try it now. If you have created an R project called class3_poverty_analysis, it should be the file path to your class3_poverty_analysis folder.

In-class Analysis

Explore the level of economic inequality in school districts across New York State.

Research Questions

What is the difference between the student poverty rate in each school district and the poverty rate of the county as a whole? and the state as a whole?

What counties have the most economic inequality, as measured by student poverty rate of school districts?

Analysis steps

Process the data

Create data frame of student poverty rate by school district
Calculate statewide poverty rate
Create data frame of poverty rate by county

Create your analysis data frame

Add county to student poverty data
Join county poverty data frame to school district data frame by county
Create new variables
- difference from county rate = school district poverty rate - county poverty rate
- difference from state rate = school district poverty rate - statewide poverty rate

Visualize and explore your analysis data frame

Calculate summary statistics

If you haven’t already, create new project from existing folder:

methods1/class3_poverty_analysis

Process the data

Process the data in individual scripts, and write them to the data/processed folder (you’ll read them into your analysis scripts later)

These scripts are in your class3_poverty_analysis folder. Open them for us to review as a group.

Script 1 - Create data frame of student poverty rate by school district
- Calculate statewide poverty rate
Script 2 - Create data frame of poverty rate by county

Create your analysis data frame

First, we need to know what county each school district is in. We’ll join our school district data to a dataset that identifies each school district’s county. And then use that column to join our school district data to our county data.

Script 3 - Create student poverty analysis data frame

Create a new script, add an overview of the purpose of the script, and save it as 3_student_poverty_analysis.R

# Create student poverty analysis data frame
    
   # + Read in school district poverty data frame
   # + Read in county poverty data frame
   # + Read in school district/county data frame
   # + Read in state poverty stats data frame
   # + Add county to student poverty data
   # + Join county poverty data frame to school district data frame by county
   # + Create new variables:
   #     + difference from county rate = school district poverty rate - county poverty rate
   #     + difference from state rate = school district poverty rate - statewide poverty rate

library(tidyverse)
library(readxl)

# Read in school district poverty data frame
sd_pov19 <- read_csv("data/processed/ny_student_poverty_rate_2019.csv")

# Read in county poverty data frame
county_pov19 <- read_csv("data/processed/county_pov_rate_2019.csv")

# Read in state poverty stats data frame
pov_stats <- read_csv("data/processed/student_poverty_state_stats.csv")

Importing excel files

Use the ‘Import dataset’ window to import an excel file of school districts and counties

We will use the import dataset user interface to create the code to import, and then copy it into our script. That way the next time you run the script you are sure you import the data in the same way.

# Read in school district/county data frame
sd_county <- read_excel("data/raw/ELSI_excel_export.xlsx", skip = 6) 

# Write out processed school district-county key
write_csv(sd_county, "data/processed/sd_county_key_19.csv")

Joining data

Find the common variable in sd_pov19 and sd_county, and join them together

Open each data frame and find the common variable
- ALWAYS try to avoid joing on a text column - capitalization and mispelling make this a dangerous way to join

# Join sd_county to sd_pov19 to add COunty to school district data frame
sd_pov19 <- sd_pov19 %>% 
  left_join(sd_county, by = c("id" = "NCESID"))

# View the data frame to see if the join worked
# Count the NA's to determine how well the join worked - look at the important variables from each data frame to check those most carefully
colSums(is.na(sd_pov19))

Joins

So far we’ve processed individual datasets into individual data frames in R and written them to our computer. Often there are variables from other datasets that you want to add to your data frame to complete your analysis. To connect two data frames, they need a common key - a variable that is present and unique in both data frames. The graphic below is an inner join – the resulting data frame only includes rows where the common key matches.

You can keep the rows from either data frame, with a left_join() or right_join(). Or you can keep the rows from both data frames with a full_join().

In-class exercise

Open the joins_testing script to practice joining data.

Now we can use the same process to join the county poverty data to the school district data frame as well

# create analysis data frame of school district and county poverty 
sd_county_pov <- sd_pov19 %>% 
  mutate(CONUM = as.numeric(CONUM)) %>% 
  left_join(county_pov19, by = "CONUM") %>% 
  # Explore the data frame, count NAs, generally make sure it worked, then continue
  select(id, district, County, CONUM, tpop, stpov, stpovrate, county_pop, county_pov_rate) %>% 
  # Create new variables
  # difference from county rate = school district poverty rate - county poverty rate
  # difference from state rate = school district poverty rate - statewide poverty rate
  mutate(pov_rate_diff_county = round(stpovrate - county_pov_rate, 3),
         pov_rate_diff_state = round(stpovrate - pov_stats$stud_povrate, 3)) %>% 
  filter(!is.na(County)) # remove the 3 districts with no County for now - we will research these later

# View the data frame to see if the join worked
# Count the NA's to determine how well the join worked - look at the important variables from each data frame to check those most carefully
colSums(is.na(sd_county_pov))

Visualize and explore your analysis data frame

# Explore data frame
# summary(sd_county_pov)
summary(sd_county_pov$pov_rate_diff_county)
summary(sd_county_pov$pov_rate_diff_state)

hist(sd_county_pov$pov_rate_diff_state)

hist(sd_county_pov$pov_rate_diff_county)

Calculate summary statistics

You can calculate summary statistics on a group of variables just like we did for the whole state using the function group_by()

# Explore through county summary statistics

county_stats <- sd_county_pov %>% 
  group_by(County) %>% 
  summarise(Districts = n(),
            `County Poverty Rate` = round(first(county_pov_rate), 2),
            `Average School District Poverty Rate` = round(mean(stpovrate), 2),
            `Maximum Poverty Rate` = round(max(stpovrate), 2),
            `Minimum Poverty Rate` = round(min(stpovrate), 2),
            `Poverty Rate Range` = round(max(stpovrate) - min(stpovrate), 2),
            `Average Poverty Rate Difference` = round(mean(pov_rate_diff_county), 2),
            `Maximum Poverty Rate Difference` = round(max(pov_rate_diff_county), 2))

Functions we learned in today’s class

paste0()
substr()
read_excel()
left_join() (and all the other joins)
colSums()
is.na()
round()
group_by()

Learn more about each of these by googling them, or searching for them in the R Studio help by typing a question mark before the function name in your Console, ex: ?paste0.