Methods 1, Week 6

Download folder for class 5 data

Download the data in the week 6 data and save it in the part1/data/raw folder.

Open part1 project:

Install the scales packages

  • install.packages("scales"):for formatting text

Outline

  • Visualization with ggplot

  • County, school district poverty analysis continues

  • Assignment 5

Visualization as part of analysis


Visualization is a tool:

  • to explore our datasets
  • check results
  • share with colleagues
  • share final analysis results

ggplot2

Tidyverse package for producing plots of your dataframe

Every ggplot has 3 required components:

  • data: the dataframe you want to visualize
  • aes: variables in the dataframe that you want to visualize
  • at least one layer that defines what type of plot you want to create
    • examples: points, bars, lines
  • you can add many more elements to make it look nicer

basic scatterplot with ggplot2

library(tidyverse)
# read in data from apportionment processing
apportion <- read_csv("data/processed/electoral_college_race.csv")

# create simple plot
ggplot(data = apportion, # data
       aes(x = percent_white, 
           y = pop_per_electoral_vote)) + # aesthetics
  geom_point() # layer

ggplot line example

A line graph doesn’t make sense for this data, but as an example:

  • the layer type determines how you display the data
ggplot(data = apportion, # data
       aes(x = percent_white, 
           y = pop_per_electoral_vote)) + # aesthetics
  geom_line() # line layer

Analysis plan

  • Create dataframe of poverty rate by county
  • Create dataframe of student poverty rate by school district
  • Calculate the statewide student poverty rate
  • Join the school district and county poverty dataframes to compare the poverty rates
  • Measure the difference in poverty rates of each school district and it’s county and the state
  • Use summary statistics to explore and gain understanding
  • Use visualizations to explore and gain understanding

Analysis so far

We have 3 scripts for our analysis so far:

  • new_york_student_poverty_2019
  • ny_county_poverty_rate_19
  • analyze_ny_poverty

Visualization plan

What counties have the most economic inequality, as measured by the student poverty rate of school districts?

  • Add school district enrollment data for context
  • Create a scatterplot to explore the county with the largest range in student poverty
  • Create scatterplots to explore other counties
  • Create scatterplots to explore the state as a whole

Visualization script

Create a new script visualize_poverty_analysis.R

  • add necessary packages
  • import data
  • join school district enrollment data
library(tidyverse)
library(scales)

### Import the summary data so we can look at it to pick the counties we want to focus on
county_stats <- read_csv("data/output/ny_county_poverty_stats.csv")

# import some extra school district data
sd_enroll <- read_csv("data/raw/ny_sd_enrollment_2019.csv")

# import the school district - county poverty data and join the school district data
sd_county_pov <- read_csv("data/processed/ny_sd_county_pov_data.csv") |> 
  left_join(sd_enroll, by = "id")

What county has the largest range in student poverty?

Orange County scatterplot, v1

### Create a scatterplot to explore the county with the largest range in student poverty 
ggplot(data = sd_county_pov |> 
         filter(COUNTY == "Orange County"), 
  aes(x = stpovrate, y = pct_bipoc)) +
  geom_point()

Orange County scatterplot, v2 code

  • Within the aesthetic mapping aes():

    • size the dots by enrollment
    • color by “urbanicity category”
ggplot(data = sd_county_pov |> 
         filter(COUNTY == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point() 

Orange County scatterplot, v2 plot

Orange County scatterplot, v3 code

  • Within geom_point() make the dots 65% transparent with alpha = .65
ggplot(data = sd_county_pov |> 
         filter(COUNTY == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .65) 

Orange County scatterplot, v3 plot

Orange County scatterplot, v4 code

Format the axis labels as percent, with no decimal place

  • use scale_x_continuous() to format the x-axis
  • percent_format() is a scales package function
  • accuracy = 1 rounds to a whole number
    • accuracy = .1 includes one decimal place
ggplot(data = sd_county_pov |> 
         filter(COUNTY == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) 

Orange County scatterplot, v4 plot

Orange County scatterplot, v5 code

Define nice-looking title, caption, axis labels and legend lables within the labs function

ggplot(data = sd_county_pov |> 
         filter(COUNTY == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") 

Orange County scatterplot, v5 plot

Orange County scatterplot, v6 code

Add a theme to add some standard styling

ggplot(data = sd_county_pov |> 
         filter(COUNTY == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

Orange County scatterplot, v6 plot

New York scatterplot code

Remove the filter to look at New York as a whole

ggplot(data = sd_county_pov, 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in New York School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

New York scatterplot

New York scatterplot (no nyc) code

Remove New York City to see how it changes

ggplot(data = sd_county_pov |> 
         filter(district != "New York City Department Of Education"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       subtitle = "Excluding New York City",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

New York scatterplot (no nyc)

Save a plot

To save a plot:

  • first save it as an object
  • use ggsave to save it
ny_scatter <- ggplot(data = sd_county_pov |> 
         filter(district != "New York City Department Of Education"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

# example code to save last plot as a 5" by 7" .png file
ggsave("data/output/NewYork_school_district_poverty.png", #specify the file path/name/type
       plot = ny_scatter, # specify the ggplot object you stored
       units = "in", # specify the units for your image
       height = 5, width = 7) # specify the image dimensions

In-class

Create 3 scatterplot using the same dataset with these values:

  • x-axis: Student Poverty Rate
  • y-axis: Education Revenue Per Pupil
    • change the axis formatting from percent_format to dollar_format
  • size: Enrollment
  • color: Percent BIPOC
  1. Orange County
  2. Nassau County (this is Long Island, right next to Queens)
  3. All of New York

We’ll come together at the end of class to discuss what these show.

Homework 5b.

Use the visualization skills you learned today to create 3 plots to explore the New York County data from last week (county poverty, asthma hospitalization rates, lottery retailers, atms).

Follow your interest. Add more data if you desire. On canvas upload your plots and a short paragraph description of what each scatterplot shows.

Before class next week


Next week we’ll talk about the US Census and the tidycensus package.

  • tidycensus requires a key: get it here.
  • Save the email that you get from tidycensus and we will install the key together in class next week