Methods 1, Week 6

Download folder for class 5 data

Download the data in the week 6 data and save it in the part1/data/raw folder.

Open part1 project:

Install the scales packages

install.packages("scales"):for formatting text

Outline

Visualization with ggplot
County, school district poverty analysis continues
Assignment 5

Visualization as part of analysis

Visualization is a tool:

to explore our datasets
check results
share with colleagues
share final analysis results

`ggplot2`

Tidyverse package for producing plots of your dataframe

Every ggplot has 3 required components:

data: the dataframe you want to visualize
aes: variables in the dataframe that you want to visualize
at least one layer that defines what type of plot you want to create
- examples: points, bars, lines
you can add many more elements to make it look nicer

basic scatterplot with ggplot2

library(tidyverse)
# read in data from apportionment processing
apportion <- read_csv("data/processed/electoral_college_race.csv")

# create simple plot
ggplot(data = apportion, # data
       aes(x = percent_white, 
           y = pop_per_electoral_vote)) + # aesthetics
  geom_point() # layer

ggplot line example

A line graph doesn’t make sense for this data, but as an example:

the layer type determines how you display the data

ggplot(data = apportion, # data
       aes(x = percent_white, 
           y = pop_per_electoral_vote)) + # aesthetics
  geom_line() # line layer

Analysis plan

Create dataframe of poverty rate by county
Create dataframe of student poverty rate by school district
Calculate the statewide student poverty rate
Join the school district and county poverty dataframes to compare the poverty rates
Measure the difference in poverty rates of each school district and it’s county and the state
Use summary statistics to explore and gain understanding
Use visualizations to explore and gain understanding

Analysis so far

We have 3 scripts for our analysis so far:

new_york_student_poverty_2019
ny_county_poverty_rate_19
analyze_ny_poverty

Visualization plan

What counties have the most economic inequality, as measured by the student poverty rate of school districts?

Add school district enrollment data for context
Create a scatterplot to explore the county with the largest range in student poverty
Create scatterplots to explore other counties
Create scatterplots to explore the state as a whole

Visualization script

Create a new script visualize_poverty_analysis.R

add necessary packages
import data
join school district enrollment data

library(tidyverse)
library(scales)

### Import the summary data so we can look at it to pick the counties we want to focus on
county_stats <- read_csv("data/output/ny_county_poverty_stats.csv")

# import some extra school district data
sd_enroll <- read_csv("data/raw/ny_sd_enrollment_2019.csv")

# import the school district - county poverty data and join the school district data
sd_county_pov <- read_csv("data/processed/ny_sd_county_pov_data.csv") |> 
  left_join(sd_enroll, by = "id")

What county has the largest range in student poverty?

Orange County scatterplot, v1

### Create a scatterplot to explore the county with the largest range in student poverty 
ggplot(data = sd_county_pov |> 
         filter(COUNTY == "Orange County"), 
  aes(x = stpovrate, y = pct_bipoc)) +
  geom_point()

Orange County scatterplot, v2 code

Within the aesthetic mapping aes():
- size the dots by enrollment
- color by “urbanicity category”

ggplot(data = sd_county_pov |> 
         filter(COUNTY == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point()

Orange County scatterplot, v2 plot

Orange County scatterplot, v3 code

Within geom_point() make the dots 65% transparent with alpha = .65

ggplot(data = sd_county_pov |> 
         filter(COUNTY == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .65)

Orange County scatterplot, v3 plot

Orange County scatterplot, v4 code

Format the axis labels as percent, with no decimal place

use scale_x_continuous() to format the x-axis
percent_format() is a scales package function
accuracy = 1 rounds to a whole number
- accuracy = .1 includes one decimal place

ggplot(data = sd_county_pov |> 
         filter(COUNTY == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1))

Orange County scatterplot, v4 plot

Orange County scatterplot, v5 code

Define nice-looking title, caption, axis labels and legend lables within the labs function

ggplot(data = sd_county_pov |> 
         filter(COUNTY == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity")

Orange County scatterplot, v5 plot

Orange County scatterplot, v6 code

Add a theme to add some standard styling

see other themes

ggplot(data = sd_county_pov |> 
         filter(COUNTY == "Orange County"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

Orange County scatterplot, v6 plot

New York scatterplot code

Remove the filter to look at New York as a whole

ggplot(data = sd_county_pov, 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in New York School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

New York scatterplot

New York scatterplot (no nyc) code

Remove New York City to see how it changes

ggplot(data = sd_county_pov |> 
         filter(district != "New York City Department Of Education"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       subtitle = "Excluding New York City",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

New York scatterplot (no nyc)

Save a plot

To save a plot:

first save it as an object
use ggsave to save it

ny_scatter <- ggplot(data = sd_county_pov |> 
         filter(district != "New York City Department Of Education"), 
  aes(x = stpovrate, 
      y = pct_bipoc,
      size = denroll_district,
      color = urbanicity)) +
  geom_point(alpha = .5) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  scale_size_area(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Orange County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019",
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

# example code to save last plot as a 5" by 7" .png file
ggsave("data/output/NewYork_school_district_poverty.png", #specify the file path/name/type
       plot = ny_scatter, # specify the ggplot object you stored
       units = "in", # specify the units for your image
       height = 5, width = 7) # specify the image dimensions

In-class

Create 3 scatterplot using the same dataset with these values:

x-axis: Student Poverty Rate
y-axis: Education Revenue Per Pupil
- change the axis formatting from percent_format to dollar_format
size: Enrollment
color: Percent BIPOC

Orange County
Nassau County (this is Long Island, right next to Queens)
All of New York

We’ll come together at the end of class to discuss what these show.

Homework 5b.

Use the visualization skills you learned today to create 3 plots to explore the New York County data from last week (county poverty, asthma hospitalization rates, lottery retailers, atms).

Follow your interest. Add more data if you desire. On canvas upload your plots and a short paragraph description of what each scatterplot shows.

Before class next week

Next week we’ll talk about the US Census and the tidycensus package.

tidycensus requires a key: get it here.
Save the email that you get from tidycensus and we will install the key together in class next week