In today’s class we are going to continue working on the same analysis as last week, and we’ll keep working in the class 3 project and folder. We will explore the student poverty dataset that we created last week, and use plots and charts to do it.




Download folder for class 4 data and add the data to your class 3 folder

Download the class 4 folder and move the data into your class3/data/raw folder

Open project:

  • methods1/class3

Install the hrbrthemes, scales and viridis packages




Outline

  • In-class analysis continues

  • Visualization

    • Defining colors in R
    • Iterating your way to beauty and underderstanding ggplot2

  • Homework review

  • Readings discussion

  • Assignment 4







Today we continue with the student poverty analysis for New York school districts.

In-class Analysis continues

Explore the level of economic inequality in school districts across New York State.


Research Questions
  • What is the difference between the student poverty rate in each school district and the poverty rate of the county as a whole? and the state as a whole?
  • What counties have the most economic inequality, as measured by student poverty rate of school districts?
Analysis steps
  • Process the data
    • Create data frame of student poverty rate by school district
    • Calculate statewide poverty rate
    • Create data frame of poverty rate by county
  • Create your analysis data frame
    • Add county to student poverty data
    • Join county poverty data frame to school district data frame by county
    • Create new variables
      • difference from county rate = school district poverty rate - county poverty rate
      • difference from state rate = school district poverty rate - statewide poverty rate
        TODAY
  • Calculate summary statistics
  • Visualize and explore your analysis data frame



We have 3 scripts for our analysis so far:

  • 1_process_school_district_data_2019
  • 2_process_county_data_2019
  • 3_student_poverty_analysis

We’ll pick up in 3_student_poverty_analysis where we left off:

First we’ll look at the county stats and then perform some quick stats to figure out our next analysis steps

# What is the range of student poverty rates in counties across New York
hist(county_stats$`Poverty Rate Range`)

### select the min max and median range for school districts
summary(county_stats$`Poverty Rate Range`)

Now that I see the county data, I realize that I want some more data about the school districts to get a fuller picture of the economic inequality of counties in New York. I’ll import some enrollment data from the National Center for Education Statistics to join to my poverty dataset.

### Import some extra data for school districts to use to explore
sd_enroll <- read_csv("data/raw/ny_sd_enrollment.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   id = col_double(),
##   urbanicity = col_character(),
##   denroll_district = col_double(),
##   dwhite = col_double(),
##   dblack = col_double(),
##   dhispanic = col_double(),
##   dasian_pi = col_double(),
##   dhawaiian_pi = col_double(),
##   damindian_aknative = col_double(),
##   d2plus_races = col_double(),
##   pct_bipoc = col_double(),
##   median_propert_value = col_double()
## )
# Join to school district data
sd_data <- sd_county_pov %>% 
  left_join(sd_enroll, by = "id")

### Check for na
colSums(is.na(sd_data))

#write out the data
write_csv(sd_data, "data/output/NY_stpov_analysis_data.csv")






Visualization

Defining colors in R

We’re going to begin to learn how to create visualizations in R today. There are lots of different ways to name a color. The two we’ll use most are hex colors and predefined color names.

Hex colors are 6-digit way to represent a color that is common in web development. You can pick colors and find their hex codes here:

R also recognizes hundreds of predefined color names like “blue” and “ivory”. You can find a full list of them here:







ggplot

Iterating your way to beauty with ggplot2

Looking at ggplot2 code for production-ready graphics can be intimidating, but understanding the process can dramatically reduce plot-related panic.

Developing ggplot2 graphics should generally follow a simple, iterative workflow:

  1. Create a basic plot
  2. Address any missing values
  3. Clean up formatting of chart elements
  4. Add a new layer of data
  5. Tidy up formatting
  6. Repeat steps 4-5 as needed
  7. Save output



Create a new script 4_visualize_poverty_analysis.R to explors and create visualizations

library(tidyverse)
library(scales)
library(viridis)

### Questions we want to answer

# What is the range of student poverty rates in counties across New York
# What does the county with the least economic inequality look like?
# What does the county with the most economic inequality look like?
# What does the county with the median economic inequality look like?

### Import the data
county_stats <- read_csv("data/output/NY_county_poverty_stats.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   County = col_character(),
##   Districts = col_double(),
##   `County Poverty Rate` = col_double(),
##   `Average School District Poverty Rate` = col_double(),
##   `Maximum Poverty Rate` = col_double(),
##   `Minimum Poverty Rate` = col_double(),
##   `Poverty Rate Range` = col_double(),
##   `Average Poverty Rate Difference` = col_double(),
##   `Maximum Poverty Rate Difference` = col_double()
## )
sd_data <- read_csv("data/output/NY_stpov_analysis_data.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   district = col_character(),
##   County = col_character(),
##   urbanicity.x = col_character(),
##   urbanicity.y = col_character(),
##   urbanicity.x.x = col_character(),
##   urbanicity.y.y = col_character(),
##   urbanicity.x.x.x = col_character(),
##   urbanicity.y.y.y = col_character(),
##   urbanicity.x.x.x.x = col_character(),
##   urbanicity.y.y.y.y = col_character(),
##   urbanicity = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
## First select only the counties with at least 5 school districts
county_stats_analysis <- county_stats %>% 
  filter(Districts > 5) # select counties with at least 5 districts

# What is the range of student poverty rates in counties across New York
hist(county_stats$`Poverty Rate Range`)

### select the min max and median range for school districts
summary(county_stats$`Poverty Rate Range`)

Step 1: Create a basic plot

# first plot
max_plot <- sd_data %>% 
  filter(County == "Monroe County") %>% 
  ggplot(aes(x = stpovrate, y = pct_bipoc)) +
  geom_point()
max_plot

Now, let’s take care of some formatting issues. Our axes don’t look great - the decimals ought to be percentages. Here, the scales package provides some help.

# format axes

max_plot <- sd_data %>% 
  filter(County == "Monroe County") %>% 
  ggplot(aes(x = stpovrate, y = pct_bipoc)) +
  geom_point() +
  # make sure you have the `scales` package loaded!
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) 
max_plot

Next, we should add some labels to our axes that make sense, along with a title for our plot and a caption that details our data sources.

# add labels

max_plot <- sd_data %>% 
  filter(County == "Suffolk County") %>% 
  ggplot(aes(x = stpovrate, y = pct_bipoc)) +
  geom_point() +
  # make sure you have the `scales` package loaded!
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019")
max_plot

Themes can be used to change the appearance of elements in your plot. There are many stock options, but I prefer theme_bw() for its clean appearance and helpful and unobtrusive gridlines.

# change theme
max_plot <- sd_data %>% 
  filter(County == "Suffolk County") %>% 
  ggplot(aes(x = stpovrate, y = pct_bipoc)) +
  geom_point() +
  # make sure you have the `scales` package loaded!
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019") +
  theme_bw()

max_plot

Step 4: Add a new layer of data

Now that we have a decent-looking graph, let’s add in a new data element to vary point size by enrollment, and make the points .

# add size element

max_plot <- sd_data %>% 
  filter(County == "Suffolk County") %>% 
  ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district)) +
  geom_point() +
  # make sure you have the `scales` package loaded!
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019") +
  theme_bw()

max_plot

We see some overlap in the points. Reducing the opacity of the points can be accomplished by setting the alpha parameter in geom_point() to a value less than 1. Setting it to .5 will make data points 50% translucent.

# add size element

max_plot <- sd_data %>% 
  filter(County == "Suffolk County") %>% 
  ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district)) +
  geom_point(alpha = .5) +
  # make sure you have the `scales` package loaded!
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019") +
  theme_bw()

max_plot

Step 5: Tidy up formatting

Adding a new variable for size creates a legend. We need to tidy the legend’s labels and the title.

# clean up legend

max_plot <- sd_data %>% 
  filter(County == "Suffolk County") %>% 
  ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district)) +
  geom_point(alpha = .5) +
  # make sure you have the `scales` package loaded!
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  # change legend label formatting
  scale_size(labels = comma) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019", 
       # add nice label for size element
       size = "Enrollment") +
  theme_bw()

max_plot

We can also adjust some paramenters to allow for more visual contrast in size. By default, ggplot2 will adjust points’ radii based on the size variable. Using area is a more visually honest way to represent the data, so let’s make that change.

# create more contrast in size

max_plot <- sd_data %>% 
  filter(County == "Suffolk County") %>% 
  ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district)) +
  geom_point(alpha = .5) +
  # make sure you have the `scales` package loaded!
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019", 
       # add nice label for size element
       size = "Enrollment") +
  theme_bw()

max_plot

Step 6: Repeat steps 4-5 as needed

Color can be helpful, too - let’s add in color based on urbanicity.

# add in color based on urbanicity

max_plot <- sd_data %>% 
  filter(County == "Suffolk County") %>% 
  ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district,
             color = urbanicity)) +
  geom_point(alpha = .5) +
  # make sure you have the `scales` package loaded!
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019", 
       # add nice label for size element
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

max_plot

We can and should adjust the colors used. R recognizes some pretty funky color names, which can be found in this helpful cheatsheet.

# adjust colors manually

max_plot <- sd_data %>% 
  filter(County == "Suffolk County") %>% 
  ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district,
             color = urbanicity)) +
  geom_point(alpha = .5) +
  # create manual color palette
  # color names pulled from a pdf y'all should bookmark
  # http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
  scale_color_manual(values = c("tomato3", "steelblue2",
                                "seagreen3", "orchid1")) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019", 
       # add nice label for size element
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

max_plot

We should strive to make our analyses as accessible as possible. The viridis package includes some color palettes that are friendly for folks with color blindness, which affects 5-10 percent of the US population.

# use colors better for visual impairments

max_plot <- sd_data %>% 
  filter(County == "Suffolk County") %>% 
  ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district,
             color = urbanicity)) +
  geom_point(alpha = .5) +
  # use a colorblind-friendly palette
  scale_color_viridis_d() +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019", 
       # add nice label for size element
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

max_plot

Let’s adjust the range of colors used to exclude that hard-to-see yellow.

# that yellow is hard to see - let's adjust the range
max_plot <- sd_data %>% 
  filter(County == "Suffolk County") %>%
  ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district,
             color = urbanicity)) +
  geom_point(alpha = .5) +
  # adjust color range
  scale_color_viridis_d(end = .8) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
       caption = "Sources: NCES, 2019 and SAIPE, 2019", 
       # add nice label for size element
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

max_plot

Step 7: Save output

When you’re satisfied with a graphic, you can save it as an image file. ggsave() will write out an image file based on the last plot you ran. Or you can store a ggplot2 object and pass that as a paremeter to ggsave().

# example code to save last plot as a 5" by 7" .png file
ggsave("figures/SuffolkCounty_poverty.png", #specify the file path/name/type
       plot = max_plot, # specify the ggplot object you stored
       units = "in", # specify the units for your image
       height = 5, width = 7) # specify the image dimensions

Finally, let’s look at the whole state now that I have some county-level context

state_plot <- sd_data %>% 
  filter(denroll_district < 500000) %>%
  ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district,
             color = urbanicity)) +
  geom_point(alpha = .5) +
  # adjust color range
  scale_color_viridis_d(end = .8) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) + 
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Student Poverty Rate", y = "Percent BIPOC",
       title = "Racial Diversity and Student Poverty in New York School Districts",
       caption = "Note: excludes New York City School District", 
       # add nice label for size element
       size = "Enrollment",
       color = "Urbanicity") +
  theme_bw()

state_plot




Homework overview




Readings Discussion




Assignment 4

4a. Readings

4b. R Assignment

Use the visualization skills you learned today to create plots to explore the state level apportionment and race data from last week’s homework. There are many more variables in the decennial census dataset. Add more variables and create some finalized plots exploring the race and apportionment data. There are no right plots, just follow your interest. If you feel inspired you can look for more state-level to add to your data frame to deepen the analysis.

When you have finished your exploratory visual analysis, go back through your script and clean it up: delete any lines of code that are not pertinent to your final analysis, write comments to explain all of your steps. Make this a script that you could open in a year and rerun.

Upload your finalized script. If you import any additional data, please also upload it so that I can run your script.