In today’s class we are going to continue working on the same analysis as last week, and we’ll keep working in the class 3 project and folder. We will explore the student poverty dataset that we created last week, and use plots and charts to do it.
Download the class 4 folder and move the data into your class3/data/raw folder
Open project:
- methods1/class3
Install the hrbrthemes, scales and viridis packages
install.packages("hrbrthemes")install.packages("scales")install.packages("viridis")
In-class analysis continues
Visualization
- Defining colors in R
Iterating your way to beauty and underderstanding ggplot2
Homework review
Readings discussion
Assignment 4
Today we continue with the student poverty analysis for New York school districts.
- What is the difference between the student poverty rate in each school district and the poverty rate of the county as a whole? and the state as a whole?
- What counties have the most economic inequality, as measured by student poverty rate of school districts?
We have 3 scripts for our analysis so far:
We’ll pick up in 3_student_poverty_analysis where we left off:
First we’ll look at the county stats and then perform some quick stats to figure out our next analysis steps
# What is the range of student poverty rates in counties across New York
hist(county_stats$`Poverty Rate Range`)
### select the min max and median range for school districts
summary(county_stats$`Poverty Rate Range`)
Now that I see the county data, I realize that I want some more data about the school districts to get a fuller picture of the economic inequality of counties in New York. I’ll import some enrollment data from the National Center for Education Statistics to join to my poverty dataset.
### Import some extra data for school districts to use to explore
sd_enroll <- read_csv("data/raw/ny_sd_enrollment.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## id = col_double(),
## urbanicity = col_character(),
## denroll_district = col_double(),
## dwhite = col_double(),
## dblack = col_double(),
## dhispanic = col_double(),
## dasian_pi = col_double(),
## dhawaiian_pi = col_double(),
## damindian_aknative = col_double(),
## d2plus_races = col_double(),
## pct_bipoc = col_double(),
## median_propert_value = col_double()
## )
# Join to school district data
sd_data <- sd_county_pov %>%
left_join(sd_enroll, by = "id")
### Check for na
colSums(is.na(sd_data))
#write out the data
write_csv(sd_data, "data/output/NY_stpov_analysis_data.csv")
We’re going to begin to learn how to create visualizations in R today. There are lots of different ways to name a color. The two we’ll use most are hex colors and predefined color names.
Hex colors are 6-digit way to represent a color that is common in web development. You can pick colors and find their hex codes here:
R also recognizes hundreds of predefined color names like “blue” and “ivory”. You can find a full list of them here:
ggplot2Looking at ggplot2 code for production-ready graphics can be intimidating, but understanding the process can dramatically reduce plot-related panic.
Developing ggplot2 graphics should generally follow a simple, iterative workflow:
Create a new script 4_visualize_poverty_analysis.R to explors and create visualizations
library(tidyverse)
library(scales)
library(viridis)
### Questions we want to answer
# What is the range of student poverty rates in counties across New York
# What does the county with the least economic inequality look like?
# What does the county with the most economic inequality look like?
# What does the county with the median economic inequality look like?
### Import the data
county_stats <- read_csv("data/output/NY_county_poverty_stats.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## County = col_character(),
## Districts = col_double(),
## `County Poverty Rate` = col_double(),
## `Average School District Poverty Rate` = col_double(),
## `Maximum Poverty Rate` = col_double(),
## `Minimum Poverty Rate` = col_double(),
## `Poverty Rate Range` = col_double(),
## `Average Poverty Rate Difference` = col_double(),
## `Maximum Poverty Rate Difference` = col_double()
## )
sd_data <- read_csv("data/output/NY_stpov_analysis_data.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## district = col_character(),
## County = col_character(),
## urbanicity.x = col_character(),
## urbanicity.y = col_character(),
## urbanicity.x.x = col_character(),
## urbanicity.y.y = col_character(),
## urbanicity.x.x.x = col_character(),
## urbanicity.y.y.y = col_character(),
## urbanicity.x.x.x.x = col_character(),
## urbanicity.y.y.y.y = col_character(),
## urbanicity = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
## First select only the counties with at least 5 school districts
county_stats_analysis <- county_stats %>%
filter(Districts > 5) # select counties with at least 5 districts
# What is the range of student poverty rates in counties across New York
hist(county_stats$`Poverty Rate Range`)
### select the min max and median range for school districts
summary(county_stats$`Poverty Rate Range`)
# first plot
max_plot <- sd_data %>%
filter(County == "Monroe County") %>%
ggplot(aes(x = stpovrate, y = pct_bipoc)) +
geom_point()
max_plot
Now, let’s take care of some formatting issues. Our axes don’t look great - the decimals ought to be percentages. Here, the scales package provides some help.
# format axes
max_plot <- sd_data %>%
filter(County == "Monroe County") %>%
ggplot(aes(x = stpovrate, y = pct_bipoc)) +
geom_point() +
# make sure you have the `scales` package loaded!
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1))
max_plot
Next, we should add some labels to our axes that make sense, along with a title for our plot and a caption that details our data sources.
# add labels
max_plot <- sd_data %>%
filter(County == "Suffolk County") %>%
ggplot(aes(x = stpovrate, y = pct_bipoc)) +
geom_point() +
# make sure you have the `scales` package loaded!
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
labs(x = "Student Poverty Rate", y = "Percent BIPOC",
title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
caption = "Sources: NCES, 2019 and SAIPE, 2019")
max_plot
Themes can be used to change the appearance of elements in your plot. There are many stock options, but I prefer theme_bw() for its clean appearance and helpful and unobtrusive gridlines.
# change theme
max_plot <- sd_data %>%
filter(County == "Suffolk County") %>%
ggplot(aes(x = stpovrate, y = pct_bipoc)) +
geom_point() +
# make sure you have the `scales` package loaded!
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
labs(x = "Student Poverty Rate", y = "Percent BIPOC",
title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
caption = "Sources: NCES, 2019 and SAIPE, 2019") +
theme_bw()
max_plot
Now that we have a decent-looking graph, let’s add in a new data element to vary point size by enrollment, and make the points .
# add size element
max_plot <- sd_data %>%
filter(County == "Suffolk County") %>%
ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district)) +
geom_point() +
# make sure you have the `scales` package loaded!
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
labs(x = "Student Poverty Rate", y = "Percent BIPOC",
title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
caption = "Sources: NCES, 2019 and SAIPE, 2019") +
theme_bw()
max_plot
We see some overlap in the points. Reducing the opacity of the points can be accomplished by setting the alpha parameter in geom_point() to a value less than 1. Setting it to .5 will make data points 50% translucent.
# add size element
max_plot <- sd_data %>%
filter(County == "Suffolk County") %>%
ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district)) +
geom_point(alpha = .5) +
# make sure you have the `scales` package loaded!
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
labs(x = "Student Poverty Rate", y = "Percent BIPOC",
title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
caption = "Sources: NCES, 2019 and SAIPE, 2019") +
theme_bw()
max_plot
Adding a new variable for size creates a legend. We need to tidy the legend’s labels and the title.
# clean up legend
max_plot <- sd_data %>%
filter(County == "Suffolk County") %>%
ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district)) +
geom_point(alpha = .5) +
# make sure you have the `scales` package loaded!
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
# change legend label formatting
scale_size(labels = comma) +
labs(x = "Student Poverty Rate", y = "Percent BIPOC",
title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
caption = "Sources: NCES, 2019 and SAIPE, 2019",
# add nice label for size element
size = "Enrollment") +
theme_bw()
max_plot
We can also adjust some paramenters to allow for more visual contrast in size. By default, ggplot2 will adjust points’ radii based on the size variable. Using area is a more visually honest way to represent the data, so let’s make that change.
# create more contrast in size
max_plot <- sd_data %>%
filter(County == "Suffolk County") %>%
ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district)) +
geom_point(alpha = .5) +
# make sure you have the `scales` package loaded!
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
# change legend label formatting
scale_size_area(labels = comma, max_size = 10) +
labs(x = "Student Poverty Rate", y = "Percent BIPOC",
title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
caption = "Sources: NCES, 2019 and SAIPE, 2019",
# add nice label for size element
size = "Enrollment") +
theme_bw()
max_plot
Color can be helpful, too - let’s add in color based on urbanicity.
# add in color based on urbanicity
max_plot <- sd_data %>%
filter(County == "Suffolk County") %>%
ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district,
color = urbanicity)) +
geom_point(alpha = .5) +
# make sure you have the `scales` package loaded!
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
# change legend label formatting
scale_size_area(labels = comma, max_size = 10) +
labs(x = "Student Poverty Rate", y = "Percent BIPOC",
title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
caption = "Sources: NCES, 2019 and SAIPE, 2019",
# add nice label for size element
size = "Enrollment",
color = "Urbanicity") +
theme_bw()
max_plot
We can and should adjust the colors used. R recognizes some pretty funky color names, which can be found in this helpful cheatsheet.
# adjust colors manually
max_plot <- sd_data %>%
filter(County == "Suffolk County") %>%
ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district,
color = urbanicity)) +
geom_point(alpha = .5) +
# create manual color palette
# color names pulled from a pdf y'all should bookmark
# http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
scale_color_manual(values = c("tomato3", "steelblue2",
"seagreen3", "orchid1")) +
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
# change legend label formatting
scale_size_area(labels = comma, max_size = 10) +
labs(x = "Student Poverty Rate", y = "Percent BIPOC",
title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
caption = "Sources: NCES, 2019 and SAIPE, 2019",
# add nice label for size element
size = "Enrollment",
color = "Urbanicity") +
theme_bw()
max_plot
We should strive to make our analyses as accessible as possible. The viridis package includes some color palettes that are friendly for folks with color blindness, which affects 5-10 percent of the US population.
# use colors better for visual impairments
max_plot <- sd_data %>%
filter(County == "Suffolk County") %>%
ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district,
color = urbanicity)) +
geom_point(alpha = .5) +
# use a colorblind-friendly palette
scale_color_viridis_d() +
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
# change legend label formatting
scale_size_area(labels = comma, max_size = 10) +
labs(x = "Student Poverty Rate", y = "Percent BIPOC",
title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
caption = "Sources: NCES, 2019 and SAIPE, 2019",
# add nice label for size element
size = "Enrollment",
color = "Urbanicity") +
theme_bw()
max_plot
Let’s adjust the range of colors used to exclude that hard-to-see yellow.
# that yellow is hard to see - let's adjust the range
max_plot <- sd_data %>%
filter(County == "Suffolk County") %>%
ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district,
color = urbanicity)) +
geom_point(alpha = .5) +
# adjust color range
scale_color_viridis_d(end = .8) +
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
# change legend label formatting
scale_size_area(labels = comma, max_size = 10) +
labs(x = "Student Poverty Rate", y = "Percent BIPOC",
title = "Racial Diversity and Student Poverty in Suffolk County School Districts",
caption = "Sources: NCES, 2019 and SAIPE, 2019",
# add nice label for size element
size = "Enrollment",
color = "Urbanicity") +
theme_bw()
max_plot
When you’re satisfied with a graphic, you can save it as an image file. ggsave() will write out an image file based on the last plot you ran. Or you can store a ggplot2 object and pass that as a paremeter to ggsave().
# example code to save last plot as a 5" by 7" .png file
ggsave("figures/SuffolkCounty_poverty.png", #specify the file path/name/type
plot = max_plot, # specify the ggplot object you stored
units = "in", # specify the units for your image
height = 5, width = 7) # specify the image dimensions
state_plot <- sd_data %>%
filter(denroll_district < 500000) %>%
ggplot(aes(x = stpovrate, y = pct_bipoc, size = denroll_district,
color = urbanicity)) +
geom_point(alpha = .5) +
# adjust color range
scale_color_viridis_d(end = .8) +
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
# change legend label formatting
scale_size_area(labels = comma, max_size = 10) +
labs(x = "Student Poverty Rate", y = "Percent BIPOC",
title = "Racial Diversity and Student Poverty in New York School Districts",
caption = "Note: excludes New York City School District",
# add nice label for size element
size = "Enrollment",
color = "Urbanicity") +
theme_bw()
state_plot
4a. Readings
4b. R Assignment
Use the visualization skills you learned today to create plots to explore the state level apportionment and race data from last week’s homework. There are many more variables in the decennial census dataset. Add more variables and create some finalized plots exploring the race and apportionment data. There are no right plots, just follow your interest. If you feel inspired you can look for more state-level to add to your data frame to deepen the analysis.
When you have finished your exploratory visual analysis, go back through your script and clean it up: delete any lines of code that are not pertinent to your final analysis, write comments to explain all of your steps. Make this a script that you could open in a year and rerun.
Upload your finalized script. If you import any additional data, please also upload it so that I can run your script.