Hello and welcome back to PHW251B: Data Visualization for Public Health! This RMarkdown document will serve as your third coding assignment this semester, meaning that you will complete the assignment to your best of your ability, knit the file to HTML, and then submit it to bCourses. Please feel free to reach out to the teaching team if you have any questions: GSI email
In this assignment, we’re building directly off of the Module #9: Interactive Plots and creating our own dynamic charts and tables. It will be helpful to have worked through that Module before beginning this assignment!
Today, we’ll be revisiting CalEnviroScreen (CES) 4.0, which is a composite dataset with a host of environmental exposure and health outcome markers produced and maintained by the Office of Environmental Health Hazards Assessment (OEHHA), a branch of California’s EPA. It’s a relevant dataset that has a continuing influence on how health equity programs are funded and distributed!
You can read more about CES here: CES 4.0 website
CES 4.0 is available as both a tabular spreadsheet (.xlsx) and a geospatial shapefile (.shp). Today, we will be using the former, but do feel free to go to the linked website above to download the shapefile for your own practice/exploration.
There are too many variables to list out descriptions in this .rmd, but we have converted the data dictionary for your viewing.
# import dataset (first sheet of Excel file)
ces_data <- read.xlsx("calenviroscreen40resultsdatadictionary.xlsx", sheet = 1) %>%
mutate(California.County = str_trim(California.County))
#interactive table to peak at data
datatable(head(ces_data))
# import dictionary (third sheet of Excel file)
ces_dict <- read.xlsx("calenviroscreen40resultsdatadictionary.xlsx", sheet = 3)
# rename data dictionary columns
colnames(ces_dict) <- c("Variable", "Description")
#interactive table to peak at dictionary
datatable(head(ces_dict))
Let’s dive in! Today we’ll be asking you to construct an interactive plot/graph and an interactive table using the CES dataset.
Task #1
Please construct a boxplot that depicts the distribution of the
Low.Birth.Weight variable across Bay Area counties using
ggplotly, plotly, or ggiraph.
You’ll want to filter the dataset for the nine Bay Area counties first,
and then construct your plot. Feel free to add any visual modification
that you deem necessary to make it a publication-ready graphic.
#define bay area counties to subset data
bay_area_counties <- c("Alameda", "Contra Costa", "Marin", "Napa", "San Francisco", "San Mateo", "Santa Clara", "Solano", "Sonoma")
# Filter the dataset for Bay Area counties
filtered_data <- ces_data %>%
filter(California.County %in% bay_area_counties)
# Create the boxplot
lbw_boxplot <- ggplot(filtered_data, aes(x = California.County,
y = Low.Birth.Weight,
fill = California.County)) +
scale_fill_brewer(palette = "Pastel2") +
geom_boxplot() +
labs(x = "County", y = "Low Birth Weight",
title = "Distribution of Low Birth Weight in Bay Area Counties") +
theme_minimal() +
theme(legend.position = "none",
axis.title = element_text(size = 14),
axis.text = element_text(size = 10))
# Convert the ggplot to a plotly object
lbw_plotly <- ggplotly(lbw_boxplot)
lbw_plotly
Task #2
Next, let’s make an interactive table that lets a user view the CES
dataset on their own. Construct the table using DT or
reactable.
The entire dataset is likely too large to explore as a table, but let’s first summarize the following variables at the County level to simplify the data:
Population-weighted averages for: - CES 4.0 Score - PM2.5 - Lead - Traffic - Pollution Burden - Asthma - Cardiovascular Disease - Housing Burden
Note: see the weighted.mean function for
weighted averages
We’d also like you to round all columns, except for the County name, to 2 decimal points, bold the County column, and turn off rownames.
data_table <- filtered_data %>%
group_by(California.County) %>%
summarise(
CES_4_Score = round(weighted.mean(CES.4.0.Score,
Total.Population,
na.rm = TRUE), 2),
PM2.5 = round(weighted.mean(PM2.5,
Total.Population,
na.rm = TRUE), 2),
Lead = round(weighted.mean(Lead,
Total.Population,
na.rm = TRUE), 2),
Traffic = round(weighted.mean(Traffic,
Total.Population,
na.rm = TRUE), 2),
Pollution_Burden = round(weighted.mean(Pollution.Burden,
Total.Population,
na.rm = TRUE), 2),
Asthma = round(weighted.mean(Asthma,
Total.Population, na.rm = TRUE), 2),
Cardiovascular_Disease = round(weighted.mean(Cardiovascular.Disease,
Total.Population,
na.rm = TRUE), 2),
Housing_Burden = round(weighted.mean(Housing.Burden,
Total.Population,
na.rm = TRUE), 2)
)
datatable(data_table)
datatable(data_table,
class = 'cell-border stripe',
rownames = F,
colnames = c("County", "CES 4.0 Score", "PM2.5", "Lead", "Traffic",
"Pollution Burden",
"Asthma", "Cardiovascular Disease", "Housing Burden"
)) %>%
formatStyle("California.County", fontWeight = "bold")
And that brings us to the end of Assignment 3 – congrats!Please knit the PDF to HTML and submit it to bCourses whenever you are done.