# A tibble: 6 × 6
county `ai/an_visits` asian_visits black_visits hispanic_visits white_visits
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 alameda 145. 22.3 175. 46.5 24.6
2 alpine 0 0 0 0 0
3 amador 0 0 0 51.3 53.4
4 butte 0 20.0 223. 25.9 40.6
5 calaver… 0 0 0 32.3 50.7
6 colusa 0 0 0 31.8 43.8
Milestone 4 Assignment
Visualizations
Your task is to utilize these datasets to create county-level measures to accomplish two goals. The first is to compare county asthma ED rates with a county CES measure to assess if there appears to be a correlation. The second is to compare asthma ED rates with county-level summaries for specific environmental measures to determine if those specific measures may be worth further investigation.
Please submit an Rmd or Qmd and publish an html file on RPubs with the following:
Final datasets for creation of visualization
Join all datasets together
Calculate any remaining data elements needed for analysis
Show code used to create joined dataset, but please do not print full data frame output (showing data structure with
str()is okay)
Visualizations (at least one per group member)
One print quality table as requested in scenario
One print quality plot or chart as requested in scenario
For groups of 3, one additional print quality table or plot of your choice (can support the requested data in the scenario, or answer a different question using the same data sources)
For each visual, include
Code used to generate visual
Legend (if necessary)
1-2 sentence interpretation
NOTE:
Please make sure the visual can stand-alone, meaning it includes enough information in title, legend, and footnote so that a person who sees only the visualization could understand what is being presented.
Please also make sure column names, axis labels, and any other labels are meaningful and not just the name of the variable (ex: “County” rather than “county_name”)
Html file that is professionally prepared for presentation and published to RPubs
Only the necessary information is in the output (e.g., suppress entire data frame outputs but showing data structure with
str()is okay)Show your work by “echoing” code used in Rmd/Qmd file to create your tables and graphs. {r, echo = TRUE} in code chunk preferences.
Use of headers and sub headers to create an organized document
For this milestone please make sure code for visualizations are included in the final file
Milestone 3 work:
CHHS Asthma
Cal Scores
# A tibble: 6 × 9
county median_ces_4_0_score hispanic_pop white_pop african_america_pop
<chr> <dbl> <dbl> <dbl> <dbl>
1 alameda 20.6 370874 519984 171113
2 alpine 13.6 130 551 9
3 amador 19.6 5340 30035 848
4 butte 21.4 36916 162537 3342
5 calaveras 16.1 5486 36409 311
6 colusa 27.4 12738 7576 256
# ℹ 4 more variables: native_american_pop <dbl>, asian_american_pop <dbl>,
# other_multiple_pop <dbl>, total_population <dbl>
Cal Measure
# A tibble: 6 × 4
county pm_2.5 housing_burden pm2.5_level
<chr> <dbl> <dbl> <chr>
1 alameda 8.87 15.9 Good
2 alpine 3.05 17.4 Good
3 amador 8.01 14.8 Good
4 butte 8.22 18.0 Good
5 calaveras 8.12 16.3 Good
6 colusa 7.54 10.6 Good
Data Dictionary
| Variable Name | Description |
|---|---|
| Variable Name | California county that the census tract falls within |
| Variable Name | Annual mean PM 2.5 concentrations |
| Variable Name | Percent housing burdened low income households |
| PM 2.5 Level | Rating of average PM 2.5 level per CA county based on mean county value compared to EPA pm 2.5 level ratings |
| CES 4.0 Level | Category of median CES 4.0 level per CA county based on CDPH groupings |
| Housing Burden Level | Category of mean percent housing burdened low income households based on cutoff points set by The Justice Gap |
| Ai/An visits | Asthma Emergency visit rates per 10,000 Native American residents |
| Asian visits | Asthma Emergency visit rates per 10,000 Asian residents |
| Black visits | Asthma Emergency visit rates per 10,000 African American residents |
| Hispanic visits | Asthma Emergency visit rates per 10,000 Hispanic residents |
| White visits | Asthma Emergency visit rates per 10,000 White residents |
| Age adjusted ed visit rate | Age Adjusted Asthma Emergency Department Visit Rate for each County |
| CES 4.0 Score | CalEnviroScreen Score, Pollution Score multiplied by Population Characteristics Score |
| California County | California county that the census tract falls within |
| Total Population | 2019 ACS population estimates in census tracts |
| Hispanic (%) | 2019 ACS population estimates of the percent per county of those who identify as Hispanic or Latino |
| White (%) | 2019 ACS population estimates of the percent per county of those who identify as non-Hispanic white |
| African American (%) | 2019 ACS population estimates of the percent per county of those who identify as non-Hispanic African American or black |
| Native American (%) | 2019 ACS population estimates of the percent per county of those who identify as non-Hispanic Native American |
| Asian American (%) | 2019 ACS population estimates of the percent per county of those who identify as non-Hispanic Asian or Pacific Islander |
| Other/Multiple (%) | 2019 ACS population estimates of the percent per county of those who identify as non-Hispanic "other" or as multiple races |
| median_ces_4_0_score | median of the census tract level CalEnviroScreen Score, also called CES 4.0 Score, for each county |
Milestone 4
Joining the datasets
#joining cal_measures_cleaned, cal_scores_data, and df datasets by county
cal_ms <- inner_join(cal_measure_cleaned, cal_scores_data, by = "county")
cal_msa_1 <- inner_join(cal_ms, df, by="county")
str(cal_msa_1)tibble [58 × 17] (S3: tbl_df/tbl/data.frame)
$ county : chr [1:58] "alameda" "alpine" "amador" "butte" ...
$ pm_2.5 : num [1:58] 8.87 3.05 8.01 8.22 8.12 ...
$ housing_burden : num [1:58] 15.9 17.4 14.8 18 16.3 ...
$ pm2.5_level : chr [1:58] "Good" "Good" "Good" "Good" ...
$ median_ces_4_0_score: num [1:58] 20.6 13.6 19.6 21.4 16.1 ...
$ hispanic_pop : num [1:58] 370874 130 5340 36916 5486 ...
$ white_pop : num [1:58] 519984 551 30035 162537 36409 ...
$ african_america_pop : num [1:58] 171113 9 848 3342 311 ...
$ native_american_pop : num [1:58] 5207 314 249 1699 184 ...
$ asian_american_pop : num [1:58] 508798 10 529 10534 693 ...
$ other_multiple_pop : num [1:58] 79607 25 1428 10790 1975 ...
$ total_population : num [1:58] 1655590 1039 38429 225817 45057 ...
$ ai/an_visits : num [1:58] 145 0 0 0 0 ...
$ asian_visits : num [1:58] 22.3 0 0 20 0 ...
$ black_visits : num [1:58] 175 0 0 223 0 ...
$ hispanic_visits : num [1:58] 46.5 0 51.3 25.9 32.3 ...
$ white_visits : num [1:58] 24.6 0 53.4 40.6 50.7 ...
head(cal_msa_1)# A tibble: 6 × 17
county pm_2.5 housing_burden pm2.5_level median_ces_4_0_score hispanic_pop
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 alameda 8.87 15.9 Good 20.6 370874
2 alpine 3.05 17.4 Good 13.6 130
3 amador 8.01 14.8 Good 19.6 5340
4 butte 8.22 18.0 Good 21.4 36916
5 calaveras 8.12 16.3 Good 16.1 5486
6 colusa 7.54 10.6 Good 27.4 12738
# ℹ 11 more variables: white_pop <dbl>, african_america_pop <dbl>,
# native_american_pop <dbl>, asian_american_pop <dbl>,
# other_multiple_pop <dbl>, total_population <dbl>, `ai/an_visits` <dbl>,
# asian_visits <dbl>, black_visits <dbl>, hispanic_visits <dbl>,
# white_visits <dbl>
Visualization 1: Interactive Bubble Chart showing the relationship between Annual PM 2.5 concentration and Asthma Emergency Department Visit Rates by California County
# Data Viz: Relationship between PM 2.5 and Asthma visit rates by county
chhs_asthma$county <- tolower(chhs_asthma$county)
chhs_data_1 <- chhs_asthma_filter %>%
group_by(county, age_adjusted_ed_visit_rate) %>%
summarise(age_adjusted_ed_visit_rate = mean(age_adjusted_ed_visit_rate),
.groups = "drop")%>%
distinct(county, .keep_all = TRUE)
cal_enviro <- inner_join(cal_msa_1, chhs_data_1, by = "county")
# rounding the variables to 2 digits.
# Filtering out the counties with 0.00 ED visit rates(assumption: the data for these counties was minimal, hence suppressed)
# Renaming the counties to start with upper case letters
cal_enviro$pm_2.5 <- round(cal_enviro$pm_2.5, 2)
cal_enviro$housing_burden <- round(cal_enviro$housing_burden, 2)
cal_msa_filter <- cal_enviro %>%
filter(age_adjusted_ed_visit_rate != 0.000)
cal_msa_filter$county <- str_to_title(cal_msa_filter$county)
# Defining the range and size of the population and color of counties in the plot_ly
size_factor <- 0.0001
num_counties <- 58
county_colors <- rainbow(num_counties, s = 0.6, v = 0.7)
# Inside the plot_ly function to make an interactive visualization
color = ~county_colors
plot_ly(
data = cal_msa_filter,
x = ~pm_2.5,
y = ~age_adjusted_ed_visit_rate,
size = ~total_population * size_factor,
color = ~county,
colors = county_colors,
text = ~paste("County: ", county,
"<br>PM 2.5: ", pm_2.5,
#"<br>Housing burden: ", housing_burden,
"<br>Population: ", total_population,
"<br>Asthma ED rate: ", age_adjusted_ed_visit_rate),
type = 'scatter',
mode = 'markers',
marker = list(sizemode = 'diameter'),
hoverinfo = 'text') %>%
layout(title = list(
text = paste0('Interactive Plot: Relationship between PM 2.5 and Asthma ED Visit Rates \n in California Counties',
'<br>',
'<sup>',
'Sources: CalEnviroScreen 4.0, 2021 & California Health and Human Services Department',
'</sup>')),
xaxis = list(title = "Annual Particulate Matter 2.5 concentrations μg/m³"),
yaxis = list(title = "Asthma ED Visits/10,000 Residents"),
showlegend = F,
annotations = list(x = 1, y = -0.08,
text = "*Size of the bubbles represents Population size",
showarrow = F, xref='paper', yref='paper',
xanchor='right', yanchor='auto', xshift=0, yshift=0,
font=list(size=10, color="blue"))
)Interpretation: The data from 29 California counties in 2021 doesn’t show a clear relationship between Annual Particulate Matter 2.5 concentration, population size, and Asthma Emergency Department Visit rates. This lack of a definitive relationship suggests the possibility of confounding factors, including socio-demographic and environmental elements, contributing to high visit rates in certain counties despite small population size and low PM 2.5 concentration.
Interactive Table 1: Interactive table showing demographic and environmental data of California counties.
housing_types<- c("High", "Moderate", "Low")
housing_factors <- factor(housing_types, levels = c("High", "Moderate", "Low"))
msa_tbl <- cal_msa_1 %>%
mutate(ces_4.0_level = case_when(median_ces_4_0_score <= 30 ~ "Low",
median_ces_4_0_score <= 50 ~ "Moderate",
TRUE ~ "High")) %>%
mutate(housing_level = case_when(housing_burden <= 12 ~ housing_factors[3],
housing_burden <= 16 ~ housing_factors[2],
TRUE ~ housing_factors[1])) %>%
mutate(percent_pop_hispanic = (hispanic_pop/total_population) * 100,
percent_pop_white = (white_pop/total_population)*100,
percent_pop_afam = (african_america_pop/total_population)*100,
percent_pop_na = (native_american_pop/total_population)*100,
percent_pop_aa = (asian_american_pop/total_population)*100,
multiple_pop = (other_multiple_pop/total_population)*100) %>%
select(county,pm2.5_level,ces_4.0_level,housing_level,percent_pop_hispanic, percent_pop_white, percent_pop_afam, percent_pop_na, percent_pop_aa, multiple_pop) %>%
mutate(county = str_to_title(county))%>%
mutate_if(is.numeric, ~ round(., digits = 3))
datatable(msa_tbl,
options = list(
pageLength=10,
lengthMenu=c(10,20,30,40,50,60),
columnDefs=list(
list(className='dt-center',targets=1:3),
list(orderable=FALSE, targets=c(1:3))
),
dom = 'ltip',
ordering=T
),
rownames=FALSE,
colnames=c("County","Average PM 2.5 Level", "Median CES 4.0 Score", "Housing Burden Level", "Population Hispanic %","Population White %", "Population Black %","Population Indigenous %", "Population Asian %", "Population Mixed Race %"),
caption="Demographic and Environmental Data of California Counties",
filter="top",
editable=list(target='cell',disable=list(columns=1:3))) %>%
formatStyle(columns = "housing_level",
color = styleEqual(c(housing_factors[3], housing_factors[2], housing_factors[1] ), c("burlywood","darkorange","red" ))) %>%
formatStyle(columns = "pm2.5_level",
color = styleEqual(c("Good", "Moderate"), c("cadetblue","blue" ))) %>%
formatStyle(columns = "ces_4.0_level",
color = styleEqual(c("Low", "Moderate"), c("darkseagreen","darkgreen" )))Interpretation: This interactive chart shows data from all 58 California counties including average PM 2.5 levels, median CES score, average housing burden, and percentage of the county population belonging to each race, and allows comparisons between different subsets. PM 2.5 levels were defined as Good or Moderate based on EPA standards (here), CES scores were defined as Low (0-30), Moderate (30-50), and High (50+) based on data from CDPH (here), and housing burden levels were defined as low (0-12%), medium (12-16%), and high (16+%), with cutoff points based on information from The Justice Gap (here).
Visualization 3: Faceted Scatterplots showing Associations between the Percentage Population of Six Racial Categories and Average CES 4.0 Scores by County-level Data.
#calculate proportion of each race/ethnicity at county level
cal_msa_1_plots <- cal_msa_1 %>%
mutate(percent_pop_hispanic = (hispanic_pop/total_population)*100,
percent_pop_white = (white_pop/total_population)*100,
percent_pop_aa = (african_america_pop/total_population)*100,
percent_pop_na = (native_american_pop/total_population)*100,
percent_pop_asian = (asian_american_pop/total_population)*100,
percent_pop_other = (other_multiple_pop/total_population)*100)
#Scatter plots for correlation between each race category and its corresponding Mean CES Score
plot_1 <- ggplot(cal_msa_1_plots,
aes(x = percent_pop_hispanic,
y = median_ces_4_0_score))+
geom_point(aes(color = county))+
geom_smooth(method = "lm", color = "red", se = F)+
theme_minimal()+
labs(x = "Hispanic population %",
y = "Mean CES Score")+
theme(legend.position = "none")
plot_2 <- ggplot(cal_msa_1_plots,
aes(x = percent_pop_white,
y = median_ces_4_0_score))+
geom_point(aes(color = county))+
geom_smooth(method = "lm", color = "red", se = F)+
theme_minimal()+
labs(x = "White population %",
y = "Mean CES Score")+
theme(legend.position = "none")
plot_3 <- ggplot(cal_msa_1_plots,
aes(x = percent_pop_aa,
y = median_ces_4_0_score))+
geom_point(aes(color = county))+
geom_smooth(method = "lm", color = "red", se = F)+
theme_minimal()+
labs(x = "Black population %",
y = "Mean CES Score")+
theme(legend.position = "none")
plot_4 <- ggplot(cal_msa_1_plots,
aes(x = percent_pop_na,
y = median_ces_4_0_score))+
geom_point(aes(color = county))+
geom_smooth(method = "lm", color = "red", se = F)+
theme_minimal()+
labs(x = "Native American population %",
y = "Mean CES Score")+
theme(legend.position = "none")
plot_5 <- ggplot(cal_msa_1_plots,
aes(x = percent_pop_asian,
y = median_ces_4_0_score))+
geom_point(aes(color = county))+
geom_smooth(method = "lm", color = "red", se = F)+
theme_minimal()+
labs(x = "Asian population %",
y = "Mean CES Score")+
theme(legend.position = "none")
plot_6 <- ggplot(cal_msa_1_plots,
aes(x = percent_pop_other,
y = median_ces_4_0_score))+
geom_point(aes(color = county))+
geom_smooth(method = "lm", color = "red", se = F)+
theme_minimal()+
labs(x = "Unknown population %",
y = "Mean CES Score")+
theme(legend.position = "none")
# putting all race categories's plots in one grid for easy comparison.
p = list(plot_1,plot_2,plot_3,plot_4, plot_5, plot_6) %>%
map(~.x + labs(y=NULL))
grid.arrange(grobs=p, ncol = 3, nrow = 2,
left = textGrob("Mean CES Score",
rot = 90,
gp = gpar(fontsize=12)),
top = "Relationship between Different Race Category Populaton Percentages with Mean CES Scores \n Sources: CalEnviroScreen 4.0",
bottom = textGrob("Each data point represents a county",
x = 1, hjust = 1,
gp = gpar(fontface = 3L, fontsize = 9)))Interpretation: These scatter plots map the Mean CES Score on the x-axis and percentage of county-level population data by race/ethnicity category, as indicated on the y-axis and title for each plot. These six plots allow comparisons to be made between the x and y covariates, and CalEnviroScreen is the source for data categories listed for race/ethnicity. “Unknown Population,” used here, is analogous to the CalEnviroScreen variable name “other_multiple.” A detailed explanation of CES Score components can be found here: https://oehha.ca.gov/calenviroscreen/scoring-model.