Milestone 4 Assignment

Visualizations

Your task is to utilize these datasets to create county-level measures to accomplish two goals. The first is to compare county asthma ED rates with a county CES measure to assess if there appears to be a correlation. The second is to compare asthma ED rates with county-level summaries for specific environmental measures to determine if those specific measures may be worth further investigation.

Please submit an Rmd or Qmd and publish an html file on RPubs with the following:

Final datasets for creation of visualization
- Join all datasets together
- Calculate any remaining data elements needed for analysis
- Show code used to create joined dataset, but please do not print full data frame output (showing data structure with str() is okay)
Visualizations (at least one per group member)
- One print quality table as requested in scenario
- One print quality plot or chart as requested in scenario
- For groups of 3, one additional print quality table or plot of your choice (can support the requested data in the scenario, or answer a different question using the same data sources)
For each visual, include
- Code used to generate visual
- Legend (if necessary)
- 1-2 sentence interpretation
- NOTE:
  - Please make sure the visual can stand-alone, meaning it includes enough information in title, legend, and footnote so that a person who sees only the visualization could understand what is being presented.
  - Please also make sure column names, axis labels, and any other labels are meaningful and not just the name of the variable (ex: “County” rather than “county_name”)
Html file that is professionally prepared for presentation and published to RPubs
- Only the necessary information is in the output (e.g., suppress entire data frame outputs but showing data structure with str() is okay)
- Show your work by “echoing” code used in Rmd/Qmd file to create your tables and graphs. {r, echo = TRUE} in code chunk preferences.
- Use of headers and sub headers to create an organized document
- For this milestone please make sure code for visualizations are included in the final file

Milestone 3 work:

CHHS Asthma

# A tibble: 6 × 6
  county   `ai/an_visits` asian_visits black_visits hispanic_visits white_visits
  <chr>             <dbl>        <dbl>        <dbl>           <dbl>        <dbl>
1 alameda            145.         22.3         175.            46.5         24.6
2 alpine               0           0             0              0            0  
3 amador               0           0             0             51.3         53.4
4 butte                0          20.0         223.            25.9         40.6
5 calaver…             0           0             0             32.3         50.7
6 colusa               0           0             0             31.8         43.8

Cal Scores

# A tibble: 6 × 9
  county    median_ces_4_0_score hispanic_pop white_pop african_america_pop
  <chr>                    <dbl>        <dbl>     <dbl>               <dbl>
1 alameda                   20.6       370874    519984              171113
2 alpine                    13.6          130       551                   9
3 amador                    19.6         5340     30035                 848
4 butte                     21.4        36916    162537                3342
5 calaveras                 16.1         5486     36409                 311
6 colusa                    27.4        12738      7576                 256
# ℹ 4 more variables: native_american_pop <dbl>, asian_american_pop <dbl>,
#   other_multiple_pop <dbl>, total_population <dbl>

Cal Measure

# A tibble: 6 × 4
  county    pm_2.5 housing_burden pm2.5_level
  <chr>      <dbl>          <dbl> <chr>      
1 alameda     8.87           15.9 Good       
2 alpine      3.05           17.4 Good       
3 amador      8.01           14.8 Good       
4 butte       8.22           18.0 Good       
5 calaveras   8.12           16.3 Good       
6 colusa      7.54           10.6 Good

Data Dictionary

Data Dictionary: Environment and Asthma in California
Variable Name	Description
Variable Name	California county that the census tract falls within
Variable Name	Annual mean PM 2.5 concentrations
Variable Name	Percent housing burdened low income households
PM 2.5 Level	Rating of average PM 2.5 level per CA county based on mean county value compared to EPA pm 2.5 level ratings
CES 4.0 Level	Category of median CES 4.0 level per CA county based on CDPH groupings
Housing Burden Level	Category of mean percent housing burdened low income households based on cutoff points set by The Justice Gap
Ai/An visits	Asthma Emergency visit rates per 10,000 Native American residents
Asian visits	Asthma Emergency visit rates per 10,000 Asian residents
Black visits	Asthma Emergency visit rates per 10,000 African American residents
Hispanic visits	Asthma Emergency visit rates per 10,000 Hispanic residents
White visits	Asthma Emergency visit rates per 10,000 White residents
Age adjusted ed visit rate	Age Adjusted Asthma Emergency Department Visit Rate for each County
CES 4.0 Score	CalEnviroScreen Score, Pollution Score multiplied by Population Characteristics Score
California County	California county that the census tract falls within
Total Population	2019 ACS population estimates in census tracts
Hispanic (%)	2019 ACS population estimates of the percent per county of those who identify as Hispanic or Latino
White (%)	2019 ACS population estimates of the percent per county of those who identify as non-Hispanic white
African American (%)	2019 ACS population estimates of the percent per county of those who identify as non-Hispanic African American or black
Native American (%)	2019 ACS population estimates of the percent per county of those who identify as non-Hispanic Native American
Asian American (%)	2019 ACS population estimates of the percent per county of those who identify as non-Hispanic Asian or Pacific Islander
Other/Multiple (%)	2019 ACS population estimates of the percent per county of those who identify as non-Hispanic "other" or as multiple races
median_ces_4_0_score	median of the census tract level CalEnviroScreen Score, also called CES 4.0 Score, for each county

Milestone 4

Joining the datasets

#joining cal_measures_cleaned, cal_scores_data, and df datasets by county

cal_ms <- inner_join(cal_measure_cleaned, cal_scores_data, by = "county")
cal_msa_1 <- inner_join(cal_ms, df, by="county")

str(cal_msa_1)

tibble [58 × 17] (S3: tbl_df/tbl/data.frame)
 $ county              : chr [1:58] "alameda" "alpine" "amador" "butte" ...
 $ pm_2.5              : num [1:58] 8.87 3.05 8.01 8.22 8.12 ...
 $ housing_burden      : num [1:58] 15.9 17.4 14.8 18 16.3 ...
 $ pm2.5_level         : chr [1:58] "Good" "Good" "Good" "Good" ...
 $ median_ces_4_0_score: num [1:58] 20.6 13.6 19.6 21.4 16.1 ...
 $ hispanic_pop        : num [1:58] 370874 130 5340 36916 5486 ...
 $ white_pop           : num [1:58] 519984 551 30035 162537 36409 ...
 $ african_america_pop : num [1:58] 171113 9 848 3342 311 ...
 $ native_american_pop : num [1:58] 5207 314 249 1699 184 ...
 $ asian_american_pop  : num [1:58] 508798 10 529 10534 693 ...
 $ other_multiple_pop  : num [1:58] 79607 25 1428 10790 1975 ...
 $ total_population    : num [1:58] 1655590 1039 38429 225817 45057 ...
 $ ai/an_visits        : num [1:58] 145 0 0 0 0 ...
 $ asian_visits        : num [1:58] 22.3 0 0 20 0 ...
 $ black_visits        : num [1:58] 175 0 0 223 0 ...
 $ hispanic_visits     : num [1:58] 46.5 0 51.3 25.9 32.3 ...
 $ white_visits        : num [1:58] 24.6 0 53.4 40.6 50.7 ...

head(cal_msa_1)

# A tibble: 6 × 17
  county    pm_2.5 housing_burden pm2.5_level median_ces_4_0_score hispanic_pop
  <chr>      <dbl>          <dbl> <chr>                      <dbl>        <dbl>
1 alameda     8.87           15.9 Good                        20.6       370874
2 alpine      3.05           17.4 Good                        13.6          130
3 amador      8.01           14.8 Good                        19.6         5340
4 butte       8.22           18.0 Good                        21.4        36916
5 calaveras   8.12           16.3 Good                        16.1         5486
6 colusa      7.54           10.6 Good                        27.4        12738
# ℹ 11 more variables: white_pop <dbl>, african_america_pop <dbl>,
#   native_american_pop <dbl>, asian_american_pop <dbl>,
#   other_multiple_pop <dbl>, total_population <dbl>, `ai/an_visits` <dbl>,
#   asian_visits <dbl>, black_visits <dbl>, hispanic_visits <dbl>,
#   white_visits <dbl>

Visualization 1: Interactive Bubble Chart showing the relationship between Annual PM 2.5 concentration and Asthma Emergency Department Visit Rates by California County

# Data Viz: Relationship between PM 2.5 and Asthma visit rates by county

chhs_asthma$county <- tolower(chhs_asthma$county)
chhs_data_1 <- chhs_asthma_filter %>%
  group_by(county, age_adjusted_ed_visit_rate) %>%
  summarise(age_adjusted_ed_visit_rate = mean(age_adjusted_ed_visit_rate), 
            .groups = "drop")%>%
  distinct(county, .keep_all = TRUE)

cal_enviro <- inner_join(cal_msa_1, chhs_data_1, by = "county")

# rounding the variables to 2 digits. 
# Filtering out the counties with 0.00 ED visit rates(assumption: the data for these counties was minimal, hence suppressed)
# Renaming the counties to start with upper case letters

cal_enviro$pm_2.5 <- round(cal_enviro$pm_2.5, 2)
cal_enviro$housing_burden <- round(cal_enviro$housing_burden, 2)
cal_msa_filter <- cal_enviro %>%
  filter(age_adjusted_ed_visit_rate != 0.000)
cal_msa_filter$county <- str_to_title(cal_msa_filter$county)

# Defining the range and size of the population and color of counties in the plot_ly

size_factor <- 0.0001  
num_counties <- 58
county_colors <- rainbow(num_counties, s = 0.6, v = 0.7)

# Inside the plot_ly function to make an interactive visualization

color = ~county_colors
 plot_ly(
  data = cal_msa_filter,
  x = ~pm_2.5,
  y = ~age_adjusted_ed_visit_rate,

  size = ~total_population * size_factor,
  color = ~county, 
  colors = county_colors,
  text = ~paste("County: ", county, 
                "<br>PM 2.5: ",  pm_2.5,
                #"<br>Housing burden: ", housing_burden,
                "<br>Population: ", total_population,
                "<br>Asthma ED rate: ", age_adjusted_ed_visit_rate),
  type = 'scatter',
  mode = 'markers',
  marker = list(sizemode = 'diameter'),
  hoverinfo = 'text') %>%
  layout(title = list(
             text = paste0('Interactive Plot: Relationship between PM 2.5 and Asthma ED Visit Rates \n in California Counties',
    '<br>',
    '<sup>',
    'Sources: CalEnviroScreen 4.0, 2021 & California Health and Human Services Department',
    '</sup>')),
    xaxis = list(title = "Annual Particulate Matter 2.5 concentrations μg/m³"),
    yaxis = list(title = "Asthma ED Visits/10,000 Residents"),
    showlegend = F,
    annotations = list(x = 1, y = -0.08,
                       text = "*Size of the bubbles represents Population size", 
                       showarrow = F, xref='paper', yref='paper', 
                       xanchor='right', yanchor='auto', xshift=0, yshift=0,
                       font=list(size=10, color="blue"))
 )

Interpretation: The data from 29 California counties in 2021 doesn’t show a clear relationship between Annual Particulate Matter 2.5 concentration, population size, and Asthma Emergency Department Visit rates. This lack of a definitive relationship suggests the possibility of confounding factors, including socio-demographic and environmental elements, contributing to high visit rates in certain counties despite small population size and low PM 2.5 concentration.

Interactive Table 1: Interactive table showing demographic and environmental data of California counties.

housing_types<- c("High", "Moderate", "Low")
housing_factors <- factor(housing_types, levels = c("High", "Moderate", "Low"))

msa_tbl <- cal_msa_1 %>%
  mutate(ces_4.0_level = case_when(median_ces_4_0_score <= 30 ~ "Low",
                                   median_ces_4_0_score <= 50 ~ "Moderate",
                                   TRUE ~ "High")) %>%
  mutate(housing_level = case_when(housing_burden <= 12 ~ housing_factors[3],
                                   housing_burden <= 16 ~ housing_factors[2],
                                   TRUE ~ housing_factors[1])) %>%
  mutate(percent_pop_hispanic = (hispanic_pop/total_population) * 100,
         percent_pop_white = (white_pop/total_population)*100,
         percent_pop_afam = (african_america_pop/total_population)*100,
         percent_pop_na = (native_american_pop/total_population)*100,
         percent_pop_aa = (asian_american_pop/total_population)*100,
         multiple_pop = (other_multiple_pop/total_population)*100) %>%
  select(county,pm2.5_level,ces_4.0_level,housing_level,percent_pop_hispanic, percent_pop_white, percent_pop_afam, percent_pop_na, percent_pop_aa, multiple_pop) %>%
  mutate(county = str_to_title(county))%>%
  mutate_if(is.numeric, ~ round(., digits = 3))

datatable(msa_tbl,
          options = list(
            pageLength=10,
            lengthMenu=c(10,20,30,40,50,60),
            columnDefs=list(
              list(className='dt-center',targets=1:3),
              list(orderable=FALSE, targets=c(1:3))
            ),
            dom = 'ltip',
            ordering=T
          ),
          rownames=FALSE,
          colnames=c("County","Average PM 2.5 Level", "Median CES 4.0 Score", "Housing Burden Level", "Population Hispanic %","Population White %", "Population Black %","Population Indigenous %", "Population Asian %", "Population Mixed Race %"),
          caption="Demographic and Environmental Data of California Counties",
          filter="top",
          editable=list(target='cell',disable=list(columns=1:3))) %>%
  formatStyle(columns = "housing_level", 
              color = styleEqual(c(housing_factors[3], housing_factors[2], housing_factors[1] ), c("burlywood","darkorange","red" ))) %>%
   formatStyle(columns = "pm2.5_level", 
              color = styleEqual(c("Good", "Moderate"), c("cadetblue","blue" ))) %>%
   formatStyle(columns = "ces_4.0_level", 
              color = styleEqual(c("Low", "Moderate"), c("darkseagreen","darkgreen" )))

Interpretation: This interactive chart shows data from all 58 California counties including average PM 2.5 levels, median CES score, average housing burden, and percentage of the county population belonging to each race, and allows comparisons between different subsets. PM 2.5 levels were defined as Good or Moderate based on EPA standards (here), CES scores were defined as Low (0-30), Moderate (30-50), and High (50+) based on data from CDPH (here), and housing burden levels were defined as low (0-12%), medium (12-16%), and high (16+%), with cutoff points based on information from The Justice Gap (here).

Visualization 3: Faceted Scatterplots showing Associations between the Percentage Population of Six Racial Categories and Average CES 4.0 Scores by County-level Data.

#calculate proportion of each race/ethnicity at county level

cal_msa_1_plots <- cal_msa_1 %>%
  mutate(percent_pop_hispanic = (hispanic_pop/total_population)*100,
         percent_pop_white = (white_pop/total_population)*100,
         percent_pop_aa = (african_america_pop/total_population)*100,
         percent_pop_na = (native_american_pop/total_population)*100,
         percent_pop_asian = (asian_american_pop/total_population)*100,
         percent_pop_other = (other_multiple_pop/total_population)*100)

#Scatter plots for correlation between each race category and its corresponding Mean CES Score

plot_1 <- ggplot(cal_msa_1_plots,
                 aes(x = percent_pop_hispanic,
                     y = median_ces_4_0_score))+
  geom_point(aes(color = county))+
  geom_smooth(method = "lm", color = "red", se = F)+
  theme_minimal()+
  labs(x = "Hispanic population %",
       y = "Mean CES Score")+
  theme(legend.position = "none")

plot_2 <- ggplot(cal_msa_1_plots, 
                 aes(x = percent_pop_white,
                     y = median_ces_4_0_score))+
  geom_point(aes(color = county))+
  geom_smooth(method = "lm", color = "red", se = F)+
   theme_minimal()+
  labs(x = "White population %",
       y = "Mean CES Score")+
  theme(legend.position = "none")


plot_3 <- ggplot(cal_msa_1_plots,
                 aes(x = percent_pop_aa, 
                     y = median_ces_4_0_score))+
  geom_point(aes(color = county))+
  geom_smooth(method = "lm", color = "red", se = F)+
   theme_minimal()+
  labs(x = "Black population %",
       y = "Mean CES Score")+
  theme(legend.position = "none")


plot_4 <- ggplot(cal_msa_1_plots, 
                 aes(x = percent_pop_na,
                     y = median_ces_4_0_score))+
  geom_point(aes(color = county))+
  geom_smooth(method = "lm", color = "red", se = F)+
   theme_minimal()+
  labs(x = "Native American population %",
       y = "Mean CES Score")+
  theme(legend.position = "none")

plot_5 <- ggplot(cal_msa_1_plots, 
                 aes(x = percent_pop_asian,
                     y = median_ces_4_0_score))+
  geom_point(aes(color = county))+
  geom_smooth(method = "lm", color = "red", se = F)+
   theme_minimal()+
  labs(x = "Asian population %",
       y = "Mean CES Score")+
  theme(legend.position = "none")

plot_6 <- ggplot(cal_msa_1_plots,
                 aes(x = percent_pop_other,
                     y = median_ces_4_0_score))+
  geom_point(aes(color = county))+
  geom_smooth(method = "lm", color = "red", se = F)+
   theme_minimal()+
  labs(x = "Unknown population %",
       y = "Mean CES Score")+
  theme(legend.position = "none")

# putting all race categories's plots in one grid for easy comparison. 

p = list(plot_1,plot_2,plot_3,plot_4, plot_5, plot_6) %>% 
    map(~.x + labs(y=NULL))

grid.arrange(grobs=p, ncol = 3, nrow = 2, 
                    left = textGrob("Mean CES Score",
                     rot = 90, 
                    gp = gpar(fontsize=12)),
     top = "Relationship between Different Race Category Populaton Percentages with Mean CES Scores \n  Sources: CalEnviroScreen 4.0",
    bottom = textGrob("Each data point represents a county",
                      x = 1,  hjust = 1, 
                      gp = gpar(fontface = 3L, fontsize = 9)))

Interpretation: These scatter plots map the Mean CES Score on the x-axis and percentage of county-level population data by race/ethnicity category, as indicated on the y-axis and title for each plot. These six plots allow comparisons to be made between the x and y covariates, and CalEnviroScreen is the source for data categories listed for race/ethnicity. “Unknown Population,” used here, is analogous to the CalEnviroScreen variable name “other_multiple.” A detailed explanation of CES Score components can be found here: https://oehha.ca.gov/calenviroscreen/scoring-model.