Assignment 4

---
title: 'Assignment 4'
subtitle: 'PHW251B: Data Visualization for Public Health'
output: 
  flexdashboard::flex_dashboard:
    vertical_layout: scroll
    theme: lumen
    social: menu
    source: embed
---


Tin Ho - PHW251b DataViz - 2024-Fall



```{r setup, include=FALSE}

  #^^^^^ text before any page definition in dashboard appears as header on every page.


#### published to https://rpubs.com/tin6150/phw_251b
#### slug cannot be reused (old one seems to remain forever, maybe future they will have edit options)
#### acc is tin6150, email tin6150@gmail.com 
#### used personal address as berkeley.edu state is up in the air after graduation, they keep changing their mind.
 
  #// Note the different arguments in lines 1-8 at the top of this RMD, which specify that the output when you knit this document will be a flex_dashboard organized in columns. In a flexdashboard RMD, the hashes (#) we have previously used to create HTML headers now dictate different parts of the dashboard:

  # Creates a new page
  
  ## Creates a column or a row (columns/rows are determined by line 6!).  These can take an additional argument for width, as seen in this document.
      
  ### Creates a subsection within that column or row

  #// Doc: https://pkgs.rstudio.com/flexdashboard/articles/layouts.html
  #// vertical_layout: fill   #// constant page size, panels' text and plots tends to overun.
  #//     vertical_layout: scroll  #// usable by hacking it, panel of text, panel for chart.

  #// story board would be similar to RMD doc single col scrolling page, if all headings are equalized to ### 

#this is an alternative way to library(package): it installs and loads packages automatically for you, if they're needed!
pacman::p_load(
  rio,             # data import/export     
  here,            # locate files
  tidyverse,       # data management and visualization
  flexdashboard,   # dashboard versions of R Markdown reports
  shiny,           # interactive figures
  plotly,          # interactive figures
  sf,              # mapping
  tmap,            # mapping
  DT               # datatable
)

library(pacman)

#p_load( xlsx )
library(openxlsx)

p_load(wesanderson)

p_load(sf)
p_load(tigris)

p_load(png)
p_load(ggspatial)

# heatmaply is plotly version of heatmap, ie add interactivity
p_load(heatmaply)
p_load(RColorBrewer)

p_load(ggplot2)


#Next, download the dataset "ebola_simulated.rds" and save it to a folder named "data" in the same folder on your local drive that contains this RMD.
#Them, import the dataset
data <- import("data/ebola_simulated.rds")

#datFile   = "../assign3/calenviroscreen40resultsdatadictionary.xlsx"
cesDatFile = "../assign3/calenviroscreen40resultsdatadictionary.xlsx"

snData = read.xlsx( cesDatFile )


# import dataset (first sheet of Excel file)
#ces_data <- read.xlsx("calenviroscreen40resultsdatadictionary.xlsx", sheet = 1) %>%
ces_data <- read.xlsx( cesDatFile, sheet = 1) %>%
    mutate(California.County = str_trim(California.County))


## removing witdth clause below
## Column 1 {data-width=550}

```


# Overview 

## Column 1 

### Introduction

Structural Determinant of Health (SDoH) are supra-individual barriers that can prevent a person from archiving the best health s/he can get.  SDoH are often found in the build environment where one lives or work, they are beyond the control of an individual to create remedial action.  An example of SDoH could be air pollution from the freeway traffic that a person may live next to.  Because of their individual-level intractability, California's Office of Environmental Health Hazard Assessment created CalEnviroScreen, a score that helps provide a simple way to understand the combined SDoH of a neighborhood for the whole of California.

In this Flexdashboard, we are going to see whether Asthma is correlated with other diseases and possibly socio-economic score as measured by CalEnviroScreen (CES 4.0, released Oct 13, 2021), with special emphasis in the 9 counties of the San Francisco Bay Area.  The Weighted Average score is calculated based on the frequency of the disease against the population size of the county.  Thru the discussion, we will demonstrate that SDoH shown by poor CES 4.0 score plays an important factor in the outcome of Asthma.    

### Figure 1: Distribution of asthma frequency by SF bay area counties

```{r, massage_ces_data}


#interactive table to peak at data
#//datatable(head(ces_data))

# import dictionary (third sheet of Excel file)
#ces_dict <- read.xlsx("calenviroscreen40resultsdatadictionary_F_2021.xlsx", sheet = 3)
#//ces_dict <- read.xlsx("calenviroscreen40resultsdatadictionary.xlsx", sheet = 3)

# rename data dictionary columns
#//colnames(ces_dict) <- c("Variable", "Description")

#interactive table to peak at dictionary
#//datatable(head(ces_dict))

# ces data summarized by county (use CORPoration to reduce search match for count)
NotUsed__ces_data_byCorp = ces_data %>% 
  group_by( California.County )  %>%
  summarize( avgAshtmaCase = mean( Asthma.Pctl, na.rm=T ))

#define bay area counties to subset data
bay_area_counties <- c("Alameda", "Contra Costa", "Marin", "Napa", "San Francisco", "San Mateo", "Santa Clara", "Solano", "Sonoma")
  
ces_data_SfBay = ces_data %>%
  filter( California.County %in% bay_area_counties )

SnFig1_boxPlot = 
  plot_ly(
    type = "box",
    color = "#A52A2A",      #  brown here doesn't show the same brown as in scatter plot
    data = ces_data_SfBay,
    x = ~Asthma.Pctl,
    y = ~California.County
  ) %>%
  layout(
    xaxis = list(title = 'Asthma Percentile'),
    yaxis = list(title = 'County')
  )

SnFig1_boxPlot

# color codes: https://r-charts.com/colors/
```

###

As seen in Figure 1 above, the burden of Asthma is not equally distributed across the 9 San Francisco bay area counties.

In Figure 2 below, we see that the pollution burden score for these counties are about the same

### Figure 2: Pollution burden score by county.

```{r, SnPlot2, warning=F}

SnFig2_boxPlot = 
  plot_ly(
    type = "box",
    color = "#A52A2A",      #  brown here doesn't show the same brown as in scatter plot
    data = ces_data_SfBay,
    x = ~Pollution.Burden.Score,
    y = ~California.County
  ) %>%
  layout(
    xaxis = list(title = 'Pollution Bruden Score'),
    yaxis = list(title = 'County')
  )

SnFig2_boxPlot
```




## Column 2


### Correlation between Asthma and CES 4.0 Score

In Figure 3 below, we show a scatter plot where can see a general correlation between high CalEnviroScreen (CES 4.0) score and the weighted average of Asthma Percentile.  
One possible explanation for such correlation is that neighborhood suffering from poor CES score have high determinant of poor, which include factors that causes more asthma.

The blue dots are non San Francisco Bay Area counties, with an overall trend represented by blue line.

The brown cross are San Francisco Bay Area counties, again with the brown line representing the trend in these 9 counties.  Note that this is a much sharper line than the rest of the California counties.  The implication is likely that because of the expense of living and density of the SF Bay area, CalEnviroScreen score is an even more drastic predictor of Asthma health bay bay area residents.  

### Figure 3: Weighted Mean Asthma Percentile vs. CalEnviroScreen 4.0 Score.


```{r, SnPlotWeigtedAvg, warning = F}

## removed data-width clause above 
## Column 2 {data-width=450}



# use the weigted average table generated in assignment 3...

table_df = ces_data %>%
#table_df = ces_data_SfBay %>%
  group_by( California.County ) %>%
  summarize( weightedAvgCes4                  = round( weighted.mean( x=CES.4.0.Score,          w=Total.Population, na.rm=T), 2),
             weightedAvgPM2.5                 = round( weighted.mean( x=PM2.5,                  w=Total.Population, na.rm=T), 2),
             weightedAvgLead                  = round( weighted.mean( x=Lead,                   w=Total.Population, na.rm=T), 2),
             weightedAvgTraffic               = round( weighted.mean( x=Traffic,                w=Total.Population, na.rm=T), 2),
             weightedAvgPollutionBurden       = round( weighted.mean( x=Pollution.Burden,       w=Total.Population, na.rm=T), 2),
             weightedAvgAsthma                = round( weighted.mean( x=Asthma,                 w=Total.Population, na.rm=T), 2),
             weightedAvgCardiovascularDisease = round( weighted.mean( x=Cardiovascular.Disease, w=Total.Population, na.rm=T), 2),
             weightedAvgHousingBurden         = round( weighted.mean( x=Housing.Burden,         w=Total.Population, na.rm=T), 2)
             )

# table_df %>% head(5)


table_pctl_df = ces_data %>%
  group_by( California.County ) %>%
  summarize( weightedAvgCes4                  = round( weighted.mean( x=CES.4.0.Percentile,          w=Total.Population, na.rm=T), 2),
             weightedAvgPM2.5                 = round( weighted.mean( x=PM2.5.Pctl,                  w=Total.Population, na.rm=T), 2),
             weightedAvgLead                  = round( weighted.mean( x=Lead.Pctl,                   w=Total.Population, na.rm=T), 2),
             weightedAvgTraffic               = round( weighted.mean( x=Traffic.Pctl,                w=Total.Population, na.rm=T), 2),
             weightedAvgPollutionBurden       = round( weighted.mean( x=Pollution.Burden.Pctl,       w=Total.Population, na.rm=T), 2),
             weightedAvgAsthma                = round( weighted.mean( x=Asthma.Pctl,                 w=Total.Population, na.rm=T), 2),
             weightedAvgCardiovascularDisease = round( weighted.mean( x=Cardiovascular.Disease.Pctl, w=Total.Population, na.rm=T), 2),
             weightedAvgHousingBurden         = round( weighted.mean( x=Housing.Burden.Pctl,         w=Total.Population, na.rm=T), 2)
            # ) %>% mutate(
            #   bay_counties = case_when( California.County %in% bay_area_counties ~ "SF Bay Area Counties" , 
            #                             TRUE ~ "Non SF Bay Area Counties" )
             )

# table_pctl_df %>% head(5)

# create 2 tables for SF Bay cities vs the rest, for use with plotly::add_markers()
`%nin%` = Negate(`%in%`)
table_pctl_df_SFcorp    = table_pctl_df %>% filter( California.County  %in% bay_area_counties ) 
table_pctl_df_nonSFcorp = table_pctl_df %>% filter( California.County %nin% bay_area_counties )

# create model for trend line
model_linear_SF    = lm( weightedAvgAsthma ~ weightedAvgCes4, data=table_pctl_df_SFcorp )
line.fmt.SF        = list(dash="solid", width = 1.5, color="brown")
model_linear_nonSF = lm( weightedAvgAsthma ~ weightedAvgCes4, data=table_pctl_df_nonSFcorp )
line.fmt.nonSF     = list(dash="solid", width = 1.5, color="blue")



# create scatter plot, for use with plotly::add_lines()
# ref: https://www.geeksforgeeks.org/getting-started-with-plotly-in-r/

# add_marker for each layer of data, ref: https://community.plotly.com/t/how-to-specify-marker-symbol-color-based-on-value/4873/2 



table_pctl_df %>%
plot_ly() %>%
  plotly::add_markers(
    data = table_pctl_df_SFcorp, 
    name = "SF Bay Area Counties",
    #which( baycounties == "SF Bay Area Counties" ), 
    y = ~weightedAvgAsthma,
    x = ~weightedAvgCes4,       # seems to have correlation
     #    color = ~bay_counties, 
    text = ~California.County,
    type = "scatter", mode = "markers", 
    marker = list(size = 12, opacity = 0.6, symbol='cross', color='#A52A2A')) %>%  # brown
  
  plotly::add_markers(
    data = table_pctl_df_nonSFcorp, 
    name = "Non SF Bay Area Counties",
    y = ~weightedAvgAsthma,
    x = ~weightedAvgCes4,       # seems to have correlation
               # y = weightedAvgPollutionBurden   # minor
               # y = weightedAvgPM2.5   # hard to see
               # y = weightedAvgCardiovascularDisease  # probably minor 
     #    color = ~bay_counties, 
    text = ~California.County,
         
    type = "scatter", mode = "markers", 
    marker = list(size = 10, opacity = 0.8, symbol='circle', color='blue')) %>%
 
   plotly::add_lines( 
    data = table_pctl_df_SFcorp,
    x    = ~weightedAvgCes4,
    y    = predict(model_linear_SF), 
    line = line.fmt.SF,
    name = "Linear Prediction 49 non SF bay area counties"
      ) %>%
 
   plotly::add_lines( 
    data = table_pctl_df_nonSFcorp,
    x    = ~weightedAvgCes4,
    y    = predict(model_linear_nonSF), 
    line = line.fmt.nonSF,
    name = "Linear Prediction 9 SF bay area counties "
      ) %>%
 
   layout(  #title = "Weighted Mean Asthma Percentile vs. CalEnviroScreen 4.0 Score",  # place title in section/panel header
        yaxis = list(title = "Weighted Average Asthma Percentile" ),
        xaxis = list(title = "Weighted Average CES 4.0 score" ),
        legend = list(
                     orientation = "v",    # show entries h = horizontally
                     xanchor = "left",     # alt: "center" of legend as anchor
                     y = .06, 
                     x = .55)              # position of legend in range 0..1
        )

```




### Comorbidity by SF Bay Are counties 


In the following heattile, we see frequency of various comorbidity in the 9 SF Bay Area counties.  
While we see that various indicator such as traffic and pollution burden varies among Bay area counties, they are not covariating with Asthma.  
This again lead to more weight that socio-economic factor measured in the CalEnviroScreen score, and that social determinants, play an important role in the outcome of asthma.  

### FIgure 4: Heat tiles of health burden in Bay Area counties

```{r, SnPlotHeatTile, warning = F}

# heattile, aka heatmap
# heat map of county vs multiple weighted avg result ... that should be doable.

# p_load(rcartocolor)


ERASE_OLD__table_SfBay = table_df %>%
  filter( California.County %in% bay_area_counties )


table_SfBay = table_pctl_df %>%
  filter( California.County %in% bay_area_counties ) %>%
  select( California.County,  
          weightedAvgPollutionBurden,
          weightedAvgTraffic,
          weightedAvgPM2.5,                
          weightedAvgLead,           
          weightedAvgAsthma,
          #weightedAvgCardiovascularDisease,
  ) %>%
  rename( # new = old name
          `Weighted Mean Pollution Burden` = weightedAvgPollutionBurden,
          `Weighted Mean Traffic` = weightedAvgTraffic,
          `Weighted Mean PM 2.5` = weightedAvgPM2.5,                
          `Weighted Mean Lead` = weightedAvgLead,           
          `Weighted Mean Asthma` = weightedAvgAsthma,
          #"weightedAvgHousingBurden"  
  )


# table_matrix = as.matrix(table_df)
# remove non numerical column, and column which not interested in the data (traffic skew the data too much)
table_matrix = table_SfBay %>%
  select( - California.County ) %>%  # weightedAvgTraffic
  as.matrix()
rownames( table_matrix ) = table_SfBay$California.County


# ref: https://www.datanovia.com/en/blog/how-to-create-a-beautiful-interactive-heatmap-in-r/#change-color-palettes
# heatmaply is plotly version of heatmap, ie add interactivity
# p_load(heatmaply)
# p_load(RColorBrewer)

# more color palette--named sequential ones are tried and tested :) 
# https://stackoverflow.com/questions/57153428/r-plot-color-combinations-that-are-colorblind-accessible
# https://colorspace.r-forge.r-project.org/articles/approximations.html


heatmaply(  table_matrix, 
            #colors = viridis(n = 256,  option = "magma"),
            #colors = colorRampPalette(brewer.pal(3, "RdBu"))(256),
            colors = colorRampPalette(brewer.pal(3, "BuPu"))(256),
            #colors =  colorRampPalette(rcartocolor::carto_pal(7, "ag_Sunset"))(256),
            dendrogram = c("none")
) %>%
   layout(  
     title = " " # want some space at top, but didn't do the trick.  
        )


```

### Conclusion

As we have seen above, 
the overall trend of poor CalEnviroScreen score and poor health outcome in Asthma is correlated.
In the SF Bay area, Solano county has the highest CES 4.0 score and the highest weighted average asthma percentile.  When compared to the general trend line of CES score vs Asthma percentile, Solano county is almost an outlier, laying above the trend line among all 9 Bay area counties (note that San Francisco and San Mateo county has almost exact same CES:Asthma ratio, the data points overlay exactly).  Solano county is not that far from Marin county.  As a community, to tackle Asthma as a disease burden of the community, inequalities and SDoH need to be removed to best accomplish this goal.   

While detail and nuances exist, and the following two pages provide additional tables and figures for further data exploration, 
we should start to address CES score disparity and their effects on health.
More research  could help, but mitigation actions should start now and not wait for "all" details to be ironed out, least we aggravate the problem to point of no return.



### 

&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;





# Data Explorer (SF Bay Area counties)

###

```{r, fig.height = 4.8}

## css-clause removed.  see stock RMD line 165
## {r, fig.length = 9}
## {r, fig.height = 9}   # these maybe relative to 10
## https://pkgs.rstudio.com/flexdashboard/articles/using.html#sizing says determine by experimentation, without much more elaboration!


table_pctl_df_withRename = table_pctl_df %>%
    rename(
          `Weighted Mean Pollution Burden` = weightedAvgPollutionBurden,
          `Weighted Mean Traffic` = weightedAvgTraffic,
          `Weighted Mean PM 2.5` = weightedAvgPM2.5,
          `Weighted Mean Lead` = weightedAvgLead,
          `Weighted Mean Asthma` = weightedAvgAsthma,
  ) %>%
  rename(
          `County` = California.County,
          `Weighted Mean CalEnviroScreen Score`  = weightedAvgCes4,
          `Weighted Mean Housing Burden`         = weightedAvgHousingBurden,
          `Weighted Mean Cardiovascular Disease` = weightedAvgCardiovascularDisease
    )


datatable(table_pctl_df_withRename %>%  filter( County %in% bay_area_counties ), 
          #remove numbered index
          rownames = F,
          extensions = 'Buttons', options = list(dom = 't'),
          height = "fit-content") %>%
  #bold the county column
  formatStyle("County", fontWeight = "bold") %>%
    #round the CES column to one decimal place
  formatRound("Weighted Mean CalEnviroScreen Score", 1)

# 2.4 DOM elements describe options https://rstudio.github.io/DT/options.html#:~:text=
```

###

```{r, fig.height = 5}

## {r, barPlot_9_corporations, include=T}


#p_load(gcookbook)
#p_load(scales)

##  pivot_longer( c("weightedAvgPM2.5", "weightedAvgLead", "weightedAvgTraffic", "weightedAvgPollutionBurden", "weightedAvgAsthma", "weightedAvgCardiovascularDisease", "weightedAvgHousingBurden" ),

#  ggplotly() result doesn't seems to work well in the dashboard, panel size is NOT correct and cut off things!
table_pctl_long = table_pctl_df %>%
    rename(
          `Pollution Burden` = weightedAvgPollutionBurden,
          `Traffic` = weightedAvgTraffic,
          `PM 2.5` = weightedAvgPM2.5,
          `Lead` = weightedAvgLead,
          `Asthma` = weightedAvgAsthma,
  ) %>% 
  pivot_longer( c("PM 2.5", "Lead", "Traffic", "Pollution Burden", "Asthma",  ),
                names_to="Burden",
                values_to="Percentile_Value") 

# View(table_pctl_long)
# table_pctl_long %>% head(5)


table_pctl_long_9corp = table_pctl_long %>%
  filter( California.County %in% bay_area_counties )

allBurden9_Plot =
table_pctl_long_9corp %>%
  ggplot( aes( x=California.County, 
                y=Percentile_Value, 
               #colour = weightedAvgCes4, 
  ) ) + 
  geom_col( 
    aes(fill=weightedAvgCes4 ), 
            ) + 
  #scale_fill_brewer() +
  #scale_colour_gradientn(colours = c("darkred", "orange", "yellow", "white")) +
  #scale_fill_gradientn(colours = c("darkred", "orange", "yellow", "white")) +
  #scale_fill_gradient(colours = c("yellow", "orange" )) +
  #scale_fill_viridis_c( option = "magma" ) +
  scale_fill_distiller( palette = rev("RdPu") ) +  # rev() was supposed to reverse the gradient scale.
  coord_flip() +
  labs(
    title = "Weighted Average Health Burden by County",
    fill="Weighted Mean CES 4.0 Score"
  ) +
  ylab( "Weighted Mean Percentile" ) +
  xlab( "County" ) +
  #guides(fill=guide_legend(title="Weighted Mean CES 4.0 Score")) +
  facet_wrap(  ~Burden, ncol = 3 ) 


## allBurden9_Plot
ggplotly( allBurden9_Plot )

# fill by continuous variable, #12.6  https://r-graphics.org/recipe-colors-palette-continuous 
# https://ggplot2-book.org/scales-colour.html#sec-particular-palettes

# RColorBrewer::display.brewer.all()

```

&nbsp;


# Data Explorer (All 58 CA counties)

###

```{r, fig.height = 9.5}

# rm css clause
# rm {r, fig.length = 15}
#     {r, fig.length = 58}

datatable(table_pctl_df_withRename, 
          #remove numbered index
          rownames = F,
          #add buttons to export the data within the html (cool!)
          #note that the `dom` argument can accept many, check out documentation for more
          #by calling `Btp`, we keep buttons (B), the table itself (t), and pagination control (p)  ; f=filter/search option
          extensions = 'Buttons', options = list(dom = 'Btplf',
                                                 buttons = c('copy', 
                                                             'csv', 
                                                             'excel', 
                                                             'pdf', 
                                                             'print'),
                                                 #how many entries per page
                                                 pageLength = 20,   # 15 dont work well, leave gab in table above control menu
                                                 #options for "Show how many entries"
                                                 lengthMenu = c(10, 20, 30, 40, 50, 58)),
          height = "fit-content") %>%
  #bold 
  formatStyle("County", fontWeight = "bold") %>%
  #round the CES column to one decimal place
  formatRound("Weighted Mean CalEnviroScreen Score", 1)



```