Overview

Row

Project Overview

Health outcomes across Chicago are not randomly distributed. Instead, they emerge from a layered interaction between environmental exposure, socioeconomic conditions, and neighborhood-level structural factors.

This project investigates how chronic diseases vary across Chicago communities and where expected relationships between environment and health begin to break down. While pollution and population density are often used to explain disease burden, these variables alone do not fully capture the uneven patterns observed across neighborhoods.

Our central research question is:

Where does do environmental and social determinants of health fail to explain health outcomes, and what might explain these gaps?

To address this, our team examines four major health indicators: obesity, hypertension, asthma, and diabetes. Each condition captures a different dimension of health risk. Obesity reflects lifestyle and access to resources, asthma is closely tied to environmental exposure, diabetes represents long-term metabolic health, and hypertension serves as a cumulative indicator of both environmental and structural stress.

Together, these measures allow us to move beyond single-variable explanations and instead identify where expected relationships break down. We introduce a Mismatch Index to capture these deviations, highlighting neighborhoods that experience either unexpectedly high or unexpectedly low health burdens.

By combining environmental, demographic, and health data, this project aims to reveal patterns of vulnerability, resilience, and inequality embedded within Chicago’s geography.

Obesity Mismatch Index

Row

Obesity Mismatch Index

Obesity Mismatch Index

Row

Resilient Neighborhoods

10

Vulnerable Neighborhoods

9

Neighborhoods Outside Expected Pattern

19

Variance in Obesity Explained by Hardship

39%

Row

Neighborhoods of Vulnerability and Resilience

Diabetes-Specific Vulnerability and Resilience

Hypertension

Hypertension is used as a central indicator because it reflects long-term exposure to both environmental and structural conditions, making it a powerful measure of inequality across space.

Row

Environmental Exposure (hypertension)

ANALYSIS

PM2.5 exposure across Chicago demonstrates clear clusters of elevated pollution levels across specific neighborhoods. These areas represent communities that are consistently subjected to higher environmental risk, which can contribute to long-term health consequences. The non-random distribution of pollution suggests that environmental burden is structurally embedded within the urban landscape. This pattern raises important concerns about environmental justice and unequal exposure. On average, a PM2.5 of around 9.2 suggests moderate air quality across neighborhoods in Chicago; however a slight health risk is present for individuals sensitive to pollutants and prone to having chronic diseases. Establishing this baseline is critical for interpreting how environmental conditions shape health outcomes.

PM2.5 Map: Environmental Risk Baseline

Row

Hypertension Analysis (spatial distribution) (hypertension)

ANALYSIS

Hypertension exhibits strong spatial clustering across Chicago, with certain neighborhoods consistently experiencing higher rates than others such as the community in Austin with 27,500 cases recorded compared to Riverdale with 2000 recorded cases. This pattern suggests that health outcomes are shaped by localized structural conditions rather than random variation. The persistence of these clusters indicate long-term exposure to risk factors such as economic stress and limited access to healthcare. Not all high-risk areas align perfectly with pollution patterns, pointing to additional underlying influences. This reinforces the importance of area-based analysis in understanding health disparities.

Spatial Distribution of Hypertension

Row

Pollution Relationship (hypertension)

ANALYSIS

The relationship between PM2.5 and hypertension shows a general downward trend when observing PM2.5 of 8 to 9.8, indicating that environmental exposure does slightly contribute to health risk. However, the variability around the trend line suggests that this relationship is not significant and realistic. Some neighborhoods experience higher-than-expected hypertension despite lower pollution levels. This indicates that additional structural or social factors are influencing outcomes. The results highlight the limitations of relying solely on environmental variables to explain health disparities.

Row

Mismatch Data (hypertension)

ANALYSIS

The mismatch index highlights where observed hypertension diverges from expected patterns based on environmental and demographic factors. These deviations are spatially clustered, indicating localized influences that are beyond pollution. Areas with high mismatch values may experience structural disadvantages that can amplify and be prone to health risk. Conversely, lower-than-expected values suggest the presence of protective community factors across the city of Chicago. This approach can help provide a deeper understanding of inequality by identifying where standard explanations fall short.

Mismatch Map

Row

Conclusion (hypertension)

Hypertension across Chicago is shaped by both environmental exposure and structural inequality. While pollution contributes to risk, it does not fully explain the observed variation.

The mismatch framework reveals that health outcomes are influenced by a broader set of factors, including socioeconomic conditions and neighborhood context. These findings emphasize the need for comprehensive approaches to public health.

Future Directions (hypertension)

Future work can expand this analysis by incorporating additional variables such as access to healthcare, green space, and community-level trust.

Temporal analysis could reveal how these relationships evolve over time. More advanced spatial models may also better capture neighborhood-level effects.

Understanding these dynamics more deeply can help design targeted interventions that address both environmental and structural drivers of health inequality.

Asthma

Row

Abstract (Asthma)

Introduction: This research aims to explore weather traffic pollution disproportionately affects some Chicago neighborhoods more than others. According to the American Lung Association previous findings have revealed there are correlations between air quality/pollution and health outcomes in the city of Chicago, predominantly affecting densely populated, disadvantaged areas. Based on this research, we hypothesize neighborhoods with higher levels of environmental burden will be more prone to higher asthma levels.

Methods: Descriptive analysis was used in order to conduct data exploration research on Chicago pollution metrics and their correlation to health outcomes. Data from the Chicago Health Atlas and Chicago Housing Authority was used in order to create four figures using R Studio, packages include: ggplot, tidyverse, janitor, and plotly. Figures used included a leaflet map, mismatch graph, standard plot and environmental burden plot. Graph interpretation was used in order to conduct results.

Row

Figure 1: Leaflet Map of Asthma Burden of Chicago (asthma)

The leaflet map sets the baseline data that shows where asthma is distributed in the city. This map is interactive and represents the prevalence of asthma in 72 Chicago neighborhoods. Asthma concentration is represented in large yellow circles which decrease in size and gradience from dark green (greater concentrations) to light green (lower concentrations). Viewers can zoom into specific neighborhoods and hover over them to reveal neighborhood name, asthma prevalence and traffic pollution levels. This map helps us identify spatial distribution of asthma throughout the city. It also helps us identify weather there is a correlation among pollution and asthma for specific neighborhoods. The map shows us asthma levels have a relatively even distribution in the city of Chicago with some higher concentrations on the northside such as Lakeview and Albany Park. However, there is no clear relationship between traffic pollution and asthma prevalence as these factors vary throughout neighborhoods.

Row

Figure 2: Mismatch Graph of Pollution Risk (asthma)

Description: This graph was created in order to compare the expected health outcomes for 77 Chicago neighborhoods based on the environmental pollution outcomes to real asthma levels. Expected Asthma levels were showcased using the traffic burden data set on top where lime green represents positive values (>0) that matched expected outcomes. The bottom showcases resilient neighborhoods that did not match expected outcomes in the negative value range (<0) in dark green. Grey circles represent exactly where outcomes match expectations. The tooltip function helps the viewer see data on individual plot points. This is helpful due to our large data set. By hovering over each point, the viewer is able to directly verify the mismatch and confirm information.

Analysis :The graph suggests health outcomes for asthma are have a low-moderate correlation to traffic related pollution as there is a general positive trend with a few positive outliers such as Lakeview and Austin. However, there is variability in the dataset which could mean multiple factors contribute to asthma outcomes.

Pollution Risk

Row

Figure 3: Standard Scatter Plot of Asthama vs Traffic Pollution (asthma)

Description: The graph demonstrates the relationship between traffic-related pollution and prevalence of asthma across 77 Chicago neighborhoods. The x-axis represents traffic-related pollution and the y-axis represents the percentage of adults with asthma (prevalence). The color gradient increases in hue to a darker green in correlation to higher pollution levels. The dashed lines represent city averages which divide the plot into four quadrants. Each plot point represents a specific Chicago neighborhood and a tooltip feature was used in order to showcase interactivity. Each plot point tells the viewer what neighborhood they are looking at, as well as the asthma and pollution levels.

Analysis: Overall, there is a general positive relationship in the data as plot points are scattered at an upwards direction, suggesting pollution could contribute to asthma levels. However, plot-points do not emphasize a linear relationship and are scattered, meaning there could be other contributing factors to asthma levels. Furthermore, the quadrants on the graph help reveal patterns, the top right quadrant show expected burden. The top left, unexpected vulnerability, bottom right: resilience and bottom left: highest resilience. Some outliers are observed but there is no clear pattern. In conclusion, the graph helps the viewer see vulnerability patterns beyond environmental exposure.

Scatter plot

Row

Figure 4: Environmental Justice Burden Plot (asthma)

Description: The graph showcases the relationship between the environmental justice burden of asthma and asthma mismatch rates. The graph plots environmental justice burden and asthma on the x-axis and asthma mismatch on the y-axis. Asthma mismatch is determined by actual minus expected asthma levels based on traffic pollution. The graph aims to explore weather certain neighborhoods are vulnerable and disproportionately affected by asthma. Neighborhoods over the dotted line (0>) have a higher than expected asthma level despite traffic pollution. Neighborhoods under the dotted line (<0) more resilient than expected asthma rates. The graph also includes an interactive feature that allows the viewer to hover over a specific neighborhood to see neighborhood name, environmental justice burden, mismatch score, asthma prevalence and pollution levels.

Analysis: This graph suggests a slight positive relationship between higher levels of EJ burden and positive mismatch yields. However, variability still exists in the graph and distribution appears mostly even. It is possible, neighborhood plays a role into asthma levels.

Environmental Justice Burden and Asthma

Row

Conclusion (asthma)

Overall, I wanted to show the relationship between pollution and asthma levels. I aimed to explore weather asthma cases were higher based on traffic pollution and how these rates affect environmental burden. My first graph is a leaflet map that showcases asthma prevalence in the Chicagoland area. This figure showed some higher concentrations of asthma in the northern region, however these differences were not significant. I made this map to establish baseline levels of asthma prevalence throughout the city. The second figure was a mismatch graph made to see weather certain neighborhoods experience higher or lower expected levels of asthama based on traffic pollution. The findings revealed the same neighborhoods with higher levels of asthma on figure 1 were the same neighborhoods with more vulerability than expected: Lakeview, Austin, Albany Park. There was a slight positive correlation between asthma levels and pollution with some variability. Figure 3 is a scatter plot with dashed lines that indicate city-averages between traffic-related pollution and asthma prevalence. This figure shows weather asthma tends to increase as pollution increases to establish a baseline relationship between the two variables. Findings reveal the same outliers as previous figures have higher asthma levels than pollution levels. This suggests pollution does not fully explain asthma outcomes. The final figure (4) attempts to uncover weather environmental justice predicts weather a neighborhood will have higher or lower asthma levels than expected. The findings reveal there is no strong pattern between the two variables as points are widely scattered with a few outliers. For example the neighborhood of Lakeview has low environmental justice burden but a high positive mismatch, the neighborhood of Austin has a high environmental burden, and high positive mismatch. These random variations reveal no real pattern between the two. In conclusion, there is no real correlation between asthma traffic pollution and asthma levels. However, there are a few neighborhoods that are disproportionately affected for unknown reasons which could be potentially attributed to traffic pollution. The results suggest further analyses needs to be conducted in order to reveal asthma causation.

Diabetes

Row

INSPECTING THE DATA

 [1] "Layer"             "Name"              "GEOID"            
 [4] "Population"        "Longitude"         "Latitude"         
 [7] "CHABXHK_2023.2024" "CHAKNKC_2023"      "CHARIPZ_2023.2024"
[10] "CHASBQJ_2023.2024" "CHASWYW_2023.2024" "CHAVCNN_2023"     
[13] "HCSNL_2023.2024"   "HCSNLP_2023.2024"  "PMC_2020"         
[16] "TRF_2020"          "LNG_2023"          "HCSOB_2023.2024"  
[19] "HCSHYT_2023.2024"  "HCSDIA_2023.2024"  "HCSATH_2023.2024" 
[22] "PCT.W_2020.2024"   "POP_2020.2024"    
           Layer           Name GEOID Population Longitude Latitude
1 Community area    Rogers Park     1      55454 -87.67017 42.00963
2 Community area   Norwood Park    10      41069 -87.80345 41.98525
3 Community area Jefferson Park    11      26201 -87.77116 41.97884
4 Community area    Forest Glen    12      19579 -87.75836 41.99394
5 Community area     North Park    13      17522 -87.72358 41.98365
6 Community area    Albany Park    14      48549 -87.72156 41.96808
  CHABXHK_2023.2024 CHAKNKC_2023 CHARIPZ_2023.2024 CHASBQJ_2023.2024
1          32.29722            0             33600          62.96973
2          46.05079            0             16300          63.68579
3          39.89650            0             15200          64.34170
4          47.45437            0             10300          77.62760
5          39.96379            0              9400          69.81005
6          41.69300            0             22100          52.11844
  CHASWYW_2023.2024 CHAVCNN_2023 HCSNL_2023.2024 HCSNLP_2023.2024 PMC_2020
1             17100            0           27000         50.62136 8.664523
2             11800            0           18600         72.77283 9.004155
3              9400            0           16400         69.32133 8.994516
4              6300            0           11000         83.17369 8.906694
5              5400            0            7800         57.81274 8.908190
6             17700            0           18000         42.46526 8.970502
  TRF_2020 LNG_2023 HCSOB_2023.2024 HCSHYT_2023.2024 HCSDIA_2023.2024
1 243.4689 5.371190           15500            15600             5200
2 427.2974 5.648500            5700             9500             4600
3 417.9095 5.111926            6900             6600             3600
4 449.4197 4.606952            3100             5600             1300
5 204.3365 5.831718            3800             6100             2300
6 231.7647 4.675789           12500            10500             8000
  HCSATH_2023.2024 PCT.W_2020.2024 POP_2020.2024
1             4600        44.90257      54023.51
2             2100        71.70276      42638.43
3               NA        55.98983      26634.54
4             1300        70.67180      19886.38
5              900        44.94954      18307.61
6             6800        34.35977      45707.82
'data.frame':   77 obs. of  23 variables:
 $ Layer            : chr  "Community area" "Community area" "Community area" "Community area" ...
 $ Name             : chr  "Rogers Park" "Norwood Park" "Jefferson Park" "Forest Glen" ...
 $ GEOID            : int  1 10 11 12 13 14 15 16 17 18 ...
 $ Population       : int  55454 41069 26201 19579 17522 48549 63038 51911 43120 14412 ...
 $ Longitude        : num  -87.7 -87.8 -87.8 -87.8 -87.7 ...
 $ Latitude         : num  42 42 42 42 42 ...
 $ CHABXHK_2023.2024: num  32.3 46.1 39.9 47.5 40 ...
 $ CHAKNKC_2023     : num  0 0 0 0 0 ...
 $ CHARIPZ_2023.2024: int  33600 16300 15200 10300 9400 22100 34600 23700 20000 7300 ...
 $ CHASBQJ_2023.2024: num  63 63.7 64.3 77.6 69.8 ...
 $ CHASWYW_2023.2024: int  17100 11800 9400 6300 5400 17700 21000 14900 14800 4200 ...
 $ CHAVCNN_2023     : num  0 0 0 0 0 ...
 $ HCSNL_2023.2024  : int  27000 18600 16400 11000 7800 18000 29200 22600 20700 4800 ...
 $ HCSNLP_2023.2024 : num  50.6 72.8 69.3 83.2 57.8 ...
 $ PMC_2020         : num  8.66 9 8.99 8.91 8.91 ...
 $ TRF_2020         : num  243 427 418 449 204 ...
 $ LNG_2023         : num  5.37 5.65 5.11 4.61 5.83 ...
 $ HCSOB_2023.2024  : int  15500 5700 6900 3100 3800 12500 23700 11500 11400 5000 ...
 $ HCSHYT_2023.2024 : int  15600 9500 6600 5600 6100 10500 19200 12800 12500 4200 ...
 $ HCSDIA_2023.2024 : int  5200 4600 3600 1300 2300 8000 6700 2900 8000 1600 ...
 $ HCSATH_2023.2024 : int  4600 2100 NA 1300 900 6800 3900 3800 1600 NA ...
 $ PCT.W_2020.2024  : num  44.9 71.7 56 70.7 44.9 ...
 $ POP_2020.2024    : num  54024 42638 26635 19886 18308 ...
    Layer               Name               GEOID      Population    
 Length:77          Length:77          Min.   : 1   Min.   :  2514  
 Class :character   Class :character   1st Qu.:20   1st Qu.: 18633  
 Mode  :character   Mode  :character   Median :39   Median : 29899  
                                       Mean   :39   Mean   : 35571  
                                       3rd Qu.:58   3rd Qu.: 45141  
                                       Max.   :77   Max.   :103048  
                                                                    
   Longitude         Latitude     CHABXHK_2023.2024  CHAKNKC_2023  
 Min.   :-87.89   Min.   :41.66   Min.   :22.71     Min.   :    0  
 1st Qu.:-87.72   1st Qu.:41.76   1st Qu.:30.83     1st Qu.:    0  
 Median :-87.67   Median :41.83   Median :38.37     Median : 2899  
 Mean   :-87.68   Mean   :41.84   Mean   :37.42     Mean   : 9832  
 3rd Qu.:-87.62   3rd Qu.:41.93   3rd Qu.:42.90     3rd Qu.:13993  
 Max.   :-87.53   Max.   :42.01   Max.   :57.15     Max.   :71934  
                                                                   
 CHARIPZ_2023.2024 CHASBQJ_2023.2024 CHASWYW_2023.2024  CHAVCNN_2023   
 Min.   : 1500     Min.   :17.28     Min.   : 1100     Min.   :  0.00  
 1st Qu.: 5300     1st Qu.:35.24     1st Qu.: 5100     1st Qu.:  0.00  
 Median :10700     Median :48.71     Median : 8200     Median : 13.01  
 Mean   :14448     Mean   :49.18     Mean   :10342     Mean   : 31.98  
 3rd Qu.:18900     3rd Qu.:62.51     3rd Qu.:13800     3rd Qu.: 69.23  
 Max.   :73700     Max.   :84.21     Max.   :36500     Max.   :100.00  
                                                                       
 HCSNL_2023.2024 HCSNLP_2023.2024    PMC_2020        TRF_2020      
 Min.   : 1100   Min.   :12.48    Min.   :8.665   Min.   :  46.79  
 1st Qu.: 5000   1st Qu.:35.80    1st Qu.:9.129   1st Qu.: 166.28  
 Median : 9800   Median :46.19    Median :9.333   Median : 251.53  
 Mean   :13018   Mean   :46.33    Mean   :9.308   Mean   : 502.64  
 3rd Qu.:18600   3rd Qu.:57.81    3rd Qu.:9.520   3rd Qu.: 567.00  
 Max.   :64400   Max.   :84.19    Max.   :9.726   Max.   :3403.19  
                                                                   
    LNG_2023      HCSOB_2023.2024 HCSHYT_2023.2024 HCSDIA_2023.2024
 Min.   : 2.410   Min.   : 1400   Min.   : 1700    Min.   :  400   
 1st Qu.: 4.744   1st Qu.: 4500   1st Qu.: 5000    1st Qu.: 1650   
 Median : 5.660   Median : 7200   Median : 7000    Median : 2950   
 Mean   : 6.362   Mean   : 9297   Mean   : 8735    Mean   : 3745   
 3rd Qu.: 8.226   3rd Qu.:11900   3rd Qu.:12500    3rd Qu.: 5275   
 Max.   :16.078   Max.   :35100   Max.   :27500    Max.   :15900   
                                                   NA's   :3       
 HCSATH_2023.2024 PCT.W_2020.2024    POP_2020.2024   
 Min.   :  300    Min.   : 0.00195   Min.   :  2293  
 1st Qu.: 1250    1st Qu.: 4.93035   1st Qu.: 18428  
 Median : 2550    Median :13.59523   Median : 29219  
 Mean   : 3057    Mean   :26.44317   Mean   : 35111  
 3rd Qu.: 4375    3rd Qu.:45.88356   3rd Qu.: 43861  
 Max.   :14000    Max.   :82.07389   Max.   :102825  
 NA's   :5                                           

Overview of Data

The dataset includes 77 Chicago community areas with 23 variables describing population, environmental exposure, and health outcomes. Initial inspection shows that diabetes is recorded as counts rather than rates, meaning that values are influenced by population size. There is also variation in environmental exposure and environmental justice burden across neighborhoods, suggesting that both environmental and stucture factors differ across the city of Chicago, and may contribute to uneven health outcomes.

Row

PART 3: VISUALIZATION 1: EXPECTATION (diabetes)

The expectation is that higher environmental exposure will lead to worse health outcomes, specifically higher diabetes prevalence. However, the observed relationship is weak and does not show a clear trend, suggesting that environmental exposure alone may not fully explain patterns in diabetes across neighborhoods.

Traffic Risk vs Diabetes Prevalence

Row

PART 4: VISUALIZATION 2: REALITY (MULTI-LAYERED APPROACH) (diabetes)

When additional variables are included in the model, the relationship between exposure and health becomes more complex. Park access and environmental justice burden add important context, but patterns still remain inconsistent. This suggests that health outcomes are shaped by multiple interacting environmental and structural variables.

Environmental Exposure Alone Does Not Explain Diabetes Patterns

Row

PART 5: VISUALIZATION 3: MISMATCH INDEX (PRIMARY ANALYSIS) (diabetes)

The mismatch index captures the difference between observed and expected diabetes outcomes based on environmental exposure. This reveals neighborhoods where diabetes prevalence is lower or higher than originally predicted. These deviations highlight areas of vulnerability and resilience that cannot be explained by exposure alone, showcasing that additional structural factors influence health outcomes.

Mismatch Index

Row

PART 6: IDENTIFYING EXTREME CASES (diabetes)

Identifying neighborhoods with the most extreme mismatch values highlights patterns of inequality. Some areas show much higher diabetes prevalence than expected, indicating additional risk factors beyond environmental exposures. Other areas performed better than expected, demonstrating that both environmental and social factors influence health outcomes.

Neighborhoods with Strongest Mismatch

Row

PART 7: FINAL TAKEAWAY

Overall, when analyzing the clean and transformed data, diabetes patterns across Chicago are not fully explained by environmental exposure alone. While pollution is an important variable, it does not consistently predict health outcomes. The mismatch index calculated in Part 5 reveals that socioeconomic conditions also play a critical role, emphasizing the importance of considering multiple variables of multiple types when studying health disparities.

---
title: "When Environment, Health, and Wealth Don’t Align"
author: ""
date: "`r Sys.Date()`"
output:
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: scroll
    source_code: embed
    toc: true
    navbar_fixed: false
    theme:
      version: 4
      bootswatch: flatly
      primary: "#cc0000"
      secondary: "#48bea1"
      base_font:
        google: "Source Sans Pro"
      heading_font:
        google: "Montserrat"
---

```{css}
/* I am putting this chunk here in case I need to create a functional knitr::kable() table */
.chart-wrapper {
  overflow-y: auto;
}

/* set a custom font for banner */
.navbar { font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; }
```

```{r setup, include=FALSE}

#Note: This runs first and prepares the environment.
#Note: We suppress warnings/messages for a clean dashboard.

knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)

#Note: Core libraries for data manipulation and visualization
library(tidyverse)
library(plotly)
library(flexdashboard)
library(broom)
library(DT)
library(leaflet)
library(readr)
library(dplyr)
library(sf)
library(ggplot2)
library(janitor)
library(scales)
library(ggrepel)
library(ggcorrplot)
library(patchwork)

#Note: SUNSET COLOR PALETTE
#Note: Consistent colors across all visuals improves readability and professionalism
sun_orange <- "#FF7A00"
sun_yellow <- "#FFC300"
sun_red    <- "#FF3B3B"
sun_pink   <- "#FF8FA3"
dark_gray  <- "#2F2F2F"

# load data
url <- "https://raw.githubusercontent.com/fitzley/miniproj2/refs/heads/main/health_data.csv"
health_data2 <- read.csv(url)

# read shapefile of Chicago
chicago_sf <- st_read("https://raw.githubusercontent.com/fitzley/miniproj2/refs/heads/main/chi_comm_areas.geojson", quiet = TRUE)

# join health data with shapefile
chicago_map <- chicago_sf %>%
  left_join(health_data2 %>%
              mutate(GEOID = as.character(GEOID)), by = c("area_numbe" = "GEOID"))

# pull hardship index from most recent census data from github repository
url2 <- "https://raw.githubusercontent.com/fitzley/miniproj2/refs/heads/main/h_index.csv"
h_index <- read.csv(url2) %>%
  slice(-1, ) # remove first row

# join hardship index to health + map data
chicago_map <- chicago_map %>%
  left_join(h_index %>% select(GEOID, `HDX_2020.2024`) %>% mutate(GEOID = as.character(GEOID), `HDX_2020.2024` = as.numeric(`HDX_2020.2024`)), by = c("area_numbe" = "GEOID"))

# transform data into percents
chicago_map <- chicago_map %>%
  mutate(
    diabetes_rate = HCSDIA_2023.2024 / Population * 100,
    obesity_rate = HCSOB_2023.2024 / Population * 100,
    hypertension_rate = HCSHYT_2023.2024 / Population * 100,
    asthma_rate = HCSATH_2023.2024 / Population * 100,
    hard_index = HDX_2020.2024,
    pct_white = PCT.W_2020.2024,
    disease_burden = rowMeans(cbind(diabetes_rate, obesity_rate, hypertension_rate, asthma_rate), na.rm = TRUE)
  ) %>% # add resilience and vulnerability scores
  mutate(
    resilient = hard_index > quantile(hard_index, 0.5, na.rm = TRUE) &
                disease_burden < quantile(disease_burden, 0.5, na.rm = TRUE),
    vulnerable = hard_index < quantile(hard_index, 0.5, na.rm = TRUE) &
                 disease_burden > quantile(disease_burden, 0.5, na.rm = TRUE)
  ) %>% # add diabetes specific resilience and vulnerability scores
  mutate(
    resilient_ob = hard_index > quantile(hard_index, 0.5, na.rm = TRUE) &
                    obesity_rate < quantile(obesity_rate, 0.5, na.rm = TRUE),
    vulnerable_ob = hard_index < quantile(hard_index, 0.5, na.rm = TRUE) &
                     obesity_rate > quantile(obesity_rate, 0.5, na.rm = TRUE),
    category_ob = case_when(
      resilient_ob == TRUE ~ "Resilient",
      vulnerable_ob == TRUE ~ "Vulnerable",
      TRUE ~ "Expected"
    )
  )

# remove Fuller Park and Burnside - raw number data do not match CHI Atlas rates, suspected data quality issue
chicago_map <- chicago_map %>%
  filter(!community %in% c("FULLER PARK", "BURNSIDE"))

# set up cor data
cor_data <- chicago_map %>%
  st_drop_geometry() %>%
  select(
    Diabetes = diabetes_rate,
    Obesity = obesity_rate,
    Hypertension = hypertension_rate,
    Asthma = asthma_rate,
    Pct_White = PCT.W_2020.2024,
    Air_Quality = CHASBQJ_2023.2024,
    Traffic_Risk = TRF_2020,
    Env_Justice = CHAKNKC_2023,
    Hardship = hard_index,
    Community = Name
  ) %>%
  drop_na()
```

Overview
=====================================

## Row {data-height="600"}

### Project Overview

Health outcomes across Chicago are not randomly distributed. Instead, they emerge from a layered interaction between environmental exposure, socioeconomic conditions, and neighborhood-level structural factors.

This project investigates how chronic diseases vary across Chicago communities and where expected relationships between environment and health begin to break down. While pollution and population density are often used to explain disease burden, these variables alone do not fully capture the uneven patterns observed across neighborhoods.

Our central research question is:

**Where does do environmental and social determinants of health fail to explain health outcomes, and what might explain these gaps?**

To address this, our team examines four major health indicators: obesity, hypertension, asthma, and diabetes. Each condition captures a different dimension of health risk. Obesity reflects lifestyle and access to resources, asthma is closely tied to environmental exposure, diabetes represents long-term metabolic health, and hypertension serves as a cumulative indicator of both environmental and structural stress.

Together, these measures allow us to move beyond single-variable explanations and instead identify where expected relationships break down. We introduce a *Mismatch Index* to capture these deviations, highlighting neighborhoods that experience either unexpectedly high or unexpectedly low health burdens.

By combining environmental, demographic, and health data, this project aims to reveal patterns of vulnerability, resilience, and inequality embedded within Chicago’s geography.

Overall Disease Trends
=====================================

## Row {data-height="400"}

### Disease Burden by Community Area

```{r disease maps, echo=FALSE, out.width="100%", results='asis'}
# create chloropleth function
make_map <- function(var, title, midpt) {
  ggplot(chicago_map) +
    geom_sf(aes(fill = .data[[var]],
                text = paste0("<b>", Name, "</b><br>",
                              title, ": ", round(.data[[var]], 1), "%"))) +
    scale_fill_gradient2(low = "#2e4f4f",
                         mid = "#f4bb8f",
                         high = "#ff0000",
                         midpoint = midpt,
                         na.value = "black",
                         name = "% Adults") +
    theme_void() +
    theme(legend.position = "none")
}

# construct maps
pd <- make_map("diabetes_rate", "Diabetes", mean(chicago_map$diabetes_rate, na.rm = TRUE))
po <- make_map("obesity_rate", "Obesity", mean(chicago_map$obesity_rate, na.rm = TRUE))
ph <- make_map("hypertension_rate", "Hypertension", mean(chicago_map$hypertension_rate, na.rm = TRUE))
pa <- make_map("asthma_rate", "Asthma", mean(chicago_map$asthma_rate, na.rm = TRUE))

# generate interactive maps of spatial patterns of disease
pd2 <- ggplotly(pd, tooltip = "text") %>% layout(autosize = TRUE)
po2 <- ggplotly(po, tooltip = "text") %>% layout(autosize = TRUE)
ph2 <- ggplotly(ph, tooltip = "text") %>% layout(autosize = TRUE)
pa2 <- ggplotly(pa, tooltip = "text") %>% layout(autosize = TRUE)

# arrange plots in a line
subplot(pd2, po2, ph2, pa2, nrows = 1) %>%
  layout(
    autosize = TRUE,
    annotations = list(
      list(x = 0.125, y = 1.0, text = "<b>Diabetes</b>",
           showarrow = FALSE, xref = "paper", yref = "paper", xanchor = "center",
           font = list(size = 14, color = "#7b4419")),
      list(x = 0.375, y = 1.0, text = "<b>Obesity</b>",
           showarrow = FALSE, xref = "paper", yref = "paper", xanchor = "center",
           font = list(size = 14, color = "#7b4419")),
      list(x = 0.625, y = 1.0, text = "<b>Hypertension</b>",
           showarrow = FALSE, xref = "paper", yref = "paper", xanchor = "center",
           font = list(size = 14, color = "#7b4419")),
      list(x = 0.875, y = 1.0, text = "<b>Asthma</b>",
           showarrow = FALSE, xref = "paper", yref = "paper", xanchor = "center",
           font = list(size = 14, color = "#7b4419"))
    )
  ) %>%
  config(displayModeBar = FALSE)
```

## Row {data-height="120"}

### Mean Diabetes Rate

```{r}
valueBox(
  value = paste0(round(mean(chicago_map$diabetes_rate, na.rm = TRUE), 1), "%"),
  caption = "Mean Diabetes Rate",
  icon = "fa-droplet",
  color = "#2d4f4f"
)
```

### Mean Obesity Rate

```{r}
valueBox(
  value = paste0(round(mean(chicago_map$obesity_rate, na.rm = TRUE), 1), "%"),
  caption = "Mean Obesity Rate",
  icon = "fa-weight-hanging",
  color = "#f4bb8f"
)
```

### Mean Hypertension Rate

```{r}
valueBox(
  value = paste0(round(mean(chicago_map$hypertension_rate, na.rm = TRUE), 1), "%"),
  caption = "Mean Hypertension Rate",
  icon = "fa-heart-pulse",
  color = "#2d4f4f"
)
```

### Mean Asthma Rate

```{r}
valueBox(
  value = paste0(round(mean(chicago_map$asthma_rate, na.rm = TRUE), 1), "%"),
  caption = "Mean Asthma Rate",
  icon = "fa-lungs",
  color = "#f4bb8f"
)
```

## Row {data-height="600"}

### Summary Table of Disease Burden {data-width=400}

```{r top diabetes neighborhoods}
chicago_map %>%
  st_drop_geometry() %>%
  filter(!is.na(diabetes_rate)) %>%
  arrange(desc(diabetes_rate)) %>%
  select(Community = community, Diabetes = diabetes_rate, Obesity = obesity_rate, Hypertension = hypertension_rate, Asthma = asthma_rate) %>%
  mutate(across(where(is.numeric), ~round(., 1))) %>%
  DT::datatable(options = list(
  pageLength = 67,
  dom = 't',
  scrollY = "450px",
  scrollCollapse = TRUE
))
```

### Spatial Determinants of Disease Burden 

``` {r chloropleth disease burden }
# calculate mean of all disease rates
chicago_map <- chicago_map %>%
  mutate(disease_burden = rowMeans(cbind(diabetes_rate, obesity_rate, hypertension_rate, asthma_rate), na.rm = TRUE))

# visualize on map
disease_burden <- ggplot(chicago_map) +
  geom_sf(aes(fill = disease_burden,
              text = paste0("<b>", Name, "</b><br>",
              "<b>Disease Burden:</b> ", round(disease_burden, 1), "%"))) +
  scale_fill_gradient2(low = "#2e4f4f", mid = "#f4bb8f", high = "#ef0300",
                       midpoint = mean(chicago_map$disease_burden, na.rm = TRUE),
                       na.value = "black", name = NA) +
  theme_void() +
  theme(plot.title = element_text(size = 14, color = "#7b4419", face = "bold", hjust = 0.5)) +
  labs(title = "Overall Disease Burden by Community Area")

ggplotly(disease_burden, tooltip = "text") %>%
  layout(autosize = TRUE) %>%
  config(displayModeBar = FALSE)
```

## Row {data-height="120"}

### Highest Overall Burden

```{r}
top_burden <- chicago_map %>%
  st_drop_geometry() %>%
  filter(!is.na(disease_burden)) %>%
  slice_max(disease_burden, n = 1)

valueBox(
  value = paste0(round(top_burden$disease_burden, 1), "%"),
  caption = paste0("Highest Overall Burden — ", top_burden$community),
  icon = "fa-location-dot",
  color = "#7b4419"
)
```

### Lowest Overall Burden

```{r}
bot_burden <- chicago_map %>%
  st_drop_geometry() %>%
  filter(!is.na(disease_burden)) %>%
  slice_min(disease_burden, n = 1)

valueBox(
  value = paste0(round(bot_burden$disease_burden, 1), "%"),
  caption = paste0("Lowest Overall Burden — ", bot_burden$community),
  icon = "fa-location-dot",
  color = "#48bea1"
)
```

### Highest Hardship Index

```{r}
top_hardship <- chicago_map %>%
  st_drop_geometry() %>%
  filter(!is.na(hard_index)) %>%
  slice_max(hard_index, n = 1)

valueBox(
  value = round(top_hardship$hard_index, 1),
  caption = paste0("Highest Hardship — ", top_hardship$community),
  icon = "fa-location-dot",
  color = "#ed4a1a"
)
```

### Lowest Hardship Index

```{r}
bot_hardship <- chicago_map %>%
  st_drop_geometry() %>%
  filter(!is.na(hard_index)) %>%
  slice_min(hard_index, n = 1)

valueBox(
  value = round(bot_hardship$hard_index, 1),
  caption = paste0("Lowest Hardship — ", bot_hardship$community),
  icon = "fa-location-dot",
  color = "#167b2b"
)
```
## Row {data-height="450"}

### Racial Determinants of Disease Burden and Hardship

``` {r scatterplot disease burden vs hardship with percent white}
scatter <- ggplot(chicago_map, aes(x = hard_index, y = disease_burden, 
                                    color = pct_white,
                                    text = paste0("<b>", community, "</b><br>",
                                                  "Hardship: ", round(hard_index, 1), "<br>",
                                                  "Disease Burden: ", round(disease_burden, 1), "%<br>",
                                                  "% White: ", round(PCT.W_2020.2024, 1), "%"))) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", color = "black") +
  scale_color_gradient(low = "#753D12", high = "#ffffff", name = "% White") +
  labs(x = "Hardship Index", y = "Disease Burden (%)") +
  theme_minimal()

ggplotly(scatter, tooltip = "text") %>%
  layout(autosize = TRUE) %>%
  config(displayModeBar = FALSE)
```

### Correlation Matrix

```{r corr plot}
# create a correlation matrix of all variables
cor_matrix <- cor(cor_data %>%
                    select(-Community)
                  )

p_cor <- ggcorrplot(cor_matrix,
           method = "circle",
           type = "lower",
           lab = FALSE,
           colors = c("red2", "white", "darkslategray"),
           title = "Correlation Matrix") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplotly(p_cor) %>%
  layout(autosize = TRUE) %>%
  config(displayModeBar = FALSE)
```

## Row {data-height="120"}

### Correlation: % White & Disease Burden

```{r}
cor_val <- cor(chicago_map$PCT.W_2020.2024, chicago_map$disease_burden, use = "complete.obs")

valueBox(
  value = round(cor_val, 2),
  caption = "Correlation: % White & Disease Burden",
  icon = "fa-chart-line",
  color = "#2d4f4f"
)
```

### Correlation: % White & Hardship

```{r}
cor_val2 <- cor(chicago_map$PCT.W_2020.2024, chicago_map$hard_index, use = "complete.obs")

valueBox(
  value = round(cor_val2, 2),
  caption = "Correlation: % White & Hardship Index",
  icon = "fa-chart-line",
  color = "#ed4a1a"
)
```

### Correlation: Hardship & Disease Burden

```{r}
cor_val3 <- cor(chicago_map$hard_index, chicago_map$disease_burden, use = "complete.obs")

valueBox(
  value = round(cor_val3, 2),
  caption = "Correlation: Hardship & Disease Burden",
  icon = "fa-chart-line",
  color = "#f4bb8f"
)
```

### Number of Neighborhoods Above Trend

```{r}
model_burden <- lm(disease_burden ~ hard_index, data = chicago_map)
above <- sum(residuals(model_burden) > 0, na.rm = TRUE)

valueBox(
  value = above,
  caption = "Neighborhoods Above Expected Burden",
  icon = "fa-triangle-exclamation",
  color = "#7b4419"
)
```

## Row {data-height="600"}

### Intra-Categorical Correlation Plot: Disease

``` {r intracat corr disease plot}
# corrplot disease
cor_matrix_disease <- cor(cor_data %>%
                            select(Diabetes, Obesity, Hypertension, Asthma)
                          )

#plot with ggcorrplot
p_cor_disease <- ggcorrplot(cor_matrix_disease,
           method = "circle",
           type = "lower",
           lab = FALSE,
           colors = c("red2", "white", "darkslategray")) +
  labs(x = NULL, y = NULL) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

# ggplotly it
ggplotly(p_cor_disease) %>%
  layout(autosize = TRUE) %>%
  config(displayModeBar = FALSE)

```

### Disease vs. Environmental Factors

``` {r disease v environment}

# create vectors of two categories
sub_disease <- c("Diabetes", "Obesity", "Hypertension", "Asthma")
sub_environ <- c("Pct_White", "Air_Quality", "Traffic_Risk", "Env_Justice", "Hardship")

cor_matrix2 <- cor(cor_data %>%
                    select(all_of(c(sub_disease, sub_environ))),
                  use = "complete.obs")

p_cor_matrix2 <- ggcorrplot(cor_matrix2[sub_disease, sub_environ],
           colors = c("darkslategray", "white", "red2")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_line(color = "white", linewidth = 1))

ggplotly(p_cor_matrix2) %>%
  layout(autosize = TRUE) %>%
  config(displayModeBar = FALSE)
```

### Intra-Categorical Correlation Plot: Social Determinants of Health

``` {r intra corr environmental factors plot}
cor_matrix_sdoh <- cor(cor_data %>%
                            select(Pct_White, Air_Quality, Traffic_Risk, Env_Justice, Hardship)
                          )

p_cor_sdoh <- ggcorrplot(cor_matrix_sdoh,
           method = "circle",
           type = "lower",
           lab = FALSE,
           lab_size = 3,
           colors = c("red2", "white", "darkslategray")) +
  scale_x_discrete(limits = rev) +
  scale_y_discrete(position = "right") +
  labs(x = NULL, y = NULL) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplotly(p_cor_sdoh) %>%
  layout(autosize = TRUE) %>%
  config(displayModeBar = FALSE)

```

Obesity Mismatch Index
=====================================

## Row {data-height="450"}

### Obesity Mismatch Index

``` {r obesity residual map and coef plot}
# Fit a linear regression model obesity ~ hardship
ob_model <- lm(obesity_rate ~ hard_index, data = chicago_map)
# extract model
ob_tbl <- tidy(ob_model, conf.int = TRUE)
# add residuals to df
chicago_map$residual[as.numeric(rownames(ob_model$model))] <- residuals(ob_model)
# Remove intercept and plot pointranges (estimate + CI) with a vertical line at 0
ob_tbl <- ob_tbl %>%
  filter(term != "(Intercept)")

# plot linear model
p_ob_hard <- ggplot(chicago_map, aes(x = obesity_rate, y = hard_index,
                                    color = residual,
                                    text = paste0("<b>", community, "</b><br>",
                                                  "Obesity Rate: ", round(obesity_rate, 1), "%<br>",
                                                  "Hardship: ", round(hard_index, 1), "<br>",
                                                  "Residual: ", round(residual, 1)))) +
  geom_point() +
  scale_color_gradient2(low = "#48bea1", mid = "#f4bb8f", high = "#7b4419",
                        midpoint = 0) +
  theme_minimal() +
  labs(title = "Obesity Prevalence Correlates with Socioeconomic Hardship",
       x = "Obesity Rate",
       y = "Hardship Index")

# interactive scatterplot
ggplotly(p_ob_hard, tooltip = "text") %>%
  layout(autosize = TRUE) %>%
  config(displayModeBar = FALSE)

# ggplot of regression coef plot
ob_hard <- ggplot(ob_tbl, aes(x = term, y = estimate)) +
  geom_pointrange(aes( ymin = conf.low, ymax = conf.high)) +
  geom_hline(yintercept = 0, linetype = "solid", color = "#e03000") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Coefficient plot with 95% CI",
       y = "Estimate",
       x = "Term")

# interactive coefficient plot 
ggplotly(ob_hard) %>%
  layout(autosize = TRUE) %>%
  config(displayModeBar = FALSE)
```

### Obesity Mismatch Index

``` {r obesity residuals}
# residual tooltip
chicago_map <- chicago_map %>%
  filter(!is.na(obesity_rate)) %>%
  mutate(
    tooltip_text = paste0(
      "<b>", community, "</b><br>",
      "Obesity Rate: ", round(obesity_rate, 1), "%<br>",
      "Predicted: ", round(obesity_rate - residual, 1), "%<br>",
      "Residual: ", round(residual, 1)
    )
  )

ob_resid <- ggplot(chicago_map) +
  geom_sf(aes(fill = residual, text = tooltip_text)) +
  scale_fill_gradient2(
    low = "#48bea1", mid = "white", high = "#7b4419",
    midpoint = 0, name = "Residual"
  ) +
  theme_minimal() +
  labs(title = "Obesity Mismatch Index",
       subtitle = "Dark gray = worse than expected | Red = better than expected")

ggplotly(ob_resid, tooltip = "text") %>%
  layout(autosize = TRUE) %>%
  config(displayModeBar = FALSE)
```

## Row {data-height="120"}

### Resilient Neighborhoods

```{r}
n_resilient <- sum(chicago_map$resilient_ob == TRUE, na.rm = TRUE)
valueBox(
  value = n_resilient,
  caption = "Resilient Neighborhoods",
  icon = "fa-shield",
  color = "#48bea1"
)
```

### Vulnerable Neighborhoods

```{r}
n_vulnerable <- sum(chicago_map$vulnerable_ob == TRUE, na.rm = TRUE)
valueBox(
  value = n_vulnerable,
  caption = "Vulnerable Neighborhoods",
  icon = "fa-triangle-exclamation",
  color = "#7b4419"
)
```

### Neighborhoods Outside Expected Pattern

```{r}
n_mismatch <- sum(chicago_map$resilient_ob == TRUE | chicago_map$vulnerable_ob == TRUE, na.rm = TRUE)
valueBox(
  value = n_mismatch,
  caption = "Neighborhoods Outside Expected Pattern",
  icon = "fa-shuffle",
  color = "#a7e831"
)
```

### Variance in Obesity Explained by Hardship

```{r}
valueBox(
  value = "39%",
  caption = "Variance in Obesity Explained by Hardship",
  icon = "fa-chart-line",
  color = "#167b2b"
)
```

## Row {data-height="450"}

### Neighborhoods of Vulnerability and Resilience

```{r mismatch scatterplot}
scatter_mismatch <- ggplot(chicago_map, aes(x = hard_index, y = disease_burden,
                                             color = category_ob,
                                             text = paste0("<b>", community, "</b><br>",
                                                           "Hardship: ", round(hard_index, 1), "<br>",
                                                           "Disease Burden: ", round(disease_burden, 1), "%<br>",
                                                           category_ob))) +
  geom_point(size = 3) +
  geom_vline(xintercept = median(chicago_map$hard_index, na.rm = TRUE), 
             linetype = "dashed", color = "grey50") +
  geom_hline(yintercept = median(chicago_map$disease_burden, na.rm = TRUE), 
             linetype = "dashed", color = "grey50") +
  scale_color_manual(values = c("Resilient" = "#48bea1", 
                                "Vulnerable" = "#D10300", 
                                "Expected" = "grey70")) +
  labs(x = "Hardship Index", y = "Disease Burden (%)", color = NULL) +
  theme_minimal()

ggplotly(scatter_mismatch, tooltip = "text") %>%
  layout(autosize = TRUE) %>%
  config(displayModeBar = FALSE)
```

### Diabetes-Specific Vulnerability and Resilience

``` {r scatterplot of diabetes vs hardship}
scatter_obesity <- ggplot(chicago_map, aes(x = hard_index, y = obesity_rate,
                                             color = category_ob,
                                             text = paste0("<b>", community, "</b><br>",
                                                           "Hardship: ", round(hard_index, 1), "<br>",
                                                           "Obesity Rate: ", round(obesity_rate, 1), "%<br>",
                                                           category_ob))) +
  geom_point(size = 3) +
  geom_vline(xintercept = median(chicago_map$hard_index, na.rm = TRUE), 
             linetype = "dashed", color = "grey50") +
  geom_hline(yintercept = median(chicago_map$obesity_rate, na.rm = TRUE), 
             linetype = "dashed", color = "grey50") +
  scale_color_manual(values = c("Resilient" = "#48bea1", 
                                "Vulnerable" = "#7b4419", 
                                "Expected" = "grey70")) +
  scale_x_discrete(limits = rev) +
  labs(x = "Hardship Index", y = "Obesity Rate (%)", color = NULL) +
  theme_minimal()

ggplotly(scatter_obesity, tooltip = "text") %>%
  layout(autosize = TRUE) %>%
  config(displayModeBar = FALSE)
```

Hypertension
=====================================

Hypertension is used as a central indicator because it reflects long-term exposure to both environmental and structural conditions, making it a powerful measure of inequality across space.

## Row {data-height="600"}

```{r load-data, include=FALSE}

### Data Preparation (hypertension)

#Note: These data sets come from the Chicago Health Atlas

# load data
url3 <- "https://raw.githubusercontent.com/fitzley/miniproj2/refs/heads/main/Chi_Health_Atlas_Data(1).csv"
health_data <- read.csv(url3)
##### health_data <- read_csv("Chi_Health_Atlas_Data(1).csv")

url4 <- "https://raw.githubusercontent.com/fitzley/miniproj2/refs/heads/main/Chicago%20Health%20Atlas%20Data%20Download%20-%20Census%20Tracts%20(2).csv"
poverty_data <- read.csv(url4)

##### poverty_data <- read_csv("Chicago Health Atlas Data Download - Census Tracts (2).csv")
```

```{r clean-data, include=FALSE}
#Note: I selected only the variables needed for analysis.
#Note: This reduces errors and keeps the data set focused.

analysis_df <- health_data %>%
  select(
    community = Name,
    lat = Latitude,
    lon = Longitude,
    hypertension = `HCSHYT_2023.2024`,
    pm25 = `PMC_2020`,
    population = `POP_2020.2024`
  ) %>%
  mutate(community = toupper(str_trim(community)))

#Note: Cleaned poverty data for merging
poverty_data <- poverty_data %>%
  select(
    community = Name,
    poverty_rate = `POV_2020.2024`
  ) %>%
  mutate(community = toupper(str_trim(community)))

#Note: Joined data sets
analysis_df <- analysis_df %>%
  left_join(poverty_data, by = "community")

#Note: Ensured poverty is numeric (prevents plotting errors)
analysis_df$poverty_rate <- readr::parse_number(analysis_df$poverty_rate)
```

### Environmental Exposure (hypertension)

ANALYSIS

PM2.5 exposure across Chicago demonstrates clear clusters of elevated pollution levels across specific neighborhoods. These areas represent communities that are consistently subjected to higher environmental risk, which can contribute to long-term health consequences. The non-random distribution of pollution suggests that environmental burden is structurally embedded within the urban landscape. This pattern raises important concerns about environmental justice and unequal exposure. On average, a PM2.5 of around 9.2 suggests moderate air quality across neighborhoods in Chicago; however a slight health risk is present for individuals sensitive to pollutants and prone to having chronic diseases. Establishing this baseline is critical for interpreting how environmental conditions shape health outcomes.

### PM2.5 Map: Environmental Risk Baseline

```{r pm25_map}
#Note: This PM2.5 map establishes environmental risk baseline

pm_df <- analysis_df %>%
  filter(!is.na(lat), !is.na(lon), !is.na(pm25))

plot_ly(
  pm_df,
  type = "scattermapbox",
  lat = ~lat,
  lon = ~lon,
  mode = "markers",
  marker = list(
    size = 10,
    color = ~pm25,
    colorscale = list(
      c(0, sun_yellow),
      c(0.5, sun_orange),
      c(1, sun_red)
    ),
    showscale = TRUE,
    line = list(color = dark_gray, width = 1)
  ),
  text = ~paste("Community:", community,
                "<br>PM2.5:", round(pm25,2))
) %>%
layout(
  mapbox = list(
    style = "carto-positron",
    zoom = 9,
    center = list(lat = 41.85, lon = -87.68)
  )
)
```


## Row {data-height="600"}

### Hypertension Analysis (spatial distribution) (hypertension)

ANALYSIS

Hypertension exhibits strong spatial clustering across Chicago, with certain neighborhoods consistently experiencing higher rates than others such as the community in Austin with 27,500 cases recorded compared to Riverdale with 2000 recorded cases. This pattern suggests that health outcomes are shaped by localized structural conditions rather than random variation. The persistence of these clusters indicate long-term exposure to risk factors such as economic stress and limited access to healthcare. Not all high-risk areas align perfectly with pollution patterns, pointing to additional underlying influences. This reinforces the importance of area-based analysis in understanding health disparities.

### Spatial Distribution of Hypertension

```{r mapbox-hypertension}
#Note:It uses Plotly's Mapbox engine for smooth zooming and better aesthetics.

#Note: IMPORTANT:
#Note: filtered only necessary variables to avoid NA issues
map_df <- analysis_df %>%
  filter(!is.na(lat), !is.na(lon), !is.na(hypertension))

#Note: Create interactive map
plot_ly(
  data = map_df,
  type = "scattermapbox",
  lat = ~lat,
  lon = ~lon,
  mode = "markers",

  #Note: Color encodes hypertension intensity
  marker = list(
    size = 10,
    color = ~hypertension,
    colorscale = list(
      c(0, "#FFC300"),   # yellow
      c(0.5, "#FF7A00"), # orange
      c(1, "#FF3B3B")    # red
    ),
    showscale = TRUE,
    line = list(color = "#2F2F2F", width = 1)
  ),

  #Note: Tooltip info
  text = ~paste(
    "Community:", community,
    "<br>Hypertension:", round(hypertension, 2)
  ),
  hoverinfo = "text"
) %>%
layout(
  mapbox = list(
    style = "carto-positron",   #Note: clean grayscale base map
    zoom = 9,
    center = list(lat = 41.85, lon = -87.68)
  ),
  margin = list(l = 0, r = 0, t = 40, b = 0),
  title = "Spatial Distribution of Hypertension Across Chicago"
)
```


## Row

### Pollution Relationship (hypertension)

ANALYSIS

The relationship between PM2.5 and hypertension shows a general downward trend when observing PM2.5 of 8 to 9.8, indicating that environmental exposure does slightly contribute to health risk. However, the variability around the trend line suggests that this relationship is not significant and realistic. Some neighborhoods experience higher-than-expected hypertension despite lower pollution levels. This indicates that additional structural or social factors are influencing outcomes. The results highlight the limitations of relying solely on environmental variables to explain health disparities.

```{r pollution-scatter}
#Note: This tests the expected environmental relationship.

scatter_df <- analysis_df %>%
  filter(!is.na(pm25), !is.na(hypertension))

p <- ggplot(scatter_df, aes(pm25, hypertension)) +
  geom_point(color = sun_orange, alpha = 0.7) +
  geom_smooth(method = "lm", color = sun_red) +
  theme_minimal()

ggplotly(p)
```


## Row {data-height="450"}

### Mismatch Data (hypertension)

ANALYSIS

The mismatch index highlights where observed hypertension diverges from expected patterns based on environmental and demographic factors. These deviations are spatially clustered, indicating localized influences that are beyond pollution. Areas with high mismatch values may experience structural disadvantages that can amplify and be prone to health risk. Conversely, lower-than-expected values suggest the presence of protective community factors across the city of Chicago. This approach can help provide a deeper understanding of inequality by identifying where standard explanations fall short.

```{r mismatch}
#Note: Residuals capture deviation from expected outcomes.

model <- lm(hypertension ~ pm25 + population, data = analysis_df)

analysis_df <- analysis_df %>%
  mutate(mismatch_index = resid(model))

```

### Mismatch Map

```{r mismatch-map}
#Note: plot is not that detailed. not how I would like to showcase the data but came across issues and decided to leave this plot as final product.

mismatch_df <- analysis_df %>%
  filter(!is.na(mismatch_index), !is.na(lat), !is.na(lon))

plot_ly(
  mismatch_df,
  x = ~lon,
  y = ~lat,
  type = "scatter",
  mode = "markers",
  color = ~mismatch_index,
  colors = c(sun_pink, sun_orange, sun_red)
)
```

## Row

### Conclusion (hypertension)
Hypertension across Chicago is shaped by both environmental exposure and structural inequality. While pollution contributes to risk, it does not fully explain the observed variation.

The mismatch framework reveals that health outcomes are influenced by a broader set of factors, including socioeconomic conditions and neighborhood context. These findings emphasize the need for comprehensive approaches to public health. 


### Future Directions (hypertension)

Future work can expand this analysis by incorporating additional variables such as access to healthcare, green space, and community-level trust.

Temporal analysis could reveal how these relationships evolve over time. More advanced spatial models may also better capture neighborhood-level effects.

Understanding these dynamics more deeply can help design targeted interventions that address both environmental and structural drivers of health inequality.


Asthma
=====================================
## Row {data-height="300"}
### Abstract (Asthma)

**Introduction: This research aims to explore weather traffic pollution disproportionately affects some Chicago neighborhoods more than others. According to the American Lung Association previous findings have revealed there are correlations between air quality/pollution and health outcomes in the city of Chicago, predominantly affecting densely populated, disadvantaged areas. Based on this research, we hypothesize neighborhoods with higher levels of environmental burden will be more prone to higher asthma levels.**

**Methods: Descriptive analysis was used in order to conduct data exploration research on Chicago pollution metrics and their correlation to health outcomes. Data from the Chicago Health Atlas and Chicago Housing Authority was used in order to create four figures using R Studio, packages include: ggplot, tidyverse, janitor, and plotly. Figures used included a leaflet map, mismatch graph, standard plot and environmental burden plot. Graph interpretation was used in order to conduct results.**

```{r}
#Note: Loaded data.
url5 <- "https://raw.githubusercontent.com/fitzley/miniproj2/refs/heads/main/Chi_Health_Atlas_Data.csv"
health_data <- read.csv(url5) %>%
  clean_names()

####health_data <- read_csv("Chi_Health_Atlas_Data.csv") %>%
####  clean_names()

#Note: Filtered through the only columns we needed."Name" includes the name of each neighborhood, "HCSATH_2023-2024" showcases the asthma and burden outcomes for each neighborhood, this data set will be used to showcase real health outcomes. "HCSATH_2023-2024" showcases the environmental burden caused by traffic which is used as a predictor of pollution to show expected health outcomes.
asthma_plot <- health_data %>%
  select(name,
    asthma = hcsath_2023_2024,
    pollution = trf_2020
  )
```

## Row {data-height="600"}

### Figure 1: Leaflet Map of Asthma Burden of Chicago (asthma)

**The leaflet map sets the baseline data that shows where asthma is distributed in the city. This map is interactive and represents the prevalence of asthma in 72 Chicago neighborhoods. Asthma concentration is represented in large yellow circles which decrease in size and gradience from dark green (greater concentrations) to light green (lower concentrations). Viewers can zoom into specific neighborhoods and hover over them to reveal neighborhood name, asthma prevalence and traffic pollution levels. This map helps us identify spatial distribution of asthma throughout the city. It also helps us identify weather there is a correlation among pollution and asthma for specific neighborhoods. The map shows us asthma levels have a relatively even distribution in the city of Chicago with some higher concentrations on the northside such as Lakeview and Albany Park. However, there is no clear relationship between traffic pollution and asthma prevalence as these factors vary throughout neighborhoods.** 

### Spatial Trends of Asthma in Chicago

```{r}
#### redundant so commented
# #Note: Loaded data and cleaned it.
# health_data <- read_csv("Chi_Health_Atlas_Data.csv") %>%
#   clean_names()

#Note: Included the variables needed for the map.
map_data <- health_data %>%
  select(
    name,
    latitude,
    longitude,
    asthma = hcsath_2023_2024,
    pollution = trf_2020
  ) %>%
  drop_na()

#Note: Created a green palette with various hues based on asthma prevalence, including my color choice for the project. 
pal <- colorNumeric(
  palette = c("#e8f5e9", "#a5d6a7", "#43a047", "#1b5e20"),
  domain = map_data$asthma
)

#Note: Built interactive asthma distribution plot using the map of Illinois and surrounding areas, making 72 Chicago neighborhoods interactive.
leaflet(map_data) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  #Note: Added grey background to emphasize neighborhoods, added neighborhood markers as well as longitude and latitude to map points to their correct location.
  addCircleMarkers(
    lng = ~longitude,
    lat = ~latitude,
    #Note: Used scale and gradience to ensure neighborhoods with larger/darker concentrations of of asthma would have larger interactive circles.
    radius = ~rescale(asthma, to = c(6, 18)),
    fillColor = ~pal(asthma),
    #Note: Customized the circles color, weight and stroke.
    fillOpacity = 0.9,
    color = "black",
    weight = 1,
    stroke = TRUE,
    #Note: Groups markers/circles together when zooming out and separate when zooming in. The legend also matches the color palette from above and matches data set.
     clusterOptions = markerClusterOptions(),
    popup = ~paste0(
      "<b>", name, "</b><br>",
      "Asthma prevalence: ", round(asthma, 2), "<br>",
      "Traffic pollution: ", round(pollution, 2)
    )
  ) %>%
  addLegend(
    "bottomright",
    pal = pal,
    values = ~asthma,
    title = "Asthma prevalence",
    opacity = 0.9
  )
```
## Row {data-height="1000"}

### Figure 2: Mismatch Graph of Pollution Risk (asthma)

**Description: This graph was created in order to compare the expected health outcomes for 77 Chicago neighborhoods based on the environmental pollution outcomes to real asthma levels. Expected Asthma levels were showcased using the traffic burden data set on top where lime green represents positive values (>0) that matched expected outcomes. The bottom showcases resilient neighborhoods that did not match expected outcomes in the negative value range (<0) in dark green. Grey circles represent exactly where outcomes match expectations. The tooltip function helps the viewer see data on individual plot points. This is helpful due to our large data set. By hovering over each point, the viewer is able to directly verify the mismatch and confirm information.**

**Analysis :The graph suggests health outcomes for asthma are have a low-moderate correlation to traffic related pollution as there is a general positive trend with a few positive outliers such as Lakeview and Austin. However, there is variability in the dataset which could mean multiple factors contribute to asthma outcomes.**

### Pollution Risk

```{r, fig.height=20, fig.width=12}
#Note: loaded the data into R.
health_data <- read.csv(url5)

# Note: Filtered the data set to only the necessary columns, I renamed the asthma and pollution variables to create simpler names and I removed neighborhoods missing asthma or pollution data using "drop_na".
asthma_plot <- health_data %>%
  select(
    Name,
    asthma = HCSATH_2023.2024,
    pollution = TRF_2020
  ) %>%
  drop_na()

# Note: Created a linear model that predicts asthma from pollution based on the traffic burden, predicting what asthma should look like. Created a custom string for each plot point in order to incorporate interactivity, used tooltip in order to display data when hovering.
mismatch_model <- lm(asthma ~ pollution, data = asthma_plot)

asthma_plot <- asthma_plot %>%
  mutate(
    expected_asthma = predict(mismatch_model),
    mismatch = asthma - expected_asthma,
    category = case_when(
      mismatch > 0 ~ "More vulnerable than expected",
      mismatch < 0 ~ "More resilient than expected",
      TRUE ~ "About as expected"
    )
  ) %>%
  arrange(mismatch) %>%
  mutate(
    Name = factor(Name, levels = Name),
    hover_text = paste(
      "Neighborhood:", Name,
      "<br>Asthma:", round(asthma, 2),
      "<br>Expected asthma:", round(expected_asthma, 2),
      "<br>Mismatch:", round(mismatch, 2),
      "<br>Traffic burden:", round(pollution, 2),
      "<br>Category:", category
    )
  )

#Note: Created the graph, making x=mismatch value and y=neighborhood name, the fill showcases the resilience/vulnerability category.
p <- ggplot(asthma_plot, aes(x = mismatch, y = Name, fill = category, text = hover_text)) +
  
  # Note: Created horizontal bars for each neighborhood and made thinner bars in order to reduce crowding.
  geom_col(width = 0.5) +
  
  # Note: Added a vertical dashed line at 0 to separate higher than expected from lower than expected.
  geom_vline(xintercept = 0, linetype = "dashed", linewidth = 0.7, color = "black") +
  
  # Note: Included points based on pollution burden to see the amount of traffic burden in each neighborhood.
  geom_point(aes(size = pollution, text = hover_text), color = "black", alpha = 0.65) +
  
  # Note: Added white space to prevent crowding on the x-axis bars.
  scale_x_continuous(expand = expansion(mult = c(0.08, 0.12))) +
  
  # Note: Chose my custom color theme to represent each category.
  scale_fill_manual(values = c(
    "More vulnerable than expected" = "#92F96A",
    "More resilient than expected" = "#74AC64",
    "About as expected" = "gray70"
  )) +
  
  # Note: Added titles to each axis, legends, main title and subtitle.
  labs(
    title = "Asthma Burden Beyond Expected Pollution Risk in Chicago",
    subtitle = "Positive values show neighborhoods with higher asthma than predicted from traffic-related pollution burden",
    x = "Asthma mismatch (actual - expected)",
    y = NULL,
    fill = NULL,
    size = "Traffic burden"
  ) +
  
  # Note: Used a minimal theme and customized multiple elements in order to improve readability and avoid crowding.
geom_col(width = 0.4) +

scale_y_discrete(expand = expansion(mult = c(0.02, 0.02))) +

theme_minimal(base_size = 12) +
theme(
  panel.grid.major.y = element_blank(),
  panel.grid.minor = element_blank(),
  plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
  plot.subtitle = element_text(size = 11, hjust = 0.5),
  axis.text.y = element_text(size = 11),
  axis.text.x = element_text(size = 11),
  legend.title = element_text(size = 11),
  legend.text = element_text(size = 10),
  plot.margin = margin(15, 20, 15, 15)
)
#Note: Converted static ggplot into an interactive plot by including the tooltip feature.
ggplotly(p, tooltip = "text")

```


## Row

### Figure 3: Standard Scatter Plot of Asthama vs Traffic Pollution (asthma)

**Description: The graph demonstrates the relationship between traffic-related pollution and prevalence of asthma across 77 Chicago neighborhoods. The x-axis represents traffic-related pollution and the y-axis represents the percentage of adults with asthma (prevalence). The color gradient increases in hue to a darker green in correlation to higher pollution levels. The dashed lines represent city averages which divide the plot into four quadrants. Each plot point represents a specific Chicago neighborhood and a tooltip feature was used in order to showcase interactivity. Each plot point tells the viewer what neighborhood they are looking at, as well as the asthma and pollution levels.**

**Analysis: Overall, there is a general positive relationship in the data as plot points are scattered at an upwards direction, suggesting pollution could contribute to asthma levels. However, plot-points do not emphasize a linear relationship and are scattered, meaning there could be other contributing factors to asthma levels. Furthermore, the quadrants on the graph help reveal patterns, the top right quadrant show expected burden. The top left, unexpected vulnerability, bottom right: resilience and bottom left: highest resilience. Some outliers are observed but there is no clear pattern. In conclusion, the graph helps the viewer see vulnerability patterns beyond environmental exposure.**

### Scatter plot

```{r}
#Note: Loaded the data set and cleaned column names using the janitor package. 
health_data <- health_data %>%
  clean_names()

#Note: Selected only the variables needed for the graph. 
plot_data <- health_data %>%
  select(
    name,
    asthma = hcsath_2023_2024,
    pollution = trf_2020
  ) %>%
  drop_na()

#Note: Calculated averages for asthma and pollution across the city.
avg_asthma <- mean(plot_data$asthma, na.rm = TRUE)
avg_pollution <- mean(plot_data$pollution, na.rm = TRUE)

# Note: Created the base ggplot and added the x and y axes.
p <- ggplot(
  plot_data,
  aes(
    x = pollution,
    y = asthma,
    color = pollution,
    text = paste(
      "Neighborhood:", name,
      "<br>Pollution:", round(pollution, 2),
      "<br>Asthma:", round(asthma, 2)
    )
  )
) +
  
  #Note: Plotted each neighborhood as a seperate data point. 
  geom_point(size = 3, alpha = 0.85) +
  
  #Note: Added dashed reference lines at city averages
  geom_vline(xintercept = avg_pollution, linetype = "dashed", color = "gray40") +
  geom_hline(yintercept = avg_asthma, linetype = "dashed", color = "gray40") +
  
  #Note: Used my custom green gradient colors instead of category colors
  scale_color_gradientn(
    colors = c("#dff3e3", "#9bd49f", "#4caf50", "#1b5e20"),
    name = "Traffic Pollution"
  ) +
  
  #Note: Added axes titles, main title and labels, kept minimal theme.
  labs(
    title = "Asthma vs Traffic Pollution in Chicago Neighborhoods",
    x = "Traffic Pollution",
    y = "Asthma Prevalence"
  ) +
  

  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 15, hjust = 0.0),
    axis.title = element_text(size = 12),
    legend.position = "right"
  )

#Note: Used a tooltip to make plot interactive.
ggplotly(p, tooltip = "text")
```


## Row

### Figure 4: Environmental Justice Burden Plot (asthma)

**Description: The graph showcases the relationship between the environmental justice burden of asthma and asthma mismatch rates. The graph plots environmental justice burden and asthma on the x-axis and asthma mismatch on the y-axis. Asthma mismatch is determined by actual minus expected asthma levels based on traffic pollution. The graph aims to explore weather certain neighborhoods are vulnerable and disproportionately affected by asthma. Neighborhoods over the dotted line (0>) have a higher than expected asthma level despite traffic pollution. Neighborhoods under the dotted line (<0) more resilient than expected asthma rates. The graph also includes an interactive feature that allows the viewer to hover over a specific neighborhood to see neighborhood name, environmental justice burden, mismatch score, asthma prevalence and pollution levels.** 

**Analysis: This graph suggests a slight positive relationship between higher levels of EJ burden and positive mismatch yields. However, variability still exists in the graph and distribution appears mostly even. It is possible, neighborhood plays a role into asthma levels.**

### Environmental Justice Burden and Asthma

```{r}
#Note: Loaded and cleaned data set.
#### health_data <- read_csv("Chi_Health_Atlas_Data.csv") %>%
####  clean_names()

#Note: Selected only the variables required for the graph. 
plot_data <- health_data %>%
  select(
    name,
    asthma = hcsath_2023_2024,
    pollution = trf_2020,
    ej_burden = chaknkc_2023
  ) %>%
  drop_na()

#Note: Incorporated mismatch model to calculate expected asthma levels based on traffic pollution.
mismatch_model <- lm(asthma ~ pollution, data = plot_data)

#Note: Calculated mismatch and categorized neighborhoods.
plot_data <- plot_data %>%
  mutate(
    expected_asthma = predict(mismatch_model),
    mismatch = asthma - expected_asthma,
    category = case_when(
      mismatch > 0 ~ "More vulnerable than expected",
      mismatch < 0 ~ "More resilient than expected",
      TRUE ~ "About as expected"
    ),
    hover_text = paste(
      "Neighborhood:", name,
      "<br>EJ burden:", ej_burden,
      "<br>Mismatch:", round(mismatch, 2),
      "<br>Asthma:", round(asthma, 2),
      "<br>Pollution:", round(pollution, 2),
      "<br>Category:", category
    )
  )

#Note: Created the mismatch plot.
p <- ggplot(plot_data,
            aes(x = ej_burden, y = mismatch,
                color = category,
                text = hover_text)) +
  
  #Note: Plotted points for each neighborhood.
  geom_point(size = 3, alpha = 0.8) +
  
  #Note: Added regression line to show overall trend
  geom_smooth(method = "lm", se = FALSE, color = "black", linewidth = 0.8) +
  
  #Note: Added the dashed horizontal line at zero to create the mismatch.
  geom_hline(yintercept = 0, linetype = "dashed") +
  
  #Note:Used custom green color scheme.
  scale_color_manual(values = c(
    "More vulnerable than expected" = "#2e7d32",
    "More resilient than expected" = "#81c784",
    "About as expected" = "gray70"
  )) +
  
  #Note: Added axis labels, a title, legends and subtitles, kept minimal theme.
  labs(
    title = "Environmental Justice Burden and Asthma Mismatch",
    subtitle = "Exploring whether vulnerable neighborhoods face disproportionate health impacts",
    x = "Environmental Justice Burden",
    y = "Asthma Mismatch (Actual - Expected)",
    color = "Category"
  ) +
  
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    legend.position = "bottom"
  )

#Note: Converted to interactive by incorporating a tooltip.
ggplotly(p, tooltip = "text")
```



## Row

### Conclusion (asthma)

**Overall, I wanted to show the relationship between pollution and asthma levels. I aimed to explore weather asthma cases were higher based on traffic pollution and how these rates affect environmental burden. My first graph is a leaflet map that showcases asthma prevalence in the Chicagoland area. This figure showed some higher concentrations of asthma in the northern region, however these differences were not significant. I made this map to establish baseline levels of asthma prevalence throughout the city. The second figure was a mismatch graph made to see weather certain neighborhoods experience higher or lower expected levels of asthama based on traffic pollution. The findings revealed the same neighborhoods with higher levels of asthma on figure 1 were the same neighborhoods with more vulerability than expected: Lakeview, Austin, Albany Park. There was a slight positive correlation between asthma levels and pollution with some variability. Figure 3 is a scatter plot with dashed lines that indicate city-averages between traffic-related pollution and asthma prevalence. This figure shows weather asthma tends to increase as pollution increases to establish a baseline relationship between the two variables. Findings reveal the same outliers as previous figures have higher asthma levels than pollution levels. This suggests pollution does not fully explain asthma outcomes. The final figure (4) attempts to uncover weather environmental justice predicts weather a neighborhood will have higher or lower asthma levels than expected. The findings reveal there is no strong pattern between the two variables as points are widely scattered with a few outliers. For example the neighborhood of Lakeview has low environmental justice burden but a high positive mismatch, the neighborhood of Austin has a high environmental burden, and high positive mismatch. These random variations reveal no real pattern between the two. In conclusion, there is no real correlation between asthma traffic pollution and asthma levels. However, there are a few neighborhoods that are disproportionately affected for unknown reasons which could be potentially attributed to traffic pollution. The results suggest further analyses needs to be conducted in order to reveal asthma causation.** 




Diabetes
=====================================
## Row

```{r}
# PART 1: SETUP + DATA CLEANING
# Define pink/purple color palette for consistency.
pink_purple <- c("#FF4FA3","#C77DFF", "#7B2CBF", "#3A0CA3")

# Load in dataset:
df <- read.csv(url5)

#### df <- read.csv("Chi_Health_Atlas_Data (1).csv")

# Clean dataset: remove missing values for important variables 
df_clean <- df %>%
  filter(!is.na(TRF_2020),
         !is.na(HCSDIA_2023.2024),
         !is.na(PMC_2020),
         !is.na(CHAKNKC_2023))
```

### INSPECTING THE DATA

```{r data-check}
# Inspect dataset:
colnames(df)
head(df)
str(df)
summary(df)
```
### Overview of Data

The dataset includes 77 Chicago community areas with 23 variables describing population, environmental exposure, and health outcomes. Initial inspection shows that diabetes is recorded as counts rather than rates, meaning that values are influenced by population size. There is also variation in environmental exposure and environmental justice burden across neighborhoods, suggesting that both environmental and stucture factors differ across the city of Chicago, and may contribute to uneven health outcomes. 


## Row {data-height="450"}

### PART 3: VISUALIZATION 1: EXPECTATION (diabetes)

The expectation is that higher environmental exposure will lead to worse health outcomes, specifically higher diabetes prevalence. However, the observed relationship is weak and does not show a clear trend, suggesting that environmental exposure alone may not fully explain patterns in diabetes across neighborhoods. 

### Traffic Risk vs Diabetes Prevalence

```{r visualization 1}
# VISUALIZATION 1: EXPECTATION
# Hypothesis: Higher environmental exposure -> worse health
# Specifically: Traffic risk should increase diabetes prevalence

ggplot(df_clean, aes(
  x = TRF_2020,
  y = HCSDIA_2023.2024
)) +
  geom_point(color = "#C77DFF",
             size = 3,
             alpha = 0.7
            ) +
            geom_smooth(
              method = "lm",
              color = "#7B2CBF"
            ) +
            scale_x_log10() +
            labs(
              title = "Expectation: Environmental Risk Predicts Diabetes",
              x = "Traffic Risk (Environmental Exposure)",
              y = "Diabetes Prevalence(%)"
            ) +
            theme_minimal(base_family = "Times New Roman")
```


## Row {data-height="450"}

### PART 4: VISUALIZATION 2: REALITY (MULTI-LAYERED APPROACH) (diabetes)

When additional variables are included in the model, the relationship between exposure and health becomes more complex. Park access and environmental justice burden add important context, but patterns still remain inconsistent. This suggests that health outcomes are shaped by multiple interacting environmental and structural variables.

### Environmental Exposure Alone Does Not Explain Diabetes Patterns

```{r visualization 2}
# VISUALIZATION 2: REALITY 
# Goal: Show that the relationship is more complex
# Adds multiple variables: park access + environmental justice burden
df_clean <- df_clean %>%
  mutate(
    diabetes_rate = (HCSDIA_2023.2024 / Population) * 100
  )

p <- ggplot(df_clean, aes(
  x = TRF_2020,
  y = diabetes_rate,
  color = PMC_2020,
  size = CHAKNKC_2023,
  text = paste(
    "Neighborhood:", Name,
    "<br>Traffic Risk:", TRF_2020,
    "<br>Diabetes (%):", round(diabetes_rate, 2),
    "<br>Park Access:", PMC_2020,
    "<br>Env. Justice:", CHAKNKC_2023
  )
)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", color = "#3A0CA3", se=FALSE) +
  scale_color_gradientn(
    colors = pink_purple,
    name = "Park Access (Higher = More Green Space)"
  ) +
  scale_size(
    range = c(4, 14),
    name = "Environmental Justice Burden (Higher = More Exposure)"
  ) +
  scale_x_log10() + 
  labs(
    title = "Reality: Environmental Exposure Alone Does Not Explain Diabetes Patterns",
    x = "Traffic Risk (log scale)",
    y = "Diabetes Prevalence (%)"
  ) +
  theme_minimal(base_family = "Times New Roman")

ggplotly(p, tooltip = "text")
```


## Row {data-height="600"}

### PART 5: VISUALIZATION 3: MISMATCH INDEX (PRIMARY ANALYSIS) (diabetes)

The mismatch index captures the difference between observed and expected diabetes outcomes based on environmental exposure. This reveals neighborhoods where diabetes prevalence is lower or higher than originally predicted. These deviations highlight areas of vulnerability and resilience that cannot be explained by exposure alone, showcasing that additional structural factors influence health outcomes. 

### Mismatch Index

```{r visualization 3}
# VISUALIZATION 3: MISMATCH INDEX 
# Goal: Identify where pollution fails to predict diabetes
# Method: Use regression residuals as a mismatch index
model <- lm(HCSDIA_2023.2024 ~ TRF_2020, data = df_clean)

df_clean <- df_clean %>%
  mutate(mismatch = resid(model))

ggplot(df_clean, aes(
  x = TRF_2020,
  y = mismatch,
  color = mismatch
)) +
  geom_point(size = 4, alpha = 0.9) + 
  scale_color_gradient2(
    low = "#FF4FA3",
    mid = "#C77DFF",
    high = "#3A0CA3",
    midpoint = 0,
    name = "Mismatch Index"
) +
  geom_hline(
    yintercept = 0,
    linetype = "dashed",
) + 
  labs(
    title = "Mismatch Index: Where Pollution Fails to Predict Diabetes",
    x = "Traffic Risk",
    y = "Mismatch (Observed - Expected Diabetes)"
  ) +
  theme_minimal(base_family = "Times New Roman")
```

## Row {data-height="450"}

### PART 6: IDENTIFYING EXTREME CASES (diabetes)

Identifying neighborhoods with the most extreme mismatch values highlights patterns of inequality. Some areas show much higher diabetes prevalence than expected, indicating additional risk factors beyond environmental exposures. Other areas performed better than expected, demonstrating that both environmental and social factors influence health outcomes. 

### Neighborhoods with Strongest Mismatch

```{r identifying extreme cases}
# Highlights: Neighborhoods with strongest mismatch
# Helps identify resilience vs vulnerability 

ggplot(df_clean, aes(
  x = TRF_2020,
  y = mismatch,
  label = Name
)) +
  geom_point(color = "#C77DFF", size = 3) +
  geom_text_repel(size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_minimal(base_family = "Times New Roman")
```


## Row

### PART 7: FINAL TAKEAWAY

Overall, when analyzing the clean and transformed data, diabetes patterns across Chicago are not fully explained by environmental exposure alone. While pollution is an important variable, it does not consistently predict health outcomes. The mismatch index calculated in Part 5 reveals that socioeconomic conditions also play a critical role, emphasizing the importance of considering multiple variables of multiple types when studying health disparities.