The Impact of Community Design on Public Health: A Focus on Obesity and COPD

Introduction

Chronic health issues like obesity and Chronic Obstructive Pulmonary Disease (COPD) are major concerns in the United States. Obesity rates currently sit at a staggering 42%, with associated healthcare costs exceeding $173 billion annually [5, 6]. COPD, often linked to smoking and a risk factor for heart disease, further burdens the healthcare system. Interestingly, research suggests a connection between community design and the prevalence of these health problems. Studies have shown that adults residing in walkable neighborhoods with good street connectivity and green spaces tend to engage in more physical activity, have lower BMIs, and potentially experience better overall heart health [1]. However, historical policies like redlining have disproportionately impacted communities of color, often creating neighborhoods with limited walkability, hindering physical activity, and potentially contributing to higher health risks [1]. This project delves into the relationship between community design and public health outcomes, specifically focusing on obesity and COPD rates. By examining the impact of walkability and design features on physical activity levels and overall health, this research aims to highlight the potential for community design to be a powerful tool in promoting public health and reducing healthcare burdens.

Facts

When adults in the US live in highly walkable neighborhoods they are more likely to engage in a proper amount of physical activity, walk more often, and have a lower BMI (Morris, 2023)

This is important because in numerous areas around the US past racial segregation and policies (like redlining) have caused a decrease in walkability, street connectivity, and green space in neighborhoods where lots of people of color live. (Morris, 2023)

COPD or Chronic obstructive pulmonary disease is a type of lung disease that causes obstructed airflow and coughing. People with this issue often struggle with heart disease, making the prevalence of COPD a possible indicator of a population’s overall heart health. (Mayo Clinic, 2020)

COPD and heart disease often occur together within an individual patient. And according to some research individuals with COPD are 2x more likely to develop cardiovascular issues. And smoking is often a contributing factor. (Harvard Health, 2022)

Load Libraries/Set Directory

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
setwd("/Users/blossomanyanwu/Documents/MATH 217 HM")

Load in Data sets

community<-read_csv("finalmerge.csv")
New names:
Rows: 1406 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): State, Census Tract, copdrates, 95% Confidence Interval, Confidence... dbl
(6): ...1, StateFIPS, CensusTract, Year, Number, parkdistancepopulation lgl
(1): ...11
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
walkability<- read_csv("marylandwalk.csv")
New names:
Rows: 3926 Columns: 118
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): CSA_Name, CBSA_Name dbl (116): ...1, OBJECTID, GEOID10, GEOID20, STATEFP,
COUNTYFP, TRACTCE, BLK...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Clean Datasets

Use Gsub to remove percent values

Visualization 1 (Walkability Data)

# This scatter plot shows the relation ship between the amount of working age people within a census tract (P_WrkAge) and the Walkability Index of a census tract
ggplot(walkability, aes(x = P_WrkAge, y = NatWalkInd)) +
  geom_point(alpha = 0.5) +
  labs(x = "Workage", y = "Walkability Score") +
  theme_minimal()

Visual 2: Walkability Index Scores versus Number of Population that is working age

# Pct_AO2p
ggplot(walkability, aes(x = Pct_AO2p, y = NatWalkInd)) +
  geom_point(alpha = 0.5) +
  labs(x = "Workage", y = "Walkability Score") +
  theme_minimal()

Visual 3: Proportion of population earning less than 1250 monthly and Walkability Index

ggplot(walkability, aes(x = E_LowWageWk, y = NatWalkInd)) +
  geom_point(alpha = 0.5) +
  labs(x = "Workage", y = "Walkability Score") +
  theme_minimal()

Preliminary Linear Model

model2 <- lm(NatWalkInd ~ E_LowWageWk, data = walkability)  # y ~ x represents dependent variable ~ independent
summary(model2)

Call:
lm(formula = NatWalkInd ~ E_LowWageWk, data = walkability)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.778  -3.867   0.559   3.381   8.895 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.010e+01  7.240e-02  139.54   <2e-16 ***
E_LowWageWk 2.737e-03  2.171e-04   12.61   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.123 on 3924 degrees of freedom
Multiple R-squared:  0.03894,   Adjusted R-squared:  0.03869 
F-statistic:   159 on 1 and 3924 DF,  p-value: < 2.2e-16

Histogram

ggplot(walkability, aes(x = R_MedWageWk)) +
  geom_histogram(binwidth = 2, fill = "red", color = "black", alpha = 0.7) +
  labs(title = "Income Distributions (less that 3300 but more than 1250 USD",
       y = "Frequency") +
  theme_minimal()

Subset MoCo

moco_walkability <- walkability %>%
  filter(COUNTYFP == 3)

ggplot(moco_walkability, aes(x = R_MedWageWk, y = NatWalkInd)) +
  geom_point(alpha = 0.5) +
  labs(x = "People earning above poverty line", y = "Walkability Score") +
  theme_minimal()

library(DataExplorer)
plot_correlation(community)
6 features with more than 20 categories ignored!
Census.Tract: 1406 categories
copdrates: 106 categories
X95..Confidence.Interval: 677 categories
Confidence.Interval.Low: 96 categories
Confidence.Interval.High: 118 categories
oldhousing: 1164 categories
Warning in cor(x = structure(list(...1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, : the
standard deviation is zero
Warning: Removed 80 rows containing missing values (`geom_text()`).

Visual 4: Healthcare Jobs/Walkability

E8_Hlth10

ggplot(moco_walkability, aes(x = Pct_AO2p, y = NatWalkInd)) +
  geom_point(alpha = 0.5) +
  labs(x = "% of 2+ Car Owning Homes", y = "Walkability Score") +
 # stat_ellipse()
 # geom_smooth()
  theme_minimal()

Remove %’s from oldhousing column (this column represents the % of housing in a census tract built prior to 1980) and the COPD rates column

# Remove "%" symbol using gsub
community$oldhousing <- gsub("%", "", community$oldhousing)
community$copdrates <- gsub("%", "", community$copdrates)

# Convert the column to numeric (depending on later results)

community$oldhousing <- as.numeric(community$oldhousing)
community$copdrates <- as.numeric(community$copdrates)
Warning: NAs introduced by coercion
## community$oldhousing <- as.numeric(community$coldhousing)

Visual 5

ggplot(community, aes(x = community$parkdistancepopulation, y = community$Number)) +
  geom_bar(stat = "identity") + 
  labs(title = "Number of People living Near Park per County",
       x = "Counties",
       y = "Frequenct") +
  theme_classic()  
Warning: Removed 133 rows containing missing values (`position_stack()`).

Visual 6 (New with % housing data) + Linear Model

model3 <- lm(copdrates ~ oldhousing, data = community)  # y ~ x represents dependent variable ~ independent
summary(model3)

Call:
lm(formula = copdrates ~ oldhousing, data = community)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9315 -1.2323 -0.3673  1.0400 10.1918 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.993742   0.126692   31.52   <2e-16 ***
oldhousing  0.024456   0.001935   12.64   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.844 on 1266 degrees of freedom
  (138 observations deleted due to missingness)
Multiple R-squared:  0.1121,    Adjusted R-squared:  0.1114 
F-statistic: 159.8 on 1 and 1266 DF,  p-value: < 2.2e-16
# Model Analysis: The coefficient of 0.024 indicates that for each one-unit increase in the oldhousing variable there is a 0.024 increase for the COPD rates variable. Also the P-values are much smaller than 0.05 suggesting that there is a statistically significant relationship between housing age and COPD rates. However, the R-Squared value is on the lower end and signifies that the model explains about 11.2% of variance in the dependent variable. This suggests that other variables may have more of an impact. So I will likely use a multiple regression model to include more variables and make the model better. All of these values indicate a relationship between housing age and COPD rates in MD census tracts. 
ggplot(community, aes(x = oldhousing, y = copdrates)) +
  geom_point(color = "red") +
  geom_smooth(method = "lm", aes(linetype = "Linear Model"), 
               model = model3, color = "blue")
Warning in geom_smooth(method = "lm", aes(linetype = "Linear Model"), model =
model3, : Ignoring unknown parameters: `model`
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 138 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 138 rows containing missing values (`geom_point()`).

  labs(title = "Sample Scatter Plot",
       x = "% of Old Housing In Census Tract",
       y = "Crude Percent of COPD In Census Tract")
$x
[1] "% of Old Housing In Census Tract"

$y
[1] "Crude Percent of COPD In Census Tract"

$title
[1] "Sample Scatter Plot"

attr(,"class")
[1] "labels"

Data Analysis & Methodology

Guiding Question 1: Is there a measurable correlation between a community’s Walk Score (a standardized measure of walkability) and average resident BMI and COPD Rates?

Statistical Method (1): Pearson Correlation Coefficient Analysis

Guiding Question 2: Can we identify specific design variables (e.g age of housing, distance from park, etc) within the community design dataset that are statistically associated with higher walkability scores (from the walkability dataset)?

Statistical Method (2): Multiple Linear Regression- analyzes the relationship between multiple independent variables (community design features like age of housing, park distance) and a single dependent variable (Walk Score).

Guiding Question 3: Is there a statistically significant correlation between a community’s median household income and its walkability score (from the walkability dataset)?

Statistical Method (3): Bootstrapping (If I want to see how confident I am in my findings)

Guiding Question 4: Are low income areas more likely to have people experiencing the public health factors (High COPD Rates and Obesity Rates)

Statistical Method (4): Chi-Square Test or Logistic Regression

Guiding Question 5: What are policy considerations that can be developed based on this data?