Have you ever wondered if the air you breathe could affect your health in more ways than just a scratchy throat on a smoggy day? Researchers have been probing this very question, unraveling the potential ties between air pollution and our long-term health. This study takes a brief dive into one specific concern: could tiny particles lurking in the air be linked to an increased risk of returning to the hospital after treatment for certain respiratory and heart conditions?
We’ll be investigating this connection for five different medical
conditions, from acute myocardial infarctions (commonly known as heart
attacks) to pneumonia, all while focusing on two types of microscopic
air pollutants: particulate matter 2.5 (PM2.5) and carbon monoxide. In
the process, we will also be using the purrr package’s
mapping functionality with the tidymodels package to
demonstrate how easy it can be to use regression models to measure
correlations between multiple pairs of independent and dependent
variables. Buckle up, data detectives, as we embark on a journey through
hospital readmission rates, air quality measurements, and statistical
sleuthing!
In short our research question can be stated as, “Is there a correlation between the amount of particulant matter measured in the air and the rate of hospital readmissions for people who received treatment for respiratory and cardiac ailments?” That is the overall question of interest. The data and methodology we use will allow us to produce multiple research questions that are more precise. Our data consists of the readmissions rates for five treatments (Acute Myocardial Infarction (AMI) 30-Day Readmission Rate, Rate of readmission for CABG, Rate of readmission for chronic obstructive pulmonary disease (COPD) patients, Heart failure (HF) 30-Day Readmission Rate, and Pneumonia (PN) 30-Day Readmission Rate) and two measures of particulant matter (Carbon Monoxide and Lead PM2.5 LC). Both measures are for counties in California. For this analysis we will use a threshold of p<0.05 to determine statistical significance. We want check for correlation between each treatment-particulant matter pair, resulting in ten research questions or pairs of hypotheses to test.
\(H_\theta\) : There is no relationship between the amount of Lead PM 2.5 LC measured in the air and the 30-day readmission rate for acute myocardial infarction in California counties.
\(H_\alpha\) : There is a relationship between the amount of Lead PM 2.5 LC measured in the air and the 30-day readmission rate for acute myocardial infarction in California counties.
\(H_\theta\) : There is no relationship between the amount of Lead PM 2.5 LC measured in the air and the rate of readmission for CABG in California counties.
\(H_\alpha\) : There is a relationship between the amount of Lead PM 2.5 LC measured in the air and the rate of readmission for CABG in California counties.
\(H_\theta\) : There is no relationship between the amount of Lead PM 2.5 LC measured in the air and the rate of readmission for chronic obstructive pulmonary disease (COPD) patients in California counties.
\(H_\alpha\) : There is a relationship between the amount of Lead PM 2.5 LC measured in the air and the rate of readmission for chronic obstructive pulmonary disease (COPD) patients in California counties.
\(H_\theta\) : There is no relationship between the amount of Lead PM 2.5 LC measured in the air and the Heart failure (HF) 30-Day readmission rate in California counties.
\(H_\alpha\) : There is a relationship between the amount of Lead PM 2.5 LC measured in the air and the Heart failure (HF) 30-Day readmission rate in California counties.
\(H_\theta\) : There is no relationship between the amount of Lead PM 2.5 LC measured in the air and the pneumonia (PN) 30-day readmission rate in California counties.
\(H_\alpha\) : There is a relationship between the amount of Lead PM 2.5 LC measured in the air and the pneumonia (PN) 30-day readmission rate in California counties.
\(H_\theta\) : There is no relationship between the amount of Carbon Monoxide measured in the air and the 30-day readmission rate for acute myocardial infarction in California counties.
\(H_\alpha\) : There is a relationship between the amount of Carbon Monoxide measured in the air and the 30-day readmission rate for acute myocardial infarction in California counties.
\(H_\theta\) : There is no relationship between the amount of Carbon Monoxide measured in the air and the rate of readmission for CABG in California counties.
\(H_\alpha\) : There is a relationship between the amount of Carbon Monoxide measured in the air and the rate of readmission for CABG in California counties.
\(H_\theta\) : There is no relationship between the amount of Carbon Monoxide measured in the air and the rate of readmission for chronic obstructive pulmonary disease (COPD) patients in California counties.
\(H_\alpha\) : There is a relationship between the amount of Carbon Monoxide measured in the air and the rate of readmission for chronic obstructive pulmonary disease (COPD) patients in California counties.
\(H_\theta\) : There is no relationship between the amount of Carbon Monoxide measured in the air and the Heart failure (HF) 30-Day readmission rate in California counties.
\(H_\alpha\) : There is a relationship between the amount of Carbon Monoxide measured in the air and the Heart failure (HF) 30-Day readmission rate in California counties.
\(H_\theta\) : There is no relationship between the amount of Carbon Monoxide measured in the air and the pneumonia (PN) 30-day readmission rate in California counties.
\(H_\alpha\) : There is a relationship between the amount of Carbon Monoxide measured in the air and the pneumonia (PN) 30-day readmission rate in California counties.
With our research question defined and threshold set let’s take a look at the data we will be working with. Imagine two bustling data warehouses, each holding crucial pieces of this puzzle. One, brimming with hospital records, whispers stories of readmission rates across counties in California. The other, a repository of environmental data, chronicles the dance of PM2.5 and carbon monoxide in the very air we breathe. With careful merging and cleaning, these two datasets join forces, forming the foundation for our investigation.
But the story isn’t without its twists. Missing data here and there, like shy guests at a party, add a touch of intrigue. Yet, with a bit of statistical know-how, we’ll wrangle these datasets into submission, coaxing out their secrets and unveiling potential connections between air quality and hospital readmissions.
The data used in this analysis was collected from the Unplanned Hospital Visits dataset provided by the Centers for Medicare and Medicaid services and annual summary data by county collected from the Environmental Protection Agency.
The data from CMS took the average scores of the hospitals for each measure by county and weighted them by the denominator value for that hospital and measure. The EPA data provided multiple measures for some counties where there were multiple devices recording particulant matter in the county. In these cases the average was calculated and is used. The data from both the EPA and CMS are measures collected in the year 2022.
This results in a data set of 56 rows across 7 variables used in the
analysis and one variable used for information only
(county_parish), although not all variables are available
for every county. Missing values range from four in the READM_30_PN
measure to 32 in the carbon_monoxide measure. In the pairing of
particulant matter measures and readmissions rate measures, complete
cases range from 15 complete observations to 24 complete observations
with most being 22 complete observations. While small, sample sizes for
the regressions still prove insightful. A table was also maintained of
the measured fields of the hospital readmissions variables to make it
easy to track which variable goes with which measure.
| measure_id | measure_name |
|---|---|
| READM_30_AMI | Acute Myocardial Infarction (AMI) 30-Day Readmission Rate |
| READM_30_CABG | Rate of readmission for CABG |
| READM_30_COPD | Rate of readmission for chronic obstructive pulmonary disease (COPD) patients |
| READM_30_HF | Heart failure (HF) 30-Day Readmission Rate |
| READM_30_PN | Pneumonia (PN) 30-Day Readmission Rate |
Before diving into the nitty-gritty of statistics, let’s get a feel for our data. Boxplots emerge, painting pictures of the spread and quirks of each variable. Density plots, like gentle hills and valleys, reveal the underlying distributions of hospital readmission rates and air quality measures.
We discover some interesting details. The data leans towards normality, a good sign for our statistical tests. But, oh, those carbon monoxide and PM2.5 measurements! They’re a bit skewed, like stubborn party hats refusing to sit quite right. But fear not, we have a secret weapon: transformations! A sprinkle of squaring and a dash of logarithms straighten these mischievous variables, readying them for the statistical showdown.
The variables we will be using in our regressions will be numeric. As such, it is a good idea to get summary statistics for each variable. We can easily see the quartiles and means for each variable below as well as missing values.
## county_parish READM_30_AMI READM_30_CABG READM_30_COPD
## Length:56 Min. :11.60 Min. : 9.40 Min. :16.80
## Class :character 1st Qu.:13.40 1st Qu.:10.77 1st Qu.:18.72
## Mode :character Median :13.99 Median :10.99 Median :19.16
## Mean :13.83 Mean :11.04 Mean :19.18
## 3rd Qu.:14.33 3rd Qu.:11.30 3rd Qu.:19.73
## Max. :15.98 Max. :12.50 Max. :21.92
## NA's :23 NA's :27 NA's :10
## READM_30_HF READM_30_PN carbon_monoxide lead_pm2_5_lc
## Min. :18.61 Min. :15.10 Min. :0.02721 Min. :0.000120
## 1st Qu.:19.90 1st Qu.:16.50 1st Qu.:0.24565 1st Qu.:0.000250
## Median :20.32 Median :16.93 Median :0.31134 Median :0.001341
## Mean :20.38 Mean :16.99 Mean :0.29628 Mean :0.001490
## 3rd Qu.:20.77 3rd Qu.:17.23 3rd Qu.:0.35978 3rd Qu.:0.002095
## Max. :23.04 Max. :21.73 Max. :0.42149 Max. :0.008125
## NA's :7 NA's :4 NA's :32 NA's :31
We can also view boxplots of the numeric variables to get and idea for their spread as well.
Finally, we can view density plots to get an idea of how well our data fits assumptions of normality.
We can see that most of our measures of hospital readmissions data
follow patterns relatively close to normal distributions. The EPA
measures, on the other hand do not. The carbon_monoxide
variable is negatively skewed and the lead_pm2_5_lc
variable is very positively skewed. Because of this we transform these
variables to take the squared value of carbon_monoxide
measures and the log transform of lead_pm2_5_lc measures.
This helps greatly improve measures of skew and kurtosis in both
variables to help them more closely resemble normal distributions. You
can see the improvement below.
This transformation also helps reduce the skew from the outlier in
the lead_pm2_5_lc variable. The outlier comes from Imperial
county which is also a heavy outlier in the 2018 data. This analyst
hasn’t found any explanation on why Imperial county would have such high
measure of lead particulant matter. If the readers have any insight on
this I would be greatly interested in learning. However, since the
county is repeatedly an outlier when a broader timeline is taken into
account, the observation is left in. Results will be reported with and
without the observation.
Now, the moment of truth arrives. We unleash a battery of statistical models, each one pairing a hospital readmission rate with an air pollutant, testing for potential correlations. Imagine these models as tiny sleuths, meticulously sifting through the data, searching for whispers of connection.
With bated breath, we await the results. And, well, none of the sleuths crack the case completely. While some relationships show a hint of intrigue, none reach the level of statistical significance. This means, sadly, we haven’t unearthed a definitive link between air pollution and increased hospital readmissions for these specific conditions.
However, two pairs of sleuths stand out from the crowd: PM2.5 and readmission rates for both chronic obstructive pulmonary disease (COPD) and pneumonia. Their whispers are the loudest, suggesting a need for further investigation. And, there’s that pesky outlier, Imperial County, with its unusually high PM2.5 levels. Should it stay or should it go? We’ll explore this dilemma later…
Using purrr’s map function makes it easy
for us to use tidy models to get a regression model for each pair of
independent and dependent variables we are interested in. The result is
the table below.
## # A tibble: 20 × 6
## term estimate std.error statistic p.value dependent_vars
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 14.1 0.363 38.7 2.77e-20 READM_30_AMI
## 2 carbon_monoxide -0.729 3.25 -0.224 8.25e- 1 READM_30_AMI
## 3 (Intercept) 16.5 1.26 13.0 7.75e- 9 READM_30_CABG
## 4 lead_pm2_5_lc 0.348 0.187 1.86 8.54e- 2 READM_30_CABG
## 5 (Intercept) 11.1 0.349 31.9 1.24e-18 READM_30_COPD
## 6 carbon_monoxide -1.80 3.12 -0.577 5.71e- 1 READM_30_COPD
## 7 (Intercept) 10.7 0.985 10.8 7.18e- 8 READM_30_HF
## 8 lead_pm2_5_lc -0.0616 0.146 -0.423 6.79e- 1 READM_30_HF
## 9 (Intercept) 19.1 0.401 47.6 6.97e-23 READM_30_PN
## 10 carbon_monoxide 3.81 3.62 1.05 3.04e- 1 READM_30_PN
## 11 (Intercept) 20.8 1.33 15.7 1.54e-11 READM_30_AMI
## 12 lead_pm2_5_lc 0.184 0.191 0.963 3.49e- 1 READM_30_AMI
## 13 (Intercept) 20.3 0.513 39.6 5.92e-22 READM_30_CABG
## 14 carbon_monoxide -0.988 4.72 -0.209 8.36e- 1 READM_30_CABG
## 15 (Intercept) 22.4 1.20 18.7 1.04e-13 READM_30_COPD
## 16 lead_pm2_5_lc 0.267 0.169 1.58 1.30e- 1 READM_30_COPD
## 17 (Intercept) 17.2 0.527 32.7 3.90e-20 READM_30_HF
## 18 carbon_monoxide 0.0967 4.85 0.0200 9.84e- 1 READM_30_HF
## 19 (Intercept) 19.9 1.45 13.7 1.22e-11 READM_30_PN
## 20 lead_pm2_5_lc 0.385 0.206 1.87 7.58e- 2 READM_30_PN
The p-value is in scientific notation. We can make it easier to see
by rounding the values to 8 digits. The table below shows us that none
of the dependent-independent variable pairs meets our criteria for being
significantly correlated. The correlation between
lead_pm2_5_lc and READM_30_CABG and the
correlation between lead_pm2_5_lc and
READM_30_PN come closest (it’s important to remember that
we performed a log transformation to the lead_pm2_5_lc
variable). For illustration we will continue in the process with these
two models, but again, none of the models, including these two, met our
predetermined criteria for significance.
## # A tibble: 20 × 6
## term estimate std.error statistic p.value dependent_vars
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 14.1 0.363 38.7 0 READM_30_AMI
## 2 carbon_monoxide -0.729 3.25 -0.224 0.825 READM_30_AMI
## 3 (Intercept) 16.5 1.26 13.0 0.00000001 READM_30_CABG
## 4 lead_pm2_5_lc 0.348 0.187 1.86 0.0854 READM_30_CABG
## 5 (Intercept) 11.1 0.349 31.9 0 READM_30_COPD
## 6 carbon_monoxide -1.80 3.12 -0.577 0.571 READM_30_COPD
## 7 (Intercept) 10.7 0.985 10.8 0.00000007 READM_30_HF
## 8 lead_pm2_5_lc -0.0616 0.146 -0.423 0.679 READM_30_HF
## 9 (Intercept) 19.1 0.401 47.6 0 READM_30_PN
## 10 carbon_monoxide 3.81 3.62 1.05 0.304 READM_30_PN
## 11 (Intercept) 20.8 1.33 15.7 0 READM_30_AMI
## 12 lead_pm2_5_lc 0.184 0.191 0.963 0.349 READM_30_AMI
## 13 (Intercept) 20.3 0.513 39.6 0 READM_30_CABG
## 14 carbon_monoxide -0.988 4.72 -0.209 0.836 READM_30_CABG
## 15 (Intercept) 22.4 1.20 18.7 0 READM_30_COPD
## 16 lead_pm2_5_lc 0.267 0.169 1.58 0.130 READM_30_COPD
## 17 (Intercept) 17.2 0.527 32.7 0 READM_30_HF
## 18 carbon_monoxide 0.0967 4.85 0.0200 0.984 READM_30_HF
## 19 (Intercept) 19.9 1.45 13.7 0 READM_30_PN
## 20 lead_pm2_5_lc 0.385 0.206 1.87 0.0758 READM_30_PN
We can access the r-squared values for these models below. Looking at the r-squared values we can see that not much of the variation in our dependent variables is explained by correlation with our independent variables. Thus meaning, even if we had found the relationships significant, they would not have been particularly strong since the r-squared value is closer to 0 than to 1.
## # A tibble: 2 × 13
## dependent_var r.squared adj.r.squared sigma statistic p.value df logLik
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 READM_30_AMI 0.211 0.150 0.686 3.47 0.0854 1 -14.5
## 2 READM_30_PN 0.149 0.107 1.11 3.51 0.0758 1 -32.5
## # ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
## # nobs <int>
Finally, let’s visualize these models with regression lines added to scatter plots.
We can view all the same results with the outlier from Imperial
County removed. Below we can see that while omitting that observation
doesn’t change the results for the READM_30_CABG or
READM_30_PN much it does drastically change the results for
the correlation between READM_30_AMI and
lead_pm2_5_lc, bringing it closer to the significance level
observed by the two variables above, and between
READM_30_COPD and lead_pm2_5_lc actually
taking past our threshold of significance. It also increases the
r-squared values for all these models, but they all remain closer to 0
than 1 with the highest values being ~0.21, suggesting that while the
relationship may be significant, it is not strong.
## # A tibble: 20 × 6
## term estimate std.error statistic p.value dependent_vars
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 14.6 0.444 32.9 2.42e-12 READM_30_AMI
## 2 carbon_monoxide -3.65 3.93 -0.929 3.73e- 1 READM_30_AMI
## 3 (Intercept) 16.5 1.26 13.0 7.75e- 9 READM_30_CABG
## 4 lead_pm2_5_lc 0.348 0.187 1.86 8.54e- 2 READM_30_CABG
## 5 (Intercept) 11.2 0.324 34.5 1.47e-12 READM_30_COPD
## 6 carbon_monoxide -1.73 2.87 -0.603 5.58e- 1 READM_30_COPD
## 7 (Intercept) 10.7 0.985 10.8 7.18e- 8 READM_30_HF
## 8 lead_pm2_5_lc -0.0616 0.146 -0.423 6.79e- 1 READM_30_HF
## 9 (Intercept) 19.3 0.595 32.4 2.86e-12 READM_30_PN
## 10 carbon_monoxide 3.94 5.27 0.747 4.70e- 1 READM_30_PN
## 11 (Intercept) 22.1 1.36 16.2 2.40e-11 READM_30_AMI
## 12 lead_pm2_5_lc 0.350 0.193 1.81 8.87e- 2 READM_30_AMI
## 13 (Intercept) 21.0 0.600 34.9 1.95e-13 READM_30_CABG
## 14 carbon_monoxide -2.34 5.51 -0.424 6.79e- 1 READM_30_CABG
## 15 (Intercept) 23.3 1.29 18.0 5.64e-13 READM_30_COPD
## 16 lead_pm2_5_lc 0.381 0.180 2.12 4.80e- 2 READM_30_COPD
## 17 (Intercept) 17.5 0.802 21.8 5.14e-11 READM_30_HF
## 18 carbon_monoxide -1.15 7.36 -0.157 8.78e- 1 READM_30_HF
## 19 (Intercept) 20.4 1.63 12.5 1.25e-10 READM_30_PN
## 20 lead_pm2_5_lc 0.449 0.228 1.97 6.36e- 2 READM_30_PN
## # A tibble: 20 × 6
## term estimate std.error statistic p.value dependent_vars
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 14.6 0.444 32.9 0 READM_30_AMI
## 2 carbon_monoxide -3.65 3.93 -0.929 0.373 READM_30_AMI
## 3 (Intercept) 16.5 1.26 13.0 0.00000001 READM_30_CABG
## 4 lead_pm2_5_lc 0.348 0.187 1.86 0.0854 READM_30_CABG
## 5 (Intercept) 11.2 0.324 34.5 0 READM_30_COPD
## 6 carbon_monoxide -1.73 2.87 -0.603 0.558 READM_30_COPD
## 7 (Intercept) 10.7 0.985 10.8 0.00000007 READM_30_HF
## 8 lead_pm2_5_lc -0.0616 0.146 -0.423 0.679 READM_30_HF
## 9 (Intercept) 19.3 0.595 32.4 0 READM_30_PN
## 10 carbon_monoxide 3.94 5.27 0.747 0.470 READM_30_PN
## 11 (Intercept) 22.1 1.36 16.2 0 READM_30_AMI
## 12 lead_pm2_5_lc 0.350 0.193 1.81 0.0887 READM_30_AMI
## 13 (Intercept) 21.0 0.600 34.9 0 READM_30_CABG
## 14 carbon_monoxide -2.34 5.51 -0.424 0.679 READM_30_CABG
## 15 (Intercept) 23.3 1.29 18.0 0 READM_30_COPD
## 16 lead_pm2_5_lc 0.381 0.180 2.12 0.0480 READM_30_COPD
## 17 (Intercept) 17.5 0.802 21.8 0 READM_30_HF
## 18 carbon_monoxide -1.15 7.36 -0.157 0.878 READM_30_HF
## 19 (Intercept) 20.4 1.63 12.5 0 READM_30_PN
## 20 lead_pm2_5_lc 0.449 0.228 1.97 0.0636 READM_30_PN
## # A tibble: 4 × 13
## dependent_var r.squared adj.r.squared sigma statistic p.value df logLik
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 READM_30_AMI 0.211 0.150 0.686 3.47 0.0854 1 -14.5
## 2 READM_30_COPD 0.170 0.118 0.877 3.28 0.0887 1 -22.1
## 3 READM_30_HF 0.200 0.156 0.874 4.50 0.0480 1 -24.6
## 4 READM_30_PN 0.170 0.126 1.13 3.88 0.0636 1 -31.3
## # ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
## # nobs <int>
While our initial quest for a definitive link hasn’t been entirely successful, we’ve made some valuable discoveries. We’ve delved into the intricate relationship between air quality and health, highlighting potential avenues for further research. The whisper of a connection between PM2.5 and some respiratory conditions deserves deeper exploration, with larger datasets and refined analyses.
And now, for the elephant in the room: Imperial County. Does its outlier status distort the picture? Removing it from the analysis reveals a shift, bringing the PM2.5-COPD relationship closer to significance. This adds another layer of intrigue, suggesting the need for further studies that factor in unique county-level characteristics.
This journey may not have ended with a resounding triumph, but it has opened doors to fascinating possibilities. We’ve planted seeds of knowledge, watered them with data, and watched them sprout into questions begging for further exploration. The quest to understand the relationship between air pollution and health continues, and this study has added its unique chapter to the ongoing narrative.
And who knows, perhaps somewhere down the line, another study will build upon our findings, finally cracking the case and revealing the full story of air pollution’s impact on our health. Until then, we remain vigilant, continuing to explore the intricate links between our environment and our well-being.