Particulant Matter Measures and Hospital Readmissions Rates

Eddy Harrity

2023-12-29


The Research Question

In short our research question can be stated as, “Is there a correlation between the amount of particulant matter measured in the air and the rate of hospital readmissions for people who received treatment for respiratory and cardiac ailments?” That is the overall question of interest. The data and methodology we use will allow us to produce multiple research questions that are more precise. Our data consists of the readmissions rates for five treatments (Acute Myocardial Infarction (AMI) 30-Day Readmission Rate, Rate of readmission for CABG, Rate of readmission for chronic obstructive pulmonary disease (COPD) patients, Heart failure (HF) 30-Day Readmission Rate, and Pneumonia (PN) 30-Day Readmission Rate) and two measures of particulant matter (Carbon Monoxide and Lead PM2.5 LC). Both measures are for counties in California. For this analysis we will use a threshold of p<0.05 to determine statistical significance. We want check for correlation between each treatment-particulant matter pair, resulting in ten research questions or pairs of hypotheses to test.

\(H_\theta\) : There is no relationship between the amount of Lead PM 2.5 LC measured in the air and the 30-day readmission rate for acute myocardial infarction in California counties.

\(H_\alpha\) : There is a relationship between the amount of Lead PM 2.5 LC measured in the air and the 30-day readmission rate for acute myocardial infarction in California counties.

\(H_\theta\) : There is no relationship between the amount of Lead PM 2.5 LC measured in the air and the rate of readmission for CABG in California counties.

\(H_\alpha\) : There is a relationship between the amount of Lead PM 2.5 LC measured in the air and the rate of readmission for CABG in California counties.

\(H_\theta\) : There is no relationship between the amount of Lead PM 2.5 LC measured in the air and the rate of readmission for chronic obstructive pulmonary disease (COPD) patients in California counties.

\(H_\alpha\) : There is a relationship between the amount of Lead PM 2.5 LC measured in the air and the rate of readmission for chronic obstructive pulmonary disease (COPD) patients in California counties.

\(H_\theta\) : There is no relationship between the amount of Lead PM 2.5 LC measured in the air and the Heart failure (HF) 30-Day readmission rate in California counties.

\(H_\alpha\) : There is a relationship between the amount of Lead PM 2.5 LC measured in the air and the Heart failure (HF) 30-Day readmission rate in California counties.

\(H_\theta\) : There is no relationship between the amount of Lead PM 2.5 LC measured in the air and the pneumonia (PN) 30-day readmission rate in California counties.

\(H_\alpha\) : There is a relationship between the amount of Lead PM 2.5 LC measured in the air and the pneumonia (PN) 30-day readmission rate in California counties.

\(H_\theta\) : There is no relationship between the amount of Carbon Monoxide measured in the air and the 30-day readmission rate for acute myocardial infarction in California counties.

\(H_\alpha\) : There is a relationship between the amount of Carbon Monoxide measured in the air and the 30-day readmission rate for acute myocardial infarction in California counties.

\(H_\theta\) : There is no relationship between the amount of Carbon Monoxide measured in the air and the rate of readmission for CABG in California counties.

\(H_\alpha\) : There is a relationship between the amount of Carbon Monoxide measured in the air and the rate of readmission for CABG in California counties.

\(H_\theta\) : There is no relationship between the amount of Carbon Monoxide measured in the air and the rate of readmission for chronic obstructive pulmonary disease (COPD) patients in California counties.

\(H_\alpha\) : There is a relationship between the amount of Carbon Monoxide measured in the air and the rate of readmission for chronic obstructive pulmonary disease (COPD) patients in California counties.

\(H_\theta\) : There is no relationship between the amount of Carbon Monoxide measured in the air and the Heart failure (HF) 30-Day readmission rate in California counties.

\(H_\alpha\) : There is a relationship between the amount of Carbon Monoxide measured in the air and the Heart failure (HF) 30-Day readmission rate in California counties.

\(H_\theta\) : There is no relationship between the amount of Carbon Monoxide measured in the air and the pneumonia (PN) 30-day readmission rate in California counties.

\(H_\alpha\) : There is a relationship between the amount of Carbon Monoxide measured in the air and the pneumonia (PN) 30-day readmission rate in California counties.

The Data: Tales of Two Datasets

Description of the Data

With our research question defined and threshold set let’s take a look at the data we will be working with. Imagine two bustling data warehouses, each holding crucial pieces of this puzzle. One, brimming with hospital records, whispers stories of readmission rates across counties in California. The other, a repository of environmental data, chronicles the dance of PM2.5 and carbon monoxide in the very air we breathe. With careful merging and cleaning, these two datasets join forces, forming the foundation for our investigation.

But the story isn’t without its twists. Missing data here and there, like shy guests at a party, add a touch of intrigue. Yet, with a bit of statistical know-how, we’ll wrangle these datasets into submission, coaxing out their secrets and unveiling potential connections between air quality and hospital readmissions.

The data used in this analysis was collected from the Unplanned Hospital Visits dataset provided by the Centers for Medicare and Medicaid services and annual summary data by county collected from the Environmental Protection Agency.

The data from CMS took the average scores of the hospitals for each measure by county and weighted them by the denominator value for that hospital and measure. The EPA data provided multiple measures for some counties where there were multiple devices recording particulant matter in the county. In these cases the average was calculated and is used. The data from both the EPA and CMS are measures collected in the year 2022.

This results in a data set of 56 rows across 7 variables used in the analysis and one variable used for information only (county_parish), although not all variables are available for every county. Missing values range from four in the READM_30_PN measure to 32 in the carbon_monoxide measure. In the pairing of particulant matter measures and readmissions rate measures, complete cases range from 15 complete observations to 24 complete observations with most being 22 complete observations. While small, sample sizes for the regressions still prove insightful. A table was also maintained of the measured fields of the hospital readmissions variables to make it easy to track which variable goes with which measure.

measure_id measure_name
READM_30_AMI Acute Myocardial Infarction (AMI) 30-Day Readmission Rate
READM_30_CABG Rate of readmission for CABG
READM_30_COPD Rate of readmission for chronic obstructive pulmonary disease (COPD) patients
READM_30_HF Heart failure (HF) 30-Day Readmission Rate
READM_30_PN Pneumonia (PN) 30-Day Readmission Rate

Exploratory Analysis: A Peek Behind the Curtain

Before diving into the nitty-gritty of statistics, let’s get a feel for our data. Boxplots emerge, painting pictures of the spread and quirks of each variable. Density plots, like gentle hills and valleys, reveal the underlying distributions of hospital readmission rates and air quality measures.

We discover some interesting details. The data leans towards normality, a good sign for our statistical tests. But, oh, those carbon monoxide and PM2.5 measurements! They’re a bit skewed, like stubborn party hats refusing to sit quite right. But fear not, we have a secret weapon: transformations! A sprinkle of squaring and a dash of logarithms straighten these mischievous variables, readying them for the statistical showdown.

Summary Statistics

The variables we will be using in our regressions will be numeric. As such, it is a good idea to get summary statistics for each variable. We can easily see the quartiles and means for each variable below as well as missing values.

##  county_parish       READM_30_AMI   READM_30_CABG   READM_30_COPD  
##  Length:56          Min.   :11.60   Min.   : 9.40   Min.   :16.80  
##  Class :character   1st Qu.:13.40   1st Qu.:10.77   1st Qu.:18.72  
##  Mode  :character   Median :13.99   Median :10.99   Median :19.16  
##                     Mean   :13.83   Mean   :11.04   Mean   :19.18  
##                     3rd Qu.:14.33   3rd Qu.:11.30   3rd Qu.:19.73  
##                     Max.   :15.98   Max.   :12.50   Max.   :21.92  
##                     NA's   :23      NA's   :27      NA's   :10     
##   READM_30_HF     READM_30_PN    carbon_monoxide   lead_pm2_5_lc     
##  Min.   :18.61   Min.   :15.10   Min.   :0.02721   Min.   :0.000120  
##  1st Qu.:19.90   1st Qu.:16.50   1st Qu.:0.24565   1st Qu.:0.000250  
##  Median :20.32   Median :16.93   Median :0.31134   Median :0.001341  
##  Mean   :20.38   Mean   :16.99   Mean   :0.29628   Mean   :0.001490  
##  3rd Qu.:20.77   3rd Qu.:17.23   3rd Qu.:0.35978   3rd Qu.:0.002095  
##  Max.   :23.04   Max.   :21.73   Max.   :0.42149   Max.   :0.008125  
##  NA's   :7       NA's   :4       NA's   :32        NA's   :31

Exploratory Graphs

We can also view boxplots of the numeric variables to get and idea for their spread as well.

Finally, we can view density plots to get an idea of how well our data fits assumptions of normality.

More Normal Distributions

We can see that most of our measures of hospital readmissions data follow patterns relatively close to normal distributions. The EPA measures, on the other hand do not. The carbon_monoxide variable is negatively skewed and the lead_pm2_5_lc variable is very positively skewed. Because of this we transform these variables to take the squared value of carbon_monoxide measures and the log transform of lead_pm2_5_lc measures. This helps greatly improve measures of skew and kurtosis in both variables to help them more closely resemble normal distributions. You can see the improvement below.

Outliers

This transformation also helps reduce the skew from the outlier in the lead_pm2_5_lc variable. The outlier comes from Imperial county which is also a heavy outlier in the 2018 data. This analyst hasn’t found any explanation on why Imperial county would have such high measure of lead particulant matter. If the readers have any insight on this I would be greatly interested in learning. However, since the county is repeatedly an outlier when a broader timeline is taken into account, the observation is left in. Results will be reported with and without the observation.

Regression Analysis: The Models Take the Stage

Now, the moment of truth arrives. We unleash a battery of statistical models, each one pairing a hospital readmission rate with an air pollutant, testing for potential correlations. Imagine these models as tiny sleuths, meticulously sifting through the data, searching for whispers of connection.

With bated breath, we await the results. And, well, none of the sleuths crack the case completely. While some relationships show a hint of intrigue, none reach the level of statistical significance. This means, sadly, we haven’t unearthed a definitive link between air pollution and increased hospital readmissions for these specific conditions.

However, two pairs of sleuths stand out from the crowd: PM2.5 and readmission rates for both chronic obstructive pulmonary disease (COPD) and pneumonia. Their whispers are the loudest, suggesting a need for further investigation. And, there’s that pesky outlier, Imperial County, with its unusually high PM2.5 levels. Should it stay or should it go? We’ll explore this dilemma later…

Using purrr’s map function makes it easy for us to use tidy models to get a regression model for each pair of independent and dependent variables we are interested in. The result is the table below.

## # A tibble: 20 × 6
##    term            estimate std.error statistic  p.value dependent_vars
##    <chr>              <dbl>     <dbl>     <dbl>    <dbl> <chr>         
##  1 (Intercept)      14.1        0.363   38.7    2.77e-20 READM_30_AMI  
##  2 carbon_monoxide  -0.729      3.25    -0.224  8.25e- 1 READM_30_AMI  
##  3 (Intercept)      16.5        1.26    13.0    7.75e- 9 READM_30_CABG 
##  4 lead_pm2_5_lc     0.348      0.187    1.86   8.54e- 2 READM_30_CABG 
##  5 (Intercept)      11.1        0.349   31.9    1.24e-18 READM_30_COPD 
##  6 carbon_monoxide  -1.80       3.12    -0.577  5.71e- 1 READM_30_COPD 
##  7 (Intercept)      10.7        0.985   10.8    7.18e- 8 READM_30_HF   
##  8 lead_pm2_5_lc    -0.0616     0.146   -0.423  6.79e- 1 READM_30_HF   
##  9 (Intercept)      19.1        0.401   47.6    6.97e-23 READM_30_PN   
## 10 carbon_monoxide   3.81       3.62     1.05   3.04e- 1 READM_30_PN   
## 11 (Intercept)      20.8        1.33    15.7    1.54e-11 READM_30_AMI  
## 12 lead_pm2_5_lc     0.184      0.191    0.963  3.49e- 1 READM_30_AMI  
## 13 (Intercept)      20.3        0.513   39.6    5.92e-22 READM_30_CABG 
## 14 carbon_monoxide  -0.988      4.72    -0.209  8.36e- 1 READM_30_CABG 
## 15 (Intercept)      22.4        1.20    18.7    1.04e-13 READM_30_COPD 
## 16 lead_pm2_5_lc     0.267      0.169    1.58   1.30e- 1 READM_30_COPD 
## 17 (Intercept)      17.2        0.527   32.7    3.90e-20 READM_30_HF   
## 18 carbon_monoxide   0.0967     4.85     0.0200 9.84e- 1 READM_30_HF   
## 19 (Intercept)      19.9        1.45    13.7    1.22e-11 READM_30_PN   
## 20 lead_pm2_5_lc     0.385      0.206    1.87   7.58e- 2 READM_30_PN

The p-value is in scientific notation. We can make it easier to see by rounding the values to 8 digits. The table below shows us that none of the dependent-independent variable pairs meets our criteria for being significantly correlated. The correlation between lead_pm2_5_lc and READM_30_CABG and the correlation between lead_pm2_5_lc and READM_30_PN come closest (it’s important to remember that we performed a log transformation to the lead_pm2_5_lc variable). For illustration we will continue in the process with these two models, but again, none of the models, including these two, met our predetermined criteria for significance.

## # A tibble: 20 × 6
##    term            estimate std.error statistic    p.value dependent_vars
##    <chr>              <dbl>     <dbl>     <dbl>      <dbl> <chr>         
##  1 (Intercept)      14.1        0.363   38.7    0          READM_30_AMI  
##  2 carbon_monoxide  -0.729      3.25    -0.224  0.825      READM_30_AMI  
##  3 (Intercept)      16.5        1.26    13.0    0.00000001 READM_30_CABG 
##  4 lead_pm2_5_lc     0.348      0.187    1.86   0.0854     READM_30_CABG 
##  5 (Intercept)      11.1        0.349   31.9    0          READM_30_COPD 
##  6 carbon_monoxide  -1.80       3.12    -0.577  0.571      READM_30_COPD 
##  7 (Intercept)      10.7        0.985   10.8    0.00000007 READM_30_HF   
##  8 lead_pm2_5_lc    -0.0616     0.146   -0.423  0.679      READM_30_HF   
##  9 (Intercept)      19.1        0.401   47.6    0          READM_30_PN   
## 10 carbon_monoxide   3.81       3.62     1.05   0.304      READM_30_PN   
## 11 (Intercept)      20.8        1.33    15.7    0          READM_30_AMI  
## 12 lead_pm2_5_lc     0.184      0.191    0.963  0.349      READM_30_AMI  
## 13 (Intercept)      20.3        0.513   39.6    0          READM_30_CABG 
## 14 carbon_monoxide  -0.988      4.72    -0.209  0.836      READM_30_CABG 
## 15 (Intercept)      22.4        1.20    18.7    0          READM_30_COPD 
## 16 lead_pm2_5_lc     0.267      0.169    1.58   0.130      READM_30_COPD 
## 17 (Intercept)      17.2        0.527   32.7    0          READM_30_HF   
## 18 carbon_monoxide   0.0967     4.85     0.0200 0.984      READM_30_HF   
## 19 (Intercept)      19.9        1.45    13.7    0          READM_30_PN   
## 20 lead_pm2_5_lc     0.385      0.206    1.87   0.0758     READM_30_PN

We can access the r-squared values for these models below. Looking at the r-squared values we can see that not much of the variation in our dependent variables is explained by correlation with our independent variables. Thus meaning, even if we had found the relationships significant, they would not have been particularly strong since the r-squared value is closer to 0 than to 1.

## # A tibble: 2 × 13
##   dependent_var r.squared adj.r.squared sigma statistic p.value    df logLik
##   <chr>             <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>
## 1 READM_30_AMI      0.211         0.150 0.686      3.47  0.0854     1  -14.5
## 2 READM_30_PN       0.149         0.107 1.11       3.51  0.0758     1  -32.5
## # ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   nobs <int>

Finally, let’s visualize these models with regression lines added to scatter plots.

We can view all the same results with the outlier from Imperial County removed. Below we can see that while omitting that observation doesn’t change the results for the READM_30_CABG or READM_30_PN much it does drastically change the results for the correlation between READM_30_AMI and lead_pm2_5_lc, bringing it closer to the significance level observed by the two variables above, and between READM_30_COPD and lead_pm2_5_lc actually taking past our threshold of significance. It also increases the r-squared values for all these models, but they all remain closer to 0 than 1 with the highest values being ~0.21, suggesting that while the relationship may be significant, it is not strong.

## # A tibble: 20 × 6
##    term            estimate std.error statistic  p.value dependent_vars
##    <chr>              <dbl>     <dbl>     <dbl>    <dbl> <chr>         
##  1 (Intercept)      14.6        0.444    32.9   2.42e-12 READM_30_AMI  
##  2 carbon_monoxide  -3.65       3.93     -0.929 3.73e- 1 READM_30_AMI  
##  3 (Intercept)      16.5        1.26     13.0   7.75e- 9 READM_30_CABG 
##  4 lead_pm2_5_lc     0.348      0.187     1.86  8.54e- 2 READM_30_CABG 
##  5 (Intercept)      11.2        0.324    34.5   1.47e-12 READM_30_COPD 
##  6 carbon_monoxide  -1.73       2.87     -0.603 5.58e- 1 READM_30_COPD 
##  7 (Intercept)      10.7        0.985    10.8   7.18e- 8 READM_30_HF   
##  8 lead_pm2_5_lc    -0.0616     0.146    -0.423 6.79e- 1 READM_30_HF   
##  9 (Intercept)      19.3        0.595    32.4   2.86e-12 READM_30_PN   
## 10 carbon_monoxide   3.94       5.27      0.747 4.70e- 1 READM_30_PN   
## 11 (Intercept)      22.1        1.36     16.2   2.40e-11 READM_30_AMI  
## 12 lead_pm2_5_lc     0.350      0.193     1.81  8.87e- 2 READM_30_AMI  
## 13 (Intercept)      21.0        0.600    34.9   1.95e-13 READM_30_CABG 
## 14 carbon_monoxide  -2.34       5.51     -0.424 6.79e- 1 READM_30_CABG 
## 15 (Intercept)      23.3        1.29     18.0   5.64e-13 READM_30_COPD 
## 16 lead_pm2_5_lc     0.381      0.180     2.12  4.80e- 2 READM_30_COPD 
## 17 (Intercept)      17.5        0.802    21.8   5.14e-11 READM_30_HF   
## 18 carbon_monoxide  -1.15       7.36     -0.157 8.78e- 1 READM_30_HF   
## 19 (Intercept)      20.4        1.63     12.5   1.25e-10 READM_30_PN   
## 20 lead_pm2_5_lc     0.449      0.228     1.97  6.36e- 2 READM_30_PN
## # A tibble: 20 × 6
##    term            estimate std.error statistic    p.value dependent_vars
##    <chr>              <dbl>     <dbl>     <dbl>      <dbl> <chr>         
##  1 (Intercept)      14.6        0.444    32.9   0          READM_30_AMI  
##  2 carbon_monoxide  -3.65       3.93     -0.929 0.373      READM_30_AMI  
##  3 (Intercept)      16.5        1.26     13.0   0.00000001 READM_30_CABG 
##  4 lead_pm2_5_lc     0.348      0.187     1.86  0.0854     READM_30_CABG 
##  5 (Intercept)      11.2        0.324    34.5   0          READM_30_COPD 
##  6 carbon_monoxide  -1.73       2.87     -0.603 0.558      READM_30_COPD 
##  7 (Intercept)      10.7        0.985    10.8   0.00000007 READM_30_HF   
##  8 lead_pm2_5_lc    -0.0616     0.146    -0.423 0.679      READM_30_HF   
##  9 (Intercept)      19.3        0.595    32.4   0          READM_30_PN   
## 10 carbon_monoxide   3.94       5.27      0.747 0.470      READM_30_PN   
## 11 (Intercept)      22.1        1.36     16.2   0          READM_30_AMI  
## 12 lead_pm2_5_lc     0.350      0.193     1.81  0.0887     READM_30_AMI  
## 13 (Intercept)      21.0        0.600    34.9   0          READM_30_CABG 
## 14 carbon_monoxide  -2.34       5.51     -0.424 0.679      READM_30_CABG 
## 15 (Intercept)      23.3        1.29     18.0   0          READM_30_COPD 
## 16 lead_pm2_5_lc     0.381      0.180     2.12  0.0480     READM_30_COPD 
## 17 (Intercept)      17.5        0.802    21.8   0          READM_30_HF   
## 18 carbon_monoxide  -1.15       7.36     -0.157 0.878      READM_30_HF   
## 19 (Intercept)      20.4        1.63     12.5   0          READM_30_PN   
## 20 lead_pm2_5_lc     0.449      0.228     1.97  0.0636     READM_30_PN
## # A tibble: 4 × 13
##   dependent_var r.squared adj.r.squared sigma statistic p.value    df logLik
##   <chr>             <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>
## 1 READM_30_AMI      0.211         0.150 0.686      3.47  0.0854     1  -14.5
## 2 READM_30_COPD     0.170         0.118 0.877      3.28  0.0887     1  -22.1
## 3 READM_30_HF       0.200         0.156 0.874      4.50  0.0480     1  -24.6
## 4 READM_30_PN       0.170         0.126 1.13       3.88  0.0636     1  -31.3
## # ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   nobs <int>

Conclusion: A Pause for Reflection, Not a Full Stop

While our initial quest for a definitive link hasn’t been entirely successful, we’ve made some valuable discoveries. We’ve delved into the intricate relationship between air quality and health, highlighting potential avenues for further research. The whisper of a connection between PM2.5 and some respiratory conditions deserves deeper exploration, with larger datasets and refined analyses.

And now, for the elephant in the room: Imperial County. Does its outlier status distort the picture? Removing it from the analysis reveals a shift, bringing the PM2.5-COPD relationship closer to significance. This adds another layer of intrigue, suggesting the need for further studies that factor in unique county-level characteristics.

This journey may not have ended with a resounding triumph, but it has opened doors to fascinating possibilities. We’ve planted seeds of knowledge, watered them with data, and watched them sprout into questions begging for further exploration. The quest to understand the relationship between air pollution and health continues, and this study has added its unique chapter to the ongoing narrative.

And who knows, perhaps somewhere down the line, another study will build upon our findings, finally cracking the case and revealing the full story of air pollution’s impact on our health. Until then, we remain vigilant, continuing to explore the intricate links between our environment and our well-being.