Introduction/Summary

This document will use chi-squared testing to perform hypothesis tests to determine if hospitals in Southern California’s Inland Empire achieve levels of care that are significantly different from hospitals elsewhere in California, whether that difference be worse or better.

We describe the data which is collected from the Centers for Medicare and Medicaid services and covers measures of healthcare-associated infections, patient surveys on the communication provided by doctors and nurses and overall staff responsiveness at the hospitals, and also the volume of patients experienced by the hospitals’ emergency departments.

Exploratory graphs and contingency tables are also developed from the data to show differences between hospitals in the Inland Empire and elsewhere in California on these measures and what the values would be expected to be if there were no relationship.

Finally chi-squared hypothesis testing is done. Because some of the expected frequencies of some of the variables are less than five, null distributions are generated and used to perform the hypothesis testing. The extremity of our observed statistics are also visualized in comparison to these null distributions.

Our findings conclude that missing values are correlated to our variables of interest and thus finding data with fewer missing values or determining a reliable method to impute values for the missing data in future research would be valuable. With the data on hand though, across most of our variables we don’t find enough evidence to reject the null hypothesis that hospitals in the Inland Empire provide levels of care that are not significantly different from the level of care provided elsewhere in California.

We do find enough evidence to reject the hypothesis that hospitals in the Inland Empire do not experience volumes of patients in their emergency departments that are different than the volumes of patients experienced by emergency departments at hospitals elsewhere in California. The data suggests that hospitals in the Inland Empire experience significantly higher volumes of patients in their emergency departments than hospitals elsewhere in California.

Research Question

Before we do anything else we want to establish a research question we will try to answer. This will involve coming up with a hypothesis or hypotheses we can test. We will also want to determine a threshold of evidence we would consider as enough to reject our hypothesis.

Our predetermined threshold will be <0.05, so we will be looking for p-values to be less 0.05 for us to conlude that we have enough evidence to reject our null hypotheses.

The research question we are trying to answer is, do hospitals in Southern California’s Inland Empire provide a quality of care that is different from the quality of care provided by hospitals elsewhere in California? This means we are interested in whether the quality of care provided by hospitals in the Inland Empire is better or worse than the quality of care provided by hospitals elsewhere in California, or this will be a two-sided hypothesis test, rather than a one-sided hypothesis test.

In this particular analysis we will be using data that will allow us to test the following ten pairs of null and alternative hypotheses.

\(H_\theta\) : Hospitals in the Inland Empire achieve the same level of Central Line Associated Bloodstream Infections compared to the national average as hospitals elsewhere in California.

\(H_\alpha\) : Hospitals in the Inland Empire achieve a different level of Central Line Associated Bloodstream Infections compared to the national average than hospitals elsewhere in California.

\(H_\theta\) : Hospitals in the Inland Empire achieve the same level of Catheter Associated Urinary Tract Infections compared to the national average as hospitals elsewhere in California.

\(H_\alpha\) : Hospitals in the Inland Empire achieve a different level of Catheter Associated Urinary Tract Infections compared to the national average than hospitals elsewhere in California.

\(H_\theta\) : Hospitals in the Inland Empire achieve the same level of surgery site infections from Colon Surgery compared to the national average as hospitals elsewhere in California.

\(H_\alpha\) : Hospitals in the Inland Empire achieve a different level of surgery site infections from Colon Surgery compared to the national average than hospitals elsewhere in California.

\(H_\theta\) : Hospitals in the Inland Empire achieve the same level of surgery site infections from Abdominal Hysterectomy compared to the national average as hospitals elsewhere in California.

\(H_\alpha\) : Hospitals in the Inland Empire achieve a different level of surgery site infections from Abdominal Hysterectomy compared to the national average than hospitals elsewhere in California.

\(H_\theta\) : Hospitals in the Inland Empire achieve the same level of Methicillin-resistant bloodstream infections compared to the national average as hospitals elsewhere in California.

\(H_\alpha\) : Hospitals in the Inland Empire achieve a different level of Methicillin-resistant bloodstream infections compared to the national average than hospitals elsewhere in California.

\(H_\theta\) : Hospitals in the Inland Empire achieve the same level of intestinal infections compared to the national average as hospitals elsewhere in California.

\(H_\alpha\) : Hospitals in the Inland Empire achieve a different level of intestinal infections compared to the national average than hospitals elsewhere in California.

\(H_\theta\) : Patients participating in the HCAHPS survey for hospitals in the Inland Empire report an average level of satisfaction with communication from nurses that is the same as patients participating in the HCAHPS survey for hospitals elsewhere in California.

\(H_\alpha\) : Patients participating in the HCAHPS survey for hospitals in the Inland Empire report an average level of satisfaction with communication from nurses that is different from patients participating in the HCAHPS survey for hospitals elsewhere in California.

\(H_\theta\) : Patients participating in the HCAHPS survey for hospitals in the Inland Empire report an average level of satisfaction with communication from doctors that is the same as patients participating in the HCAHPS survey for hospitals elsewhere in California.

\(H_\alpha\) : Patients participating in the HCAHPS survey for hospitals in the Inland Empire report an average level of satisfaction with communication from doctors that is different from patients participating in the HCAHPS survey for hospitals elsewhere in California.

\(H_\theta\) : Patients participating in the HCAHPS survey for hospitals in the Inland Empire report an average level of satisfaction with staff responsiveness that is the same as patients participating in the HCAHPS survey for hospitals elsewhere in California.

\(H_\alpha\) : Patients participating in the HCAHPS survey for hospitals in the Inland Empire report an average level of satisfaction with staff responsiveness that is different from patients participating in the HCAHPS survey for hospitals elsewhere in California.

\(H_\theta\) : Hospitals in the Inland Empire experience the same volumes of patients in their emergency departments relative to the national average as hospitals elsewhere in California.

\(H_\alpha\) : Hospitals in the Inland Empire experience the different volumes of patients in their emergency departments relative to the national average than hospitals elsewhere in California.

The Data

Description of the Data

The data used in this analysis is collected from the Centers for Medicare and Medicaid Services(CMS) at https://data.cms.gov/. Census data was also collected to assign zipcodes to Census Bureau Statistical Areas(CBSAs) and Urban Areas(UAs). The particular datasets used for this specific analysis performing hypothesis tests on differences in outcomes between the Inland Empire and the rest of California include hospital-level results for healthcare-associated infections measures, hospital-level results for the Hospital Consumer Assessment of Healthcare Providers(HCAHP) and Systems, and hospital-level results for process of care measures.

Specific measures on nurse and doctor communication as well as staff responsiveness were collected from the HCAHP survey data. The measure of emergency department volume was collected from the process of care measures. All measures of healthcare-associated infections were collected. The data was limited to hospitals in California. Hospitals assigned by the Census Bureau to the CBSA of “Riverside-San Bernardino-Ontario, CA” and also to the UAs of “Riverside–San Bernardino, CA”, “Los Angeles–Long Beach–Anaheim, CA”, “Victorville–Hesperia–Apple Valley, CA”, “Indio–Palm Desert–Palm Springs, CA”, or “Temecula–Murrieta–Menifee, CA” were labeled as being in the Inland Empire of Southern California.

This resulted in a dataset of 340 hospitals covering 11 variables of interest with 8 other informative variables (e.g. facility name, zip_code). 27 hospitals are labeled as being in the Inland Empire. A table was also maintained of the measured fields and their names to make it easier to understand

## Rows: 340
## Columns: 19
## $ facility_id                 <chr> "050002", "050006", "050007", "050008", "0…
## $ HAI_1_SIR                   <chr> NA, "Better than the National Benchmark", …
## $ HAI_2_SIR                   <chr> NA, "Better than the National Benchmark", …
## $ HAI_3_SIR                   <chr> NA, "No Different than National Benchmark"…
## $ HAI_4_SIR                   <chr> NA, NA, NA, NA, NA, NA, NA, "No Different …
## $ HAI_5_SIR                   <chr> NA, "No Different than National Benchmark"…
## $ HAI_6_SIR                   <chr> "No Different than National Benchmark", "B…
## $ H_COMP_1_STAR_RATING        <chr> "medium", "medium", "medium", "medium", "m…
## $ H_COMP_2_STAR_RATING        <chr> "medium", "medium", "medium", "medium", "m…
## $ H_COMP_3_STAR_RATING        <chr> "medium", "medium", "medium", "medium", "m…
## $ emergency_department_volume <chr> "medium", "medium", "high", "low", "medium…
## $ facility_name               <chr> "ST ROSE HOSPITAL", "PROVIDENCE ST JOSEPH …
## $ city_town                   <chr> "HAYWARD", "EUREKA", "BURLINGAME", "SAN FR…
## $ zip_code                    <dbl> 94545, 95501, 94010, 94117, 94558, 94574, …
## $ cbsa                        <dbl> 41860, 21700, 41860, 41860, 34900, 34900, …
## $ cbsa_title                  <chr> "San Francisco-Oakland-Fremont, CA", "Eure…
## $ urban_area                  <chr> "78904", "28198", "78904", "78904", "61057…
## $ ua_title                    <chr> "San Francisco--Oakland, CA", "Eureka, CA"…
## $ inland_empire               <chr> "no", "no", "no", "no", "no", "no", "no", …
measure_name measure_id
Central Line Associated Bloodstream Infection (ICU + select Wards) HAI_1_SIR
Catheter Associated Urinary Tract Infections (ICU + select Wards) HAI_2_SIR
SSI - Colon Surgery HAI_3_SIR
SSI - Abdominal Hysterectomy HAI_4_SIR
MRSA Bacteremia HAI_5_SIR
Clostridium Difficile (C.Diff) HAI_6_SIR
Nurse communication - star rating H_COMP_1_STAR_RATING
Doctor communication - star rating H_COMP_2_STAR_RATING
Staff responsiveness - star rating H_COMP_3_STAR_RATING

Exploratory Analysis of the data

Missing values

The most significant challenge with the current dataset is missing values in our data. Data that is unreported for whatever reason. Unfortunately, the data pulled from CMS is currently the best that has been found to work with by this analyst. As such, it will be the data we work with. Still we can observe from the data below that incomplete records are significant in our dataset. As such we will report results with the missing records removed from the analysis, as well as results with them categorized as “missing”.

Data summary
Name hospital_care_data
Number of rows 340
Number of columns 19
_______________________
Column type frequency:
character 17
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
facility_id 0 1.00 6 6 0 340 0
HAI_1_SIR 112 0.67 33 36 0 3 0
HAI_2_SIR 92 0.73 33 36 0 3 0
HAI_3_SIR 161 0.53 33 36 0 3 0
HAI_4_SIR 294 0.14 33 36 0 2 0
HAI_5_SIR 149 0.56 33 36 0 3 0
HAI_6_SIR 62 0.82 34 36 0 2 0
H_COMP_1_STAR_RATING 49 0.86 4 9 0 3 0
H_COMP_2_STAR_RATING 49 0.86 4 9 0 3 0
H_COMP_3_STAR_RATING 49 0.86 4 9 0 3 0
emergency_department_volume 90 0.74 3 9 0 4 0
facility_name 0 1.00 13 66 0 338 0
city_town 0 1.00 4 19 0 230 0
cbsa_title 18 0.95 8 36 0 33 0
urban_area 17 0.95 5 5 0 102 0
ua_title 17 0.95 8 47 0 102 0
inland_empire 0 1.00 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
zip_code 0 1 93245.02 1813.56 90015 92005.5 93235 94903.25 96161 ▅▇▇▆▇
cbsa 0 1 38917.54 15569.39 12540 31080.0 40140 41860.00 99999 ▁▇▁▁▁

Exploratory Graphs

Now let’s begin taking an exploratory look at our data through graphs. Keeping in mind our research question of whether or not hospitals located in the Inland Empire provide care that results in a different level of healthcare outcomes that other hospitals in California.

The graph below indicates the number of hospitals that perform above, below, and at the national benchmark for various measure of healthcare-associated infections(the names of the measures below can be found in the table provided above of measure_names and measure_ids). At first glance there don’t seem to be any dramatic differences. There is a much lower total of hospitals located in the Inland Empire compared to those located in the rest of the state, which is to be expected.

The following graph also compares hospital performance in California, this time based on patient surveys of communication from Doctors and nurses and staff responsiveness. Here again, there doesn’t currently appear to be any dramatic difference between performance of hospitals in the Inland Empire and hospitals in the rest of California.

Finally, the graph below compares the levels of emergency department volume at hospitals in the Inland Empire and the rest of California. We can see here that hospitals in the Inland Empire experience “high” and particularly “very high” volumes of patients in their emergency departments at what appears to be a significantly greater rate than hospitals in the rest of California.

Contingency Tables

In this analysis we will be using chi-squared hypothesis testing to determine if there is a difference in healthcare outcomes between hospitals in the Inland Empire of Southern California and hospitals in the rest of California. for this we will want to see contingency tables of our data measure we plan to test. we do that below.

First we look at the values we observe in our data. The result below is a list of contingency comparing our measures of interest across hospitals in the Inland Empire and not in the Inland Empire. We notice some slight differences between the share of hospitals in the Inland Empire and hospitals in the rest of California spread across these variables.

## [[1]]
##              HAI_1_SIR
## inland_empire Better than the National Benchmark
##           no                                  26
##           yes                                  4
##              HAI_1_SIR
## inland_empire No Different than National Benchmark
##           no                                   162
##           yes                                   17
##              HAI_1_SIR
## inland_empire Worse than the National Benchmark
##           no                                 17
##           yes                                 2
## 
## [[2]]
##              HAI_2_SIR
## inland_empire Better than the National Benchmark
##           no                                  27
##           yes                                  4
##              HAI_2_SIR
## inland_empire No Different than National Benchmark
##           no                                   188
##           yes                                   21
##              HAI_2_SIR
## inland_empire Worse than the National Benchmark
##           no                                  8
##           yes                                 0
## 
## [[3]]
##              HAI_3_SIR
## inland_empire Better than the National Benchmark
##           no                                  14
##           yes                                  2
##              HAI_3_SIR
## inland_empire No Different than National Benchmark
##           no                                   138
##           yes                                   14
##              HAI_3_SIR
## inland_empire Worse than the National Benchmark
##           no                                 10
##           yes                                 1
## 
## [[4]]
##              HAI_4_SIR
## inland_empire No Different than National Benchmark
##           no                                    38
##           yes                                    6
##              HAI_4_SIR
## inland_empire Worse than the National Benchmark
##           no                                  2
##           yes                                 0
## 
## [[5]]
##              HAI_5_SIR
## inland_empire Better than the National Benchmark
##           no                                  13
##           yes                                  2
##              HAI_5_SIR
## inland_empire No Different than National Benchmark
##           no                                   145
##           yes                                   18
##              HAI_5_SIR
## inland_empire Worse than the National Benchmark
##           no                                 12
##           yes                                 1
## 
## [[6]]
##              HAI_6_SIR
## inland_empire Better than the National Benchmark
##           no                                 149
##           yes                                 20
##              HAI_6_SIR
## inland_empire No Different than National Benchmark
##           no                                   102
##           yes                                    7
## 
## [[7]]
##              H_COMP_1_STAR_RATING
## inland_empire excellent medium poor
##           no          7    242   15
##           yes         0     27    0
## 
## [[8]]
##              H_COMP_2_STAR_RATING
## inland_empire excellent medium poor
##           no          5    220   39
##           yes         0     21    6
## 
## [[9]]
##              H_COMP_3_STAR_RATING
## inland_empire excellent medium poor
##           no         17    227   20
##           yes         0     24    3
## 
## [[10]]
##              emergency_department_volume
## inland_empire high low medium very high
##           no    50  61     69        46
##           yes    7   2      4        11

The list of tables below shows the expected frequencies. It is important to note here that some of our expected frequencies in these tables are less than 5, meaning that in order to use chi-squared testing we will want to simulated the sampling distribution of the test statistic. Still seeing the expected frequencies we can see where the values in the observed frequencies are most different from what we would expect to see if there were no relation between hospitals in the Inland Empire and and healthcare outcomes in the measures we have available to us from CMS.

## [[1]]
##              HAI_1_SIR
## inland_empire Better than the National Benchmark
##           no                           26.973684
##           yes                           3.026316
##              HAI_1_SIR
## inland_empire No Different than National Benchmark
##           no                             160.94298
##           yes                             18.05702
##              HAI_1_SIR
## inland_empire Worse than the National Benchmark
##           no                          17.083333
##           yes                          1.916667
## 
## [[2]]
##              HAI_2_SIR
## inland_empire Better than the National Benchmark
##           no                              27.875
##           yes                              3.125
##              HAI_2_SIR
## inland_empire No Different than National Benchmark
##           no                             187.93145
##           yes                             21.06855
##              HAI_2_SIR
## inland_empire Worse than the National Benchmark
##           no                          7.1935484
##           yes                         0.8064516
## 
## [[3]]
##              HAI_3_SIR
## inland_empire Better than the National Benchmark
##           no                           14.480447
##           yes                           1.519553
##              HAI_3_SIR
## inland_empire No Different than National Benchmark
##           no                             137.56425
##           yes                             14.43575
##              HAI_3_SIR
## inland_empire Worse than the National Benchmark
##           no                           9.955307
##           yes                          1.044693
## 
## [[4]]
##              HAI_4_SIR
## inland_empire No Different than National Benchmark
##           no                              38.26087
##           yes                              5.73913
##              HAI_4_SIR
## inland_empire Worse than the National Benchmark
##           no                          1.7391304
##           yes                         0.2608696
## 
## [[5]]
##              HAI_5_SIR
## inland_empire Better than the National Benchmark
##           no                           13.350785
##           yes                           1.649215
##              HAI_5_SIR
## inland_empire No Different than National Benchmark
##           no                             145.07853
##           yes                             17.92147
##              HAI_5_SIR
## inland_empire Worse than the National Benchmark
##           no                          11.570681
##           yes                          1.429319
## 
## [[6]]
##              HAI_6_SIR
## inland_empire Better than the National Benchmark
##           no                           152.58633
##           yes                           16.41367
##              HAI_6_SIR
## inland_empire No Different than National Benchmark
##           no                              98.41367
##           yes                             10.58633
## 
## [[7]]
##              H_COMP_1_STAR_RATING
## inland_empire excellent    medium      poor
##           no  6.3505155 244.04124 13.608247
##           yes 0.6494845  24.95876  1.391753
## 
## [[8]]
##              H_COMP_2_STAR_RATING
## inland_empire excellent    medium      poor
##           no  4.5360825 218.63918 40.824742
##           yes 0.4639175  22.36082  4.175258
## 
## [[9]]
##              H_COMP_3_STAR_RATING
## inland_empire excellent    medium      poor
##           no   15.42268 227.71134 20.865979
##           yes   1.57732  23.28866  2.134021
## 
## [[10]]
##              emergency_department_volume
## inland_empire   high    low medium very high
##           no  51.528 56.952 65.992    51.528
##           yes  5.472  6.048  7.008     5.472

Hypothesis Testing

Now we move on to the actual analysis of our data using chi-squared (\(\chi^2\)) hypothesis testing.

Calculating Chi-squared Test Statistic of Observed data

The first thing we will want to do in our hypothesis testing is calculate the observed \(\chi^2\) statistic for our variables. The lists below calculate them both for the data with the missing values, and with a new data set recoding missing observations as an unknown category.

## $HAI_1_SIR
## Response: HAI_1_SIR (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1 0.421
## 
## $HAI_2_SIR
## Response: HAI_2_SIR (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  1.17
## 
## $HAI_3_SIR
## Response: HAI_3_SIR (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1 0.184
## 
## $HAI_4_SIR
## Response: HAI_4_SIR (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##       stat
##      <dbl>
## 1 1.82e-30
## 
## $HAI_5_SIR
## Response: HAI_5_SIR (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1 0.229
## 
## $HAI_6_SIR
## Response: HAI_6_SIR (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  1.64
## 
## $H_COMP_1_STAR_RATING
## Response: H_COMP_1_STAR_RATING (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  2.43
## 
## $H_COMP_2_STAR_RATING
## Response: H_COMP_2_STAR_RATING (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  1.48
## 
## $H_COMP_3_STAR_RATING
## Response: H_COMP_3_STAR_RATING (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  2.15
## 
## $emergency_department_volume
## Response: emergency_department_volume (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  11.1
## $HAI_1_SIR
## Response: HAI_1_SIR (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  4.89
## 
## $HAI_2_SIR
## Response: HAI_2_SIR (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  7.19
## 
## $HAI_3_SIR
## Response: HAI_3_SIR (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  1.47
## 
## $HAI_4_SIR
## Response: HAI_4_SIR (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  2.38
## 
## $HAI_5_SIR
## Response: HAI_5_SIR (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  5.87
## 
## $HAI_6_SIR
## Response: HAI_6_SIR (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  9.20
## 
## $H_COMP_1_STAR_RATING
## Response: H_COMP_1_STAR_RATING (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  7.74
## 
## $H_COMP_2_STAR_RATING
## Response: H_COMP_2_STAR_RATING (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  6.64
## 
## $H_COMP_3_STAR_RATING
## Response: H_COMP_3_STAR_RATING (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  7.41
## 
## $emergency_department_volume
## Response: emergency_department_volume (factor)
## Explanatory: inland_empire (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  16.7

Calculate Null-distributions

The next step in our hypothesis testing will be to generate null-based distributions. We can do this randomly or assuming theory based distributions. We will run both a random distribution and a distribution that assumes a \(\chi^2\) distribution for both datasets. We print only the lists of the \(\chi^2\) distribution below.

## $HAI_1_SIR
## A Chi-squared distribution with 2 degrees of freedom.
## $HAI_2_SIR
## A Chi-squared distribution with 2 degrees of freedom.
## $HAI_3_SIR
## A Chi-squared distribution with 2 degrees of freedom.
## $HAI_4_SIR
## A Chi-squared distribution with 1 degree of freedom.
## $HAI_5_SIR
## A Chi-squared distribution with 2 degrees of freedom.
## $HAI_6_SIR
## A Chi-squared distribution with 1 degree of freedom.
## $H_COMP_1_STAR_RATING
## A Chi-squared distribution with 2 degrees of freedom.
## $H_COMP_2_STAR_RATING
## A Chi-squared distribution with 2 degrees of freedom.
## $H_COMP_3_STAR_RATING
## A Chi-squared distribution with 2 degrees of freedom.
## $emergency_department_volume
## A Chi-squared distribution with 3 degrees of freedom.
## $HAI_1_SIR
## A Chi-squared distribution with 3 degrees of freedom.
## $HAI_2_SIR
## A Chi-squared distribution with 3 degrees of freedom.
## $HAI_3_SIR
## A Chi-squared distribution with 3 degrees of freedom.
## $HAI_4_SIR
## A Chi-squared distribution with 2 degrees of freedom.
## $HAI_5_SIR
## A Chi-squared distribution with 3 degrees of freedom.
## $HAI_6_SIR
## A Chi-squared distribution with 2 degrees of freedom.
## $H_COMP_1_STAR_RATING
## A Chi-squared distribution with 3 degrees of freedom.
## $H_COMP_2_STAR_RATING
## A Chi-squared distribution with 3 degrees of freedom.
## $H_COMP_3_STAR_RATING
## A Chi-squared distribution with 3 degrees of freedom.
## $emergency_department_volume
## A Chi-squared distribution with 4 degrees of freedom.

Calculating the P-values

Now that we have these distributions, we can calculate the p-values for variables based on the random distributions (remember that some of our variables had expected frequencies where some unique categories had fewer than 5).

The table below indicates that the variation we see between the Inland Empire and the rest of California in most of these variables does not meet the commonly used standards of statistical significance. Thus, for most of these variables, we would not reject the null hypothesis that there is no difference between the performance of hospitals in the Inland Empire and elsewhere in California. However, we do find that in the dataset where we categorize missing values as unkown some of these variables to become statistically significant, suggesting that unknown values are correlated with our inland_empire variable. Further research for data without missing variables, or maybe finding a reliable method to impute missing values may be of significant value for our research question.

We do, however, find in both data sets that the p-value for the relationship between our inland_empire variable and the volume of patients experienced by hospitals does meet the threshold we set for statistical significance. This suggests that the difference we observed of hospitals located in the Inland Empire of Southern California experience high volumes of patients in their emergency departments more often than hospitals located elsewhere in California is not due to random variation and there is in fact a meaningful relationship between those two variables. Further research may want to dig in and find out what might be causing hospitals in the Inland Empire to experience high and very high volumes of patients in their emergency departments more often than hospitals in other parts of California, and work with their surrounding communities to reduce those emergencies and/or be as prepared as possible to handle them.

measure p_value_missing p_value_unknown
HAI_1_SIR 0.282 0.376
HAI_2_SIR 0.908 0.098
HAI_3_SIR 0.220 0.494
HAI_4_SIR 1.000 0.648
HAI_5_SIR 0.178 0.210
HAI_6_SIR 0.280 0.026
H_COMP_1_STAR_RATING 0.628 0.110
H_COMP_2_STAR_RATING 0.894 0.210
H_COMP_3_STAR_RATING 0.630 0.100
emergency_department_volume 0.024 0.004

Visuialize our results

One of the amazing features of the infer package is that it allows us to visualize the null distibution and the p-value of our observed test statistic to get a better idea of how much of an outlier our observed data is assuming the null hypothesis. For this analysis we will only observe two of our p-values, the one for the emergency_department_volume variable, and the one for the HAI_3_SIR variable or the measure of surgical site infections from colon surgery.

First, let’s visualize the p-value of emergency department volumes. The graph below is from our distributions and observed statistic for the data where missing values were removed. The grey bars graph the simulation-based null distribution while the curve graphs the Theoretical Chi-Square distribution. The red bar is our observed test statistic. The graph below illustrates very clearly that it is very unlikely (although still possible) that if our null hypothesis were true, we would observe a test statistic as significant as the one observed from our data. This visualization should make it very clear that we have ample evidence to believe that it is highly likely that hospitals in the Inland Empire experience higher volumes of patients in their emergency departments for a reason other than random variance.

Now, let’s visualize the null-distributions and our observed test-statistic for our measure of healthcare-associated infections around the site of colon surgeries. In our contingencies table we saw that hospitals in the Inland Empire performed better than the national average more often than would be expected, but not by much. This visualization helps confirm that while the bulk of test statistics from our null-distributions and theoretical distribution would not be as extreme as the one we observed, we still see a fair amount of our null distribution being at least as extreme as our observed statistic. This suggests that while it is possible that Inland Empire hospitals perform better than the national average more often than hospitals elsewhere in California on these measure, we don’t have enough evidence to confidently believe this isn’t due to random variance. Especially considering our p-value did not meet our pre-determined threshold, we fail to reject the null hypothesis that hospitals in the Inland Empire provide a significantly different level of care on this measure than hospitals located elsewhere in California. However, someone who is particularly prone to action and requires a lower threshold of evidence might look at this same data and determine that hospitals elsewhere in California might want to consider looking to hospitals in the Inland Empire for practices that help reduce healthcare-associated infections due to colon surgery.

Conclusion

Throughout this analysis we have described our data and its source and performed an exploratory analysis on the data to visualize differences in the outcomes of care between hospitals in Southern California’s Inland Empire and hospitals elsewhere in California. We made contingency tables of the observed values and expected frequencies of those values and determined that due to the amount of missing values in our variables we would conduct analysis on two data sets, one where the missing values were removed, and one where they were categorized as “unknown” to be included in the analysis. Finally, we performed chi-squared hypothesis tests across our variables to determine if observed differences in outcomes between hospitals in the Inland Empire and elsewhere in California might be due to meaningful difference in hospital performance or just a result of random variance. We developed null distributions due to some of our variables having expected frequencies of less than 5 (a key assumption of the chi-squared test).

Our results found that for most of the variables of interest, differences between performances at hospitals in the Inland Empire and hospitals elsewhere in California are not significant enough that we can be sure they are not due to random variation. Thus we fail to reject the null hypotheses that hospitals in the Inland Empire achieve outcomes of care different from those achieved by other hospitals in California. However, we do find enough evidence to reject the null hypothesis that emergency departments at hospitals in the Inland Empire experience the same volumes of patients as hospitals elsewhere in California. Evidence suggests that hospitals in the Inland Empire experience meaningfully higher volumes of patients in the emergency departments.