Biostats Final Project

Part 1 - Introduction

This data was collected from multiple regions in the US. US territories along with Alaska and Hawaii are listed in other. The purpose of this study is to see if the region affects the wait time in the ER for people who are on medicaid or some type of government aid. A further study could be done to see if the amount of cities also affects the wait time. Our major question is does the region of the US affect the wait time in the ER? Our explanatory variable is the region and the response variable is the wait time. In this section we loaded all necessary packages, fixed the data, and grouped our states/US territories into regions.

#load data
care_state <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-08/care_state.csv')

## Rows: 1232 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): state, condition, measure_id, measure_name, footnote
## dbl  (1): score
## date (2): start_date, end_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#look at data
View(care_state)

care_state_summary <- care_state |> 
  group_by(state) |> 
  summarise(
    avg_score = mean(score, na.rm = TRUE),
    n = n()
  ) |> 
  arrange(desc(avg_score))
care_state_summary

## # A tibble: 56 × 3
##    state avg_score     n
##    <chr>     <dbl> <int>
##  1 RI         179.    22
##  2 MD         179.    22
##  3 DC         178.    22
##  4 MA         168.    22
##  5 DE         168.    22
##  6 ME         165.    22
##  7 PR         165.    22
##  8 VT         156.    22
##  9 IL         154.    22
## 10 NC         152.    22
## # ℹ 46 more rows

## R doesn't understand some of these variables!
## Let's help R understand which variables are factors and which are numerical
is.factor(care_state$condition)

## [1] FALSE

care_state$condition <- as.factor(care_state$condition)
levels(care_state$condition)

## [1] "Cataract surgery outcome"            "Colonoscopy care"                   
## [3] "Electronic Clinical Quality Measure" "Emergency Department"               
## [5] "Healthcare Personnel Vaccination"    "Sepsis Care"

is.factor(care_state$measure_name)

## [1] FALSE

care_state$measure_name <- as.factor(care_state$measure_name)
levels(care_state$measure_name)

##  [1] "Average (median) time patients spent in the emergency department before leaving from the visit A lower number of minutes is better"                                          
##  [2] "Average (median) time patients spent in the emergency department before leaving from the visit- Psychiatric/Mental Health Patients.  A lower number of minutes is better"    
##  [3] "Average time patients spent in the emergency department before being sent home A lower number of minutes is better (high)"                                                   
##  [4] "Average time patients spent in the emergency department before being sent home A lower number of minutes is better (low)"                                                    
##  [5] "Average time patients spent in the emergency department before being sent home A lower number of minutes is better (moderate)"                                               
##  [6] "Average time patients spent in the emergency department before leaving from the visit - Psychiatric/Mental Health Patients.  A lower number of minutes is better (high)"     
##  [7] "Average time patients spent in the emergency department before leaving from the visit - Psychiatric/Mental Health Patients.  A lower number of minutes is better (low)"      
##  [8] "Average time patients spent in the emergency department before leaving from the visit - Psychiatric/Mental Health Patients.  A lower number of minutes is better (moderate)" 
##  [9] "Average time patients spent in the emergency department before leaving from the visit - Psychiatric/Mental Health Patients.  A lower number of minutes is better (very high)"
## [10] "Healthcare workers given influenza vaccination Higher percentages are better"                                                                                                
## [11] "Percentage of healthcare personnel who are up to date with COVID-19 vaccinations"                                                                                            
## [12] "Percentage of patients receiving appropriate recommendation for follow-up screening colonoscopy Higher percentages are better"                                               
## [13] "Percentage of patients who came to the emergency department with stroke symptoms who received brain scan results within 45 minutes of arrival Higher percentages are better" 
## [14] "Percentage of patients who had cataract surgery and had improvement in visual function within 90 days following the surgery Higher percentages are better"                   
## [15] "Percentage of patients who left the emergency department before being seen Lower percentages are better"                                                                     
## [16] "Percentage of patients who received appropriate care for severe sepsis and septic shock. Higher percentages are better"                                                      
## [17] "Safe Use of Opioids - Concurrent Prescribing"                                                                                                                                
## [18] "Septic Shock 3-Hour Bundle"                                                                                                                                                  
## [19] "Septic Shock 6-Hour Bundle"                                                                                                                                                  
## [20] "Severe Sepsis 3-Hour Bundle"                                                                                                                                                 
## [21] "Severe Sepsis 6-Hour Bundle"

levels(care_state$measure_id)

## NULL

is.factor(care_state$measure_id)

## [1] FALSE

care_state$measure_id <- as.factor(care_state$measure_id)
levels(care_state$measure_id)

##  [1] "HCP_COVID_19"         "IMM_3"                "OP_18b"              
##  [4] "OP_18b_HIGH_MIN"      "OP_18b_LOW_MIN"       "OP_18b_MEDIUM_MIN"   
##  [7] "OP_18b_VERY_HIGH_MIN" "OP_18c"               "OP_18c_HIGH_MIN"     
## [10] "OP_18c_LOW_MIN"       "OP_18c_MEDIUM_MIN"    "OP_18c_VERY_HIGH_MIN"
## [13] "OP_22"                "OP_23"                "OP_29"               
## [16] "OP_31"                "SAFE_USE_OF_OPIOIDS"  "SEP_1"               
## [19] "SEP_SH_3HR"           "SEP_SH_6HR"           "SEV_SEP_3HR"         
## [22] "SEV_SEP_6HR"

#filter rows where team column is equal to factor label 'OP_18b' (Average median wait time (see online info))
subset.df <- care_state %>% 
  filter(measure_id %in% c('OP_18b'))
View(subset.df)
is.numeric(subset.df$score)

## [1] TRUE

# Add custom region bins
subset.df <- subset.df |> 
  mutate(
    region = case_when(
      state %in% c("AZ", "NM", "TX", "OK", "CA", "NV") ~ "Southwest",
      state %in% c("WA", "OR", "ID", "CO", "MT", "WY", "UT") ~ "Northwest",
      state %in% c("ND", "SD", "MN", "WI", "IA", "MI") ~ "Upper Midwest",
      state %in% c("MO", "IL", "IN", "OH", "NE", "KS") ~ "Lower Midwest",
      state %in% c("FL", "GA", "AL", "MS", "SC", "NC", "TN", "KY", "LA", "AR", "WV", "VA") ~ "Southeast",
      state %in% c("NY", "NJ", "PA", "CT", "MA", "RI", "NH", "VT", "ME", "DE", "MD", "DC") ~ "Northeast",
      TRUE ~ "Other"  # fallback if state is not matched
    )
  )
??reorder
subset.df$region <- factor(subset.df$region, levels=c("Northwest", "Upper Midwest", "Lower Midwest", "Southwest", "Northeast", "Southeast", "Other"))
#null = the region does not affect the er wait times
#alt. = the region does affect the er wait times

Region Wait-times

Part 2 - Exploring the Data (Descriptive Statistics)

We calculated the mean, median, standard deviation, and standard error of the ER wait times, and plotted the oringal data as a bar graph. A tranformation was done to try to normalize the data. Then graphs by each region were made for comparison.

mean(subset.df$score, na.rm=TRUE)

## [1] 161.1923

median(subset.df$score, na.rm= TRUE)

## [1] 154.5

sd(subset.df$score, na.rm=TRUE)

## [1] 42.43243

sd(subset.df$score, na.rm=TRUE)/sqrt(52)

## [1] 5.88432

library(ggplot2)
q <- ggplot(subset.df, aes(x=state, y=score)) + 
  geom_bar(stat = "identity") + theme_minimal()
q + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).

## maybe we should bin by regional area
## Southwest, Northwest, Upper Midwest, Lower Midwest, Southeast, Northeast
subsetdflog <- log(subset.df$score)
hist(subsetdflog)

#Our original graph was all over the place, so we transformed the graph to see 
#if we could get a more normal distribution and the log transformation did that.

## histogram
library(ggpubr)
histogram1 <- gghistogram(subset.df, x = "score",
            add = "mean", rug = TRUE,
            color = "region", palette = c("#00AFBB", "#E7B800", "#00B", "#E80", "#F89", "#E67", "#E24"))

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

facet(histogram1 + theme_bw(), facet.by = "region")

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Let's make a boxplot 
ggplot(subset.df, aes(x=region, y=score, fill = region)) + 
  geom_boxplot() + ylab("Median Wait Time in Emergency Department")

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggplot(subset.df, aes(x=region, y=score, fill = region)) + 
  geom_bar(stat = "identity")

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).

#make qqplot
levels(subset.df$region)

## [1] "Northwest"     "Upper Midwest" "Lower Midwest" "Southwest"    
## [5] "Northeast"     "Southeast"     "Other"

ggqqplot1 <- ggqqplot(subset.df, x = "score",
                      add = "mean", rug = TRUE,
                      color = "region", palette = c("#00AFBB", "#E7B800", "#00B", "#E80", "#F89", "#E67", "#E24"))
facet(ggqqplot1 + theme_bw(), facet.by = "region")

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_qq()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_qq_line()`).
## Removed 4 rows containing non-finite outside the scale range
## (`stat_qq_line()`).

Part 3 - Statistical Test (Inferential Statistics)

We included the appropraite statistical test needed for our data. Explainations for why we used each test is listed below in the hashtags.

is.numeric(subset.df$score)

## [1] TRUE

is.factor(subset.df$state)

## [1] FALSE

#subset.df$state <- as.factor(as.character(subset.df$state))
hist(subset.df$score)

#filtered.data <- filter(subset.df, region)
model.1 <- lm(score ~ region, data = subset.df)
hist(model.1$residuals)

summary(model.1)

## 
## Call:
## lm(formula = score ~ region, data = subset.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.667 -18.521  -5.286  13.491 116.333 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          139.286     12.691  10.975  2.6e-14 ***
## regionUpper Midwest  -11.119     18.680  -0.595 0.554674    
## regionLower Midwest    1.214     18.680   0.065 0.948459    
## regionSouthwest       13.881     18.680   0.743 0.461297    
## regionNortheast       68.048     15.969   4.261 0.000102 ***
## regionSoutheast       13.298     15.969   0.833 0.409401    
## regionOther           46.381     23.170   2.002 0.051362 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.58 on 45 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.4475, Adjusted R-squared:  0.3738 
## F-statistic: 6.075 on 6 and 45 DF,  p-value: 0.000102

#The adjusted R squared value of 0.3738 shows there is little variance between the 
#ER wait times.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

ANOVA.model <- car::Anova(model.1)
ANOVA.model

## Anova Table (Type II tests)
## 
## Response: score
##           Sum Sq Df F value   Pr(>F)    
## region     41093  6  6.0749 0.000102 ***
## Residuals  50733 45                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

hist(resid(model.1))

qqnorm(resid(model.1))
qqline(resid(model.1))

shapiro.test(resid(model.1))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model.1)
## W = 0.88908, p-value = 0.0001609

#not normal

leveneTest(data = subset.df, score ~ region, center = mean)

## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value    Pr(>F)    
## group  6  7.0888 2.343e-05 ***
##       45                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

kruskal.test(score ~ region, data=subset.df)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  score by region
## Kruskal-Wallis chi-squared = 26.709, df = 6, p-value = 0.0001641

#we chose this because our shapiro showed abnormalities so we ran a levenes test. 
#the p value still showed that a kruskal should be ran. We also observed in our 
#plots that the data was not normal since there were major outliers that skewed
#our graph 
#the p value is smaller than 0.05 so we are able to reject the null. 
#T test = 26.709, df= 6, p val = 0.0001641
#the null hypothesis can be reject and the regon does significantly affect er 
#wait times due to the p value.

Part 4 - Conclusion

We can conclude that we can reject our null hypothesis that the region does not affect the wait time in the ER, since our p value = 0.0001641 which is less than 0.05 and a T test = 26.709, df= 6. This analysis shows us that there could be
faults in the government aid programs in healthcare. This also leads to poorer patient care due to long wait times and diseases are left in the public longer, or
people opt out of getting seen to avoid long lines. Limitations in this study is that city populations (and in a sense state populations) are not accurately taken into account for average wait times in the ER.

References

 R for Data Science. (2025, April 8). TidyTuesday: 2025-04-08 [Data set]. GitHub. https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-04-08/readme.md

Zhu, K. (2025, January 13). Mapped: Emergency room visit times by State. Visual Capitalist. https://www.visualcapitalist.com/mapped-emergency-room-visit-times-by-state/