This data was collected from multiple regions in the US. US territories along with Alaska and Hawaii are listed in other. The purpose of this study is to see if the region affects the wait time in the ER for people who are on medicaid or some type of government aid. A further study could be done to see if the amount of cities also affects the wait time. Our major question is does the region of the US affect the wait time in the ER? Our explanatory variable is the region and the response variable is the wait time. In this section we loaded all necessary packages, fixed the data, and grouped our states/US territories into regions.
#load data
care_state <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-08/care_state.csv')
## Rows: 1232 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): state, condition, measure_id, measure_name, footnote
## dbl (1): score
## date (2): start_date, end_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#look at data
View(care_state)
care_state_summary <- care_state |>
group_by(state) |>
summarise(
avg_score = mean(score, na.rm = TRUE),
n = n()
) |>
arrange(desc(avg_score))
care_state_summary
## # A tibble: 56 × 3
## state avg_score n
## <chr> <dbl> <int>
## 1 RI 179. 22
## 2 MD 179. 22
## 3 DC 178. 22
## 4 MA 168. 22
## 5 DE 168. 22
## 6 ME 165. 22
## 7 PR 165. 22
## 8 VT 156. 22
## 9 IL 154. 22
## 10 NC 152. 22
## # ℹ 46 more rows
## R doesn't understand some of these variables!
## Let's help R understand which variables are factors and which are numerical
is.factor(care_state$condition)
## [1] FALSE
care_state$condition <- as.factor(care_state$condition)
levels(care_state$condition)
## [1] "Cataract surgery outcome" "Colonoscopy care"
## [3] "Electronic Clinical Quality Measure" "Emergency Department"
## [5] "Healthcare Personnel Vaccination" "Sepsis Care"
is.factor(care_state$measure_name)
## [1] FALSE
care_state$measure_name <- as.factor(care_state$measure_name)
levels(care_state$measure_name)
## [1] "Average (median) time patients spent in the emergency department before leaving from the visit A lower number of minutes is better"
## [2] "Average (median) time patients spent in the emergency department before leaving from the visit- Psychiatric/Mental Health Patients. A lower number of minutes is better"
## [3] "Average time patients spent in the emergency department before being sent home A lower number of minutes is better (high)"
## [4] "Average time patients spent in the emergency department before being sent home A lower number of minutes is better (low)"
## [5] "Average time patients spent in the emergency department before being sent home A lower number of minutes is better (moderate)"
## [6] "Average time patients spent in the emergency department before leaving from the visit - Psychiatric/Mental Health Patients. A lower number of minutes is better (high)"
## [7] "Average time patients spent in the emergency department before leaving from the visit - Psychiatric/Mental Health Patients. A lower number of minutes is better (low)"
## [8] "Average time patients spent in the emergency department before leaving from the visit - Psychiatric/Mental Health Patients. A lower number of minutes is better (moderate)"
## [9] "Average time patients spent in the emergency department before leaving from the visit - Psychiatric/Mental Health Patients. A lower number of minutes is better (very high)"
## [10] "Healthcare workers given influenza vaccination Higher percentages are better"
## [11] "Percentage of healthcare personnel who are up to date with COVID-19 vaccinations"
## [12] "Percentage of patients receiving appropriate recommendation for follow-up screening colonoscopy Higher percentages are better"
## [13] "Percentage of patients who came to the emergency department with stroke symptoms who received brain scan results within 45 minutes of arrival Higher percentages are better"
## [14] "Percentage of patients who had cataract surgery and had improvement in visual function within 90 days following the surgery Higher percentages are better"
## [15] "Percentage of patients who left the emergency department before being seen Lower percentages are better"
## [16] "Percentage of patients who received appropriate care for severe sepsis and septic shock. Higher percentages are better"
## [17] "Safe Use of Opioids - Concurrent Prescribing"
## [18] "Septic Shock 3-Hour Bundle"
## [19] "Septic Shock 6-Hour Bundle"
## [20] "Severe Sepsis 3-Hour Bundle"
## [21] "Severe Sepsis 6-Hour Bundle"
levels(care_state$measure_id)
## NULL
is.factor(care_state$measure_id)
## [1] FALSE
care_state$measure_id <- as.factor(care_state$measure_id)
levels(care_state$measure_id)
## [1] "HCP_COVID_19" "IMM_3" "OP_18b"
## [4] "OP_18b_HIGH_MIN" "OP_18b_LOW_MIN" "OP_18b_MEDIUM_MIN"
## [7] "OP_18b_VERY_HIGH_MIN" "OP_18c" "OP_18c_HIGH_MIN"
## [10] "OP_18c_LOW_MIN" "OP_18c_MEDIUM_MIN" "OP_18c_VERY_HIGH_MIN"
## [13] "OP_22" "OP_23" "OP_29"
## [16] "OP_31" "SAFE_USE_OF_OPIOIDS" "SEP_1"
## [19] "SEP_SH_3HR" "SEP_SH_6HR" "SEV_SEP_3HR"
## [22] "SEV_SEP_6HR"
#filter rows where team column is equal to factor label 'OP_18b' (Average median wait time (see online info))
subset.df <- care_state %>%
filter(measure_id %in% c('OP_18b'))
View(subset.df)
is.numeric(subset.df$score)
## [1] TRUE
# Add custom region bins
subset.df <- subset.df |>
mutate(
region = case_when(
state %in% c("AZ", "NM", "TX", "OK", "CA", "NV") ~ "Southwest",
state %in% c("WA", "OR", "ID", "CO", "MT", "WY", "UT") ~ "Northwest",
state %in% c("ND", "SD", "MN", "WI", "IA", "MI") ~ "Upper Midwest",
state %in% c("MO", "IL", "IN", "OH", "NE", "KS") ~ "Lower Midwest",
state %in% c("FL", "GA", "AL", "MS", "SC", "NC", "TN", "KY", "LA", "AR", "WV", "VA") ~ "Southeast",
state %in% c("NY", "NJ", "PA", "CT", "MA", "RI", "NH", "VT", "ME", "DE", "MD", "DC") ~ "Northeast",
TRUE ~ "Other" # fallback if state is not matched
)
)
??reorder
subset.df$region <- factor(subset.df$region, levels=c("Northwest", "Upper Midwest", "Lower Midwest", "Southwest", "Northeast", "Southeast", "Other"))
#null = the region does not affect the er wait times
#alt. = the region does affect the er wait times
We calculated the mean, median, standard deviation, and standard error of the ER wait times, and plotted the oringal data as a bar graph. A tranformation was done to try to normalize the data. Then graphs by each region were made for comparison.
mean(subset.df$score, na.rm=TRUE)
## [1] 161.1923
median(subset.df$score, na.rm= TRUE)
## [1] 154.5
sd(subset.df$score, na.rm=TRUE)
## [1] 42.43243
sd(subset.df$score, na.rm=TRUE)/sqrt(52)
## [1] 5.88432
library(ggplot2)
q <- ggplot(subset.df, aes(x=state, y=score)) +
geom_bar(stat = "identity") + theme_minimal()
q + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## maybe we should bin by regional area
## Southwest, Northwest, Upper Midwest, Lower Midwest, Southeast, Northeast
subsetdflog <- log(subset.df$score)
hist(subsetdflog)
#Our original graph was all over the place, so we transformed the graph to see
#if we could get a more normal distribution and the log transformation did that.
## histogram
library(ggpubr)
histogram1 <- gghistogram(subset.df, x = "score",
add = "mean", rug = TRUE,
color = "region", palette = c("#00AFBB", "#E7B800", "#00B", "#E80", "#F89", "#E67", "#E24"))
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
facet(histogram1 + theme_bw(), facet.by = "region")
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Let's make a boxplot
ggplot(subset.df, aes(x=region, y=score, fill = region)) +
geom_boxplot() + ylab("Median Wait Time in Emergency Department")
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
ggplot(subset.df, aes(x=region, y=score, fill = region)) +
geom_bar(stat = "identity")
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).
#make qqplot
levels(subset.df$region)
## [1] "Northwest" "Upper Midwest" "Lower Midwest" "Southwest"
## [5] "Northeast" "Southeast" "Other"
ggqqplot1 <- ggqqplot(subset.df, x = "score",
add = "mean", rug = TRUE,
color = "region", palette = c("#00AFBB", "#E7B800", "#00B", "#E80", "#F89", "#E67", "#E24"))
facet(ggqqplot1 + theme_bw(), facet.by = "region")
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_qq_line()`).
## Removed 4 rows containing non-finite outside the scale range
## (`stat_qq_line()`).
We included the appropraite statistical test needed for our data. Explainations for why we used each test is listed below in the hashtags.
is.numeric(subset.df$score)
## [1] TRUE
is.factor(subset.df$state)
## [1] FALSE
#subset.df$state <- as.factor(as.character(subset.df$state))
hist(subset.df$score)
#filtered.data <- filter(subset.df, region)
model.1 <- lm(score ~ region, data = subset.df)
hist(model.1$residuals)
summary(model.1)
##
## Call:
## lm(formula = score ~ region, data = subset.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.667 -18.521 -5.286 13.491 116.333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 139.286 12.691 10.975 2.6e-14 ***
## regionUpper Midwest -11.119 18.680 -0.595 0.554674
## regionLower Midwest 1.214 18.680 0.065 0.948459
## regionSouthwest 13.881 18.680 0.743 0.461297
## regionNortheast 68.048 15.969 4.261 0.000102 ***
## regionSoutheast 13.298 15.969 0.833 0.409401
## regionOther 46.381 23.170 2.002 0.051362 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.58 on 45 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.4475, Adjusted R-squared: 0.3738
## F-statistic: 6.075 on 6 and 45 DF, p-value: 0.000102
#The adjusted R squared value of 0.3738 shows there is little variance between the
#ER wait times.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
ANOVA.model <- car::Anova(model.1)
ANOVA.model
## Anova Table (Type II tests)
##
## Response: score
## Sum Sq Df F value Pr(>F)
## region 41093 6 6.0749 0.000102 ***
## Residuals 50733 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
hist(resid(model.1))
qqnorm(resid(model.1))
qqline(resid(model.1))
shapiro.test(resid(model.1))
##
## Shapiro-Wilk normality test
##
## data: resid(model.1)
## W = 0.88908, p-value = 0.0001609
#not normal
leveneTest(data = subset.df, score ~ region, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 6 7.0888 2.343e-05 ***
## 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
kruskal.test(score ~ region, data=subset.df)
##
## Kruskal-Wallis rank sum test
##
## data: score by region
## Kruskal-Wallis chi-squared = 26.709, df = 6, p-value = 0.0001641
#we chose this because our shapiro showed abnormalities so we ran a levenes test.
#the p value still showed that a kruskal should be ran. We also observed in our
#plots that the data was not normal since there were major outliers that skewed
#our graph
#the p value is smaller than 0.05 so we are able to reject the null.
#T test = 26.709, df= 6, p val = 0.0001641
#the null hypothesis can be reject and the regon does significantly affect er
#wait times due to the p value.
We can conclude that we can reject our null hypothesis that the
region does not affect the wait time in the ER, since our p value =
0.0001641 which is less than 0.05 and a T test = 26.709, df= 6. This
analysis shows us that there could be
faults in the government aid programs in healthcare. This also leads to
poorer patient care due to long wait times and diseases are left in the
public longer, or
people opt out of getting seen to avoid long lines. Limitations in this
study is that city populations (and in a sense state populations) are
not accurately taken into account for average wait times in the ER.
R for Data Science. (2025, April 8). TidyTuesday: 2025-04-08 [Data set]. GitHub. https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-04-08/readme.md
Zhu, K. (2025, January 13). Mapped: Emergency room visit times by State. Visual Capitalist. https://www.visualcapitalist.com/mapped-emergency-room-visit-times-by-state/