c Following my analysis, I come up with the following insights.
The critical factor in wait times are the number of patients.
The number of patients vary by number of day or day of week.
My recommendations are as follows;
Vary the number of staff depending on days of week that have more emergency cases.
Vary staff by hour of day, with more staff during hours with more emergency cases like 1100 hrs, and less during times when there are fewer cases like around midnight.
Given that the hospital has fixed number of beds, there should be arrangements to create portable, temporary emergency wards and beds to cater for the upsurge in emergency cases duting particular days of week and hours of day.
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
if (!require(pacman)) {
install.packages("pacman")
}
## Loading required package: pacman
pacman::p_load(
tidyverse, janitor, skimr,
kableExtra, readxl, conflicted,
GGally, corrplot, ggthemes,
randomForest, vip, stargazer,
naivebayes, e1071, caret,
kernlab
)
options(digits = 3)
theme_set(ggthemes::theme_clean())
In hospital emergency wards, the longer the patient waits in a queue before being attended, the greater the risk of mortality. In this analysis, I use data regarding the patient wait times in an emergency department in an anonymous hospital to derive insights that can help the hospital management reduce the wait times. Specifically, the analysis aims uncovering factors associated with admission delays so as to advise the management of actionable strategies to reduce the delays.
To achieve this goal, I run the following models for the analysis.
Simple Linear Regression Model.
Naive Bayes Model.
Random Forest Model.
Support Vector Machines (SVM) Model.
But first, I load and explore the data, run the models, and, finally, make recommendations. The analysis then concludes.
In this section, I load and explore the data.
## Load the data ----
waiting <- readxl::read_xlsx("Hw_data.xlsx") |>
janitor::clean_names()
## Overview of the data
glimpse(waiting)
## Rows: 744
## Columns: 8
## $ date_and_time <dttm> 2016-05-01 00:00:00, 2016-05-01 …
## $ number_of_ed_beds <dbl> 72, 72, 72, 72, 72, 72, 72, 72, 7…
## $ number_of_ip_beds <dbl> 532, 532, 532, 532, 532, 532, 532…
## $ number_of_ed_pts <dbl> 57, 57, 52, 40, 40, 34, 26, 28, 2…
## $ number_of_ed_pts_waiting_ip_bed <dbl> 9, 8, 9, 9, 9, 9, 11, 10, 12, 11,…
## $ number_of_critical_care_pts_display <dbl> 3.5, 5.5, 5.5, 5.5, 6.5, 5.5, 6.0…
## $ door_to_bed_time_for_last_ed_patient <dbl> 1.82, 1.05, 1.10, 1.56, 0.01, 0.1…
## $ longest_admit_time_waiting_in_ed <dbl> 3.55, 4.25, 6.91, 1.49, 2.11, 3.3…
The data consists of 744 observations of 8 variables. Table 1 below describes the variables.
# Describing the variables in the raw data ----
tribble(
~Variable, ~Description,
"date_and_time",
"Time and date of patient arrival.",
"number_of_ed_beds",
"Number of beds in the emergency department.",
"number_of_ip_beds",
"Number of inpatient beds.",
"number_of_ed_pts",
"Number of emergency department patients.",
"number_of_ed_pts_waiting_ip_bed",
"Number of emergency department patients waiting for bed",
"number_of_critical_care_pts_display",
"Number of critical care patients display",
"door_to_bed_time_for_last_ed_patient",
"Door to bed time for last patient in emergency department.",
"longest_admit_time_waiting_in_ed",
"Longest admission time for patient waiting in emergency department."
) |>
kbl(booktabs = TRUE, caption = "Variables Description") |>
kable_classic(
full_width = TRUE,
latex_options = "hold_position"
)
Variable | Description |
---|---|
date_and_time | Time and date of patient arrival. |
number_of_ed_beds | Number of beds in the emergency department. |
number_of_ip_beds | Number of inpatient beds. |
number_of_ed_pts | Number of emergency department patients. |
number_of_ed_pts_waiting_ip_bed | Number of emergency department patients waiting for bed |
number_of_critical_care_pts_display | Number of critical care patients display |
door_to_bed_time_for_last_ed_patient | Door to bed time for last patient in emergency department. |
longest_admit_time_waiting_in_ed | Longest admission time for patient waiting in emergency department. |
The target variable (dependent variable) is the longest admit time waiting in emergency department (longest_admit_time_waiting_in_ed).
The data has no missing values.
## Check missing values
waiting |>
sapply(is.na) |>
colSums() |>
tibble(variables = names(waiting), missing = _) |>
kbl(booktabs = TRUE, caption = "Missing values") |>
kable_classic(
full_width = FALSE,
latex_options = "hold_position"
)
variables | missing |
---|---|
date_and_time | 0 |
number_of_ed_beds | 0 |
number_of_ip_beds | 0 |
number_of_ed_pts | 0 |
number_of_ed_pts_waiting_ip_bed | 0 |
number_of_critical_care_pts_display | 0 |
door_to_bed_time_for_last_ed_patient | 0 |
longest_admit_time_waiting_in_ed | 0 |
Likewise, the data has no duplicate rows.
## Check duplicate observations
waiting |>
janitor::get_dupes()
## # A tibble: 0 × 9
## # ℹ 9 variables: date_and_time <dttm>, number_of_ed_beds <dbl>,
## # number_of_ip_beds <dbl>, number_of_ed_pts <dbl>,
## # number_of_ed_pts_waiting_ip_bed <dbl>,
## # number_of_critical_care_pts_display <dbl>,
## # door_to_bed_time_for_last_ed_patient <dbl>,
## # longest_admit_time_waiting_in_ed <dbl>, dupe_count <int>
Given that this is a relatively clean dataset, I embark on data exploration.
I start by summarising the data.
## Summary statistics for the data
waiting |>
select(-date_and_time) |> skimr::skim_without_charts() |>
select(-complete_rate, -n_missing) |>
set_names(c(
"no", "variable", "Mean", "SD", "Min",
"Q1", "Median", "Q3", "Max"
)) |>
select(-no) |> kbl(booktabs = TRUE, caption = "Data Summary") |>
kable_classic(
full_width = FALSE, latex_options = "hold_position"
)
variable | Mean | SD | Min | Q1 | Median | Q3 | Max |
---|---|---|---|---|---|---|---|
number_of_ed_beds | 72.000 | 0.00 | 72.00 | 72.00 | 72.00 | 72.000 | 72.00 |
number_of_ip_beds | 532.000 | 0.00 | 532.00 | 532.00 | 532.00 | 532.000 | 532.00 |
number_of_ed_pts | 60.579 | 19.57 | 20.00 | 43.00 | 62.00 | 76.000 | 109.00 |
number_of_ed_pts_waiting_ip_bed | 10.633 | 5.24 | 0.00 | 6.00 | 10.00 | 14.000 | 29.00 |
number_of_critical_care_pts_display | 5.443 | 2.35 | 0.00 | 3.50 | 5.50 | 7.000 | 14.50 |
door_to_bed_time_for_last_ed_patient | 0.761 | 1.16 | -0.07 | 0.03 | 0.23 | 0.883 | 6.76 |
longest_admit_time_waiting_in_ed | 5.041 | 5.24 | 0.00 | 1.63 | 3.36 | 6.562 | 30.02 |
I do a pairs plot for the variables to uncover the significant relationships between the variables.
## Pairs plotof the raw data
waiting |>
select(
-date_and_time, -number_of_ed_beds,
-number_of_ip_beds
) |>
GGally::ggpairs(title = "Variables Pairs Plots")
Pairs Plot for Raw Data
Note that the number of both inpatient and emergency department beds are constant and hence the NA value for the correlations of these variables and other variables in the dataset. Hence, I omit both variables in the correlation analysis.
Overall, there are significant correlations between waiting times and the number of patients (number_of_ed_pts), and the patients waiting in inpatient beds (number_of_ed_pts_waiting_ip_bed). The variables can be useful in explaining waiting times in emergency rooms except for the number of beds.
I remove the two constant columns from the data; number_of_ed_beds, number_of_ip_beds.
## Remove the columns with constant values
## These are; number_of_ed_beds, number_of_ip_beds.
waiting <- waiting |>
remove_constant()
Given that the flow of patients may vary by day of the week and time of day, I split the date_and_time variable into two variables, weekday (Sunday (1) to Saturday (7)) and hour of the day (in 24 hour format 00 to 23). Note that the data was collected in May 2016 hence the month and year may not be useful in the analysis.
Given that I also run Naive Bayes and Support Vector Machines (SVM) models, I create a new categorical variable from the wait times. Specifically, I segment the target variable (longest_admit_time_waiting_in_ed) into 9 categories. Examples of these categories are patients that wait for 0-3.34 minutes, 3.34 to 6.67 minutes, and so on. I have used my own judgement to select these segments.
## Creating new variables, day of week, hour of day, and wait_cat
## wait_cat bins the waiting times into 9 categories
waiting <- waiting |>
mutate(
week_day = wday(date_and_time, label = FALSE),
hour_day = hour(date_and_time)
) |>
select(-date_and_time) |>
mutate(
hour_day = factor(hour_day),
week_day = factor(week_day)
) |>
mutate(wait_cat = cut(longest_admit_time_waiting_in_ed,
breaks = 9
))
Figure 1 below shows the distribution of weight times. We see that
the variable is skewed with much of the wait time less than 10 minutes.
However, there are outliers of upto 30 minutes which could mean life and
death for patients. Next, we see the distribution of patients based on
the waits times for the categorical variable wait_cat
(Figure 2).
## Distribution of patient wait times
waiting |> ggplot(aes(x = longest_admit_time_waiting_in_ed)) +
geom_histogram(color = "black", fill = "gray80") +
geom_density(aes(y = ..count..)) +
labs(x = "Wait Time", y = "Count",
title = "Distribution of Wait Time")
Distribution of Patient Wait Times
## Waiting time by categories
waiting |> count(wait_cat) |>
ggplot(mapping = aes(x = wait_cat, y = n)) + geom_col() +
labs(x = "length of Wait", y = "Count",
title = "Waiting Times in Emergency Departments") + coord_flip()
Waiting Times in Emergency Departments
Again the skewness in the histogram repeats, with most waits between zeo and 4 minutes. However, there are outliers of upto thirty minutes which are a cause for concern.
Table 4 below shows the distribution of cases by week day. The table shows that most cases occur on Sundays, Mondays, and Tuesdays with 120 admissions each.
## Patients by day of the week
waiting |>
count(week_day) |>
kbl(booktabs = TRUE, caption = "Patients by Day of Week") |>
kable_classic(
full_width = FALSE,
latex_options = "hold_position"
)
week_day | n |
---|---|
1 | 120 |
2 | 120 |
3 | 120 |
4 | 96 |
5 | 96 |
6 | 96 |
7 | 96 |
For hour of day, there are a uniform number of admissions for this data (Table 5).
## Patients by hour of day
waiting |>
count(hour_day) |>
kbl(booktabs = TRUE, caption = "Patients by Hour of Day") |>
kable_classic(
full_width = FALSE,
latex_options = "hold_position"
)
hour_day | n |
---|---|
0 | 31 |
1 | 31 |
2 | 31 |
3 | 31 |
4 | 31 |
5 | 31 |
6 | 31 |
7 | 31 |
8 | 31 |
9 | 31 |
10 | 31 |
11 | 31 |
12 | 31 |
13 | 31 |
14 | 31 |
15 | 31 |
16 | 31 |
17 | 31 |
18 | 31 |
19 | 31 |
20 | 31 |
21 | 31 |
22 | 31 |
23 | 31 |
In this section, I run the models in the following order.
Linear Regression Model.
Random forest Model.
Naive Bayes Model.
Support Vector Machines Model.
The regression analysis [@]Muller2016 shows the factors that are important in determining waiting times for patients.
## Run the regression model
lm_model <- lm(longest_admit_time_waiting_in_ed ~ . - wait_cat, data = waiting)
## Generate regression model output for either HTML or PDF output
if (knitr::is_html_output()) {
stargazer::stargazer(lm_model,
type = "html",
title = "Regression Output"
)
} else {
stargazer::stargazer(lm_model,
type = "latex",
title = "Regression Output",
font.size = "tiny"
)
}
Dependent variable: | |
longest_admit_time_waiting_in_ed | |
number_of_ed_pts | 0.026 |
(0.020) | |
number_of_ed_pts_waiting_ip_bed | 0.109** |
(0.043) | |
number_of_critical_care_pts_display | -0.030 |
(0.075) | |
door_to_bed_time_for_last_ed_patient | 0.207 |
(0.156) | |
week_day2 | -0.451 |
(0.588) | |
week_day3 | 0.896 |
(0.632) | |
week_day4 | 0.493 |
(0.619) | |
week_day5 | 2.140*** |
(0.641) | |
week_day6 | 0.812 |
(0.626) | |
week_day7 | -0.448 |
(0.606) | |
hour_day1 | -0.377 |
(1.120) | |
hour_day2 | -0.010 |
(1.140) | |
hour_day3 | -0.421 |
(1.170) | |
hour_day4 | -0.593 |
(1.200) | |
hour_day5 | -0.037 |
(1.220) | |
hour_day6 | -0.249 |
(1.230) | |
hour_day7 | -0.503 |
(1.230) | |
hour_day8 | 0.647 |
(1.190) | |
hour_day9 | -0.642 |
(1.140) | |
hour_day10 | 1.550 |
(1.130) | |
hour_day11 | 3.750*** |
(1.120) | |
hour_day12 | 4.740*** |
(1.130) | |
hour_day13 | 5.110*** |
(1.130) | |
hour_day14 | 6.430*** |
(1.140) | |
hour_day15 | 4.820*** |
(1.140) | |
hour_day16 | 4.990*** |
(1.150) | |
hour_day17 | 3.640*** |
(1.150) | |
hour_day18 | 2.900** |
(1.150) | |
hour_day19 | 0.473 |
(1.170) | |
hour_day20 | 1.580 |
(1.170) | |
hour_day21 | -0.220 |
(1.170) | |
hour_day22 | -0.201 |
(1.140) | |
hour_day23 | -0.116 |
(1.120) | |
Constant | 0.326 |
(1.300) | |
Observations | 744 |
R2 | 0.339 |
Adjusted R2 | 0.308 |
Residual Std. Error | 4.360 (df = 710) |
F Statistic | 11.000*** (df = 33; 710) |
Note: | p<0.1; p<0.05; p<0.01 |
The critical factors include;
Number_of_ed_patients_waiting_in_ip_beds.
Weekday.
Hour of day.
Specifically, the number of patients waiting in inpatient beds has a direct relationship with waiting times. Besides, some days of the week are likely to experience longer waiting times than others. For instance, compared to Sunday (day 1), Thursday(day 5) has longer waiting times. Likewise, some hours of the day see longer waiting times. As case in point, 1100 hrs has a higher waiting times compared to midnight.
Recommendation: Increase the number of staff during peak days like Thursday so as to clear the number of patients waiting in inpatient beds.
I run the random forest model (Muller and Guido 2016).
## Set random seed for reproducibility
set.seed(123)
## Run the random forest model
rf_model <- randomForest(longest_admit_time_waiting_in_ed ~ .
- wait_cat, data = waiting)
summary(rf_model)
## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 744 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 744 -none- numeric
## importance 6 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 744 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
For inference and actionable insights, I get the variable importance. Figure 4 below shows that hour of the day is of primary importance in explaining waiting times for patients. The number of patients and the number of patients waiting in inpatient beds are also important.
## Variable importance plots for the random forest model
vip(rf_model) +
labs(title = "Variable Importance: Random Forest Model")
Variable Importance: Random Forest Model
Recommendation: Vary the number of staff by the hour of day where there are more emergency cases.
To run Naive Bayes model, I have divided the outcome variable into
categories to create a categorical variable (wait_cat). The variable has
nine categories. For instance, how many times did admissions delay for
between zero and three minutes, and so on. This is the
wait_cat
variable (see section 3.4 on feature
engineering).
## Create a dataset for Naive Bayes and SVM Models
final_data <- waiting |>
select(-longest_admit_time_waiting_in_ed) |>
data.frame()
I now run the Naive Bayes model and generate the variable importance plots (Edureka 2019). Figure 5 shows the variable imprortance by category. For instance, the top left plot shows the factors that are imprtant for patients that wait for between 20 and 23.3 minutes. In that case, the number of patients is critical, followed by the door to bed time for the last emergency department patient. For the other categories, we use the same interpretation.
Recommendation: The implication is that in days and hours where there is projected to be more patients, we should have additional staff and portable emergency wards and beds.
## Create a hyperparameter tuning grid
Grid <- data.frame(usekernel = FALSE, laplace = 0, adjust = 1)
## Run the Naive Bayes model
mdl <- train(wait_cat ~ .,data = final_data,
method = "naive_bayes", trControl = trainControl(method = "none"),
tuneGrid = Grid
)
## Plot the variable importance
plot(varImp(mdl))
Variable Importance: Naive Bayes
We run the SVM model and generate variable importance plots (DataCamp n.d.). As in the Naive Bayes case, for patients that wait between 20 and 23.3 minutes, the number of patients is critical, followed by the door to bed time for the last emergency department patient.
# Create cross validation folds
train_control <- trainControl(
method = "repeatedcv", number = 10, repeats = 3)
# Fit the model
svm_model <- train(wait_cat ~ ., data = final_data,
method = "svmLinear", trControl = train_control,
preProcess = c("center", "scale"), tuneGrid = expand.grid(C = seq(0, 2,
length = 20
)))
# View the model
plot(varImp(svm_model))
Variable Importance: Random Forest Model
In this analysis, I have used emergency department data from a fictitious hospital to generate insights to reduce waiting times. Following my analysis, I come up with the following insights.
The critical factor in wait times are the number of patients.
The number of patients vary by number of day or day of week.
My recommendations are as follows;
Vary the number of staff depending on days of week that have more emergency cases.
Vary staff by hour of day, with more staff during hours with more emergency cases like 1100 hrs, and less during times when there are fewer cases like around midnight.
Given that the hospital has fixed number of beds, there should be arrangements to create portable, temporary emergency wards and beds to cater for the upsurge in emergency cases duting particular days of week and hours of day.
I have utilised the following packages in R for the analysis.
sessionInfo()
## R version 4.2.3 (2023-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kernlab_0.9-32 caret_6.0-94 lattice_0.20-45
## [4] e1071_1.7-13 naivebayes_0.9.7 stargazer_5.2.3
## [7] vip_0.3.2 randomForest_4.7-1.1 ggthemes_4.2.4
## [10] corrplot_0.92 GGally_2.1.2 conflicted_1.2.0
## [13] readxl_1.4.2 kableExtra_1.3.4 skimr_2.1.5
## [16] janitor_2.2.0 lubridate_1.9.2 forcats_1.0.0
## [19] stringr_1.5.0 dplyr_1.1.1 purrr_1.0.1
## [22] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
## [25] ggplot2_3.4.2 tidyverse_2.0.0 pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.1-0 class_7.3-21 snakecase_0.11.0
## [4] base64enc_0.1-3 rstudioapi_0.14 proxy_0.4-27
## [7] listenv_0.9.0 farver_2.1.1 prodlim_2023.03.31
## [10] fansi_1.0.4 xml2_1.3.3 codetools_0.2-19
## [13] splines_4.2.3 cachem_1.0.7 knitr_1.42
## [16] jsonlite_1.8.4 pROC_1.18.0 compiler_4.2.3
## [19] httr_1.4.5 Matrix_1.5-3 fastmap_1.1.1
## [22] cli_3.6.1 htmltools_0.5.5 tools_4.2.3
## [25] gtable_0.3.3 glue_1.6.2 reshape2_1.4.4
## [28] Rcpp_1.0.10 cellranger_1.1.0 jquerylib_0.1.4
## [31] vctrs_0.6.1 svglite_2.1.1 nlme_3.1-162
## [34] iterators_1.0.14 timeDate_4022.108 gower_1.0.1
## [37] xfun_0.38 globals_0.16.2 rvest_1.0.3
## [40] timechange_0.2.0 lifecycle_1.0.3 future_1.32.0
## [43] MASS_7.3-58.2 scales_1.2.1 ipred_0.9-14
## [46] hms_1.1.3 parallel_4.2.3 RColorBrewer_1.1-3
## [49] yaml_2.3.7 memoise_2.0.1 gridExtra_2.3
## [52] sass_0.4.5 rpart_4.1.19 reshape_0.8.9
## [55] stringi_1.7.12 highr_0.10 foreach_1.5.2
## [58] hardhat_1.3.0 lava_1.7.2.1 repr_1.1.6
## [61] rlang_1.1.0 pkgconfig_2.0.3 systemfonts_1.0.4
## [64] evaluate_0.20 labeling_0.4.2 recipes_1.0.5
## [67] tidyselect_1.2.0 parallelly_1.35.0 plyr_1.8.8
## [70] magrittr_2.0.3 R6_2.5.1 generics_0.1.3
## [73] pillar_1.9.0 withr_2.5.0 survival_3.5-3
## [76] nnet_7.3-18 future.apply_1.10.0 utf8_1.2.3
## [79] tzdb_0.3.0 rmarkdown_2.21 grid_4.2.3
## [82] data.table_1.14.8 ModelMetrics_1.2.2.2 digest_0.6.31
## [85] webshot_0.5.4 stats4_4.2.3 munsell_0.5.0
## [88] viridisLite_0.4.1 bslib_0.4.2