library(tidyverse)
library(ggplot2)
library(prettyR)BIOS 507 – Homework 4
Background
Problem 1
A national healthcare system specializes in small hospitals across the U.S. Each hospital serves a “service area” characterized by the estimated number of residents living within a 30-minute drive. For budgeting, the healthcare system wants to better understand the yearly demand for specific items so that they can stockpile them efficiently. They collected data from 18 hospitals last year on:
pop_thousands the service-area population in thousands. These are small hospitals, and so the populations they service range from about 5,000 to around 25,000.
RSK the number of respiratory support kits (RSKs) dispensed by the hospital last year. Note that this is a made up item for the purposes of this homework problem.
The healthcare system has several specific questions for you: (1) do service areas with higher populations need more RSKs? (2) Most of their hospitals service around 15 thousand people (pop thousands=15). How many RSKs should we expect these hospitals to use, in general? (3) they are planning to open up a new hospital next year in an area with 7,000 persons (pop thousands=7) living in the service-area. Can you suggest how many units they stockpile and provide a reasonable range of how many units they might expect to need at this new hospital?
Please write a short report (no page requirements, but does not need to be long) an- swering these questions aimed at a non-statistical audience. Feel free to include any plots or tables that help make your case.
#Read in the data
rsk = read.csv("C:/Users/esincl3/OneDrive - Emory/Documents/PhD Spring 2026/BIOS 507/RSKs.csv")#Extract descriptive statistics for both the dependent (RSK) and independent (population) variables to understand min, max, mean, median, and range, and assess any potential outliers.
#Pop_thousands
mean(rsk$pop_thousands)[1] 846.5
median(rsk$pop_thousands)[1] 13.5
range(rsk$pop_thousands) #Based on this, it looks like there's an outlier on the max extreme[1] 5 15000
#Plot a histogram to see distribution
hist(rsk$pop_thousands)#15000 is likely a data entry error and should probably be 15 given the variable is population in thousands. Additionally, the data description noted all hospitals are small, serving 5,000 to 25,000 people, and 15000 thousand would be 15,000,000.
#Change 15000 to 15
rsk <- rsk %>%
mutate(pop_thousands = replace(pop_thousands, hospital_id == 11, 15))
#Rerun descriptive stats and histogram
mean(rsk$pop_thousands)[1] 14
median(rsk$pop_thousands)[1] 13.5
range(rsk$pop_thousands)[1] 5 25
#Plot a historgram to see distribution
hist(rsk$pop_thousands)#This all looks much more reasonable now.
#RSK
mean(rsk$RSK)[1] 51.27778
median(rsk$RSK)[1] 50
range(rsk$RSK)[1] 15 94
#Plot a histogram to see distribution
hist(rsk$RSK)#Nothing stands out as an obvious error.#Create a scatterplot to visualize the relationship between RSK and pop_thousands. Make sure it looks plausibly linear.
ggplot(data = rsk, aes(x = pop_thousands, y = RSK)) +
geom_point() +
labs(
x = "Hospital Service Population (in thousands)",
y = "RSKs Distributed Annually",
title = "Scatter Plot of Hospital Service Population and RSKs Distributed"
) +
theme_minimal()#Nothing looks crazy. Seems good to go.(1) Do service areas with higher populations need more RSKs?
Run a simple linear regression model to model the relationship of the number of RSKs distributed and the hospital service population (in thousands).
model <- lm(RSK ~ pop_thousands, data = rsk)
summary(model)
Call:
lm(formula = RSK ~ pop_thousands, data = rsk)
Residuals:
Min 1Q Median 3Q Max
-20.2086 -6.2683 -0.2809 5.2741 18.7725
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.3658 6.5990 1.419 0.175
pop_thousands 2.9937 0.4339 6.900 3.57e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.94 on 16 degrees of freedom
Multiple R-squared: 0.7485, Adjusted R-squared: 0.7327
F-statistic: 47.61 on 1 and 16 DF, p-value: 3.573e-06
confint(model, level = 0.95) 2.5 % 97.5 %
(Intercept) -4.623532 23.35519
pop_thousands 2.073942 3.91348
Based on the model summary, for every one-thousand person increase in hospital service population, the number of RSKs distributed increases by 2.99 (95% CI: 2.07, 3.91). This relationship is statistically significant (p < 0.05). –> This indicates that hospitals with higher service populations do need more RSKs.
(2) Most of their hospitals service around 15 thousand people (pop thousands=15). How many RSKs should we expect these hospitals to use, in general?
predict(model, newdata = data.frame(pop_thousands = 15), interval = "confidence", level = 0.95) fit lwr upr
1 54.27149 48.72738 59.81559
(3) They are planning to open up a new hospital next year in an area with 7,000 persons (pop thousands=7) living in the service-area. Can you suggest how many units they stockpile and provide a reasonable range of how many units they might expect to need at this new hospital?
predict(model, newdata = data.frame(pop_thousands = 7), interval = "prediction", level = 0.95) fit lwr upr
1 30.3218 5.636101 55.00751
Creat a graph to show the model and the confidence interval for a hospital with 15,000 (part 2), and show we are less confident as the values get smaller (or larger) (part 3)
#Create the highlight points for a hospital with 15,000 people for parts 2 and 3
highlight_points <- data.frame(
pop_thousands = 15,
Type = "15,000"
)
#Predict the y-values of these new points
highlight_points$RSK <- predict(model, newdata = highlight_points)
plot = ggplot(data = rsk, aes(x = pop_thousands, y = RSK)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
#Map 'Type' to color to generate the legend
geom_point(data = highlight_points,
aes(x = pop_thousands, y = RSK, color = Type),
size = 6) +
#Manually set the colors for our 'Type' labels
scale_color_manual(values = c("15,000" = "red")) +
labs(
x = "Hospital Service Population (in thousands)",
y = "RSKs Distributed Annually",
title = "Hospital Service Population and RSKs",
color = "Hospital Service Population" # Changes the title of the legend
) +
theme_minimal()Short Report Answer
Background Note: A potential data entry error was identified for the service population of Hospital #11. The service population for this hospital was recorded as 15,000. However, service population values for all other hospitals were recorded in terms of thousands (e.g., a service population of 8,000 was entered as 8). Excluding the entry of 15,000, all other values were single- and double-digit, with a range of 5-25. A value of 15,000 would be an extreme outlier and would mean Hospital #11 has a service population of 15,000,000. Information provided about this study indicates that all included hospitals have service populations between 5,000 and 25,000. Given all of this information, the service population for Hospital #11 was corrected from 15,000 to 15.
Based on the corrected data and simple linear regression model, there is strong statistical evidence to indicate the number of RSKs a hospital dispenses is significantly associated with the service population served by that hospital. Hospitals with larger service populations need more RSKs. On average, for every one-thousand person increase in hospital service population, the need for RSKs increases by about 2.99 RSK units. It is important to note that this increase is on average, but we can be 95% confident the demand for RSKs increases between 2.07 and 3.91 RSKs per one-thousand person increase in service population.
In Figure 1 below, the linear regression model describing the relationship between hospital service population (in thousands) and RSKs dispersed is shown by the blue line. From this model we can estimate average and predict specific RSK values based on hospital service population. The grey area represents a 95% confidence interval (the area in which we are confident that are estimate will be correct 95% of the time). On average, a hospital with a service population of 15,000 (red dot in Figure 1) will need 54.27 RSKs each year. This value is an estimate, however we can be 95% confident that the average hospital with a service population of 15,000 will need between 48.73 and 59.82 RSKs (as indicated by the grey area).
It is important to note that it is more difficult to predict an exact number of RSKs needed for a specific hospital rather than an average hospital. Using the model shown in Figure 1, we can predict a specific new hospital with a service population of 7,000 will need 30.32 RSKs. Given we are trying to predict a specific hospital, our confidence interval is wider than the one shown in Figure 1. We can be 95% confident that number of RSKs needed by this specific new hospital serving a population of 7,000 is between 5.64 and 55.01 RSKs. If we obtain more data, we can be more certain about the relationship between hospital service population and RSKs, and may be able to provide a narrower prediction interval.
Figure 1. Hospital Service Population and RSKs.