This is dummy data.
Let’s say we’re looking at the total deaths by quarter at a large hospital.
The question we’re interested in is whether there are meaningful differences across quarters. For instance, anecdotal evidence indicates that deaths often rise in the 4th quarter of the year.
We also want to set standards for what to expect in future quarters. If there are no longer-term trends over time, then the confidence interval for Q1 that we establish from the historical data can be used to indicate wheter the next Q1 has a higher/lower than expected number of deaths.
Here’s what the data looks like:
# paste data using {datapasta} add-in
# Data ----------
df1.test_discharges <-
tibble::tribble(
~fyear, ~quarter, ~is_q4, ~site, ~total_discharges, ~alos_days, ~total_deaths,
"fy2015", "Q1", "no", "VGH", 7193, 9.5, 282,
"fy2015", "Q2", "no", "VGH", 7064, 9.7, 301,
"fy2015", "Q3", "no", "VGH", 7365, 9.8, 300,
"fy2015", "Q4", "yes", "VGH", 7190, 9.8, 300,
"fy2016", "Q1", "no", "VGH", 7246, 9.6, 319,
"fy2016", "Q2", "no", "VGH", 6897, 10.2, 268,
"fy2016", "Q3", "no", "VGH", 7241, 10, 292,
"fy2016", "Q4", "yes", "VGH", 7011, 10.2, 319,
"fy2017", "Q1", "no", "VGH", 6898, 10.2, 283,
"fy2017", "Q2", "no", "VGH", 6756, 10.2, 283,
"fy2017", "Q3", "no", "VGH", 6739, 10.1, 300,
"fy2017", "Q4", "yes", "VGH", 6931, 10.3, 330,
"fy2018", "Q1", "no", "VGH", 6900, 10.1, 270,
"fy2018", "Q2", "no", "VGH", 6784, 10, 303,
"fy2018", "Q3", "no", "VGH", 6943, 9.9, 318,
"fy2018", "Q4", "yes", "VGH", 7160, 9.8, 326,
"fy2019", "Q1", "no", "VGH", 7094, 9.8, 312,
"fy2019", "Q2", "no", "VGH", 6980, 9.8, 312,
"fy2019", "Q3", "no", "VGH", 7132, 9.9, 338,
"fy2019", "Q4", "yes", "VGH", 7149, 10.1, 335
) %>%
mutate_if(is.character, as.factor)
# str(df1.test_discharges)
# summary(df1.test_discharges)
df1.test_discharges %>%
datatable(extensions = 'Buttons',
options = list(dom = 'Bfrtip',
buttons = c('excel', "csv")))
# df1.test_discharges %>%
# ggplot(aes(x = total_deaths)) +
# geom_density()
#
# df1.test_discharges %>%
# ggplot(aes(x = alos_days)) +
# geom_density()
#
# df1.test_discharges %>%
# ggplot(aes(x = total_discharges)) +
# geom_density()
# Models ------
m1.deaths <- lm(total_deaths ~ quarter,
data = df1.test_discharges)
summary(m1.deaths)
##
## Call:
## lm(formula = total_deaths ~ quarter, data = df1.test_discharges)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.40 -10.60 0.50 10.45 28.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 293.200 8.023 36.546 <2e-16 ***
## quarterQ2 0.200 11.346 0.018 0.9862
## quarterQ3 16.400 11.346 1.445 0.1676
## quarterQ4 28.800 11.346 2.538 0.0219 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.94 on 16 degrees of freedom
## Multiple R-squared: 0.3615, Adjusted R-squared: 0.2418
## F-statistic: 3.02 on 3 and 16 DF, p-value: 0.06047
Note that the F-stat is not significant. This model may not be useful.
# m1.deaths %>%
# tidy() %>%
# datatable(extensions = 'Buttons',
# options = list(dom = 'Bfrtip',
# buttons = c('excel', "csv")))
df2.nested <-
df1.test_discharges %>%
group_by(quarter) %>%
nest() %>%
mutate(conf = map(data,
function(df){
mean <- smean.cl.normal(df$total_deaths)[1]
lwr <- smean.cl.normal(df$total_deaths)[2]
upr <- smean.cl.normal(df$total_deaths)[3]
return(data.frame(lwr_death = lwr,
mean_death = mean,
upr_death = upr))
})) # %>% View("confint")
df2.nested %>%
unnest(conf) %>%
select(quarter,
lwr_death:upr_death) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"))
| quarter | lwr_death | mean_death | upr_death |
|---|---|---|---|
| Q1 | 266.9571 | 293.2 | 319.4429 |
| Q2 | 271.4573 | 293.4 | 315.3427 |
| Q3 | 286.6108 | 309.6 | 332.5892 |
| Q4 | 305.0887 | 322.0 | 338.9113 |
df1.test_discharges %>%
ggplot(aes(x = quarter,
y = total_deaths)) +
stat_summary(fun.y = mean,
geom = "point") +
stat_summary(fun.data = mean_cl_normal,
geom = "errorbar") +
stat_summary(fun.data = mean_cl_boot,
geom = "errorbar",
col = "red") +
labs(title = "Dummy data - Estimates of average total deaths per quarter",
subtitle = "Black - CI based on t-dist \nRed - CI based on bootstrap \n\nModel: deaths ~ quarter") +
theme_light() +
theme(panel.grid.minor = element_line(colour = "grey95"),
panel.grid.major = element_line(colour = "grey95"))
Q. Are shorter CIs necessarily better?
Ans. No. See Hesterberg, 2015:
“…the common combination of nonparametric bootstrapping and bootstrap percentile CIs is less accurate than using t-intervals for small samples, though more accurate for larger samples.”
“(In small samples) bootsrap distributions tend to be too narrow on average …”
m2.deaths <- lm(total_deaths ~ is_q4,
data = df1.test_discharges)
summary(m2.deaths)
##
## Call:
## lm(formula = total_deaths ~ is_q4, data = df1.test_discharges)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.733 -15.733 1.767 13.067 39.267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 298.733 4.728 63.187 <2e-16 ***
## is_q4yes 23.267 9.455 2.461 0.0242 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.31 on 18 degrees of freedom
## Multiple R-squared: 0.2517, Adjusted R-squared: 0.2101
## F-statistic: 6.055 on 1 and 18 DF, p-value: 0.0242
# m2.deaths %>%
# tidy() %>%
# datatable(extensions = 'Buttons',
# options = list(dom = 'Bfrtip',
# buttons = c('excel', "csv")))
df2.nested <-
df1.test_discharges %>%
group_by(is_q4) %>%
nest() %>%
mutate(conf = map(data,
function(df){
mean <- smean.cl.normal(df$total_deaths)[1]
lwr <- smean.cl.normal(df$total_deaths)[2]
upr <- smean.cl.normal(df$total_deaths)[3]
return(data.frame(lwr_death = lwr,
mean_death = mean,
upr_death = upr))
})) # %>% View("confint")
df2.nested %>%
unnest(conf) %>%
select(is_q4,
lwr_death:upr_death) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"))
| is_q4 | lwr_death | mean_death | upr_death |
|---|---|---|---|
| no | 287.9656 | 298.7333 | 309.5010 |
| yes | 305.0887 | 322.0000 | 338.9113 |
df1.test_discharges %>%
ggplot(aes(x = is_q4,
y = total_deaths)) +
stat_summary(fun.y = mean,
geom = "point") +
stat_summary(fun.data = mean_cl_normal,
geom = "errorbar") +
stat_summary(fun.data = mean_cl_boot,
geom = "errorbar",
col = "red") +
labs(title = "Dummy data - Estimates of average total deaths per quarter",
subtitle = "Black - CI based on t-dist \nRed - CI based on bootstrap \n\nModel: deaths ~ is_q4") +
theme_light() +
theme(panel.grid.minor = element_line(colour = "grey95"),
panel.grid.major = element_line(colour = "grey95"))
Previous models show that not every level of the quarter variable is likely to have a significant coefficient. We could manually try to find which levels to include (as we did in model m2.deaths), but that doesn’t scale - what if we need to do this for dozens of different patient groups?
Instead, let’s try automated variable selection with LASSO. See here for more on this.
?glmnet: “Fit a generalized linear model via penalized maximum likelihood… Fits linear, logistic and multinomial, poisson, and Cox regression models.”
?cv.glmnet: “Does k-fold cross-validation for glmnet, produces a plot, and returns a value for lambda (and gamma if relax=TRUE)”
Well start with cross-validation to select a value for lambda. (In actual analysis, we would do this on training dataset, not the full dataset.)
# predictors:
x <- model.matrix(total_deaths ~ .,
data = df1.test_discharges %>%
select(quarter,
total_deaths))
# drop the first column of all 1s (glmnet will recreate this)
x <- x[, -1]
y <- df1.test_discharges$total_deaths
# Find the best lambda using cross-validation
set.seed(123)
cv.lasso <- cv.glmnet(x, y)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
cv.lasso
##
## Call: cv.glmnet(x = x, y = y)
##
## Measure: Mean-Squared Error
##
## Lambda Measure SE Nonzero
## min 0.678 348.3 75.27 2
## 1se 8.364 418.3 102.50 1
plot(cv.lasso)
We show the mean cross-validated error curve, as well as a one-standard-deviation band. In this figure the left vertical line corresponds to the minimum error, while the right vertical line the largest value of lambda such that the error is within one standard-error of the minimum|the so called “one-standard-error” rule.
If we use cv.lasso$lambda.min, there will be 2 nonzero coefficients. If we use cv.lasso$lambda.1se, there will be only 1 nonzero coefficient.
Now we fit final models with the values of lambda that we got from cross-validation.
# Fit the final model with 2 coeffs:
m3.deaths_lasso <- glmnet(x, y, lambda = cv.lasso$lambda.min)
coef(m3.deaths_lasso)
## 4 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 294.47512
## quarterQ2 .
## quarterQ3 13.94975
## quarterQ4 26.34977
tidy(m3.deaths_lasso) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"))
| term | step | estimate | lambda | dev.ratio |
|---|---|---|---|---|
| (Intercept) | 1 | 294.47512 | 0.6784497 | 0.3580967 |
| quarterQ3 | 1 | 13.94975 | 0.6784497 | 0.3580967 |
| quarterQ4 | 1 | 26.34977 | 0.6784497 | 0.3580967 |
Note how similar these results are to those of the first model we fit, m1.deaths
Finally, just to test, let’s fit the model with only 1 coefficient:
# Fit the final model with 1 coeffs:
m4.deaths_lasso <- glmnet(x, y, lambda = cv.lasso$lambda.1se)
coef(m4.deaths_lasso)
## 4 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 303.562432
## quarterQ2 .
## quarterQ3 .
## quarterQ4 3.950271
tidy(m4.deaths_lasso) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"))
| term | step | estimate | lambda | dev.ratio |
|---|---|---|---|---|
| (Intercept) | 1 | 303.562432 | 8.364245 | 0.0782156 |
| quarterQ4 | 1 | 3.950271 | 8.364245 | 0.0782156 |
In this case, results are quite different from previous models. Might be best to stick with using lambda.min (refer to the plot above).
Overall, using LASSO for this purpose seems very promising.