Machine Learning in Medecine project The MLM project has been initialized in 2016 and aims to:
1.Encourage using Machine Learning techniques in medical research in Vietnam
2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
The hospital costs for patients with heart infarction can depend on various factors such as the age (age), the number of complications (complications), the duration of stay (ichours)…9000 patients hospitalized for this heart disease with 3 descriptive variables age, complications and ichours constitutes of a nice dataset.
The main questions for this chapter are:
The original dataset was provided in Chapter 8, Machine Learning in Medicine-Cookbook Two (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS format (chap8simula- tion1.sav) from their website: extras.springer.com.
Data analysis was performed in R statistical programming language. In this chapter, we use 2 approaches to study the questions: Monte Carlo simulation and Bootstrap.
The procedure consists of 4 steps
Data exploration Monte Carlo simulations Bootstrap Monte Carlo simulations for 2 scenarii
First, we load the necessary libraries and load data
library(data.table)
library(ggplot2)
library(dplyr)
library(foreign)
library(tidyr)
library(triangle)
rm(list=ls())
df <- read.spss("..data/chap8simulation1.sav",to.data.frame = T)
At firts, let’s look at the distribution of the data
df %>% gather() %>% ggplot(aes(x = value)) + geom_histogram(aes(fill=key))+facet_wrap(~key,ncol=2,scales ="free")+theme_bw()
Now we want to know how the boxplots of each variables look likes according to their age
df3 <- df %>% mutate(age = round(age)) %>% gather(.,key,value,complications:cost)
df3 %>%
ggplot(aes(x=factor(age), y= value,fill = key))+geom_boxplot()+facet_wrap(~key,ncol=5,scales = "free")+theme_bw()
The figure above is complete, however kind of difficult to see. We separate the variables by group of age.
df3 <- df3 %>%
mutate(tranche_age = cut(age, breaks = c(33,40,50,60,70,85)),
tranche_age = plyr::mapvalues(tranche_age,
from=c("(33,40]", "(40,50]", "(50,60]",
"(60,70]", "(70,85]"),
to=c("34-40", "40-50", "50-60",
"60-70", "70+")))
df3 %>%
ggplot(aes(x=factor(tranche_age), y= value,fill = key))+geom_boxplot()+facet_wrap(~key,ncol=5,scales = "free")+theme_bw()
We have an idea about the distribution forms of the variables, more outliers and increasing cost when age increases
At first, we train a linear regression on the whole dataset.
fit_lm <- lm(cost~., data = df)
res <- as.data.frame(t(fit_lm$coefficients))
summary(fit_lm)
##
## Call:
## lm(formula = cost ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3842 -1592 -795 806 55238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28570.977 254.044 -112.47 <2e-16 ***
## age 202.403 2.767 73.14 <2e-16 ***
## complications 4022.405 21.661 185.70 <2e-16 ***
## ichours -111.241 2.124 -52.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2944 on 8996 degrees of freedom
## Multiple R-squared: 0.8307, Adjusted R-squared: 0.8306
## F-statistic: 1.471e+04 on 3 and 8996 DF, p-value: < 2.2e-16
The linear model fit quite goods on the dataset and all variables are strongly statistically significant.
Now, we will use the Monte Carlo simulations to discover what percentage of patients cost for hospital over a XXX euros. We will run 1000 iterations, and look at the best results. We also suppose that the distribution of the variables: “age” is triangle, “complication” and “ichours” are log normal.
n <- 2000
MCres <- list()
MCres[["res"]] <- rep(NA,n)
MCres[["bestsimulation"]] <- list()
for (i in 1:n) {
age <- rtriangle(9000, a=33.54, b=84.73, c=73.19)
complications <- rlnorm(n = 9000,
meanlog = meanlog(df$complications), sdlog=sdlog(df$complications))
ichours <- rlnorm(n = 9000,meanlog = meanlog(df$ichours),sdlog = sdlog(df$ichours))
newdf <- data.frame(age = age, complications = complications, ichours = ichours)
newdf <- mutate (newdf,
cost = age*res$age + complications*res$complications + ichours*res$ichours + res$Intercept)
moyenne <- (mean(newdf$cost) - mean(df$cost))*100/mean(df$cost)
# Insert the score into ParamGrid
MCres$res[i] <- moyenne
MCres$bestsimulation[[i]] <- newdf
}
Now we take the best simulation regarding to the real data and look at its 5th and 95th percentile
ind <- which(MCres$res == min(MCres$res))
newdf <- MCres$bestsimulation[[ind]]
newdf1 <- rbindlist(MCres$bestsimulation)
hist(newdf1$cost,breaks = 100)
quantile(newdf1$cost,0.05)
## 5%
## 684.9905
quantile(newdf1$cost,0.95)
## 95%
## 21961.11
So, 5% of the patients costs for the hospital less than 700 euros, and other 5% of the patients over 21000 euros
Bootstrap is a special case of the Monte Carlo simulation. The differences between the bootstrap and mote carlo methods are:
First, we need to create a bootstrap resamples. The function of bootstrap resampling is still not optimal. To limit the computational time, we only do 100 bootstrap resamples
sampler <- function(dat, clustervar, replace = TRUE, reps = 1) {
cid <- unique(dat[, clustervar[1]])
ncid <- length(cid)
recid <- sample(cid, size = ncid * reps, replace = TRUE)
if (replace) {
rid <- lapply(seq_along(recid), function(i) {
cbind(NewID = i, RowID = sample(which(dat[, clustervar] == recid[i]),
size = length(which(dat[, clustervar] == recid[i])), replace = TRUE))
})
} else {
rid <- lapply(seq_along(recid), function(i) {
cbind(NewID = i, RowID = which(dat[, clustervar] == recid[i]))
})
}
dat <- as.data.frame(do.call(rbind, rid))
dat$Replicate <- factor(cut(dat$NewID, breaks = c(1, ncid * 1:reps), include.lowest = TRUE,
labels = FALSE))
dat$NewID <- factor(dat$NewID)
return(dat)
}
n <- 100
set.seed(20)
tmp <- sampler(df, "cost", reps = n)
bigdata <- cbind(tmp, df[tmp$RowID, 1:3])
Now we do exactly the same procedure on these bootstraps samples and look at the results
hist(newdf1$cost,breaks = 100)
quantile(newdf1$cost,0.05)
## 5%
## 684.9905
quantile(newdf1$cost,0.95)
## 95%
## 21961.11
The bootstrap (resampling on real data) gave the same observation as the Monte Carlo simulation (on the simulated data). The only difference is the number of iteration, 100 for bootstrap and 1000 for MC simulations
Now let’s look at how change the cost with 2 scenarios
hist(newdf1$cost,breaks = 100)
quantile(newdf1$cost,0.05)
## 5%
## -7008.982
quantile(newdf1$cost,0.95)
## 95%
## 14389.89
So, 95% of the patients now cost less than 15000 euros
hist(newdf1$cost,breaks = 100)
quantile(newdf1$cost,0.95)
## 95%
## 24037.91
Now, the cost increases. The last 5% of the patients cost over 24000 euros
The MC simulations and Bootstrap are two powerful techniques to reinforce our data. They are especially efficient when we do not have a large dataset. In this case, we do not state any difference between both methods. However, the MC simulation need a hypothesis on the distribution form of the data. The distribution forms of the data here is quite evident.