Machine Learning in Medecine project The MLM project has been initialized in 2016 and aims to:

2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Background

The hospital costs for patients with heart infarction can depend on various factors such as the age (age), the number of complications (complications), the duration of stay (ichours)…9000 patients hospitalized for this heart disease with 3 descriptive variables age, complications and ichours constitutes of a nice dataset.

The main questions for this chapter are:

What percentage of patients will cost the hospital over 20000 euros? Less than 10000 euros?
How these percentages will change if a) the number of complications is reduced by 2 and b) the duration of stay by 20

Material and methods

The original dataset was provided in Chapter 8, Machine Learning in Medicine-Cookbook Two (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS format (chap8simula- tion1.sav) from their website: extras.springer.com.

Data analysis was performed in R statistical programming language. In this chapter, we use 2 approaches to study the questions: Monte Carlo simulation and Bootstrap.

The procedure consists of 4 steps

Data exploration Monte Carlo simulations Bootstrap Monte Carlo simulations for 2 scenarii

First, we load the necessary libraries and load data

library(data.table)
library(ggplot2)
library(dplyr)
library(foreign)
library(tidyr)
library(triangle)

rm(list=ls())
df <- read.spss("..data/chap8simulation1.sav",to.data.frame = T)

Data Exploration

At firts, let’s look at the distribution of the data

df %>% gather() %>% ggplot(aes(x = value)) + geom_histogram(aes(fill=key))+facet_wrap(~key,ncol=2,scales ="free")+theme_bw()

Now we want to know how the boxplots of each variables look likes according to their age

df3 <- df %>% mutate(age = round(age)) %>% gather(.,key,value,complications:cost)

df3 %>% 
  ggplot(aes(x=factor(age), y= value,fill = key))+geom_boxplot()+facet_wrap(~key,ncol=5,scales = "free")+theme_bw()

The figure above is complete, however kind of difficult to see. We separate the variables by group of age.

df3 <- df3 %>%
  mutate(tranche_age = cut(age, breaks = c(33,40,50,60,70,85)),
         tranche_age = plyr::mapvalues(tranche_age,
                                       from=c("(33,40]", "(40,50]", "(50,60]",
                                              "(60,70]", "(70,85]"),
                                       to=c("34-40", "40-50", "50-60",
                                            "60-70", "70+")))

df3 %>% 
  ggplot(aes(x=factor(tranche_age), y= value,fill = key))+geom_boxplot()+facet_wrap(~key,ncol=5,scales = "free")+theme_bw()

We have an idea about the distribution forms of the variables, more outliers and increasing cost when age increases

Monte Carlo simulations

At first, we train a linear regression on the whole dataset.

fit_lm <- lm(cost~., data = df)
res <- as.data.frame(t(fit_lm$coefficients))

summary(fit_lm)

## 
## Call:
## lm(formula = cost ~ ., data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3842  -1592   -795    806  55238 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -28570.977    254.044 -112.47   <2e-16 ***
## age              202.403      2.767   73.14   <2e-16 ***
## complications   4022.405     21.661  185.70   <2e-16 ***
## ichours         -111.241      2.124  -52.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2944 on 8996 degrees of freedom
## Multiple R-squared:  0.8307, Adjusted R-squared:  0.8306 
## F-statistic: 1.471e+04 on 3 and 8996 DF,  p-value: < 2.2e-16

The linear model fit quite goods on the dataset and all variables are strongly statistically significant.

Now, we will use the Monte Carlo simulations to discover what percentage of patients cost for hospital over a XXX euros. We will run 1000 iterations, and look at the best results. We also suppose that the distribution of the variables: “age” is triangle, “complication” and “ichours” are log normal.

n <- 2000
MCres <- list()
MCres[["res"]] <- rep(NA,n)
MCres[["bestsimulation"]] <- list()

for (i in 1:n) {
  age <- rtriangle(9000, a=33.54, b=84.73, c=73.19)
  complications <- rlnorm(n = 9000,
  meanlog = meanlog(df$complications), sdlog=sdlog(df$complications))
  ichours <- rlnorm(n = 9000,meanlog = meanlog(df$ichours),sdlog = sdlog(df$ichours))
  
  newdf <- data.frame(age = age, complications = complications, ichours = ichours)
  newdf <- mutate (newdf,
                   cost = age*res$age + complications*res$complications + ichours*res$ichours + res$Intercept)
  
  moyenne <- (mean(newdf$cost) - mean(df$cost))*100/mean(df$cost)
  
  # Insert the score into ParamGrid
  MCres$res[i] <- moyenne
  MCres$bestsimulation[[i]] <- newdf
}

Now we take the best simulation regarding to the real data and look at its 5th and 95th percentile

ind <- which(MCres$res == min(MCres$res))
newdf <- MCres$bestsimulation[[ind]]
newdf1 <- rbindlist(MCres$bestsimulation)

hist(newdf1$cost,breaks = 100)

quantile(newdf1$cost,0.05)

##       5% 
## 684.9905

quantile(newdf1$cost,0.95)

##      95% 
## 21961.11

So, 5% of the patients costs for the hospital less than 700 euros, and other 5% of the patients over 21000 euros

Bootstrap

Bootstrap is a special case of the Monte Carlo simulation. The differences between the bootstrap and mote carlo methods are:

Monte Carlo simulation is based on a random number and a a priori distribution form of data
Bootstrap resamples with replacement on the real data.

First, we need to create a bootstrap resamples. The function of bootstrap resampling is still not optimal. To limit the computational time, we only do 100 bootstrap resamples

sampler <- function(dat, clustervar, replace = TRUE, reps = 1) {
  cid <- unique(dat[, clustervar[1]])
  ncid <- length(cid)
  recid <- sample(cid, size = ncid * reps, replace = TRUE)
  if (replace) {
    rid <- lapply(seq_along(recid), function(i) {
      cbind(NewID = i, RowID = sample(which(dat[, clustervar] == recid[i]),
                                      size = length(which(dat[, clustervar] == recid[i])), replace = TRUE))
    })
  } else {
    rid <- lapply(seq_along(recid), function(i) {
      cbind(NewID = i, RowID = which(dat[, clustervar] == recid[i]))
    })
  }
  dat <- as.data.frame(do.call(rbind, rid))
  dat$Replicate <- factor(cut(dat$NewID, breaks = c(1, ncid * 1:reps), include.lowest = TRUE,
                              labels = FALSE))
  dat$NewID <- factor(dat$NewID)
  return(dat)
}

n <- 100
set.seed(20)
tmp <- sampler(df, "cost", reps = n)
bigdata <- cbind(tmp, df[tmp$RowID, 1:3])

Now we do exactly the same procedure on these bootstraps samples and look at the results

hist(newdf1$cost,breaks = 100)

quantile(newdf1$cost,0.05)

##       5% 
## 684.9905

quantile(newdf1$cost,0.95)

##      95% 
## 21961.11

The bootstrap (resampling on real data) gave the same observation as the Monte Carlo simulation (on the simulated data). The only difference is the number of iteration, 100 for bootstrap and 1000 for MC simulations

2 scenarii

Now let’s look at how change the cost with 2 scenarios

The number of complications is reduced by 2

hist(newdf1$cost,breaks = 100)

quantile(newdf1$cost,0.05)

##        5% 
## -7008.982

quantile(newdf1$cost,0.95)

##      95% 
## 14389.89

So, 95% of the patients now cost less than 15000 euros

The duration of stay in hospital is reduced by 20

hist(newdf1$cost,breaks = 100)

quantile(newdf1$cost,0.95)

##      95% 
## 24037.91

Now, the cost increases. The last 5% of the patients cost over 24000 euros

Monte Carlo simulation and Bootstrap

Vịt Trần

February 9, 2017