The Omawa Civic Hospital

The Omawa Civic Hospital has two branches, the first branch host Omawa residence of age 14 and above. It also holds a gynecology department that assists women in labour among other treatment and consultations. The second branch is a clinic which offers limited health care services.

When a patient needs longer term medical attention, they are usually transferred to the main building, where they can occupy a bed until the end of their care. However, the clinic has began to keep patients for longer periods of time, as the main building’s bed capacity is reaching its limit.

Thus, in order to determine whether they will need to use the clinic, or add new beds in the main build, the Omawa Civc Hospital has asked us to observe the patient length of stay in the hospital and predict the probability of staying longer than two days.

We were given two data sets, the first being the mimic3d data set, which displays the main building patients information, the second one named Patient, provides us with information about the patients visiting the clinic. As both the clinic and the main building’s administrative structures function differently, the mimic3d and Patient data set do not share all of the same variables. (View dictionnary)

Hence why, we are, for the first part of our analysis, only using the mimic3d, as it will give us a better understanding of the trends seen in the hospital.

The first part is a broad overview of the elements in the main building data set.


Estimating the length of stay per patient using Bayesian Linear Regression

In this second part of our analysis, we are trying to perform a Bayesian regression on the mimic3d data set, to predict the length of stay as well as the probability of staying longer than two days. We made a few changes in the data set to simplify our calculations:

  • We have first only kept the column that are both in the mimic3d data set as well as the patient data set

  • We ignored the admission diagnosis column, as it contained a number of categorical variables too large for the capacity of this analysis

  • We removed all the columns where the age was zero, as it made the age distribution highly skewed, and newborns length of stay are fairly constant

  • We have also ignored the marital status, to minimize the number of predictor variables, however it can still be added to the analyses.

  • Finally, we turn all categorical variables into dummies variables.

We are, again, only using the information coming from the main building to train our model as it provides a more accurate understanding of the patients length of stay.

A Bayesian Linear Regression, predicting the length of stay

Before we do any analysis, we have split our mimic3d data set into a training (80%) and test (20%) to verify the accuracy of our predictions. In the dictionaries section we can see the variables used for the regression and their transformation, both from the mimic3d and the patient data sets.

set.seed(123)
ind <- sample(seq_len(nrow(mimic)), size = .8*nrow(mimic))
train = mimic[ind,c(2,3,7,9:10,19)]
test = mimic[-ind,c(2,3,7,9:10,19)]

We then using a frequentist multivariate linear regression to estimate our coefficients \(\boldsymbol{\beta}\).

fit <- lm(LOSdays ~ age + gender_F + admit_type_ELECTIVE +
            admit_type_EMERGENCY, data = train)

coefficients(fit)
##          (Intercept)                  age             gender_F 
##         12.535815533          0.003890085         -0.306127583 
##  admit_type_ELECTIVE admit_type_EMERGENCY 
##         -3.822056006         -2.635814522

Since our predicted Y is normally distributed, our likelihood is: \(Y|\boldsymbol{X},\boldsymbol{\beta},\sigma^2 \sim N(\boldsymbol{\beta}^T\boldsymbol{X},\sigma^2I)\)

Our priors will be: \(\beta_{i} \sim N(b_{i},\sigma^2)\) and \(1/\sigma^2 \sim Gamma(1/2,\sigma_{res}^2/2)\)

with \(\beta_{i}\) being our estimated parameters.

And our posterior distribution is: \(Posterior \propto likelihood*prior\)

Since we are using logarithmic distributions in these calculations. Our posterior distribution will be:

\(Posterior \propto log(likelihood)+log(prior)\)

Y = train$LOSdays
X = as.matrix(cbind(1,train[,c(1,3:5)]))
sig = Ginv(t(X)%*%X)
mB = coefficients(fit)
sdB = sqrt(diag(vcov(fit)))
rsd = sigma(fit)

prior <- function(par){
  b0 <- par[1]
  b1 <- par[2]
  b2 <- par[3]
  b3 <- par[4]
  b4 <- par[5]
  sd <- par[6]
  return(c(b0prior = dnorm(b0, mean = mB[1], sd = sd, log = T),
             b1prior = dnorm(b1, mean = mB[2], sd = sd, log = T),
             b2prior = dnorm(b2, mean = mB[3], sd = sd, log = T),
             b3prior = dnorm(b3, mean = mB[4], sd = sd, log = T),
             b4prior = dnorm(b4, mean = mB[5], sd = sd, log = T),
             sdprior = 1/sqrt(dgamma(0.5,rsd/2))))}
likelihood <- function(X,par){
  sum(dnorm(Y, mean = X%*%par[-length(par)], sd = par[length(par)], log = T))}
posterior <- function(X,par){
  return(likelihood(X,par)+sum(prior(par)))}

Using the metropolis hasting algorithm, we estimate a series of randomly distribution set of parameters named Sample.

set.seed((123))
MH<-function(theta0, T = 1000,r = 1, n = length(theta0)){
  dim = length(theta0)
  res = matrix(0, nrow=T+1, ncol = dim)
  res[1,] = theta0
  for(i in 1:T){
    proposal = res[i, ] + rnorm(dim, mean = 0, sd = r)
    probab = exp(posterior(X,proposal) - posterior(X,res[i,]))
    if(runif(1) < probab){res[i+1,] = proposal} 
    else{res[i+1,] = res[i,]}}
return(res)} 
Sample = MH(c(mB -sdB ,9))

As we can see from the code chunk, the Metropolis Hasting algorithm estimates a series of randomly distributed parameters, adding a random walk to each new generated sets. The output of our code will be a list of 1000 estimated set of parameters, as we have adjust the number of iterations to this number.

To predict our new values, the code below averages the output of multivariate linear regression estimated using our Sample list.

prediction<-function(X){
  y = c()
  nx = as.matrix(X)
  n = nrow(X)
  for(i in 1:n){
  ny = mean(1*Sample[,1] + nx[i,1]*Sample[,2] + nx[i,2]*Sample[,3] +
              nx[i,3]*Sample[,4] + nx[i,4]*Sample[,5])
  if(ny<0){ny = 0}
  else{ny = ny}
  y = append(y,ny)
  i = i+1}
  return(y)}

mse <- function(ypred,ytrue){ ytrue = as.matrix(ytrue)
  mean((ytrue - ypred)^2)}

We can see from the table below that the training and test set derived from the mimic3d dataset both render MSE values higher than the patient dataset. However, it does not make the clinic’s dataset better for our prediction. The phenomenon can be explained by the fact that length of stay, for the main building, expands on a broader range of days than for the patient dataset.

This means that although, we can see the average length of stay for both training and test falls within out predict data, larger and more extreme values are not as well represented by the model. Where as for the patient dataset, the length of stay doesn’t tend to vary as much.

predtrain = prediction(train[,-c(2,6)])
predtest = prediction(test[,-c(2,6)])
predpatient = prediction(patient[,c(2,7:9)])

MSE = c(mse(predtrain,train[,2]),mse(predtest,test[,2]),mse(predpatient,patient[,3]))

lml = cbind(Dataset = c("Train","Test","Patient"),MSE)
kable(lml, caption = "MSE per dataset", digits = 2)
MSE per dataset
Dataset MSE
Train 113.447210126972
Test 121.82647815696
Patient 60.0282654040456
par(mfrow = c(1,3))
hist(predtrain, main="Inference on Training", xlab="Mean of responses variables  = red")
abline(v = mean(train$LOSdays),col="red")
hist(predtest, main="Inference on Test", xlab="Mean of responses variables  = red")
abline(v = mean(test$LOSdays),col="red")
hist(predpatient, main="Inference on Patient", xlab="Mean of responses variables  = red")
abline(v = mean(patient$`Length of Stay`),col="red")

Probabilities of staying longer than 2 days with Bayesian Logistic Regression

Here we are trying to perform a Bayesian Logistic Regression to estimate the probability of patient staying longer than two days.

Our likelihood function is: \(Y \sim Bern(p)\), where \(logit p = \boldsymbol{\beta}^t\boldsymbol{X}\)

Our priors will be: \(\beta_{0} \sim N(b_{0},\sigma^2)\), \(\beta_{i>0} \sim N(0,1000)\), and \(1/\sigma^2 \sim Gamma(1/2,\sigma_{res}^2/2)\)

And our posterior distribution is: \(Posterior \propto likelihood*prior\)

We are again using logarithmic distributions in these calculations.

fit <- glm(LOSdis ~ . - LOSdays, 
           data = train, family = binomial(link = "logit"))
Y = train$LOSdis
X = as.matrix(cbind(1,train[,c(1,3:5)]))
mB = coefficients(fit)
sdB = sqrt(diag(vcov(fit)))
sd = sd(Y)
prior <- function(par){
  b0 = par[1]
  b1 = par[2]
  b2 = par[3]
  b3 = par[4]
  b4 = par[5]
  b0prior = dnorm(b0, mean = mB[1], sd = sdB[1], log = T)
  b1prior = dnorm(b1, mean = 0, sd = 1000, log = T)
  b2prior = dnorm(b2, mean = 0, sd = 1000, log = T)
  b3prior = dnorm(b3, mean = 0, sd = 1000, log = T)
  b4prior = dnorm(b4, mean = 0, sd = 1000, log = T)
  insig = dgamma(0.1,0.1)
  sdprior = sqrt(1/insig)
  return(sum(b0prior, b1prior, b2prior,b3prior,b4prior,sdprior))}
likelihood <- function(X,par){
  sum(dbinom(Y, size = 1, prob = plogis(X %*% par[-length(par)]), log = TRUE))}
posterior <- function(X,par){
  return(likelihood(X,par)+prior(par))}

We then use the same metropolis hasting algorithm on our new posterior functions to estimate our probabilities, generating a new Sample set for out logistic regression.

Now, instead of using a multivariate regression model to estimate the length of stay, we instead use the Sample set to average the probability of staying longer than two days in the hospital.

probability <- function(X) {
  X = as.matrix(X)
  n <- as.numeric(nrow(X))
  probabilities <- matrix(0, nrow= n, ncol = 1)
  for (i in 1:n) {
    linear_predictor = (1*Sample[,1] + X[i,1]*Sample[,2] + X[i,2]*Sample[,3] +
              X[i,3]*Sample[,4] + X[i,4]*Sample[,5])
    probability = (mean(1 / (1 + exp(-linear_predictor))))
    probabilities[i] <- probability}
return(probabilities)}

mean.log.loss = function(yprob, ytrue){
  ytrue = as.matrix(ytrue)
mean.log.loss = mean(-ytrue*log(yprob)-(1-ytrue)*log(1-yprob))
return(mean.log.loss)}

We can see here that the mean log loss for our training and test sets are much lower than for our patient test sets. This is again because our clinic dataset doesn’t provide an accurate length of stay for patients, as they tend to be moved to the main building when they are expected to stay longer than a day.

Mean Log Loss per dataset
Dataset Mean_Log_Loss
Train 0.435178918730835
Test 0.428727847609923
Patient 1.2105985808161

Conclusion

We notice from the Bayesian analysis performed both in linear and logistic regression form, that the data given by the clinic does not accurately represent patients length of stays. Therefore for future observations and predictions, we advise the hospital to use the main buildings patient information to perform such inferences. We have both offered a function for the prediction of the length of stay and the probability of staying longer than two days as tools of prediction.

Post Mortem

Bayesian statistics can be very subjective, especially in it’s use of priors, which can make working and collaborating with other people difficult. It also renders knowing what the right process is strenuous, as there are many ways to solve one problem. After spending hours trying to figure out what was wrong with my code, and implimenting different methods and prior, I came to realize that I had to trust myself. Since we only had 3 weeks to learn the subject and complete the project, it would have been difficult to applied multiple method for validation and comparison before. However, no matter how well I have perform in this task, I now have new prior knowledge for future experiences.

- Amina

Contributors

Alphonse Marie Diatta : First part: Graphical Analysis

Amina Al-Nadjib : Second part: Bayesian Analysis

Dataset Dictionary

Mimic3d

## spc_tbl_ [58,976 × 28] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hadm_id         : num [1:58976] 1e+05 1e+05 1e+05 1e+05 1e+05 ...
##  $ gender          : chr [1:58976] "F" "M" "F" "F" ...
##  $ age             : num [1:58976] 35 59 48 73 60 54 21 67 49 55 ...
##  $ LOSdays         : num [1:58976] 6.17 4.04 12.04 7.29 4.88 ...
##  $ admit_type      : chr [1:58976] "EMERGENCY" "EMERGENCY" "EMERGENCY" "EMERGENCY" ...
##  $ admit_location  : chr [1:58976] "CLINIC REFERRAL/PREMATURE" "EMERGENCY ROOM ADMIT" "EMERGENCY ROOM ADMIT" "EMERGENCY ROOM ADMIT" ...
##  $ AdmitDiagnosis  : chr [1:58976] "DIABETIC KETOACIDOSIS" "UPPER GI BLEED" "COPD FLARE" "BOWEL OBSTRUCTION" ...
##  $ insurance       : chr [1:58976] "Private" "Private" "Private" "Private" ...
##  $ religion        : chr [1:58976] "PROTESTANT QUAKER" "NOT SPECIFIED" "NOT SPECIFIED" "JEWISH" ...
##  $ marital_status  : chr [1:58976] "DIVORCED" "SINGLE" "SINGLE" "MARRIED" ...
##  $ ethnicity       : chr [1:58976] "WHITE" "WHITE" "BLACK/AFRICAN AMERICAN" "WHITE" ...
##  $ NumCallouts     : num [1:58976] 0.16 0.25 0 0.41 0 0.23 0.07 0.1 0 0 ...
##  $ NumDiagnosis    : num [1:58976] 2.59 2.23 0.75 0.69 3.69 1.14 0.97 1.09 12.7 1.78 ...
##  $ NumProcs        : num [1:58976] 0 0.99 0.17 0.27 0.82 0.68 1.04 0.4 4.76 0.81 ...
##  $ AdmitProcedure  : chr [1:58976] "na" "Endosc control gast hem" "Non-invasive mech vent" "Part sm bowel resect NEC" ...
##  $ NumCPTevents    : num [1:58976] 1.3 1.98 0.83 0.69 2.25 1.83 3.13 1.09 3.17 2.43 ...
##  $ NumInput        : num [1:58976] 25.1 13.6 11.5 20.3 20.5 ...
##  $ NumLabs         : num [1:58976] 43.4 55.9 33.4 32.2 50.6 ...
##  $ NumMicroLabs    : num [1:58976] 0.65 1.24 0.33 0.69 0.61 0 1.88 0.3 0 0.81 ...
##  $ NumNotes        : num [1:58976] 0.05 1.59 0.15 0.17 0.34 0.11 0.21 0.15 0 0.21 ...
##  $ NumOutput       : num [1:58976] 5.19 5.45 4.15 9.05 16.19 ...
##  $ NumRx           : num [1:58976] 14.91 7.18 6.23 11.52 25 ...
##  $ NumProcEvents   : num [1:58976] 1.13 0.99 0 0 2.87 1.14 4.1 2.28 0 1.13 ...
##  $ NumTransfers    : num [1:58976] 0.65 1.24 0.33 0.96 2.05 0.91 0.21 0.6 4.76 0.49 ...
##  $ NumChartEvents  : num [1:58976] 399 373 286 526 555 ...
##  $ ExpiredHospital : num [1:58976] 0 0 0 0 0 0 0 0 0 0 ...
##  $ TotalNumInteract: num [1:58976] 494 466 344 603 680 ...
##  $ LOSgroupNum     : num [1:58976] 1 1 3 1 1 1 3 2 0 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hadm_id = col_double(),
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   LOSdays = col_double(),
##   ..   admit_type = col_character(),
##   ..   admit_location = col_character(),
##   ..   AdmitDiagnosis = col_character(),
##   ..   insurance = col_character(),
##   ..   religion = col_character(),
##   ..   marital_status = col_character(),
##   ..   ethnicity = col_character(),
##   ..   NumCallouts = col_double(),
##   ..   NumDiagnosis = col_double(),
##   ..   NumProcs = col_double(),
##   ..   AdmitProcedure = col_character(),
##   ..   NumCPTevents = col_double(),
##   ..   NumInput = col_double(),
##   ..   NumLabs = col_double(),
##   ..   NumMicroLabs = col_double(),
##   ..   NumNotes = col_double(),
##   ..   NumOutput = col_double(),
##   ..   NumRx = col_double(),
##   ..   NumProcEvents = col_double(),
##   ..   NumTransfers = col_double(),
##   ..   NumChartEvents = col_double(),
##   ..   ExpiredHospital = col_double(),
##   ..   TotalNumInteract = col_double(),
##   ..   LOSgroupNum = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Patient dataset

str(Patient)
## spc_tbl_ [61 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Horodateur            : chr [1:61] "2023/06/05 10:47:32 PM UTC−4" "2023/06/05 10:53:15 PM UTC−4" "2023/06/05 11:03:37 PM UTC−4" "2023/06/05 11:09:01 PM UTC−4" ...
##  $ Age                   : num [1:61] 10 21 13 55 22 22 21 45 35 84 ...
##  $ Height                : num [1:61] 150 172 145 170 183 ...
##  $ Weight                : num [1:61] 32 55 43 150 180 84 60 75 65 50 ...
##  $ Gender                : chr [1:61] "Female" "Female" "Male" "Male" ...
##  $ Diagnosis             : chr [1:61] "Asthma" "Depression" "Hypertension" "Asthma" ...
##  $ Marital Status        : chr [1:61] "Single" "Single" "Single" "Married" ...
##  $ Income                : chr [1:61] "Low" "Low" "Low" "Medium" ...
##  $ Length of Stay        : num [1:61] 0.08 2.4 24 0.2 0.2 0.2 0.3 0.25 0.25 10.4 ...
##  $ Location/distance     : num [1:61] 5 3 12 5 8 2 1 15 10 12 ...
##  $ Cost of stay          : num [1:61] 1 1000 234 100 5000 1500 50 2000 1500 874 ...
##  $ Payment method        : chr [1:61] "Debit Card" "Insurance" "Insurance" "Credit card" ...
##  $ Inpatient/outpatient  : chr [1:61] "outpatient" "outpatient" "Inpatient" "outpatient" ...
##  $ Previous admission    : chr [1:61] "Yes" "Yes" "No" "No" ...
##  $ Severity of injuries  : chr [1:61] "Non-life Threatening" "Life-threatening" "Superficial" "Non-life Threatening" ...
##  $ Condition at discharge: chr [1:61] "Stable" "Stable" "Stable" "Stable" ...
##  $ Admission Type        : chr [1:61] "Elective" "Elective" "Elective" "Emergency" ...
##  $ Blood types           : chr [1:61] "A" "O" "RH-positive" "O" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Horodateur = col_character(),
##   ..   Age = col_double(),
##   ..   Height = col_double(),
##   ..   Weight = col_double(),
##   ..   Gender = col_character(),
##   ..   Diagnosis = col_character(),
##   ..   `Marital Status` = col_character(),
##   ..   Income = col_character(),
##   ..   `Length of Stay` = col_double(),
##   ..   `Location/distance` = col_double(),
##   ..   `Cost of stay` = col_double(),
##   ..   `Payment method` = col_character(),
##   ..   `Inpatient/outpatient` = col_character(),
##   ..   `Previous admission` = col_character(),
##   ..   `Severity of injuries` = col_character(),
##   ..   `Condition at discharge` = col_character(),
##   ..   `Admission Type` = col_character(),
##   ..   `Blood types` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Training dataset

## tibble [38,784 × 6] (S3: tbl_df/tbl/data.frame)
##  $ age                 : num [1:38784] 71 58 62 73 53 77 86 32 52 42 ...
##  $ LOSdays             : num [1:38784] 3.5 2.88 3.21 5.04 18.08 ...
##  $ gender_F            : int [1:38784] 1 1 0 1 0 1 1 1 1 1 ...
##  $ admit_type_ELECTIVE : int [1:38784] 0 0 0 1 0 1 1 0 0 0 ...
##  $ admit_type_EMERGENCY: int [1:38784] 1 1 1 0 1 0 0 1 1 1 ...
##  $ LOSdis              : num [1:38784] 0 1 0 0 0 0 0 0 0 0 ...
##  - attr(*, "na.action")= 'omit' Named int [1:2385] 121 145 148 168 174 176 228 250 306 309 ...
##   ..- attr(*, "names")= chr [1:2385] "121" "145" "148" "168" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Patient dataset after removing variable

## spc_tbl_ [61 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Horodateur            : chr [1:61] "2023/06/05 10:47:32 PM UTC−4" "2023/06/05 10:53:15 PM UTC−4" "2023/06/05 11:03:37 PM UTC−4" "2023/06/05 11:09:01 PM UTC−4" ...
##  $ Age                   : num [1:61] 10 21 13 55 22 22 21 45 35 84 ...
##  $ Height                : num [1:61] 150 172 145 170 183 ...
##  $ Weight                : num [1:61] 32 55 43 150 180 84 60 75 65 50 ...
##  $ Gender                : chr [1:61] "Female" "Female" "Male" "Male" ...
##  $ Diagnosis             : chr [1:61] "Asthma" "Depression" "Hypertension" "Asthma" ...
##  $ Marital Status        : chr [1:61] "Single" "Single" "Single" "Married" ...
##  $ Income                : chr [1:61] "Low" "Low" "Low" "Medium" ...
##  $ Length of Stay        : num [1:61] 0.08 2.4 24 0.2 0.2 0.2 0.3 0.25 0.25 10.4 ...
##  $ Location/distance     : num [1:61] 5 3 12 5 8 2 1 15 10 12 ...
##  $ Cost of stay          : num [1:61] 1 1000 234 100 5000 1500 50 2000 1500 874 ...
##  $ Payment method        : chr [1:61] "Debit Card" "Insurance" "Insurance" "Credit card" ...
##  $ Inpatient/outpatient  : chr [1:61] "outpatient" "outpatient" "Inpatient" "outpatient" ...
##  $ Previous admission    : chr [1:61] "Yes" "Yes" "No" "No" ...
##  $ Severity of injuries  : chr [1:61] "Non-life Threatening" "Life-threatening" "Superficial" "Non-life Threatening" ...
##  $ Condition at discharge: chr [1:61] "Stable" "Stable" "Stable" "Stable" ...
##  $ Admission Type        : chr [1:61] "Elective" "Elective" "Elective" "Emergency" ...
##  $ Blood types           : chr [1:61] "A" "O" "RH-positive" "O" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Horodateur = col_character(),
##   ..   Age = col_double(),
##   ..   Height = col_double(),
##   ..   Weight = col_double(),
##   ..   Gender = col_character(),
##   ..   Diagnosis = col_character(),
##   ..   `Marital Status` = col_character(),
##   ..   Income = col_character(),
##   ..   `Length of Stay` = col_double(),
##   ..   `Location/distance` = col_double(),
##   ..   `Cost of stay` = col_double(),
##   ..   `Payment method` = col_character(),
##   ..   `Inpatient/outpatient` = col_character(),
##   ..   `Previous admission` = col_character(),
##   ..   `Severity of injuries` = col_character(),
##   ..   `Condition at discharge` = col_character(),
##   ..   `Admission Type` = col_character(),
##   ..   `Blood types` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>