The Omawa Civic Hospital has two branches, the first branch host Omawa residence of age 14 and above. It also holds a gynecology department that assists women in labour among other treatment and consultations. The second branch is a clinic which offers limited health care services.
When a patient needs longer term medical attention, they are usually transferred to the main building, where they can occupy a bed until the end of their care. However, the clinic has began to keep patients for longer periods of time, as the main building’s bed capacity is reaching its limit.
Thus, in order to determine whether they will need to use the clinic, or add new beds in the main build, the Omawa Civc Hospital has asked us to observe the patient length of stay in the hospital and predict the probability of staying longer than two days.
We were given two data sets, the first being the mimic3d data set, which displays the main building patients information, the second one named Patient, provides us with information about the patients visiting the clinic. As both the clinic and the main building’s administrative structures function differently, the mimic3d and Patient data set do not share all of the same variables. (View dictionnary)
Hence why, we are, for the first part of our analysis, only using the mimic3d, as it will give us a better understanding of the trends seen in the hospital.
The first part is a broad overview of the elements in the main building data set.
Age and Length of Stay relationship trends
In the scatter plot of length of stay vs age, we observed a high density of points in the region where age falls between 14 and above, and the number of hospital days is less than 70 days. This suggests that a large number of patients in this age group spend a similar number of days in the hospital. However, there is a disparity between newborns and other age groups, as newborns tend to have a more consistent length of stay compared to other individuals.
The second graph shows us the density distribution of age, and its general curve shape resembles that of a normal distribution. However, there is a skweness at the very beginning, completely to the left of the graph. This skewness is likely explained by the gap observed between newborns and individuals of age 13 and oldr. This simply means that there is a small number of individuals who spend a significant amount of time in the hospital compared to the majority who spend little time in general.
We also observe a density distribution of age with an exponential shape. The graph shows a rapidly declining curve at the beginning, indicating a high frequency of younger ages. As the age increases, the curve gradually flattens out, indicating a decreasing frequency of older ages. Interpreting the graph, we can observe a high concentration of individuals in the younger age groups, with the frequency decreasing as age increases. The exponential shape suggests that the probability of encountering individuals in older age groups becomes progressively smaller. This means that as age increases, the likelihood of finding individuals in older age categories decreases.
Trends within some variables with regards to the length of Stay
Correlation between the length of stay, age and other variable
Based on the box plot of length of stay by insurance, we noticed that people with no insurance have the shortest hospital stays, which is expected. When studying the correlation between variables, we found that the correlation between LOS days and age is -0.03252, indicating a negative correlation.
By analyzing our correlation heatmap, we notice a cell of light color between LOSdays (length of stay in days) and age. This confirms that there is indeed a weak or little correlation between LOSdays and age. In other words, the age of the patients does not have a significant influence on their number of hospital days. When examining the heatmap in general, we can also identify other correlation patterns. For example, the fact that we see a diagonal of red cells, indicate a positive correlation between each variable and itself, which is expected as a variable is perfectly correlated with itself.
In this second part of our analysis, we are trying to perform a Bayesian regression on the mimic3d data set, to predict the length of stay as well as the probability of staying longer than two days. We made a few changes in the data set to simplify our calculations:
We have first only kept the column that are both in the mimic3d data set as well as the patient data set
We ignored the admission diagnosis column, as it contained a number of categorical variables too large for the capacity of this analysis
We removed all the columns where the age was zero, as it made the age distribution highly skewed, and newborns length of stay are fairly constant
We have also ignored the marital status, to minimize the number of predictor variables, however it can still be added to the analyses.
Finally, we turn all categorical variables into dummies variables.
We are, again, only using the information coming from the main building to train our model as it provides a more accurate understanding of the patients length of stay.
Before we do any analysis, we have split our mimic3d data set into a training (80%) and test (20%) to verify the accuracy of our predictions. In the dictionaries section we can see the variables used for the regression and their transformation, both from the mimic3d and the patient data sets.
set.seed(123)
ind <- sample(seq_len(nrow(mimic)), size = .8*nrow(mimic))
train = mimic[ind,c(2,3,7,9:10,19)]
test = mimic[-ind,c(2,3,7,9:10,19)]
We then using a frequentist multivariate linear regression to estimate our coefficients \(\boldsymbol{\beta}\).
fit <- lm(LOSdays ~ age + gender_F + admit_type_ELECTIVE +
admit_type_EMERGENCY, data = train)
coefficients(fit)
## (Intercept) age gender_F
## 12.535815533 0.003890085 -0.306127583
## admit_type_ELECTIVE admit_type_EMERGENCY
## -3.822056006 -2.635814522
Since our predicted Y is normally distributed, our likelihood is: \(Y|\boldsymbol{X},\boldsymbol{\beta},\sigma^2 \sim N(\boldsymbol{\beta}^T\boldsymbol{X},\sigma^2I)\)
Our priors will be: \(\beta_{i} \sim N(b_{i},\sigma^2)\) and \(1/\sigma^2 \sim Gamma(1/2,\sigma_{res}^2/2)\)
with \(\beta_{i}\) being our estimated parameters.
And our posterior distribution is: \(Posterior \propto likelihood*prior\)
Since we are using logarithmic distributions in these calculations. Our posterior distribution will be:
\(Posterior \propto log(likelihood)+log(prior)\)
Y = train$LOSdays
X = as.matrix(cbind(1,train[,c(1,3:5)]))
sig = Ginv(t(X)%*%X)
mB = coefficients(fit)
sdB = sqrt(diag(vcov(fit)))
rsd = sigma(fit)
prior <- function(par){
b0 <- par[1]
b1 <- par[2]
b2 <- par[3]
b3 <- par[4]
b4 <- par[5]
sd <- par[6]
return(c(b0prior = dnorm(b0, mean = mB[1], sd = sd, log = T),
b1prior = dnorm(b1, mean = mB[2], sd = sd, log = T),
b2prior = dnorm(b2, mean = mB[3], sd = sd, log = T),
b3prior = dnorm(b3, mean = mB[4], sd = sd, log = T),
b4prior = dnorm(b4, mean = mB[5], sd = sd, log = T),
sdprior = 1/sqrt(dgamma(0.5,rsd/2))))}
likelihood <- function(X,par){
sum(dnorm(Y, mean = X%*%par[-length(par)], sd = par[length(par)], log = T))}
posterior <- function(X,par){
return(likelihood(X,par)+sum(prior(par)))}
Using the metropolis hasting algorithm, we estimate a series of randomly distribution set of parameters named Sample.
set.seed((123))
MH<-function(theta0, T = 1000,r = 1, n = length(theta0)){
dim = length(theta0)
res = matrix(0, nrow=T+1, ncol = dim)
res[1,] = theta0
for(i in 1:T){
proposal = res[i, ] + rnorm(dim, mean = 0, sd = r)
probab = exp(posterior(X,proposal) - posterior(X,res[i,]))
if(runif(1) < probab){res[i+1,] = proposal}
else{res[i+1,] = res[i,]}}
return(res)}
Sample = MH(c(mB -sdB ,9))
As we can see from the code chunk, the Metropolis Hasting algorithm estimates a series of randomly distributed parameters, adding a random walk to each new generated sets. The output of our code will be a list of 1000 estimated set of parameters, as we have adjust the number of iterations to this number.
To predict our new values, the code below averages the output of multivariate linear regression estimated using our Sample list.
prediction<-function(X){
y = c()
nx = as.matrix(X)
n = nrow(X)
for(i in 1:n){
ny = mean(1*Sample[,1] + nx[i,1]*Sample[,2] + nx[i,2]*Sample[,3] +
nx[i,3]*Sample[,4] + nx[i,4]*Sample[,5])
if(ny<0){ny = 0}
else{ny = ny}
y = append(y,ny)
i = i+1}
return(y)}
mse <- function(ypred,ytrue){ ytrue = as.matrix(ytrue)
mean((ytrue - ypred)^2)}
We can see from the table below that the training and test set derived from the mimic3d dataset both render MSE values higher than the patient dataset. However, it does not make the clinic’s dataset better for our prediction. The phenomenon can be explained by the fact that length of stay, for the main building, expands on a broader range of days than for the patient dataset.
This means that although, we can see the average length of stay for both training and test falls within out predict data, larger and more extreme values are not as well represented by the model. Where as for the patient dataset, the length of stay doesn’t tend to vary as much.
predtrain = prediction(train[,-c(2,6)])
predtest = prediction(test[,-c(2,6)])
predpatient = prediction(patient[,c(2,7:9)])
MSE = c(mse(predtrain,train[,2]),mse(predtest,test[,2]),mse(predpatient,patient[,3]))
lml = cbind(Dataset = c("Train","Test","Patient"),MSE)
kable(lml, caption = "MSE per dataset", digits = 2)
| Dataset | MSE |
|---|---|
| Train | 113.447210126972 |
| Test | 121.82647815696 |
| Patient | 60.0282654040456 |
par(mfrow = c(1,3))
hist(predtrain, main="Inference on Training", xlab="Mean of responses variables = red")
abline(v = mean(train$LOSdays),col="red")
hist(predtest, main="Inference on Test", xlab="Mean of responses variables = red")
abline(v = mean(test$LOSdays),col="red")
hist(predpatient, main="Inference on Patient", xlab="Mean of responses variables = red")
abline(v = mean(patient$`Length of Stay`),col="red")
Here we are trying to perform a Bayesian Logistic Regression to estimate the probability of patient staying longer than two days.
Our likelihood function is: \(Y \sim Bern(p)\), where \(logit p = \boldsymbol{\beta}^t\boldsymbol{X}\)
Our priors will be: \(\beta_{0} \sim N(b_{0},\sigma^2)\), \(\beta_{i>0} \sim N(0,1000)\), and \(1/\sigma^2 \sim Gamma(1/2,\sigma_{res}^2/2)\)
And our posterior distribution is: \(Posterior \propto likelihood*prior\)
We are again using logarithmic distributions in these calculations.
fit <- glm(LOSdis ~ . - LOSdays,
data = train, family = binomial(link = "logit"))
Y = train$LOSdis
X = as.matrix(cbind(1,train[,c(1,3:5)]))
mB = coefficients(fit)
sdB = sqrt(diag(vcov(fit)))
sd = sd(Y)
prior <- function(par){
b0 = par[1]
b1 = par[2]
b2 = par[3]
b3 = par[4]
b4 = par[5]
b0prior = dnorm(b0, mean = mB[1], sd = sdB[1], log = T)
b1prior = dnorm(b1, mean = 0, sd = 1000, log = T)
b2prior = dnorm(b2, mean = 0, sd = 1000, log = T)
b3prior = dnorm(b3, mean = 0, sd = 1000, log = T)
b4prior = dnorm(b4, mean = 0, sd = 1000, log = T)
insig = dgamma(0.1,0.1)
sdprior = sqrt(1/insig)
return(sum(b0prior, b1prior, b2prior,b3prior,b4prior,sdprior))}
likelihood <- function(X,par){
sum(dbinom(Y, size = 1, prob = plogis(X %*% par[-length(par)]), log = TRUE))}
posterior <- function(X,par){
return(likelihood(X,par)+prior(par))}
We then use the same metropolis hasting algorithm on our new posterior functions to estimate our probabilities, generating a new Sample set for out logistic regression.
Now, instead of using a multivariate regression model to estimate the length of stay, we instead use the Sample set to average the probability of staying longer than two days in the hospital.
probability <- function(X) {
X = as.matrix(X)
n <- as.numeric(nrow(X))
probabilities <- matrix(0, nrow= n, ncol = 1)
for (i in 1:n) {
linear_predictor = (1*Sample[,1] + X[i,1]*Sample[,2] + X[i,2]*Sample[,3] +
X[i,3]*Sample[,4] + X[i,4]*Sample[,5])
probability = (mean(1 / (1 + exp(-linear_predictor))))
probabilities[i] <- probability}
return(probabilities)}
mean.log.loss = function(yprob, ytrue){
ytrue = as.matrix(ytrue)
mean.log.loss = mean(-ytrue*log(yprob)-(1-ytrue)*log(1-yprob))
return(mean.log.loss)}
We can see here that the mean log loss for our training and test sets are much lower than for our patient test sets. This is again because our clinic dataset doesn’t provide an accurate length of stay for patients, as they tend to be moved to the main building when they are expected to stay longer than a day.
| Dataset | Mean_Log_Loss |
|---|---|
| Train | 0.435178918730835 |
| Test | 0.428727847609923 |
| Patient | 1.2105985808161 |
We notice from the Bayesian analysis performed both in linear and logistic regression form, that the data given by the clinic does not accurately represent patients length of stays. Therefore for future observations and predictions, we advise the hospital to use the main buildings patient information to perform such inferences. We have both offered a function for the prediction of the length of stay and the probability of staying longer than two days as tools of prediction.
Bayesian statistics can be very subjective, especially in it’s use of priors, which can make working and collaborating with other people difficult. It also renders knowing what the right process is strenuous, as there are many ways to solve one problem. After spending hours trying to figure out what was wrong with my code, and implimenting different methods and prior, I came to realize that I had to trust myself. Since we only had 3 weeks to learn the subject and complete the project, it would have been difficult to applied multiple method for validation and comparison before. However, no matter how well I have perform in this task, I now have new prior knowledge for future experiences.
- Amina
Alphonse Marie Diatta : First part: Graphical Analysis
Amina Al-Nadjib : Second part: Bayesian Analysis
Mimic3d
## spc_tbl_ [58,976 × 28] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hadm_id : num [1:58976] 1e+05 1e+05 1e+05 1e+05 1e+05 ...
## $ gender : chr [1:58976] "F" "M" "F" "F" ...
## $ age : num [1:58976] 35 59 48 73 60 54 21 67 49 55 ...
## $ LOSdays : num [1:58976] 6.17 4.04 12.04 7.29 4.88 ...
## $ admit_type : chr [1:58976] "EMERGENCY" "EMERGENCY" "EMERGENCY" "EMERGENCY" ...
## $ admit_location : chr [1:58976] "CLINIC REFERRAL/PREMATURE" "EMERGENCY ROOM ADMIT" "EMERGENCY ROOM ADMIT" "EMERGENCY ROOM ADMIT" ...
## $ AdmitDiagnosis : chr [1:58976] "DIABETIC KETOACIDOSIS" "UPPER GI BLEED" "COPD FLARE" "BOWEL OBSTRUCTION" ...
## $ insurance : chr [1:58976] "Private" "Private" "Private" "Private" ...
## $ religion : chr [1:58976] "PROTESTANT QUAKER" "NOT SPECIFIED" "NOT SPECIFIED" "JEWISH" ...
## $ marital_status : chr [1:58976] "DIVORCED" "SINGLE" "SINGLE" "MARRIED" ...
## $ ethnicity : chr [1:58976] "WHITE" "WHITE" "BLACK/AFRICAN AMERICAN" "WHITE" ...
## $ NumCallouts : num [1:58976] 0.16 0.25 0 0.41 0 0.23 0.07 0.1 0 0 ...
## $ NumDiagnosis : num [1:58976] 2.59 2.23 0.75 0.69 3.69 1.14 0.97 1.09 12.7 1.78 ...
## $ NumProcs : num [1:58976] 0 0.99 0.17 0.27 0.82 0.68 1.04 0.4 4.76 0.81 ...
## $ AdmitProcedure : chr [1:58976] "na" "Endosc control gast hem" "Non-invasive mech vent" "Part sm bowel resect NEC" ...
## $ NumCPTevents : num [1:58976] 1.3 1.98 0.83 0.69 2.25 1.83 3.13 1.09 3.17 2.43 ...
## $ NumInput : num [1:58976] 25.1 13.6 11.5 20.3 20.5 ...
## $ NumLabs : num [1:58976] 43.4 55.9 33.4 32.2 50.6 ...
## $ NumMicroLabs : num [1:58976] 0.65 1.24 0.33 0.69 0.61 0 1.88 0.3 0 0.81 ...
## $ NumNotes : num [1:58976] 0.05 1.59 0.15 0.17 0.34 0.11 0.21 0.15 0 0.21 ...
## $ NumOutput : num [1:58976] 5.19 5.45 4.15 9.05 16.19 ...
## $ NumRx : num [1:58976] 14.91 7.18 6.23 11.52 25 ...
## $ NumProcEvents : num [1:58976] 1.13 0.99 0 0 2.87 1.14 4.1 2.28 0 1.13 ...
## $ NumTransfers : num [1:58976] 0.65 1.24 0.33 0.96 2.05 0.91 0.21 0.6 4.76 0.49 ...
## $ NumChartEvents : num [1:58976] 399 373 286 526 555 ...
## $ ExpiredHospital : num [1:58976] 0 0 0 0 0 0 0 0 0 0 ...
## $ TotalNumInteract: num [1:58976] 494 466 344 603 680 ...
## $ LOSgroupNum : num [1:58976] 1 1 3 1 1 1 3 2 0 1 ...
## - attr(*, "spec")=
## .. cols(
## .. hadm_id = col_double(),
## .. gender = col_character(),
## .. age = col_double(),
## .. LOSdays = col_double(),
## .. admit_type = col_character(),
## .. admit_location = col_character(),
## .. AdmitDiagnosis = col_character(),
## .. insurance = col_character(),
## .. religion = col_character(),
## .. marital_status = col_character(),
## .. ethnicity = col_character(),
## .. NumCallouts = col_double(),
## .. NumDiagnosis = col_double(),
## .. NumProcs = col_double(),
## .. AdmitProcedure = col_character(),
## .. NumCPTevents = col_double(),
## .. NumInput = col_double(),
## .. NumLabs = col_double(),
## .. NumMicroLabs = col_double(),
## .. NumNotes = col_double(),
## .. NumOutput = col_double(),
## .. NumRx = col_double(),
## .. NumProcEvents = col_double(),
## .. NumTransfers = col_double(),
## .. NumChartEvents = col_double(),
## .. ExpiredHospital = col_double(),
## .. TotalNumInteract = col_double(),
## .. LOSgroupNum = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Patient dataset
str(Patient)
## spc_tbl_ [61 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Horodateur : chr [1:61] "2023/06/05 10:47:32 PM UTC−4" "2023/06/05 10:53:15 PM UTC−4" "2023/06/05 11:03:37 PM UTC−4" "2023/06/05 11:09:01 PM UTC−4" ...
## $ Age : num [1:61] 10 21 13 55 22 22 21 45 35 84 ...
## $ Height : num [1:61] 150 172 145 170 183 ...
## $ Weight : num [1:61] 32 55 43 150 180 84 60 75 65 50 ...
## $ Gender : chr [1:61] "Female" "Female" "Male" "Male" ...
## $ Diagnosis : chr [1:61] "Asthma" "Depression" "Hypertension" "Asthma" ...
## $ Marital Status : chr [1:61] "Single" "Single" "Single" "Married" ...
## $ Income : chr [1:61] "Low" "Low" "Low" "Medium" ...
## $ Length of Stay : num [1:61] 0.08 2.4 24 0.2 0.2 0.2 0.3 0.25 0.25 10.4 ...
## $ Location/distance : num [1:61] 5 3 12 5 8 2 1 15 10 12 ...
## $ Cost of stay : num [1:61] 1 1000 234 100 5000 1500 50 2000 1500 874 ...
## $ Payment method : chr [1:61] "Debit Card" "Insurance" "Insurance" "Credit card" ...
## $ Inpatient/outpatient : chr [1:61] "outpatient" "outpatient" "Inpatient" "outpatient" ...
## $ Previous admission : chr [1:61] "Yes" "Yes" "No" "No" ...
## $ Severity of injuries : chr [1:61] "Non-life Threatening" "Life-threatening" "Superficial" "Non-life Threatening" ...
## $ Condition at discharge: chr [1:61] "Stable" "Stable" "Stable" "Stable" ...
## $ Admission Type : chr [1:61] "Elective" "Elective" "Elective" "Emergency" ...
## $ Blood types : chr [1:61] "A" "O" "RH-positive" "O" ...
## - attr(*, "spec")=
## .. cols(
## .. Horodateur = col_character(),
## .. Age = col_double(),
## .. Height = col_double(),
## .. Weight = col_double(),
## .. Gender = col_character(),
## .. Diagnosis = col_character(),
## .. `Marital Status` = col_character(),
## .. Income = col_character(),
## .. `Length of Stay` = col_double(),
## .. `Location/distance` = col_double(),
## .. `Cost of stay` = col_double(),
## .. `Payment method` = col_character(),
## .. `Inpatient/outpatient` = col_character(),
## .. `Previous admission` = col_character(),
## .. `Severity of injuries` = col_character(),
## .. `Condition at discharge` = col_character(),
## .. `Admission Type` = col_character(),
## .. `Blood types` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Training dataset
## tibble [38,784 × 6] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:38784] 71 58 62 73 53 77 86 32 52 42 ...
## $ LOSdays : num [1:38784] 3.5 2.88 3.21 5.04 18.08 ...
## $ gender_F : int [1:38784] 1 1 0 1 0 1 1 1 1 1 ...
## $ admit_type_ELECTIVE : int [1:38784] 0 0 0 1 0 1 1 0 0 0 ...
## $ admit_type_EMERGENCY: int [1:38784] 1 1 1 0 1 0 0 1 1 1 ...
## $ LOSdis : num [1:38784] 0 1 0 0 0 0 0 0 0 0 ...
## - attr(*, "na.action")= 'omit' Named int [1:2385] 121 145 148 168 174 176 228 250 306 309 ...
## ..- attr(*, "names")= chr [1:2385] "121" "145" "148" "168" ...
## - attr(*, ".internal.selfref")=<externalptr>
Patient dataset after removing variable
## spc_tbl_ [61 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Horodateur : chr [1:61] "2023/06/05 10:47:32 PM UTC−4" "2023/06/05 10:53:15 PM UTC−4" "2023/06/05 11:03:37 PM UTC−4" "2023/06/05 11:09:01 PM UTC−4" ...
## $ Age : num [1:61] 10 21 13 55 22 22 21 45 35 84 ...
## $ Height : num [1:61] 150 172 145 170 183 ...
## $ Weight : num [1:61] 32 55 43 150 180 84 60 75 65 50 ...
## $ Gender : chr [1:61] "Female" "Female" "Male" "Male" ...
## $ Diagnosis : chr [1:61] "Asthma" "Depression" "Hypertension" "Asthma" ...
## $ Marital Status : chr [1:61] "Single" "Single" "Single" "Married" ...
## $ Income : chr [1:61] "Low" "Low" "Low" "Medium" ...
## $ Length of Stay : num [1:61] 0.08 2.4 24 0.2 0.2 0.2 0.3 0.25 0.25 10.4 ...
## $ Location/distance : num [1:61] 5 3 12 5 8 2 1 15 10 12 ...
## $ Cost of stay : num [1:61] 1 1000 234 100 5000 1500 50 2000 1500 874 ...
## $ Payment method : chr [1:61] "Debit Card" "Insurance" "Insurance" "Credit card" ...
## $ Inpatient/outpatient : chr [1:61] "outpatient" "outpatient" "Inpatient" "outpatient" ...
## $ Previous admission : chr [1:61] "Yes" "Yes" "No" "No" ...
## $ Severity of injuries : chr [1:61] "Non-life Threatening" "Life-threatening" "Superficial" "Non-life Threatening" ...
## $ Condition at discharge: chr [1:61] "Stable" "Stable" "Stable" "Stable" ...
## $ Admission Type : chr [1:61] "Elective" "Elective" "Elective" "Emergency" ...
## $ Blood types : chr [1:61] "A" "O" "RH-positive" "O" ...
## - attr(*, "spec")=
## .. cols(
## .. Horodateur = col_character(),
## .. Age = col_double(),
## .. Height = col_double(),
## .. Weight = col_double(),
## .. Gender = col_character(),
## .. Diagnosis = col_character(),
## .. `Marital Status` = col_character(),
## .. Income = col_character(),
## .. `Length of Stay` = col_double(),
## .. `Location/distance` = col_double(),
## .. `Cost of stay` = col_double(),
## .. `Payment method` = col_character(),
## .. `Inpatient/outpatient` = col_character(),
## .. `Previous admission` = col_character(),
## .. `Severity of injuries` = col_character(),
## .. `Condition at discharge` = col_character(),
## .. `Admission Type` = col_character(),
## .. `Blood types` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>