data=read.csv("C:/Users/35386/Downloads/ArizonaCardio.csv")
#loading necessary libraries
library(knitr)
## Warning: package 'knitr' was built under R version 4.4.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
Question 1:
To analyse the dataset I will consider the distribution and summary statistics of the LOS (Length of Stay) and then do the same for the two standard cardiovascular procedures CABG and PTCA.
Firstly I will use the head() and str() functions, to show the first few rows of the data to let me better understand the data I will be analysing.
head(data)
## los procedure
## 1 67 1
## 2 53 0
## 3 51 1
## 4 30 0
## 5 43 1
## 6 43 1
As given in the question the dataset contains variables los (length of stay in days) and procedure, either CABG, where procedure =1, or PTCA, where procedure =0.
I will now look at the total number of patients and also the number of patients who had each treatment.
#finding total no. of rows/patients
n_total=nrow(data)
n_total
## [1] 3589
#finding no. of CABG patients
cabg=data[data$procedure=="1",]
n_cabg=nrow(cabg)
n_cabg
## [1] 1676
#finding no. of ptca patients
ptca=data[data$procedure=="0",]
n_ptca=nrow(ptca)
n_ptca
## [1] 1913
#finding proportion of CABG
p_cabg=n_cabg/n_total
p_cabg
## [1] 0.4669824
#Finding prop. of PTCA
p_ptca=1-p_cabg
p_ptca
## [1] 0.5330176
I will now more clearly depict these in first a frequency table and then a proportion table.
#creating frequency table
los_dist=data.frame("Total"=n_total,"CABG"=n_cabg,"PTCA"=n_ptca)
los_dist
## Total CABG PTCA
## 1 3589 1676 1913
kable(los_dist)
| Total | CABG | PTCA |
|---|---|---|
| 3589 | 1676 | 1913 |
#creating proportion table of data
prop.table(los_dist[2:3])
## CABG PTCA
## 1 0.4669824 0.5330176
Out of the 3589 patients who underwent one of the procedures ,1676 (0.467) of them underwent CABG and 1913 (0.533) underwent PTCA.
The summary statistics of the overall sample, CABG and PTCA can be seen in the table below that I created using the summary() function.
summary_data=summary(data$los)
summary_cabg=summary(cabg$los)
summary_ptca=summary(ptca$los)
#Creating dataframe of the 3 summary outputs
summary = data.frame(
Statistic = names(summary_data),
Overall = as.numeric(summary_data),
CABG = as.numeric(summary_cabg),
PTCA = as.numeric(summary_ptca)
)
# Presenting dataframe
kable(summary)
| Statistic | Overall | CABG | PTCA |
|---|---|---|---|
| Min. | 1.000000 | 3.00000 | 1.000000 |
| 1st Qu. | 4.000000 | 9.00000 | 2.000000 |
| Median | 8.000000 | 11.00000 | 4.000000 |
| Mean | 8.830872 | 13.02088 | 5.159958 |
| 3rd Qu. | 11.000000 | 15.00000 | 7.000000 |
| Max. | 83.000000 | 83.00000 | 53.000000 |
The overall average of LOS of a patient is 8.83 days. The average LOS of a CABG patient is 13.02 days ,compared with 5.16 days average LOS of PTCA patients. Hence on average a CABG patient stays in hospital for 4.19 days longer than a PTCA patient. This implies that CABG may have a more intense recovery and monitoring period required than PTCA.
The median of all patients is 8 , with a range of 82 days and an IQR of 7 days. For CABG the median is 11 , the range is 80 days and the IQR is 6 days. The median of PTCA is 4 days , the range is 52 days and the IQR is 5 days. At least 50% of all CABG patients are discharged within 11 days post-op vs 4 days for PTCA. (Further evidence that LOS CABG is bigger)The greater iqr and range of CABG vs PTCA implies than the spread of los is wider for CABG . NEED MORE HERE
I will depict this difference in mean and spread in a boxplot of LOS vs procedure.
#reclassifying the numeric values as categories/levels
data$procedure_cat = factor(data$procedure, levels = c("0", "1"), labels=c("PTCA","CABG"))
#creating boxplot of LOS vs procedure
ggplot(data, aes(x = procedure_cat, y = los)) +
geom_boxplot(alpha = 0.7,fill="skyblue") +
labs(
title = "Length of Stay by Procedure Type",
x = "Procedure",
y = " Length of Stay (Days)"
)
To better show the spread of the data # improve explanation
#creating log value boxplot
data$log_los=log(data$los)
ggplot(data, aes(x = procedure_cat, y = log_los)) +
geom_boxplot(alpha = 0.7,fill="lightgreen") +
labs(
title = " Log Length of Stay by Procedure Type",
x = "Procedure",
y = " Log Length of Stay log(Days))"
)
#commenting on the mean =, iqr ,etc
As we can clearly see from
I will also compute variance of the subsets before commenting.
#computing var
# Calculate variances
var_cabg <- var(cabg$los)
var_ptca <- var(ptca$los)
# Create a data frame of the variances
variance <- data.frame(
Procedure = c("CABG", "PTCA"),
Variance = c(var_cabg, var_ptca)
)
kable(variance)
| Procedure | Variance |
|---|---|
| CABG | 50.00732 |
| PTCA | 17.34365 |
#comment how variance of cabg far higher
The variance of CABG ,50.007, is significantly larger than the variance of PTCA , 17.34365. This implies that the LOS for CABG patients fluctuates far more than the LOS of PTCA patients and has substantially more uncertainty around it. This was also reflected in the wider range and iqr of CABG vs PTCA. This may point towards CABG as being a more invasive procedure with more potential for adverse side effects as seen in the extreme outliers of CABG
???(keep or adapt)
#hist of cabg
# Histogram of LOS for CABG only
ggplot(cabg, aes(x = los)) +
geom_histogram(binwidth = 5, , color = "black", alpha = 0.7) +
labs(
title = "Histogram of Length of Stay for CABG Patients",
x = "Length of Stay (Days)",
y = "Frequency"
) +
theme_minimal(base_size = 14)
#hist of dist of ptca
# Histogram of LOS for PTCA only
ggplot(ptca, aes(x = los)) +
geom_histogram(binwidth = 5, color = "black", alpha = 0.7) +
labs(
title = "Histogram of Length of Stay for PTCA Patients",
x = "Length of Stay (Days)",
y = "Frequency"
) +
theme_minimal(base_size = 14)
Comment on skew of graphs
#maybe change to hist()
Question 2 :
The hospital specifies independent gamma priors for \(\theta_c\) (LOS CABG) and \(\theta_p\) (LOS PTCA).
The expected value of a gamma distribution \(\theta\) ~ gamma(a,b) , ( where a is shape parameter and b is rate parameter)
is defined as:
\[ E(\theta)=\frac{a}{b} \]
The variance of a gamma distribution \(\theta\) ~gamma(a,b) is defined by:
\[ \text{Var}(\theta) =\frac{a}{b^2} \]
The hospital management believe apriori that the average or expected
LOS is 7 days for both CABG and PTCA . I.e that:
\[
E(\theta) = \frac{a}{b} = 7
\]
The hospital management also hold the prior belief that the standard deviation of LOS is 2 days, i.e that variance of LOS=4
\[ \text{Var}(\theta) = \frac{a}{b^2} = 4 \]
We solve this system of equations to obtain the appropriate shape and
rate parameters corresponding to their prior beliefs that:
\(\frac{a}{b}=7\) and \(\frac{a}{b^2}=4\)
Implying that:
\[ a = 7b \] \[ a = 4b^2 \]
Hence
\[ 7b = 4b^2 \]
Therefore
\[ b(4b - 7) = 0 \]
Disregarding b=0 ( which implies a=7(0)=0)
\[ 4b - 7 = 0 \]
\[ b = \frac{7}{4}=1.75 \]
\[ a = 7(1.75) = 12.25 \]
Based on the hospital’s prior beliefs of expected/average LOS is 7 days and the standard deviation of LOS is 2 days, the appropriate shape parameter a is 12.25 and rate parameter b is 1.75 for their prior gamma distributions of both \(\theta_c\) and \(\theta_p\)
\[ \theta_C \sim \text{Gamma}(12.25, 1.75) \]
\[ \theta_P \sim \text{Gamma}(12.25,1.75) \]
Question 3:
#plots put in prior mean as lines
From question 2 we know that \(\theta_C, \theta_P \sim \text{Gamma}(12.25,1.75)\) , using this we can employ the qgamma() function to find the 95% credible interval for \(\theta_c\) and \(\theta_p\). \(\theta_c\) and \(\theta_p\) have the same prior distribution, hence will have the same prior credible interval.
#Generating 95% credible interval using qgamma
a=12.25
b=1.75
prior_ci_los=qgamma(c(0.025,0.975),shape =a,rate=b )
prior_ci_los
## [1] 3.645632 11.430288
# PLotting the prior distribution
x<-seq(0,20,0.0001)
plot(x, dgamma(x,a,b),type="l",
xlab=expression(theta), ylab=expression(italic(paste("p(",theta,")",sep=""))),
main = "Prior Distribution of Theta")
mean=a/b
# 95% credible interval
lower <- qgamma(0.025, a, b)
upper <- qgamma(0.975, a, b)
# Add vertical lines
abline(v = mean, col = "blue", lwd = 2)
abline(v = lower, col = "red", lty = 2, lwd = 2)
abline(v = upper, col = "red", lty = 2, lwd = 2)
#Add labels
text(mean, max(dgamma(x,a,b))*0.9, "Mean=7", col="blue", pos=4)
text(lower, max(dgamma(x,a,b))*0.7, "2.5%", col="red", pos=4)
text(upper, max(dgamma(x,a,b))*0.7, "97.5%", col="red", pos=4)
The prior credible interval of both \(\theta_c\) and \(\theta_p\) ranges from 3.645632 to 11.430288. Based on the hospital’s management’s prior beliefs , a priori , before we’ve observed any data, we can say that there is 95% probability that the true values of \(\theta_c\) and \(\theta_p\) lie in the interval 3.645632 to 11.430288. It is worth noting that this interval is quite wide , reflecting the high level of uncertainty the hospital has in it’s prior beliefs of average length of stay.
As well as knowing we have a gamma prior distribution , we also know from question two that the likelihood functions of LOS for CABG and PTCA are poisson distributions. For modelling the posterior distribution of \(\theta_c\) and \(\theta_p\) we can use the fact that a gamma class of prior is conjugate for a poisson sampling model. Hence giving us a gamma posterior.
\[ \theta \sim \text{Gamma}(a,b) \]
\[ y_1, \ldots, y_n \sim \text{Poisson}(\theta) \]
\[ \theta \mid y_1,\ldots,y_n \sim \text{Gamma}\left(a + \sum_{i=1}^n y_i,\; b + n\right) \]
For \(\theta_c\) :
\[ \theta_c \sim \mathrm{Gamma}(a=12.25,\; b=1.75) \] \[ y_{c,1}, \ldots, y_{c,n} \mid \theta_c \sim \mathrm{Poisson}(\theta_c) \] where n=1676 is the number of CABG patients which we know from question 1.
Also \(\displaystyle \sum_{i=1}^{1676} y_{c,i}\) is the total LOS of all CABG patients
#finding total los of all cabg patients
total_los_cabg=sum(cabg$los)
total_los_cabg
## [1] 21823
Therefore :
\[ \theta_c \mid y_{c,1},\ldots,y_{c,1676} \sim \mathrm{Gamma}\!\left(12.25 + 21823,\; 1.75 + 1676\right) \] \[ \theta_c \mid y_{c,1},\ldots,y_{c,1676} \sim \mathrm{Gamma}(21835.25,\;1677.75) \]
For \(\theta_p\) :
\[ \theta_p \sim \mathrm{Gamma}(a=12.25,\; b=1.75) \]
\[ y_{p,1}, \ldots, y_{p,n} \mid \theta_p \sim \mathrm{Poisson}(\theta_p) \] where n=1913 is the total number of PTCA patients from question 1.
Also \(\displaystyle \sum_{i=1}^{1913} y_{p,i}\) is the total LOS of all PTCA patients
#finding total LOS of all PTCA patients
total_los_ptca=sum(ptca$los)
total_los_ptca
## [1] 9871
Therefore:
\[ \theta_p \mid y_{p,1},\ldots,y_{p,1913} \sim \mathrm{Gamma}\!\left(12.25 + 9871,\; 1.75 + 1913\right) \]
\[ \theta_p \mid y_{p,1},\ldots,y_{p,1913} \sim \mathrm{Gamma}\!\left(9883.25,\; 1914.75\right) \]
We can now use the qgamma() function again to find the 95% posterior credible intervals.
#calcuting 95% posterior coverage intervals LOS CABG
a_post_theta_c=a+total_los_cabg
b_post_theta_c=b+n_cabg
post_ci_theta_c=qgamma(c(0.025,0.975),shape=a_post_theta_c,rate=b_post_theta_c)
post_ci_theta_c
## [1] 12.84254 13.18779
#plotting the coverage interval for LOS CABG
x<-seq(12,14,0.001)
plot(x, dgamma(x,a_post_theta_c,b_post_theta_c),type="l",
xlab=expression(theta), ylab = expression( italic( paste("p(", theta, " | ", y[1], ",...,", y[n], ")", sep="") )
),
main = "Posterior Distribution of Theta_c")
#plotting ci
lower=post_ci_theta_c[1]
upper=post_ci_theta_c[2]
upper
## [1] 13.18779
abline(v = lower, col = "red", lty = 2, lwd = 2)
abline(v = upper, col = "red", lty = 2, lwd = 2)
#95% CI for theta_p
a_post_theta_p=a+total_los_ptca
b_post_theta_p=b+n_ptca
post_ci_theta_p=qgamma(c(0.025,0.975),a_post_theta_p,b_post_theta_p)
post_ci_theta_p
## [1] 5.060373 5.263896
#plotting ci of theta_p
x<-seq(4,6,0.0001)
plot(x, dgamma(x,a_post_theta_p,b_post_theta_p),type="l",
xlab=expression(theta), ylab = expression( italic( paste("p(", theta, " | ", y[1], ",...,", y[n], ")", sep="") ) ),
main = "Posterior Distribution of Theta_p")
#plotting ci
lower=post_ci_theta_p[1]
upper=post_ci_theta_p[2]
upper
## [1] 5.263896
abline(v = lower, col = "red", lty = 2, lwd = 2)
abline(v = upper, col = "red", lty = 2, lwd = 2)
#graphing prior vs posterior CABG
# Create range of possible theta values
x = seq(0, 20,0.001)
# Making dataframe2 of prior and post dist
df_1 = data.frame(
x = x,
prior = dgamma(x, a, b),
posterior = dgamma(x, a_post_theta_c,b_post_theta_c)
)
df_2 = data.frame(
x = x,
prior = dgamma(x, a, b),
posterior = dgamma(x, a_post_theta_p,b_post_theta_p)
)
#plotting prior vs post for theta_c
ggplot(df_1, aes(x = x)) +
geom_line(aes(y = prior, color = "Prior"), size = 1.2) +
geom_line(aes(y = posterior, color = "Posterior"), size = 1.2) +
scale_color_manual(values = c("Prior" = "blue", "Posterior" = "red")) +
labs(
title = "Prior vs Posterior Distributions for Theta_c",
x = expression(theta[C]),
y = "Density",
color = "Distribution"
) +
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#plotting prior vs post for theta_p
ggplot(df_2, aes(x = x)) +
geom_line(aes(y = prior, color = "Prior"), size = 1.2) +
geom_line(aes(y = posterior, color = "Posterior"), size = 1.2) +
scale_color_manual(values = c("Prior" = "blue", "Posterior" = "red")) +
labs(
title = "Prior vs Posterior Distributions for Theta_p",
x = expression(theta[p]),
y = "Density",
color = "Distribution"
) +
theme_bw()
The 95% posterior credible interval of \(\theta_c\) is (12.84254 ,13.18779), we can interpret this as a posteriori , given our prior belief and having observed the data , that there is 95% probability that the true value of \(\theta_c\) lies in the interval (12.84254 ,13.18779) .
Similarly the 95% posterior credible interval of \(\theta_p\) is (5.060373, 5.263896), we can say a posteriori, given our prior belief and observed data, that there is 95% probability that the true value of \(\theta_p\) lies in the interval (5.060373, 5.263896).
Comparing the intervals we can see that the entire 95% credible interval of \(\theta_c\) lies above that of \(\theta_p\) . This is strong evidence to suggest that the average LOS of CABG is longer than the average LOS of PTCA having observed the data and given our prior beliefs.
Also our prior mean of 7 doesn’t lie within either credible interval ,demonstrating how as sample size increases the degree to which our posterior is informed by the observed data increases, dominating our prior belief . and causing the posterior means to be close to our sample means.
It is also worth noting that both posterior credible intervals are significantly narrower ( prior width \(\approx\) 7..78 vs posterior width \(\approx\) 0.35 and \(\approx\) 0.2 for \(\theta_c\) and \(\theta_p\) respectively) of than the prior credible interval , implying the increased confidence in our beliefs having observed a large sample of data.
Question 4:
To estimate the posterior probability that that \(\theta_c\) > \(\theta_p\) using Monte Carlo approximation we draw a S, large number ,of independent samples of \(\theta_c\) and \(\theta_p\) from their respective posterior distributions.
\[ \left\{ \theta_{C}^{(1)}, \ldots, \theta_{C}^{(S)} \right\} \;\sim\; \text{i.i.d. } \mathrm{Gamma}(21835.25,\;1677.75). \]
\[ \left\{ \theta_{p}^{(1)}, \ldots, \theta_{p}^{(S)} \right\} \;\sim\; \text{i.i.d. } \mathrm{Gamma}(9883.25,\;1914.75). \]
By finding the proportion of the sample of \(\theta_c\) is greater than the sample of \(\theta_p\) we can approximate the posterior probability that \(\theta_c\) > \(\theta_p\) via Monte Carlo Approximation because of The Law of Large Numbers.
#drawing the samples of theta_c and theta_p
s=100000
#creates s random values from gamma dist
set.seed(100)
theta_c_sample=rgamma(s,shape = a_post_theta_c,
rate = b_post_theta_c)
theta_p_sample=rgamma(s,shape = a_post_theta_p,
rate=b_post_theta_p)
#compare the samples of theta_c and theta_p
mean(theta_c_sample>theta_p_sample)
## [1] 1
#Check this is right/enough
The Monte Carlo approximation ,given the hospital’s prior belief and the observed data, tells us that there is a 100% (check this) posterior probability that \(\theta_c\) ( the average LOS of CABG) is greater than \(\theta_p\) (the average LOS of PTCA). This very high posterior probability is strong evidence to suggest that \(\theta_c\) > \(\theta_p\) .
Question 5:
To estimate the posterior predictive probability that a future CABG patient will have a longer length of stay than a future PTCA patient using Monte Carlo approximation we use the large independent samples of \(\theta_c\) and \(\theta_p\) that wetook from the posterior distributions in question 4 to sample from the posterior predictive distribution.
#finish/improve this explanation
The posterior predictive distribution is a negative binomial of the
form:
\[
p(\tilde{y}\mid y_1,\ldots,y_n)
= dnbinom\!\left(\tilde{y},\, a + \sum_{i=1}^n y_i,\, b + n\right)
\]
However for large n the posterior predictive distribution is approximately Poisson(\(\bar{y}\)) distributed.
\[ p(\tilde{Y} \mid Y_1, \dots, Y_n) \approx \text{Poisson}(\bar{Y}|\theta) \]
Using the independent samples of \(\theta_c\) and \(\theta_p\) that I obtained in question 4 I can sample \(\{\tilde{Y}_{c,1}, \dots, \tilde{Y}_{c,n}\}\) from Poisson(\(\tilde{Y}_c \mid \theta_c\)) and sample \(\{\tilde{Y}_{p,1}, \dots, \tilde{Y}_{p,n}\}\) from Poisson(\(\tilde{Y}_p \mid \theta_p\)).
By finding the proportion of predicted \(\tilde{y}_C\) > predicted \(\tilde{y}_p\) I can approximate the probability that a future CABG patient will have longer LOS than a future PTCA patient.
#generating samples of y_tilde
y_pred_c=rpois(s,theta_c_sample)
y_pred_p=rpois(s,theta_p_sample)
#compare two samples
mean(y_pred_c>y_pred_p)
## [1] 0.96052
#comment on results
Based on the sample there is 0.96052 probability that a future CABG patient has a longer LOS than a future PTCA patient, given our prior beliefs and observed data
#chats version #pbviously adapt and change as see fit
The Monte Carlo estimate of the posterior predictive probability that a future CABG patient will have a longer length of stay than a future PTCA patient is 0.9602. This indicates that, after accounting for both uncertainty in the Poisson rates and the natural variability in future patient outcomes, there is very strong evidence that a newly arriving CABG patient is more likely to stay in hospital longer than a newly arriving PTCA patient. A probability this high suggests that, under the posterior predictive distribution, the CABG procedure produces longer stays in nearly all simulated future comparisons.”
Question 6: # compare negative binomial ptca to to rpois mc approx
Using a similar method to question 5 , I will use Monte Carlo aprroximation to approximate the probability a future randomly selected PTCA patient will have LOS = 7.
Exact calculation:
As I said in question 5 the posterior predictive distribution of LOS of PTCA is a negative binomial of form:
\[
p(\tilde{y}_{P} \mid y_{P,1}, \ldots, y_{P,n_P})
= dnbinom\!\left(
\tilde{y}_{P},\;
a + \sum_{i=1}^{n_{P}} y_{P,i},\;
b + n_{P}
\right).
\] Therefore I can compute the exact posterior predictive
probability that LOS PTCA = 7 using the dnbinom() function.
exact=dnbinom(7,size=a_post_theta_p,mu=a_post_theta_p/b_post_theta_p)
exact
## [1] 0.1110023
MC Approximation:
Using the Monte Carlo generated sample of the posterior predictive distribution , which is approximated by poisson , \(\{\tilde{Y}_{p,1}, \dots, \tilde{Y}_{p,n}\}\) from Poisson(\(\tilde{Y}_p \mid \theta_p\) ) , I can find approximate the probability by finding the proportion of the sample that is greater than 7.
mc=mean(y_pred_p==7)
mc
## [1] 0.11188
#Improve this comparison
The exact and Monte Carlo approximated posterior predictive probabilities are in a high level of agreement with one and other. Emphasising the accuracy of the Monte Carlo
Question 7:
#model checking
#step 1 pick test statistic from plot of emperical vs post pred
#plot emp vs pred dist
To check that the model is a good fit to the prediction model , I will first graph the emp and pred dist
#plotting the emperical dist. vs post. pred. dist of the los of CABG
#emperical dist
# creating range of the y values in the emperical dist
y_los=0:max(cabg$los)
tbl=table(c(cabg$los,y_los))-1
#The -1 removes the artificial “1” added to each y-value to make sure every value appears in the table.
empirical=tbl/n_cabg
#post pred dist
predictive=dnbinom(y_los,size=a+total_los_cabg,
mu=(a+total_los_cabg)/(b+n_cabg))
# making 1 data frame
y_plot=c(y_los,y_los)
probs=c(predictive, empirical)
group=factor(c(rep("Posterior Predicitive",length(y_los)),
rep("Emperical Distribution",length(y_los)))
)
plot_df=data.frame(y=y_plot,prob=probs,group=group)
ggplot(plot_df,aes(x=factor(y),y=prob,fill=group))+
geom_bar(stat="identity",position = "dodge",width=0.5)+
scale_x_discrete(breaks = seq(0, max(y_los), by = 5))+
labs(title = "Empirical vs Posterior Predictive Distribution (CABG)",x="Length of Stay (days)",
y="Probability")+
theme_bw()
#undefined
#extreme value (>20 days)
Test stat t=prop of values greater than 20 days
t.obs=mean(cabg$los>20)
t.obs
## [1] 0.09307876
Observed test stat is
0.09307876
I will sample 100000 replicated samples of
t.mc=NULL
for(s in 1:100000){
theta_c=rgamma(1,a_post_theta_c,b_post_theta_c)
y1.mc=rpois(n_cabg,theta_c)
t.mc=mean(y1.mc>20)
}
comp=mean(t.mc>t.obs)
comp
## [1] 0
For a well fitting model , the observed test statistic should be a typical outcome generated by the model. Out of the 100000 replicated datasets based on the model, 0 of them had values of t greater than that of our observed data.