Exploratory Data Analysis:

The goal of this is to explore the dataset “cholangitis.csv” - which comes from a randomized, clinical trial of the immunosuppressive drug D-penicillamine at the Mayo Clinic. Patients were randomly given either D-penicillamine or a placebo in a double-blind study. The patients were people living with primary biliary cholangitis, a fatal chronic autoimmune disease of unknown cause affective the liver. There are 418 observations of 20 variables (both numeric and categorical). The study lasted about 12 years. The goal of the project is to fit a linear regression equation to the number of days a patient survives from the time of registration (n_days).

cholangitis <- read.csv("cholangitis.csv", header = TRUE)
cholangitis$sex <- factor(cholangitis$sex)
cholangitis$status <- factor(cholangitis$status)
cholangitis$stage <- factor(cholangitis$stage)
cholangitis$edema <- factor(cholangitis$edema)
cholangitis$spiders <- factor(cholangitis$spiders)
cholangitis$hepatomegaly <- factor(cholangitis$hepatomegaly)
cholangitis$ascites <- factor(cholangitis$ascites)
cholangitis$drug <- factor(cholangitis$drug)
#cut.POSIXct(cholangitis$age, breaks = "years")
head(cholangitis, 5)

To begin the exploring the data, I first run a summary on the number of days survived from time of registration:

summary(cholangitis$n_days)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    1093    1730    1918    2614    4795

I found that the minimum days survived was 41 and that the mean was 1918 days survived! I visualized these findings through a histogram:

hist(cholangitis$n_days, main = "Days the Patient Survived From Time of Registration", xlab = "Days", breaks = 40)
abline(v = mean(cholangitis$n_days), lty = "dashed")
abline(v = median(cholangitis$n_days))
legend("topright", legend = c("Median", "Mean"), lty = c("solid", "dashed"))

The histogram gave me a better visualization of the summarized statistics on the number of days survived. You can see that the mean and the median are quite close, indicating that it is a symmetric distribution, however the distribution is left skewed.

As I was playing around with the data, I realized that I want to see the difference in survival between the placebo and the administered drug. So I first created two subset data sets to differentiate the data.

chol_placebo <- subset(cholangitis, drug == "Placebo")
chol_drug <- subset(cholangitis, drug == "D-penicillamine")

Then, I conducted a summary on both of the datasets in order to gain a better understanding of the difference. As you can see, the the survival rate for patients who received the administered drug is higher, as the mean and the median is a little more than for the placebo treated patients.

summary(chol_placebo$n_days)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      51    1153    1811    1997    2771    4523

summary(chol_drug$n_days)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    1231    1895    2016    2632    4556

I arranged the data set in descending order just to understand the distribution a little better.

descplac <- arrange(chol_placebo, n_days)
descdrug <- arrange(chol_drug, n_days)

I created histograms of both the placebo and the D-penicillamine patients.

hist(chol_placebo$n_days, main = "Days Survived with Placebo", xlab = "Days Survived", breaks = 60)
abline(v = mean(chol_placebo$n_days), lty = "dashed")
abline(v = median(chol_placebo$n_days))
legend("topright", legend = c("Median", "Mean"), lty = c("solid", "dashed"))

hist(chol_drug$n_days, main = "Days Survived with Drug", xlab = "Days Survived", breaks = 60)
abline(v = mean(chol_drug$n_days), lty = "dashed")
abline(v = median(chol_drug$n_days))
legend("topright", legend = c("Median", "Mean"), lty = c("solid", "dashed"))

As you can see, the distribution of the days survived with the placebo treatment is left-skewed, but the mean and the median are still relatively close to one another. For the days survived with the D-penicillamine, you can see that the mean and median are closer together and the data follows a more normal distribution.

par(mar = c(10, 4.1, 4.1, 0.1))
boxplot(cholangitis$n_days ~ cholangitis$drug,
        main = "Days Survived w/ Drug vs. Placebo", xlab = "", ylab = "Days Survived", las = 3)

I created a boxplot to visualize the days survived with the drug treatment vs. the days survived with the placebo treatment and saw that this did not really visualize the difference as well as the histogram did.

I wanted to explore what other variables are in play to cause different survival rates for the patients. So I explored that by using gpairs() to show the pairwise relationships between edema, drug, ascites, bilirubin, and alk_phos in determining survival.

library(gpairs)
gpairs(droplevels(cholangitis[c("edema", "drug", "ascites", "bilirubin", "alk_phos")]))

I also noticed that the difference in starting age for the patients may have been a confounding variable that affected their survival. So I ran a summary on the age:

summary(cholangitis$age)/365

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.30   42.86   51.04   50.78   58.28   78.49

As the age is depicted in number of days, I found that after dividing it by the total amount of days in a year, that the average age for the patients was 50.78 and the minimum was 26.30. The median and mean were also relatively close together as well. However the vast difference between the minimum age and the maximum age alerted me that those might be confounding variables.

Multivariate Regression:

I started by running a summary on the regression equation to number of days survived for the overall chalongitis dataset.

summary(lm(n_days ~., cholangitis))

## 
## Call:
## lm(formula = n_days ~ ., data = cholangitis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2182.88  -311.10   -35.55   343.92  2089.75 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.510e+03  7.844e+02   4.475 1.11e-05 ***
## id            -7.649e+00  5.263e-01 -14.535  < 2e-16 ***
## statusCL      -5.823e+02  1.678e+02  -3.469 0.000603 ***
## statusD       -1.144e+03  1.083e+02 -10.559  < 2e-16 ***
## drugPlacebo   -6.939e+01  7.803e+01  -0.889 0.374580    
## age           -8.051e-03  1.158e-02  -0.695 0.487604    
## sexM           1.316e+02  1.310e+02   1.004 0.316026    
## ascitesY      -6.190e+01  2.001e+02  -0.309 0.757231    
## hepatomegalyY -4.288e-01  9.133e+01  -0.005 0.996257    
## spidersY      -1.635e+02  9.600e+01  -1.703 0.089593 .  
## edemaS        -2.304e+02  1.412e+02  -1.632 0.103810    
## edemaY        -7.074e+02  2.157e+02  -3.280 0.001169 ** 
## bilirubin     -3.554e+01  1.279e+01  -2.779 0.005812 ** 
## cholesterol   -5.561e-02  2.010e-01  -0.277 0.782258    
## albumin        1.201e+02  1.143e+02   1.050 0.294490    
## copper        -1.531e+00  5.494e-01  -2.787 0.005680 ** 
## alk_phos       4.170e-02  1.974e-02   2.112 0.035524 *  
## sgot           1.119e+00  7.984e-01   1.401 0.162304    
## tryglicerides  7.626e-01  6.805e-01   1.121 0.263407    
## platelets      9.003e-02  4.442e-01   0.203 0.839517    
## prothrombin    1.829e+01  4.735e+01   0.386 0.699533    
## stage2        -6.910e+01  1.910e+02  -0.362 0.717779    
## stage3        -1.788e+02  1.870e+02  -0.956 0.339911    
## stage4        -3.380e+02  2.013e+02  -1.679 0.094267 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 652.8 on 283 degrees of freedom
##   (111 observations deleted due to missingness)
## Multiple R-squared:  0.6904, Adjusted R-squared:  0.6652 
## F-statistic: 27.44 on 23 and 283 DF,  p-value: < 2.2e-16

I found that id, status, the presence of edema despite diuretic therapy (edemaY), bilirubin, copper, and alk_phos were all marked as significant. Thus this led me to think about what variables were really at play in determining survival throughout this study.

I then plotted each variable with the Days survived so that I could better visualize the effect. The ID variable is irrelevant in this case. But you can see especially that the presence of edema showed a drastic difference in the days survived. As the presence of it despite diuretic treatment (the Y on the x axis refers to that) has a significantly lower mean rate for days survived. Thus this indicates that perhaps the patients who were still testing high for edema despite the treatment, which is to directly help with liver failure and primary biliary cholangitis is a condition that affects the liver, that they were already experiencing high resistance to all forms of treatment to combat the disease and that could explain why they were already predisposed to other factors that affected their survival rate.

In the stages boxplot, you can see that patients who were in Stage 1 of the disease had a little lower mean rate for surviving throughout the course of treatment.

for (i in 1:20) {
    plot(cholangitis[, i], cholangitis[, 1], xlab = names(cholangitis)[i],
        ylab = "Days Survived")
}

I created a regression equation to number of days survived with just the variables of whether they received the drug treatment and their survival rate. I wanted to still explore whether the drug treatment was affecting survival, although other variables like the presence of enema had been showing more of a causal relationship with survival. I fitted both that model with just the overal cholangitis model and ran the function anova to compare. The F value, test statistic, is 0.0266 for the drug treatment, and as that is lower than .05, that indicates the difference between the drug treatment and the placebo treatment is statistically relevant. The F value of 35.9470 for the status of the patients indicates that there is a statistically relevant relationship with the status of the patients and their survival rate.

model0 <- lm(n_days ~ drug + status, data = cholangitis)
anova(model0, cholangitis)

## Warning in anova.lmlist(object, ...): models with response '"NULL"' removed
## because response differs from model 1

I then did the same process but with a model that accounted for edema, ascites, bilirubin, and alk_phos. I could see from this anova()process that all were deemed statistically significant due to the fact that the F-values were all above 0.05. However, I could also see that ascites had the lowest F-value out of the four. And that edema, al_phos, and bilirubin were all marked as significant codes, representing the fact that they are very signficant.

m0 <- lm(n_days ~ edema + ascites + bilirubin + alk_phos, data = cholangitis)
anova(m0, cholangitis)

## Warning in anova.lmlist(object, ...): models with response '"NULL"' removed
## because response differs from model 1

Variable Selection:

Edema and status are both significantly correlated with number of days survived however as they have more than 2 levels, I could not use them as variables due to the regsubsets. Thus, I filtered those two variables, as well as stage (due to it having more than 2 levels) and id as patient identification was not relevant to determining the effect of D-penicillamine on the number of days a patient survived. I created “filterchol” to reflect that subset of the cholangitis data set.

filterchol <- subset(cholangitis, select = -c(id, status, stage, edema))

library(leaps)
Days <- regsubsets(n_days ~., filterchol)
summary(Days)

## Subset selection object
## Call: regsubsets.formula(n_days ~ ., filterchol)
## 15 Variables  (and intercept)
##               Forced in Forced out
## drugPlacebo       FALSE      FALSE
## age               FALSE      FALSE
## sexM              FALSE      FALSE
## ascitesY          FALSE      FALSE
## hepatomegalyY     FALSE      FALSE
## spidersY          FALSE      FALSE
## bilirubin         FALSE      FALSE
## cholesterol       FALSE      FALSE
## albumin           FALSE      FALSE
## copper            FALSE      FALSE
## alk_phos          FALSE      FALSE
## sgot              FALSE      FALSE
## tryglicerides     FALSE      FALSE
## platelets         FALSE      FALSE
## prothrombin       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          drugPlacebo age sexM ascitesY hepatomegalyY spidersY bilirubin
## 1  ( 1 ) " "         " " " "  " "      " "           " "      "*"      
## 2  ( 1 ) " "         " " " "  " "      " "           " "      "*"      
## 3  ( 1 ) " "         " " " "  " "      " "           " "      "*"      
## 4  ( 1 ) " "         " " " "  " "      " "           " "      "*"      
## 5  ( 1 ) " "         " " " "  " "      "*"           " "      "*"      
## 6  ( 1 ) " "         " " " "  "*"      "*"           " "      "*"      
## 7  ( 1 ) " "         " " " "  "*"      "*"           "*"      "*"      
## 8  ( 1 ) " "         " " " "  "*"      "*"           "*"      "*"      
##          cholesterol albumin copper alk_phos sgot tryglicerides platelets
## 1  ( 1 ) " "         " "     " "    " "      " "  " "           " "      
## 2  ( 1 ) " "         "*"     " "    " "      " "  " "           " "      
## 3  ( 1 ) " "         "*"     " "    "*"      " "  " "           " "      
## 4  ( 1 ) " "         "*"     "*"    "*"      " "  " "           " "      
## 5  ( 1 ) " "         "*"     "*"    "*"      " "  " "           " "      
## 6  ( 1 ) " "         "*"     "*"    "*"      " "  " "           " "      
## 7  ( 1 ) " "         "*"     "*"    "*"      " "  " "           " "      
## 8  ( 1 ) " "         "*"     "*"    "*"      " "  " "           " "      
##          prothrombin
## 1  ( 1 ) " "        
## 2  ( 1 ) " "        
## 3  ( 1 ) " "        
## 4  ( 1 ) " "        
## 5  ( 1 ) " "        
## 6  ( 1 ) " "        
## 7  ( 1 ) " "        
## 8  ( 1 ) "*"

From this, we can interpret that the best model with one explanatory variable is the model with bilirubin. Then the next best model with two explanatory variables is with bilirubin and albumin. Then the next best model with three explanatory variables is bilirubin, albumin, and alk_phos. And a regression model involving all explanatory variables would be one with ascites, hepatomegaly, spiders, bilirubin, albumin, copper, alk_phos, and prothrombin.

I then wanted to estimate the error by dividing the data into training and test data. I wanted to cross-validate:

set.seed(78912)
perm <- sample(1:nrow(cholangitis))
fold <- cut(1:nrow(cholangitis), breaks = 10, labels = FALSE)
predictedError <- matrix(nrow=10, ncol = nrow(summary(Days)$which))

for (i in 1:10) {
  testIndex <- which(fold == i, arr.ind = TRUE)
  testData <- cholangitis[perm, ][testIndex, ]
  trainData <- cholangitis[perm, ][-testIndex, 
                                   ]
  predError <- apply(summary(Days)$which[, -1], 1,
                     function(x) {
                       lmo <- lm(trainData$n_days ~ ., data = trainData[, -1][, x, drop = FALSE])
                       testp <- predict(lmo, newdata = testData[, -1])
                       mean((testData$n_days - testp)^2)
                     })
  predictedError[i, ] <- predError
}
predictedError

##            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
##  [1,]  909308.9  837028.5  798926.8  793818.8  798200.0  801812.7  814734.2
##  [2,] 1049281.8 1125086.1        NA        NA        NA        NA        NA
##  [3,] 1001586.5  960352.6 1088297.6  977583.9  984788.3  900768.0  904466.4
##  [4,] 1032484.5  991266.4  887784.4  889415.1  896309.0  817255.5  817529.8
##  [5,] 1431103.3 1386137.5 1346085.8 1204384.0 1204415.8 1122102.2 1134409.0
##  [6,] 1757563.7 1450824.6        NA        NA        NA        NA        NA
##  [7,]  986448.7  968069.8  947978.6  768928.8  770096.7  613946.2  615612.6
##  [8,] 1066452.6 1031761.0        NA        NA        NA        NA        NA
##  [9,] 1378692.6 1212215.8 1136581.8 1079726.3 1090844.0 1018659.9 1018648.3
## [10,] 1291741.3 1031049.1        NA        NA        NA        NA        NA
##            [,8]
##  [1,]  824201.3
##  [2,]        NA
##  [3,]  916807.1
##  [4,]  814058.4
##  [5,] 1132877.9
##  [6,]        NA
##  [7,]  627777.3
##  [8,]        NA
##  [9,] 1037853.0
## [10,]        NA

Then I averaged the estimates: The high variance in the estimates proves to me that I overfit the model by using all of the variables selected in the subset process above.

colMeans(predictedError)

## [1] 1190466 1099379      NA      NA      NA      NA      NA      NA

Thus, I fit a regression line with the variables outlined above from the regsubsets().I chose to exclude prothrombin due to is lesser causal relationship with number of days survived. This was also due to the overfitting of my regression model as checked prior with the cross-validation.

regression <- (lm(n_days ~ ascites + hepatomegaly + spiders + bilirubin + albumin + copper + alk_phos + drug, data = filterchol))
regression

## 
## Call:
## lm(formula = n_days ~ ascites + hepatomegaly + spiders + bilirubin + 
##     albumin + copper + alk_phos + drug, data = filterchol)
## 
## Coefficients:
##   (Intercept)       ascitesY  hepatomegalyY       spidersY      bilirubin  
##     -258.4863      -205.2065      -218.9467      -151.1010       -57.7247  
##       albumin         copper       alk_phos    drugPlacebo  
##      730.4442        -2.3995         0.1325        49.0001

I then run diagnostics on the regression model:

plot(regression)

I found from the regression diagnostics that: in the residuals vs. fitted plot, that the residuals (red line) follow a roughly linear pattern thus a linear regression model is appropriate! From the Normal Q-Q plot we see that the residuals are normally distributed as they follow the straight diagonal line. Thus, we can deduce that they are normally distributed! With the exception of 60, 325, and 87 as outliers. However they do not steer far enough away from the line to be deemed not normally distributed. From the Scale-Location model we can see that the red line is roughly horizontal, thus homoscedasticity is followed. Implying that the assumption of equal variance is met. From the Residuals vs. Leverage model, we can see that there are no points that fall outside Cook’s distance as we cannot even see the dashed red lines in the model. This implies that our data is evenly distributed and there are no overly influential points!

Some Overarching Conclusions:

From this process and through the multiple different forms of diagnostics, I can conclude that D-penicillamine had a causal effect on the number of days a patient survived. After taking into consideration the fact that many patients were at different onsets of the illnesses and that they were a variety of ages at the start of the trial, I can still conclude that D-penicillamine positively increased patient survival rates. I also learned that there were other factors in the patient that contributed to their survival rate. Those factors being the presence of ascites, the presence of hepatomegaly, the presence of spier angiomas, the seum billirubin levels, the albumin levels, the urine copper levels, and the alkaline phosphatase levels. I assume that the drug D-penicillamine could have had an effect on those variables and thus I conclude that D-penicillamine works in positively affecting the patient’s survival rate.

Cholangitis Data Analysis and Regression

Margad Bayarmagnai

5/1/2023

Exploratory Data Analysis:

Multivariate Regression:

Variable Selection:

Some Overarching Conclusions: