Primary biliary cholangitis is a chronic autoimmune disease in which the liver’s bile ducts become inflamed and, over time, are destroyed (National Institute of Diabetes and Digestive and Kidney Diseases, 2021). Complications include high blood cholesterol levels, osteoporosis, vitamin deficiencies, cirrhosis, portal hypertension, liver failure, and liver cancer (National Institute of Diabetes and Digestive and Kidney Diseases, 2021). In the United States, PBC affects about 15 out of every 100,000 men and 58 out of every 100,000 women (National Institute of Diabetes and Digestive and Kidney Diseases, 2021). Specifically, those at risk are people who are middle aged, of northern European descent, or have a relative with PBC (National Institute of Diabetes and Digestive and Kidney Diseases, 2021). Additionally, environmental factors such as infections, smoking cigarettes, or exposure to toxic chemicals may increase risk (Mayo Clinic, 2023).
Question 2
Survival analysis is a procedure used to analyze the time until an event of interest occurs (Columbia University Mailman School of Public Health, n.d.). Specifically, it requires estimating survival functions, comparison of treatments, and estimating effects of explanatory of variables in regards to the failure time (Dey et al., 2020). These methods are often used in medical studies, such as time to death for cancer patients (Clark, et al., 2003). For example, Clark et al. (2003) cite a survival analysis study on ovarian cancer, in which mortality and relapse times were tracked. A similar example is a study on lung cancer, where the event of interest is the same as the aforementioned study, but the data is grouped based on treatment (Clark et al., 2003). In either case, survival analysis proves to be a powerful method in understanding mortality rates, relapse-free survival, and the effectiveness of treatments. In the following survival analysis, I will be examining a randomized clinical trial from the Mayo Clinic in which patients with primary biliary cholangitis were either assigned D-penicillamine, a placebo, or no treatment. I will model the number of days a patient survives to gain insight into the length of survival and effectiveness of D-penicillamine.
First steps
Question 1
Read in csv file. Ensure categorical variables (status, edema, stage) are factors with correct levels.
# create "cholangitis" data framecholangitis =read.csv("~/Stat131Public/project/cholangitis.csv") # set levels for status, edema, and stage according to dictionary# status: C (not dead), CL (received liver transplant), or D (died)# edema: N (no edema and no diuretic therapy for edema), S (edema present# without diuretics, or edema resolved by diuretics), or Y (edema despite# diuretic therapy)# stage: histologic stage of disease 1, 2, 3, or 4cholangitis = cholangitis %>%mutate(status =factor(status, levels =c("C", "CL", "D")),edema =factor(edema, levels =c("N", "S", "Y")),stage =factor(stage))head(cholangitis)
id n_days status drug age sex ascites hepatomegaly spiders edema
1 1 400 D D-penicillamine 21464 F Y Y Y Y
2 2 4500 C D-penicillamine 20617 F N Y Y N
3 3 1012 D D-penicillamine 25594 M N N N S
4 4 1925 D D-penicillamine 19994 F N Y Y S
5 5 1504 CL Placebo 13918 F N Y Y N
6 6 2503 D Placebo 24201 F N Y N N
bilirubin cholesterol albumin copper alk_phos sgot tryglicerides platelets
1 14.5 261 2.60 156 1718.0 137.95 172 190
2 1.1 302 4.14 54 7394.8 113.52 88 221
3 1.4 176 3.48 210 516.0 96.10 55 151
4 1.8 244 2.54 64 6121.8 60.63 92 183
5 3.4 279 3.53 143 671.0 113.15 72 136
6 0.8 248 3.98 50 944.0 93.00 63 361
prothrombin stage
1 12.2 4
2 10.6 3
3 12.0 4
4 10.3 4
5 10.9 3
6 11.0 3
The cholangitis data set has 20 variables for the 418 patients that participated in the study.
Question 2
Examine missing values within “cholangitis” data frame.
rows_and_columns_with_na <- cholangitis[!complete.cases(cholangitis), colSums(is.na(cholangitis)) >0, drop =FALSE]head(rows_and_columns_with_na)
drug cholesterol tryglicerides
41 D-penicillamine NA NA
106 D-penicillamine NA NA
146 Placebo NA NA
178 D-penicillamine NA NA
190 D-penicillamine NA NA
313 <NA> 374 168
Missing values only exist for drug, cholesterol, and tryglicerides.
Fill in missing values because data frame has relatively few observations, and approximately 25% of observations contain missing values.
# fill in missing values for cholesterol and tryglicerides with meanscholangitis <- cholangitis %>%mutate(cholesterol =ifelse(is.na(cholesterol), mean(cholangitis$cholesterol, na.rm =TRUE), cholesterol),tryglicerides =ifelse(is.na(tryglicerides), mean(cholangitis$tryglicerides, na.rm =TRUE), tryglicerides))# replace missing values randomly with "No treatment"cholangitis$drug <-ifelse(is.na(cholangitis$drug), "No treatment", cholangitis$drug)
The cholangitis data frame no longer contains missing values. For cholesterol and tryglicerides, mean was used. For patients who received no drug, “No treatment” has been assigned.
Create heatmap to visualize correlation between all numeric variables.
The highest correlations are found between n_days and albumin, sgot and bilirubin, and copper and bilirubin.
Create pairs plot to compare correlation and view linearity between variables.
# create pairs plotcholangitis %>%select(status, n_days, age, bilirubin, cholesterol, albumin, copper, alk_phos, sgot, tryglicerides, platelets, prothrombin) %>%ggpairs(aes(fill=status)) # fill based on categorical variable status
The pairs plot reveals several valuable insights. For instance, n_days is shown to be higher for patients who survived than those who died, suggesting status D’s survival time is low. Additionally, their age tends to be higher, and they share higher correlation with variables such as bilirubin, copper, and prothrombin.
Using data from the pairs plot, filter out observations that appear to be outliers. There appear to be a significant amount of outliers for bilirubin, cholesterol, copper, and alk_phos.
# create a box plotggplot(data = cholangitis, aes(x = bilirubin)) +geom_boxplot() +labs(title ="Boxplot of bilirubin")
Observations have been filtered for bilirubin, cholesterol, copper, and alk_phos.
Create density plot to visualize distribution of n_days per treatment.
# create density plotcholangitis %>%ggplot(mapping =aes(x = n_days,color = drug)) +geom_density(kernel ="gaussian", mapping =aes(y =after_stat(density))) +labs(x ="Days between registration and earlier of: death, transplantation, or end of study",y ="Density",title ="No treatment has significantly lower number of survival days",subtitle ="Estimated Probability Density") +theme_classic()
No treatment has significantly lower number of survival days. D-penicillamine and the placebo share similar distributions.
Create density plot to visualize distribution of n_days and sex.
# create density plotcholangitis %>%ggplot(mapping =aes(x = n_days,color = sex)) +geom_density(kernel ="gaussian", mapping =aes(y =after_stat(density))) +labs(x ="Days between registration and earlier of: death, transplantation, or end of study",y ="Density",title ="Distribution for both sexes is fairly similar",subtitle ="Estimated Probability Density") +theme_classic()
Number of survival days for both sexes is fairly similar.
Create density plot to visualize distribution of n_days and ascites.
# create density plotcholangitis %>%ggplot(mapping =aes(x = n_days,color = ascites)) +geom_density(kernel ="gaussian", mapping =aes(y =after_stat(density))) +labs(x ="Days between registration and earlier of: death, transplantation, or end of study",y ="Density",title ="Presence of ascites associated with low number of days",subtitle ="Estimated Probability Density") +theme_classic()
Presence of ascites is correlated with low number of days. No presence has right skew.
Create density plot to visualize distribution of n_days and hepatomegaly.
# create density plotcholangitis %>%ggplot(mapping =aes(x = n_days,color = hepatomegaly)) +geom_density(kernel ="gaussian", mapping =aes(y =after_stat(density))) +labs(x ="Days between registration and earlier of: death, transplantation, or end of study",y ="Density",title ="Distributions of hepatomegaly are fairly similar",subtitle ="Estimated Probability Density") +theme_classic()
The distributions are fairly similar. No presence of hepatomegaly has two peaks around 1000 days and 2750 days.
Create density plot to visualize distribution of n_days and spiders.
# create density plotcholangitis %>%ggplot(mapping =aes(x = n_days,color = spiders)) +geom_density(kernel ="gaussian", mapping =aes(y =after_stat(density))) +labs(x ="Days between registration and earlier of: death, transplantation, or end of study",y ="Density",title ="Presence of spider angiomas distributions are fairly similar",subtitle ="Estimated Probability Density") +theme_classic()
Whether or not a patient had the presence of spider angiomas or not share similar distributions.
Create density plot to visualize distribution of n_days and edema.
# create density plotcholangitis %>%ggplot(mapping =aes(x = n_days,color = edema)) +geom_density(kernel ="gaussian", mapping =aes(y =after_stat(density))) +labs(x ="Days between registration and earlier of: death, transplantation, or end of study",y ="Density",title ="Edema shows high density for low number of days",subtitle ="Estimated Probability Density") +theme_classic()
Presence of edema shows high density for low number of days. No presence of edema peaks around 1300, while the final category for patients who had edema present without diuretics, or edema resolved by diuretics peaks around 1000.
Create density plot of n_days and stage.
# create density plotcholangitis %>%ggplot(mapping =aes(x = n_days,color = stage)) +geom_density(kernel ="gaussian", mapping =aes(y =after_stat(density))) +labs(x ="Days between registration and earlier of: death, transplantation, or end of study",y ="Density",title ="Stage 1 and 2, and Stage 3 and 4 are similar respectively",subtitle ="Estimated Probability Density") +theme_classic()
Stages 1 and 2 share similar distributions and peak around 2800 days. Stages 3 and 4 share similar distributions as well, but peak at 1000 and 1500 days respectively.
Create stacked bar chart to compare associations between stage and status.
# create stacked bar chartggplot(data = cholangitis, mapping =aes(x = stage, fill = status)) +geom_bar(color ="black", position ="fill") +labs(title ="As stage progresses, patients who survive decrease.",y ="Proportion")
As stage progresses, patients who survive decrease while patients who die increase. Those who survive with a liver transplant are fairly consistent between stages 2, 3, and 4.
Explanatory Modeling
Question 3
Perform a regression analysis of number of days on all explanatory variables. Exclude “id” since it is only used for patient identification.
A linear model is fit to all variables excluding “id”. The output shows the estimates for each variable. The model is fit to 380 observations as a result of filtering out outliers.
Question 4
Create residuals vs. fitted plot to assess variance.
# create residuals vs. fitted plotcholangitis_res = cholangitis_filteredcholangitis_res$residuals <-residuals(linear_model) # obtain residualscholangitis_res$fitted_values <-fitted(linear_model) # obtain fitted values# create scatterplotggplot(cholangitis_res, aes(x = fitted_values, y = residuals)) +geom_point(alpha =0.5) +geom_hline(yintercept =0, linetype ="dashed", color ="red") +labs(x ="Fitted Values", y ="Residuals", title ="Residuals vs Fitted Plot")
Residuals for smaller values tends to be smaller, but data generally does not show a pattern. Assumptions are slightly violated.
The scale plot shows a slightly increase trend. However, the pattern is small and the data appears to be random, so linearity and constant variance are only slightly violated.
Question 5
Holding all other variables constant, among patients who received D-penicillamine, a placebo, or no treatment, we expect those who received D-penicillamine to survive about 311 days more than the other two groups.
Predictive Modeling
Question 6
Create training split.
set.seed(131) # setting a seed for reproducibilitycholangitis = cholangitis %>%select(-c("id"))# create data splitdata_split <-initial_split(cholangitis, prop =0.7)# create train and test datacholangitis_train <-training(data_split)cholangitis_test <-testing(data_split)
Training and testing sets created for prediction. I chose a 0.7 split since the cholangitis data set is relatively small, and I want to ensure the training set has enough observations to convey the general trend of data.
Question 7
Use regression subsets to find which explanatory variables to include.
The model’s RMSE is 1084.367. It uses variables bilirubin and albumin.
Fit random forest.
# fit random forestrandom_forest <-randomForest(formula = n_days ~ ., data = cholangitis_train,cutoff =c(0.5, 0.5),importance =TRUE)# view which variables are importantvarImpPlot(random_forest, main ="Variable Importance Plot",pch =16)
Important variables appear to be albumin, bilirubin, and prothrombin.
Calculate RMSE based on random forest model.
# Predict using the random forest modelforest_test_predictions <-predict(random_forest, newdata = cholangitis_test)# Calculate RMSEforest_rmse <-sqrt(mean((forest_test_predictions - cholangitis_test$n_days)^2))print(forest_rmse)
[1] 867.1492
The model’s RMSE is 867.1492.
Question 9
Create markdown table for the three models created.
# create data frame for markdown tablemarkdown_data =data.frame(Model =c("Linear Regression", "Regression Trees", "Random Forest"),RMSE =c(model_8_RMSE, acc_test, forest_rmse))# print markdown tablekable(markdown_data, format ="markdown")
Model
RMSE
Linear Regression
851.7943
Regression Trees
1084.3667
Random Forest
867.1492
Using the RMSE as my testing metric, I will use my linear regression model for predictive purpose, as it has the lowest RMSE.
Next Steps
Question 1
One aspect of my project I believe can be improved is my filtering of my data. I chose to filter out outliers based on quantiles, but I could have done a more thorough job in selecting which outliers to exclude since the data set is fairly small.
Question 2
Based on my work for this project, I am interested in further exploring the relationship of bilirubin and albumin to number of days. In all my predictive models, bilirubin and albumin were considered important variables, so I am interested in closer examining their impact. Additionally, I am curious about the effectiveness of treatment on patients at different stages of the disease. Performing analysis and comparing the number of days based on that could reveal insight on treatment effectiveness for different groups.