Cholangitis Analysis

Author

Emma Yosanovich

Background

Question 1

Primary biliary cholangitis is a chronic autoimmune disease in which the liver’s bile ducts become inflamed and, over time, are destroyed (National Institute of Diabetes and Digestive and Kidney Diseases, 2021). Complications include high blood cholesterol levels, osteoporosis, vitamin deficiencies, cirrhosis, portal hypertension, liver failure, and liver cancer (National Institute of Diabetes and Digestive and Kidney Diseases, 2021). In the United States, PBC affects about 15 out of every 100,000 men and 58 out of every 100,000 women (National Institute of Diabetes and Digestive and Kidney Diseases, 2021). Specifically, those at risk are people who are middle aged, of northern European descent, or have a relative with PBC (National Institute of Diabetes and Digestive and Kidney Diseases, 2021). Additionally, environmental factors such as infections, smoking cigarettes, or exposure to toxic chemicals may increase risk (Mayo Clinic, 2023).

Question 2

Survival analysis is a procedure used to analyze the time until an event of interest occurs (Columbia University Mailman School of Public Health, n.d.). Specifically, it requires estimating survival functions, comparison of treatments, and estimating effects of explanatory of variables in regards to the failure time (Dey et al., 2020). These methods are often used in medical studies, such as time to death for cancer patients (Clark, et al., 2003). For example, Clark et al. (2003) cite a survival analysis study on ovarian cancer, in which mortality and relapse times were tracked. A similar example is a study on lung cancer, where the event of interest is the same as the aforementioned study, but the data is grouped based on treatment (Clark et al., 2003). In either case, survival analysis proves to be a powerful method in understanding mortality rates, relapse-free survival, and the effectiveness of treatments. In the following survival analysis, I will be examining a randomized clinical trial from the Mayo Clinic in which patients with primary biliary cholangitis were either assigned D-penicillamine, a placebo, or no treatment. I will model the number of days a patient survives to gain insight into the length of survival and effectiveness of D-penicillamine.

First steps

Question 1

Read in csv file. Ensure categorical variables (status, edema, stage) are factors with correct levels.

# create "cholangitis" data frame
cholangitis = read.csv("~/Stat131Public/project/cholangitis.csv") 

# set levels for status, edema, and stage according to dictionary
# status: C (not dead), CL (received liver transplant), or D (died)
# edema: N (no edema and no diuretic therapy for edema), S (edema present
# without diuretics, or edema resolved by diuretics), or Y (edema despite
# diuretic therapy)
# stage: histologic stage of disease 1, 2, 3, or 4
cholangitis = cholangitis %>%
  mutate(status = factor(status, levels = c("C", "CL", "D")),
         edema = factor(edema, levels = c("N", "S", "Y")),
         stage = factor(stage))

head(cholangitis)
  id n_days status            drug   age sex ascites hepatomegaly spiders edema
1  1    400      D D-penicillamine 21464   F       Y            Y       Y     Y
2  2   4500      C D-penicillamine 20617   F       N            Y       Y     N
3  3   1012      D D-penicillamine 25594   M       N            N       N     S
4  4   1925      D D-penicillamine 19994   F       N            Y       Y     S
5  5   1504     CL         Placebo 13918   F       N            Y       Y     N
6  6   2503      D         Placebo 24201   F       N            Y       N     N
  bilirubin cholesterol albumin copper alk_phos   sgot tryglicerides platelets
1      14.5         261    2.60    156   1718.0 137.95           172       190
2       1.1         302    4.14     54   7394.8 113.52            88       221
3       1.4         176    3.48    210    516.0  96.10            55       151
4       1.8         244    2.54     64   6121.8  60.63            92       183
5       3.4         279    3.53    143    671.0 113.15            72       136
6       0.8         248    3.98     50    944.0  93.00            63       361
  prothrombin stage
1        12.2     4
2        10.6     3
3        12.0     4
4        10.3     4
5        10.9     3
6        11.0     3

The cholangitis data set has 20 variables for the 418 patients that participated in the study.

Question 2

Examine missing values within “cholangitis” data frame.

rows_and_columns_with_na <- cholangitis[!complete.cases(cholangitis), 
                                        colSums(is.na(cholangitis)) > 0, 
                                        drop = FALSE]
head(rows_and_columns_with_na)
               drug cholesterol tryglicerides
41  D-penicillamine          NA            NA
106 D-penicillamine          NA            NA
146         Placebo          NA            NA
178 D-penicillamine          NA            NA
190 D-penicillamine          NA            NA
313            <NA>         374           168

Missing values only exist for drug, cholesterol, and tryglicerides.

Fill in missing values because data frame has relatively few observations, and approximately 25% of observations contain missing values.

# fill in missing values for cholesterol and tryglicerides with means
cholangitis <- cholangitis %>%
  mutate(cholesterol = ifelse(is.na(cholesterol), 
                              mean(cholangitis$cholesterol, 
                                   na.rm = TRUE), cholesterol),
         tryglicerides = ifelse(is.na(tryglicerides), 
                                mean(cholangitis$tryglicerides, 
                                     na.rm = TRUE), tryglicerides))

# replace missing values randomly with "No treatment"
cholangitis$drug <- ifelse(is.na(cholangitis$drug), "No treatment", 
                           cholangitis$drug)

The cholangitis data frame no longer contains missing values. For cholesterol and tryglicerides, mean was used. For patients who received no drug, “No treatment” has been assigned.

Create heatmap to visualize correlation between all numeric variables.

# create data frame of numeric data
cholangitis_numeric = cholangitis %>%
  select(n_days, age, bilirubin, cholesterol, albumin, copper, alk_phos, sgot, 
         tryglicerides, platelets, prothrombin)

# create heatmap
as.data.frame(cor(cholangitis_numeric)) %>%
    rownames_to_column("Variables_1") %>%
  pivot_longer(-c(Variables_1), names_to = "Variables_2", 
               values_to = "Correlation") %>%
  ggplot(mapping = aes(x = Variables_1, y = Variables_2)) +
  geom_tile(aes(fill = Correlation)) +
  scale_fill_gradient(low = "black", high = "red") +
  theme_classic() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

The highest correlations are found between n_days and albumin, sgot and bilirubin, and copper and bilirubin.

Create pairs plot to compare correlation and view linearity between variables.

# create pairs plot
cholangitis %>%
    select(status, n_days, age, bilirubin, cholesterol, albumin, copper, 
           alk_phos, sgot, tryglicerides, platelets, prothrombin) %>%
    ggpairs(aes(fill=status)) # fill based on categorical variable status

The pairs plot reveals several valuable insights. For instance, n_days is shown to be higher for patients who survived than those who died, suggesting status D’s survival time is low. Additionally, their age tends to be higher, and they share higher correlation with variables such as bilirubin, copper, and prothrombin.

Using data from the pairs plot, filter out observations that appear to be outliers. There appear to be a significant amount of outliers for bilirubin, cholesterol, copper, and alk_phos.

# create a box plot
ggplot(data = cholangitis, aes(x = bilirubin)) +
  geom_boxplot() +
  labs(title = "Boxplot of bilirubin")

# filter out outliers
Q1 <- quantile(cholangitis$bilirubin, 0.25)
Q3 <- quantile(cholangitis$bilirubin, 0.75)

IQR_value <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

cholangitis_filter <- cholangitis[cholangitis$bilirubin >= lower_bound & 
                                cholangitis$bilirubin <= upper_bound,]

ggplot(data = cholangitis_filter, aes(x = bilirubin)) +
  geom_boxplot() +
  labs(title = "Boxplot of bilirubin without Outliers")

# create a box plot
ggplot(data = cholangitis, aes(x = cholesterol)) +
  geom_boxplot() +
  labs(title = "Boxplot of cholesterol")

# filter out outliers
Q1 <- quantile(cholangitis$cholesterol, 0.25)
Q3 <- quantile(cholangitis$cholesterol, 0.75)

IQR_value <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

cholangitis_filter <- cholangitis[cholangitis$cholesterol >= lower_bound & 
                                cholangitis$cholesterol <= upper_bound,]

ggplot(data = cholangitis_filter, aes(x = cholesterol)) +
  geom_boxplot() +
  labs(title = "Boxplot of cholesterol without Outliers")

# create a box plot
ggplot(data = cholangitis, aes(x = copper)) +
  geom_boxplot() +
  labs(title = "Boxplot of copper")

# filter out outliers
Q1 <- quantile(cholangitis$copper, 0.25)
Q3 <- quantile(cholangitis$copper, 0.75)

IQR_value <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

cholangitis_filter <- cholangitis[cholangitis$copper >= lower_bound & 
                                cholangitis$copper <= upper_bound,]

ggplot(data = cholangitis_filter, aes(x = copper)) +
  geom_boxplot() +
  labs(title = "Boxplot of copper without Outliers")

# create a box plot
ggplot(data = cholangitis, aes(x = alk_phos)) +
  geom_boxplot() +
  labs(title = "Boxplot of alkaline phosphatase")

# filter out outliers
Q1 <- quantile(cholangitis$alk_phos, 0.25)
Q3 <- quantile(cholangitis$alk_phos, 0.75)

IQR_value <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

cholangitis_filter <- cholangitis[cholangitis$alk_phos >= lower_bound & 
                                cholangitis$alk_phos <= upper_bound,]

ggplot(data = cholangitis_filter, aes(x = alk_phos)) +
  geom_boxplot() +
  labs(title = "Boxplot of alkaline phosphatase without Outliers")

Observations have been filtered for bilirubin, cholesterol, copper, and alk_phos.

Create density plot to visualize distribution of n_days per treatment.

# create density plot
cholangitis %>%
  ggplot(mapping = aes(x = n_days,
                       color = drug)) +
  geom_density(kernel = "gaussian", 
               mapping = aes(y = after_stat(density))) +
  labs(x = "Days between registration and earlier of: death, transplantation,
       or end of study",
       y = "Density",
       title = "No treatment has significantly lower number of survival days",
       subtitle = "Estimated Probability Density") +
  theme_classic()

No treatment has significantly lower number of survival days. D-penicillamine and the placebo share similar distributions.

Create density plot to visualize distribution of n_days and sex.

# create density plot
cholangitis %>%
  ggplot(mapping = aes(x = n_days,
                       color = sex)) +
  geom_density(kernel = "gaussian", 
               mapping = aes(y = after_stat(density))) +
  labs(x = "Days between registration and earlier of: death, transplantation,
       or end of study",
       y = "Density",
       title = "Distribution for both sexes is fairly similar",
       subtitle = "Estimated Probability Density") +
  theme_classic()

Number of survival days for both sexes is fairly similar.

Create density plot to visualize distribution of n_days and ascites.

# create density plot
cholangitis %>%
  ggplot(mapping = aes(x = n_days,
                       color = ascites)) +
  geom_density(kernel = "gaussian", 
               mapping = aes(y = after_stat(density))) +
  labs(x = "Days between registration and earlier of: death, transplantation,
       or end of study",
       y = "Density",
       title = "Presence of ascites associated with low number of days",
       subtitle = "Estimated Probability Density") +
  theme_classic()

Presence of ascites is correlated with low number of days. No presence has right skew.

Create density plot to visualize distribution of n_days and hepatomegaly.

# create density plot
cholangitis %>%
  ggplot(mapping = aes(x = n_days,
                       color = hepatomegaly)) +
  geom_density(kernel = "gaussian", 
               mapping = aes(y = after_stat(density))) +
  labs(x = "Days between registration and earlier of: death, transplantation,
       or end of study",
       y = "Density",
       title = "Distributions of hepatomegaly are fairly similar",
       subtitle = "Estimated Probability Density") +
  theme_classic()

The distributions are fairly similar. No presence of hepatomegaly has two peaks around 1000 days and 2750 days.

Create density plot to visualize distribution of n_days and spiders.

# create density plot
cholangitis %>%
  ggplot(mapping = aes(x = n_days,
                       color = spiders)) +
  geom_density(kernel = "gaussian", 
               mapping = aes(y = after_stat(density))) +
  labs(x = "Days between registration and earlier of: death, transplantation,
       or end of study",
       y = "Density",
       title = "Presence of spider angiomas distributions are fairly similar",
       subtitle = "Estimated Probability Density") +
  theme_classic()

Whether or not a patient had the presence of spider angiomas or not share similar distributions.

Create density plot to visualize distribution of n_days and edema.

# create density plot
cholangitis %>%
  ggplot(mapping = aes(x = n_days,
                       color = edema)) +
  geom_density(kernel = "gaussian", 
               mapping = aes(y = after_stat(density))) +
  labs(x = "Days between registration and earlier of: death, transplantation,
       or end of study",
       y = "Density",
       title = "Edema shows high density for low number of days",
       subtitle = "Estimated Probability Density") +
  theme_classic()

Presence of edema shows high density for low number of days. No presence of edema peaks around 1300, while the final category for patients who had edema present without diuretics, or edema resolved by diuretics peaks around 1000.

Create density plot of n_days and stage.

# create density plot
cholangitis %>%
  ggplot(mapping = aes(x = n_days,
                       color = stage)) +
  geom_density(kernel = "gaussian", 
               mapping = aes(y = after_stat(density))) +
  labs(x = "Days between registration and earlier of: death, transplantation,
       or end of study",
       y = "Density",
       title = "Stage 1 and 2, and Stage 3 and 4 are similar respectively",
       subtitle = "Estimated Probability Density") +
  theme_classic()

Stages 1 and 2 share similar distributions and peak around 2800 days. Stages 3 and 4 share similar distributions as well, but peak at 1000 and 1500 days respectively.

Create stacked bar chart to compare associations between stage and status.

# create stacked bar chart
ggplot(data = cholangitis, mapping = aes(x = stage, fill = status)) +
  geom_bar(color = "black", position = "fill") +
  labs(title = "As stage progresses, patients who survive decrease.",
         y = "Proportion")

As stage progresses, patients who survive decrease while patients who die increase. Those who survive with a liver transplant are fairly consistent between stages 2, 3, and 4.

Explanatory Modeling

Question 3

Perform a regression analysis of number of days on all explanatory variables. Exclude “id” since it is only used for patient identification.

# fit model
cholangitis_filtered = cholangitis_filter %>%
  select(-c("id")) # exclude id

linear_model <- lm(formula = n_days ~ ., data = cholangitis_filtered) 

model_summary = linear_model %>%
  tidy() %>%
  select(term, estimate, std.error)

print(model_summary)
# A tibble: 24 × 3
   term                estimate std.error
   <chr>                  <dbl>     <dbl>
 1 (Intercept)      -1528.       768.    
 2 statusCL          -414.       186.    
 3 statusD           -573.       111.    
 4 drugNo treatment  -302.       110.    
 5 drugPlacebo         -8.53     100.    
 6 age                  0.00318    0.0127
 7 sexM                49.6      150.    
 8 ascitesY           283.       220.    
 9 hepatomegalyY       24.9       93.8   
10 spidersY            39.6      103.    
# ℹ 14 more rows

A linear model is fit to all variables excluding “id”. The output shows the estimates for each variable. The model is fit to 380 observations as a result of filtering out outliers.

Question 4

Create residuals vs. fitted plot to assess variance.

# create residuals vs. fitted plot
cholangitis_res = cholangitis_filtered

cholangitis_res$residuals <- residuals(linear_model) # obtain residuals
cholangitis_res$fitted_values <- fitted(linear_model) # obtain fitted values

# create scatterplot
ggplot(cholangitis_res, aes(x = fitted_values, y = residuals)) +
  geom_point(alpha = 0.5) + 
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") + 
  labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted Plot")

Residuals for smaller values tends to be smaller, but data generally does not show a pattern. Assumptions are slightly violated.

Use qq-plot to check for normality.

# standardize variable
residuals_standardized <- rstandard(linear_model)

# create qq-plot
qq_data <- qqnorm(residuals_standardized, plot.it = FALSE)

qq_df <- data.frame(Theoretical = qq_data$x, StandardizedResiduals = qq_data$y)

ggplot(qq_df, aes(x = Theoretical, y = StandardizedResiduals)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "blue") +
  labs(title = "QQ Plot of Standardized Residuals",
       x = "Theoretical Quantiles",
       y = "Standardized Residuals")

Extreme values depart from normality. Violations are somewhat violated.

Create scale location plot to check linearity and variance.

# standardize model
sqrt_std_residuals <- sqrt(abs(rstandard(linear_model)))

# create scale location plot
scale_location_df <- data.frame(
  FittedValues = fitted(linear_model),
  SqrtStdResiduals = sqrt_std_residuals
)

ggplot(scale_location_df, aes(x = FittedValues, y = SqrtStdResiduals)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", color = "blue", se = FALSE) +
  labs(title = "Scale-Location Plot",
       x = "Fitted Values",
       y = "Sqrt(Standardized Residuals)")
`geom_smooth()` using formula = 'y ~ x'

The scale plot shows a slightly increase trend. However, the pattern is small and the data appears to be random, so linearity and constant variance are only slightly violated.

Question 5

Holding all other variables constant, among patients who received D-penicillamine, a placebo, or no treatment, we expect those who received D-penicillamine to survive about 311 days more than the other two groups.

Predictive Modeling

Question 6

Create training split.

set.seed(131)  # setting a seed for reproducibility

cholangitis = cholangitis %>%
  select(-c("id"))

# create data split
data_split <- initial_split(cholangitis, prop = 0.7)
# create train and test data
cholangitis_train <- training(data_split)
cholangitis_test <- testing(data_split)

Training and testing sets created for prediction. I chose a 0.7 split since the cholangitis data set is relatively small, and I want to ensure the training set has enough observations to convey the general trend of data.

Question 7

Use regression subsets to find which explanatory variables to include.

# regression subsets
best_k_model_results <- regsubsets(n_days ~ ., cholangitis_train)
summary(best_k_model_results)
Subset selection object
Call: regsubsets.formula(n_days ~ ., cholangitis_train)
23 Variables  (and intercept)
                 Forced in Forced out
statusCL             FALSE      FALSE
statusD              FALSE      FALSE
drugNo treatment     FALSE      FALSE
drugPlacebo          FALSE      FALSE
age                  FALSE      FALSE
sexM                 FALSE      FALSE
ascitesY             FALSE      FALSE
hepatomegalyY        FALSE      FALSE
spidersY             FALSE      FALSE
edemaS               FALSE      FALSE
edemaY               FALSE      FALSE
bilirubin            FALSE      FALSE
cholesterol          FALSE      FALSE
albumin              FALSE      FALSE
copper               FALSE      FALSE
alk_phos             FALSE      FALSE
sgot                 FALSE      FALSE
tryglicerides        FALSE      FALSE
platelets            FALSE      FALSE
prothrombin          FALSE      FALSE
stage2               FALSE      FALSE
stage3               FALSE      FALSE
stage4               FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         statusCL statusD drugNo treatment drugPlacebo age sexM ascitesY
1  ( 1 ) " "      " "     " "              " "         " " " "  " "     
2  ( 1 ) " "      " "     " "              " "         " " " "  " "     
3  ( 1 ) " "      " "     " "              " "         " " " "  " "     
4  ( 1 ) " "      "*"     " "              " "         " " " "  " "     
5  ( 1 ) " "      "*"     " "              " "         " " " "  " "     
6  ( 1 ) " "      "*"     " "              " "         " " " "  " "     
7  ( 1 ) " "      "*"     " "              " "         " " " "  " "     
8  ( 1 ) " "      "*"     "*"              " "         " " " "  " "     
         hepatomegalyY spidersY edemaS edemaY bilirubin cholesterol albumin
1  ( 1 ) " "           " "      " "    " "    " "       " "         "*"    
2  ( 1 ) " "           " "      " "    " "    "*"       " "         "*"    
3  ( 1 ) " "           " "      " "    " "    "*"       " "         "*"    
4  ( 1 ) " "           " "      " "    " "    "*"       " "         "*"    
5  ( 1 ) " "           " "      " "    " "    "*"       " "         "*"    
6  ( 1 ) " "           " "      " "    " "    "*"       " "         "*"    
7  ( 1 ) " "           " "      " "    " "    "*"       " "         "*"    
8  ( 1 ) " "           " "      " "    " "    "*"       " "         "*"    
         copper alk_phos sgot tryglicerides platelets prothrombin stage2 stage3
1  ( 1 ) " "    " "      " "  " "           " "       " "         " "    " "   
2  ( 1 ) " "    " "      " "  " "           " "       " "         " "    " "   
3  ( 1 ) " "    "*"      " "  " "           " "       " "         " "    " "   
4  ( 1 ) " "    "*"      " "  " "           " "       " "         " "    " "   
5  ( 1 ) " "    "*"      " "  " "           " "       "*"         " "    " "   
6  ( 1 ) " "    "*"      " "  " "           " "       "*"         " "    " "   
7  ( 1 ) "*"    "*"      " "  " "           " "       "*"         " "    " "   
8  ( 1 ) "*"    "*"      " "  " "           " "       "*"         " "    " "   
         stage4
1  ( 1 ) " "   
2  ( 1 ) " "   
3  ( 1 ) " "   
4  ( 1 ) " "   
5  ( 1 ) " "   
6  ( 1 ) "*"   
7  ( 1 ) "*"   
8  ( 1 ) "*"   
# function to perform k-fold cross validation
ONE_CV_FOLD <- function(fold_number, formula_string)
{
  cholangitis_train <- cholangitis %>%
  filter(folds != fold_number)

  cholangitis_test <- cholangitis %>%
  filter(folds == fold_number)

  cv_linear_model <- 
    do.call(what = "lm", 
        args = list(formula = as.formula(formula_string), 
                    data = quote(cholangitis_train)))

  cv_predictions <- predict(object= cv_linear_model, newdata = cholangitis_test)

  observations <- cholangitis_test %>%
  select(n_days) %>%
  pull()
  
  RMSE <- sqrt(mean((cv_predictions-observations)^2))
  
  return(RMSE)
}

# create folds
k <- 10
fold_vector <- 
  cut(1:nrow(cholangitis), breaks = k, labels = FALSE)

random_folds <- sample(x = fold_vector, size = nrow(cholangitis),
                       replace = FALSE)

cholangitis <- cholangitis %>%
  mutate(folds = random_folds)

# obtain RMSE based on regression subsets
model_1_RMSE <- mean(sapply(unique(fold_vector), FUN = ONE_CV_FOLD,
               formula_string = "n_days ~ albumin"))
model_2_RMSE <- mean(sapply(unique(fold_vector), FUN = ONE_CV_FOLD,
               formula_string = "n_days ~ albumin + bilirubin"))
model_3_RMSE <- mean(sapply(unique(fold_vector), FUN = ONE_CV_FOLD,
               formula_string = "n_days ~ albumin + bilirubin + alk_phos"))
model_4_RMSE <- mean(sapply(unique(fold_vector), FUN = ONE_CV_FOLD,
               formula_string = "n_days ~ albumin + bilirubin + alk_phos +
               status"))
model_5_RMSE <- mean(sapply(unique(fold_vector), FUN = ONE_CV_FOLD,
               formula_string = "n_days ~ albumin + bilirubin + alk_phos +
               status + prothrombin"))
model_6_RMSE <- mean(sapply(unique(fold_vector), FUN = ONE_CV_FOLD,
               formula_string = "n_days ~ albumin + bilirubin + alk_phos +
               status + prothrombin + stage"))
model_7_RMSE <- mean(sapply(unique(fold_vector), FUN = ONE_CV_FOLD,
               formula_string = "n_days ~ albumin + bilirubin + alk_phos + 
               status + prothrombin + stage + copper"))
model_8_RMSE <- mean(sapply(unique(fold_vector), FUN = ONE_CV_FOLD,
               formula_string = "n_days ~ albumin + bilirubin + alk_phos + 
               status + prothrombin + stage + copper + drug"))

print(min(c(model_1_RMSE, model_2_RMSE, model_3_RMSE, model_4_RMSE,
            model_5_RMSE, model_6_RMSE, model_7_RMSE, model_8_RMSE)))
[1] 851.7943

The best model to predict n_days includes albumin, bilirubin, alk_phos, status, prothrombin, stage, copper, and drug. The model’s RMSE is 851.7943.

Question 8

Create regression tree and print cp levels.

# fit regression tree
decision_tree <- 
  rpart(formula = n_days ~ ., data = cholangitis_train)

printcp(decision_tree)

Regression tree:
rpart(formula = n_days ~ ., data = cholangitis_train)

Variables actually used in tree construction:
[1] albumin       alk_phos      bilirubin     copper        platelets    
[6] prothrombin   tryglicerides

Root node error: 346922678/292 = 1188091

n= 292 

         CP nsplit rel error  xerror     xstd
1  0.184933      0   1.00000 1.00518 0.072165
2  0.062779      1   0.81507 0.87766 0.061672
3  0.037251      2   0.75229 0.91608 0.070621
4  0.035585      3   0.71504 0.89650 0.070057
5  0.032277      4   0.67945 0.90373 0.073637
6  0.027847      5   0.64717 0.89969 0.075324
7  0.019563      7   0.59148 0.89487 0.074900
8  0.015396      8   0.57192 0.91335 0.081764
9  0.015283      9   0.55652 0.90893 0.080588
10 0.014942     10   0.54124 0.92785 0.081561
11 0.013406     11   0.52630 0.92827 0.082178
12 0.012233     12   0.51289 0.95981 0.081914
13 0.011515     13   0.50066 1.00229 0.083124
14 0.011066     14   0.48914 1.00901 0.083673
15 0.010000     15   0.47808 1.02702 0.084936

Based on the data, we can prune the tree at the cp level of 0.062779.

Prune the tree.

# refit tree
refit = prune(decision_tree, cp = 0.062779)
rpart.plot(refit, type = 3)

The pruned tree includes variables bilirubin and albumin.

Now with the model fit, calculate the RMSE.

# regression tree prediction
test_predictions = predict(refit, newdata = cholangitis_test)
acc_test = sqrt(mean((test_predictions - cholangitis_test$n_days)^2))
print(acc_test)
[1] 1084.367

The model’s RMSE is 1084.367. It uses variables bilirubin and albumin.

Fit random forest.

# fit random forest
random_forest <- 
  randomForest(formula = n_days ~ ., 
        data = cholangitis_train,
        cutoff = c(0.5, 0.5),
        importance = TRUE)

# view which variables are important
varImpPlot(random_forest, 
           main = "Variable Importance Plot",
           pch = 16)

Important variables appear to be albumin, bilirubin, and prothrombin.

Calculate RMSE based on random forest model.

# Predict using the random forest model
forest_test_predictions <- predict(random_forest, newdata = cholangitis_test)

# Calculate RMSE
forest_rmse <- sqrt(mean((forest_test_predictions - cholangitis_test$n_days)^2))
print(forest_rmse)
[1] 867.1492

The model’s RMSE is 867.1492.

Question 9

Create markdown table for the three models created.

# create data frame for markdown table
markdown_data = data.frame(
  Model = c("Linear Regression", "Regression Trees", "Random Forest"),
  RMSE = c(model_8_RMSE, acc_test, forest_rmse)
)

# print markdown table
kable(markdown_data, format = "markdown")
Model RMSE
Linear Regression 851.7943
Regression Trees 1084.3667
Random Forest 867.1492

Using the RMSE as my testing metric, I will use my linear regression model for predictive purpose, as it has the lowest RMSE.

Next Steps

Question 1

One aspect of my project I believe can be improved is my filtering of my data. I chose to filter out outliers based on quantiles, but I could have done a more thorough job in selecting which outliers to exclude since the data set is fairly small.

Question 2

Based on my work for this project, I am interested in further exploring the relationship of bilirubin and albumin to number of days. In all my predictive models, bilirubin and albumin were considered important variables, so I am interested in closer examining their impact. Additionally, I am curious about the effectiveness of treatment on patients at different stages of the disease. Performing analysis and comparing the number of days based on that could reveal insight on treatment effectiveness for different groups.