Project 3 - STAT 611 - Smith, Nick

Application of the R Programming Language / SW to Biostatistical Data Analysis

Introduction

Biliary Cirrhosis Introduction

Primary biliary cirrhosis (PBC), or cholangitis, is a chronic disease of the liver. In instances where the disease is present, the bile ducts of an individual’s liver are slowly destroyed. Bile is a fluid made in the liver that aids digestion, and absorbs vitamins. It helps filter out cholesterol, toxins, and worn blood cells. [1]

Once the bile ducts of a liver are destroyed, the irreversible scarring of liver tissue - cirrhosis, occurs, and the liver will eventually fail. This can contribute to complications such as decreased mental functioning, vitamin deficiencies, increased blood pressure, liver failure, liver cancer, and death. [1]

As of now, there is no cure to PBC. The medications and treatments generally involve slowing down the active liver damage, and the other comorbidities to PBC [2]. In severe cases, liver transplant is necessary to keep a patient alive [2].

For individuals involved in medical research and those who are at risk due of PBC, it is of great interest to identify key factors that will lead to the progression of the disease (via Ludwig’s classification, PBC histiological stage(s) 1 through 4 [3]), and those that lead to death. Though there is no clear cure to PBC, identifying these important factors can assist in determining how severe the disease is on a quantitative scale. This may provide assistance for research studies and understanding factors to prioritize in the development of PBC.

Study Problem Statement

Based on the Mayo Clinic Biliary Cirrhosis clinical trial dataset, what factors are important for determining severity of the disease, and potential death?

These factors can be continuous or discrete. They can be numeric or categorical. The question is, what is important?

While a certain factor may be present more in terminally staged PBC patients, or deceased patients, this data does not imply that the PBC was the direct cause of death, or the “critical” factor was what made PBC worse.

The objective is to find these critical factors, and help provide a statistician, data scientist, biomedical researcher, etc. a basis for important comorbidities or ailments that should be treated. With that, if a number of these various symptoms are present, it can help provide (with further model building) a quantitative, data-driven, metric to predict if the patient is truly in danger.

Study Methodology, how do we answer the problem?

To answer the problem statement, the Mayo Clinic Biliary Cirrhosis clinical trial dataset [4] will be imported and cleaned.

This dataset will be analyzed for what factors are present, categorical and quantitative biomarkers.

Once these factors are identified, exploratory data analysis will be done to provide an idea of what may contribute to the stage progression and death.

After exploratory data analysis, Boruta Analysis; a machine learning algorithm, will be applied to determine what factors are statistically significant in the cleaned dataset.

Discussion will note other steps that may be done to analyze this rich and complex dataset.

Note, this is not a medical study, and correlation is not equal to causation, this report does not claim this any any manner. It is only looking at the dataset to see what features have statistical meaning.

Packages Required

The R software packages are imported as such above. Their functions are listed below:

  • ggplot2: Used for data visualization.
  • dplyr: Used for making tidy data and manipulating dataframes.
  • tibble: Used for generating easy to read and manipulate datatables.
  • corrplot: Used to generate correlation plots of quantitative variables.
  • gridExtra: Used to present ggplot2 objects in concise manner.
  • Boruta: Used to perform Boruta feature analysis and implement algorithm.
  • DT: Used to display the data via HTML as scrollable format.
  • pastecs: Used to compute descriptive statistics.
library("ggplot2")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("tibble")
library("corrplot")
## corrplot 0.92 loaded
library("gridExtra")
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library("Boruta")
library("DT")
library("pastecs")
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
## 
##     first, last

Data Preparation

Original Source

The datasets are sourced from the Vanderbilt University Department of Biostatistics, links below:

The PBC dataset is from a Mayo Clinical trial studying PBC between 1974 and 1984. A description of the trial can be found in the text below:

“A total of 424 PBC patients, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trial of the drug D-penicillamine. The first 312 cases in the data set participated in the randomized trial and contain largely complete data. The additional 112 cases did not participate in the clinical trial, but consented to have basic measurements recorded and to be followed for survival. Six of those cases were lost to follow-up shortly after diagnosis, so the data here are on an additional 106 cases as well as the 312 randomized participants. Missing data items are denoted by a period.” [6]

The variable descriptions are listed below:

  • case number
  • fu.days: number of days between registration and the earlier of death, transplantion, or study analysis time in July, 1986
  • status: 0=alive, 1=dead
  • drug: 1= D-penicillamine, 2=placebo
  • age: age in days
  • sex: 0=male, 1=female
  • ascites: presence of ascites- 0=no 1=yes
  • hept: presence of hepatomegaly - 0=no 1=yes
  • spiders: presence of spiders - 0=no 1=yes
  • edema: presence of edema - 0=no edema and no diuretic therapy for edema; .5 = edema present without diuretics, or edema resolved by diuretics; 1 = edema despite diuretic therapy
  • bili: serum bilirubin in mg/dl
  • chol: serum cholesterol in mg/dl
  • albumin: albumin in gm/dl
  • copper: urine copper in ug/day
  • alk.phos: alkaline phosphatase in U/liter
  • sgot: SGOT in U/ml
  • trig: triglicerides in mg/dl
  • platelet: platelets per cubic ml / 1000
  • protime: prothrombin time in seconds
  • stage: histologic stage of disease

Data Importing and Cleaning Process

For the data analysis, some modifications will be need to be done to make it workable for analysis.

First the R script must read the .csv file into the interpreter. Then, it will confirm if the structure is consistent to the data descriptors previously mentioned.

To clean up the data and factors that will model the status of the patient (0 - alive, or 1 - deceased), some are not going to make practical sense. For brevity, the dataset will be filtered to remove the patients who are not randomized, which are inherently missing many measurements. This will limit the dataset to those treated with a placebo or the drug of interest.

Follow-up days will be omitted, as this only notes the study analysis time of the patient. This will simulate this as a random sample, and will remove this noise factor, which will not necessarily be able to be generalized as a factor. (It doesn’t make sense to note a follow-up day time as a feature/factor that will contribute to severe PBC, this is just a timekeeping feature of the clinical trial data.)

For this initial study, there will be no imputed data points. For brevity, and to reduce model noise, the NA’s and their related rows will be removed. This will further contribute to a more complete dataset.

The R code below will execute each step in sequence:

# import the data create dataframe
setwd("~/Documents/STAT 611/Project 3")

ciro.df <- read.csv("pbc.csv")
as_tibble(str(ciro.df))
## 'data.frame':    418 obs. of  19 variables:
##  $ bili    : num  14.5 1.1 1.4 1.8 3.4 ...
##  $ albumin : num  2.6 4.14 3.48 2.54 3.53 ...
##  $ stage   : int  4 3 4 4 3 3 3 3 2 4 ...
##  $ protime : num  12.2 10.6 12 10.3 10.9 ...
##  $ sex     : chr  "female" "female" "male" "female" ...
##  $ fu.days : int  400 4500 1012 1925 1504 2503 1832 2466 2400 51 ...
##  $ age     : num  58.8 56.4 70.1 54.7 38.1 ...
##  $ spiders : chr  "present" "present" "absent" "present" ...
##  $ hepatom : chr  "present" "present" "absent" "present" ...
##  $ ascites : chr  "present" "absent" "absent" "absent" ...
##  $ alk.phos: num  1718 7395 516 6122 671 ...
##  $ sgot    : num  137.9 113.5 96.1 60.6 113.2 ...
##  $ chol    : int  261 302 176 244 279 248 322 280 562 200 ...
##  $ trig    : int  172 88 55 92 72 63 213 189 88 143 ...
##  $ platelet: int  190 221 151 183 136 NA 204 373 251 302 ...
##  $ drug    : chr  "placebo" "placebo" "placebo" "placebo" ...
##  $ status  : int  1 0 1 1 0 1 0 1 1 1 ...
##  $ edema   : chr  "edema despite diuretic therapy" "no edema" "edema, no diuretic therapy" "edema, no diuretic therapy" ...
##  $ copper  : int  156 54 210 64 143 50 52 52 79 140 ...
# removed samples that are not randomized, and follow-up days:
ciro_rand.df <- subset(ciro.df, drug != "not randomized") 
ciro_rand.df <- subset(ciro_rand.df, select = -fu.days)

#comit NAs
colSums(is.na(ciro_rand.df))
##     bili  albumin    stage  protime      sex      age  spiders  hepatom 
##        0        0        0        0        0        0        0        0 
##  ascites alk.phos     sgot     chol     trig platelet     drug   status 
##        0        0        0       28       30        4        0        0 
##    edema   copper 
##        0        2
ciro_rand_cleaned.df <- na.omit(ciro_rand.df)

# confrim the NAs have been removed
colSums(is.na(ciro_rand_cleaned.df))
##     bili  albumin    stage  protime      sex      age  spiders  hepatom 
##        0        0        0        0        0        0        0        0 
##  ascites alk.phos     sgot     chol     trig platelet     drug   status 
##        0        0        0        0        0        0        0        0 
##    edema   copper 
##        0        0

Next, the cleaned data data will need to be modified in a way that sets factor levels when appropriate. Per the data descriptors, the R code below will make categorical variables into factors with levels, and will turn relevant integer variables into factors:

# take categorical variables and convert to factors

ciro_rand_cleaned.df[sapply(ciro_rand_cleaned.df, is.character)] <- lapply(ciro_rand_cleaned.df[
  sapply(ciro_rand_cleaned.df, is.character)],
              as.factor)

# convert stage and status into factor
ciro_rand_cleaned.df$status <- as.factor(ciro_rand_cleaned.df$status)
ciro_rand_cleaned.df$stage <- as.factor(ciro_rand_cleaned.df$stage)

as_tibble(str(ciro_rand_cleaned.df))
## 'data.frame':    276 obs. of  18 variables:
##  $ bili    : num  14.5 1.1 1.4 1.8 3.4 ...
##  $ albumin : num  2.6 4.14 3.48 2.54 3.53 ...
##  $ stage   : Factor w/ 4 levels "1","2","3","4": 4 3 4 4 3 3 3 2 4 4 ...
##  $ protime : num  12.2 10.6 12 10.3 10.9 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 1 2 1 1 1 1 1 1 1 ...
##  $ age     : num  58.8 56.4 70.1 54.7 38.1 ...
##  $ spiders : Factor w/ 2 levels "absent","present": 2 2 1 2 2 1 1 2 2 2 ...
##  $ hepatom : Factor w/ 2 levels "absent","present": 2 2 1 2 2 2 1 1 1 2 ...
##  $ ascites : Factor w/ 2 levels "absent","present": 2 1 1 1 1 1 1 1 2 1 ...
##  $ alk.phos: num  1718 7395 516 6122 671 ...
##  $ sgot    : num  137.9 113.5 96.1 60.6 113.2 ...
##  $ chol    : int  261 302 176 244 279 322 280 562 200 259 ...
##  $ trig    : int  172 88 55 92 72 213 189 88 143 79 ...
##  $ platelet: int  190 221 151 183 136 204 373 251 302 258 ...
##  $ drug    : Factor w/ 2 levels "D-penicillamine",..: 2 2 2 2 1 1 1 2 1 1 ...
##  $ status  : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 2 2 2 ...
##  $ edema   : Factor w/ 3 levels "edema despite diuretic therapy",..: 1 3 2 2 3 3 3 3 1 3 ...
##  $ copper  : int  156 54 210 64 143 52 52 79 140 46 ...
##  - attr(*, "na.action")= 'omit' Named int [1:36] 6 14 40 41 42 45 49 53 58 70 ...
##   ..- attr(*, "names")= chr [1:36] "6" "14" "40" "41" ...

Final Data Preview

A preview of the dataset is created. Using dplyr, the numerical numbers are rounded for conciseness (actual analysis will use full values). An HTML datatable is generated for 25 abridged entries via the DT package. Please note, due to the columns in the dataset, the datatable is inherently large. The output is limited, but for an accurate data preview, all columns are present.

# for presentation, round the numeric values to 3 digits
ciro_rand_round <- ciro_rand_cleaned.df %>% 
  mutate_if(is.numeric, round, digits=3)

# data table of dataset, only 25 entries
datatable(head(ciro_rand_round, 25), filter = "top", options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)), autoHideNavigation = TRUE) 

Exploratory Data Analysis

Categorical Data Visualization

Before investigating the status of the patients, the histiological stage data is important for understanding the progression of the disease, and the categorical factors present. As one can infer, as the stage progression goes from 1-4, the patient’s risk of dying will increase. Stage 4 is noted as the terminal stage prior to death [3].

To understand the study data, it will be helpful to get an idea of how the patient’s stage of PBC is distributed throughout the data:

# summary of stages
round(prop.table(table(ciro_rand_cleaned.df$stage)),2) 
## 
##    1    2    3    4 
## 0.04 0.21 0.40 0.34

Using these proportions as a guide, the distribution of patients and their respective distributions are shown below:

# summary of stages and data, continued
stagecount <- as.data.frame(table(ciro_rand_cleaned.df$stage))

colnames(stagecount)[1] <- "stage"

ggplot(stagecount,(aes(x = stage, y = Freq,fill = stage)))+
  geom_bar(stat="identity")+
  scale_fill_manual(values=c("gold","orangered","red2","purple"))+
  labs(title = "Quantity of Cirrhosis Stage Data")+
  theme(plot.title = element_text(hjust = 0.5))+
  guides(fill=guide_legend(title="stage"))

From this plot, one can infer this dataset will be skewed with more patients towards the terminal stages of PBC.

The categorical comorbidities and symptoms of PBC were previously structured. Based on some of the most notable identifiable comorbidities [3] ascites, hepatomegaly, and spiders will be analyzed.

To get an idea of their occurrence, the proportion of patients who have these ailments will be displayed in each stage. One will also see how the placebo and treatment drug (D-penicillamine) are given to the patients.

Ascites is generally considered the most lethal of the three listed ailments. This is when the liver is actively weeping fluid from the veins of the liver. This fluid will cause the patient appear bloated and will impact the surrounding organs and is generally considered a sign of liver failure [2].

Hepatomegaly is an observation of enlargement of the liver. This can occur as viral infection, or in some cases, alchohol-induced PBC or liver damage. It is not inherently fatal, but can be considered a symptom of poor liver health. [2]

Spiders, or spider angioma, are a skin symptom of veins that appear red and irritated. It is indicative of liver functioning deteriorating, and is commonly seen in alcoholic induced PBC.

The following R code will create a dataframe for each of these categorical features, and create a ggplot object for each variable across the stages. gridExtra is used to present the plots in a concise manner.

# ascites count

ascitecount <- as.data.frame(table(ciro_rand_cleaned.df$ascites, ciro_rand_cleaned.df$stage))


colnames(ascitecount)[1] <- "ascites"
colnames(ascitecount)[2] <- "stage"

asciteplot <- ggplot(ascitecount, aes(x=stage, y=Freq, fill=ascites))+
              geom_bar(stat="Identity",position="dodge")+
              labs(title="Ascites Count Among Stage Progression")+
              guides(fill=guide_legend(title="stage"))

# heptomegaly count

heptcount <- as.data.frame((table(ciro_rand_cleaned.df$hepatom, ciro_rand_cleaned.df$stage)))

colnames(heptcount)[1] <- "heptomegaly"
colnames(heptcount)[2] <- "stage"

heptplot<- ggplot(heptcount, aes(x=stage, y=Freq, fill=heptomegaly))+
  geom_bar(stat="Identity",position="dodge")+
  labs(title="Hepatomegaly Count Among Stage Progression")+
  guides(fill=guide_legend(title="stage"))


# spiders count
spiderscount <- as.data.frame((table(ciro_rand_cleaned.df$spiders,
                                     ciro_rand_cleaned.df$stage)))


colnames(spiderscount)[1] <- "spiders"
colnames(spiderscount)[2] <- "stage"

spidersplot <- ggplot(spiderscount, aes(x=stage, y=Freq, fill=spiders))+
  geom_bar(stat="Identity",position="dodge")+
  labs(title="Spider Count Among Stage Progression")+
  guides(fill=guide_legend(title="stage"))

# drug count
drugcount <- as.data.frame((table(ciro_rand_cleaned.df$drug, 
                                  ciro_rand_cleaned.df$stage)))


colnames(drugcount)[1] <- "drug"
colnames(drugcount)[2] <- "stage"

drugplot<- ggplot(drugcount, aes(x=stage, y=Freq, fill=drug))+
  geom_bar(stat="Identity",position="dodge")+
  labs(title="Drug Count Among Stage Progression")+
  guides(fill=guide_legend(title="stage"))

grid.arrange(asciteplot, drugplot) 

grid.arrange(spidersplot, heptplot)

The drug treatments are applied in a manner apparently proportional to the stage distribution previously shown, though they are randomly assigned.

Interestingly enough, ascites is present only in a small amount of patients in stage 4 PBC. This supports the idea that it is a sign of fatal liver conditions exacerbated by PBC.

Hepatomegaly appears to become more present as the PBC stage goes higher. This can be interpreted as it is a sign of declining liver health, and PBC, but perhaps not completely fatal. This also appears to be a feature that may be significant in predicting the status of a patient, and the severity of the disease.

Spiders appears in the same stages of PBC as hepatomegaly, though not as frequently. While it may be an indicator of PBC, it can be interpreted as it is not necessarily a direct sign of severe PBC.

Quantitative / Continuous Data Visualization

Prior to analyzing the continuous variables, as there are many, it will be helpful to create a dataframe of the isolated continuous variables.

# quantitative variables of interest, create correlation plot of data
ciro_rc_quant <- select_if(ciro_rand_cleaned.df, is.numeric)

head(ciro_rc_quant)

For further context of the proceeding data, below is a summary of the descriptive statistics for both of the data groups:

# summary of all quantitative variables. base summary, variance covariance, and 
# descriptive statistics via pastecs package.

# living patients
# base summary via R
ciro_base_summary<- summary(ciro_rc_quant)
ciro_base_summary
##       bili           albumin         protime           age       
##  Min.   : 0.300   Min.   :1.960   Min.   : 9.00   Min.   :26.28  
##  1st Qu.: 0.800   1st Qu.:3.310   1st Qu.:10.00   1st Qu.:41.51  
##  Median : 1.400   Median :3.545   Median :10.60   Median :49.71  
##  Mean   : 3.334   Mean   :3.517   Mean   :10.74   Mean   :49.80  
##  3rd Qu.: 3.525   3rd Qu.:3.772   3rd Qu.:11.20   3rd Qu.:56.58  
##  Max.   :28.000   Max.   :4.400   Max.   :17.10   Max.   :78.44  
##     alk.phos            sgot             chol             trig      
##  Min.   :  289.0   Min.   : 28.38   Min.   : 120.0   Min.   : 33.0  
##  1st Qu.:  922.5   1st Qu.: 82.46   1st Qu.: 249.5   1st Qu.: 85.0  
##  Median : 1277.5   Median :116.62   Median : 310.0   Median :108.0  
##  Mean   : 1996.6   Mean   :124.12   Mean   : 371.3   Mean   :125.0  
##  3rd Qu.: 2068.2   3rd Qu.:153.45   3rd Qu.: 401.0   3rd Qu.:151.2  
##  Max.   :13862.4   Max.   :457.25   Max.   :1775.0   Max.   :598.0  
##     platelet         copper      
##  Min.   : 62.0   Min.   :  4.00  
##  1st Qu.:200.0   1st Qu.: 42.75  
##  Median :257.0   Median : 74.00  
##  Mean   :261.8   Mean   :100.77  
##  3rd Qu.:318.2   3rd Qu.:129.25  
##  Max.   :563.0   Max.   :588.00
# variance and covariances rounded to 3 digits
ciro_var_living <- round(cov(ciro_rc_quant), 3)
ciro_var_living
##              bili albumin protime      age    alk.phos      sgot      chol
## bili       21.170  -0.581   1.536    3.785    1350.882   110.910   426.673
## albumin    -0.581   0.164  -0.081   -1.021     -95.477    -4.551    -6.396
## protime     1.536  -0.081   1.017    2.597     187.250     4.000    -7.901
## age         3.785  -1.021   2.597  110.735    -440.739   -72.142  -387.005
## alk.phos 1350.882 -95.477 187.250 -440.739 4475246.722 17427.545 75636.417
## sgot      110.910  -4.551   4.000  -72.142   17427.545  3217.153  4753.847
## chol      426.673  -6.396  -7.901 -387.005   75636.417  4753.847 55125.575
## trig      132.066  -2.898   1.651   16.975   25212.374   499.709  4198.049
## platelet  -31.878   7.487 -18.110 -146.657   29010.058  -311.397  4210.169
## copper    187.034  -8.590  19.026   76.752   35649.111  1511.284  2578.403
##               trig  platelet    copper
## bili       132.066   -31.878   187.034
## albumin     -2.898     7.487    -8.590
## protime      1.651   -18.110    19.026
## age         16.975  -146.657    76.752
## alk.phos 25212.374 29010.058 35649.111
## sgot       499.709  -311.397  1511.284
## chol      4198.049  4210.169  2578.403
## trig      4261.578   615.762  1639.817
## platelet   615.762  8672.984  -690.315
## copper    1639.817  -690.315  7791.371
# pastecs descriptive statistics rounded to 3 digits
ciro_desc <- round(stat.desc(ciro_rc_quant), 3)
ciro_desc

Though this report and analysis does not seek to build a regression model directly (note that the feature analysis will assist in giving a statistician or data scientist an idea of what variables to look at in this context), a correlation plot of how the variables are “interacting” or “associated” with one another is a good tool to observe.

# create correlation matrix
corr.data <- cor(ciro_rc_quant)
corr.data
##                 bili     albumin     protime         age    alk.phos
## bili      1.00000000 -0.31204315  0.33117618  0.07817686  0.13878731
## albumin  -0.31204315  1.00000000 -0.19958367 -0.23967646 -0.11149675
## protime   0.33117618 -0.19958367  1.00000000  0.24477423  0.08778416
## age       0.07817686 -0.23967646  0.24477423  1.00000000 -0.01979841
## alk.phos  0.13878731 -0.11149675  0.08778416 -0.01979841  1.00000000
## sgot      0.42498704 -0.19821020  0.06993950 -0.12086821  0.14524188
## chol      0.39496606 -0.06729864 -0.03337385 -0.15663815  0.15228106
## trig      0.43969068 -0.10967327  0.02508706  0.02470982  0.18256606
## platelet -0.07439504  0.19860002 -0.19285873 -0.14964961  0.14725018
## copper    0.46052723 -0.24040832  0.21376979  0.08263080  0.19091202
##                 sgot        chol        trig    platelet      copper
## bili      0.42498704  0.39496606  0.43969068 -0.07439504  0.46052723
## albumin  -0.19821020 -0.06729864 -0.10967327  0.19860002 -0.24040832
## protime   0.06993950 -0.03337385  0.02508706 -0.19285873  0.21376979
## age      -0.12086821 -0.15663815  0.02470982 -0.14964961  0.08263080
## alk.phos  0.14524188  0.15228106  0.18256606  0.14725018  0.19091202
## sgot      1.00000000  0.35697089  0.13495728 -0.05895142  0.30185842
## chol      0.35697089  1.00000000  0.27389605  0.19254785  0.12441341
## trig      0.13495728  0.27389605  1.00000000  0.10128464  0.28457920
## platelet -0.05895142  0.19254785  0.10128464  1.00000000 -0.08397617
## copper    0.30185842  0.12441341  0.28457920 -0.08397617  1.00000000
# correlation plot of quantitative variables
corrplot(corr.data, method = 'ellipse', order='AOE', type = 'upper')

This matrix shows how the variables are interrelated to one another. It is interesting as the vast majority of these categorical variables appear to interact in some manner. Trigliceride volume seems to not react with age or albumin levels.

There is a strong correlation with biliary levels and platelets. In fact, platelets seem to interact negetively in some manner with many of the predictors.

The correlation matrix is difficult to draw direct, “binary”, conclusions from a visual analysis, but is useful nonetheless. Age, bilirubin levels, prothrombin time, copper content in urine, and alkaline phosphate levels all appear to be present in many of the interactions.

Data visualizations of the kernel density of these variables across the living and deceased patients can provide a basis for understanding what values are present in patients, and thus, a notion of criticality in terms of understanding the severity of their PBC and potential lethality.

PCA analysis can be of importance for later studies, as noted later in the report.

# plot the stage progression across the age of patients
ggplot(ciro_rand_cleaned.df, aes(x = age, fill=as.factor(stage)))+
  geom_density(alpha = 0.40, color=NA)+
  scale_fill_manual(values=c("gold","orangered","red2","purple"))+
  labs(title = "Kernel Density of Cirrhosis Stage across Age")+
  theme(plot.title = element_text(hjust = 0.5))+
  facet_grid(stage ~ .)+
  guides(fill=guide_legend(title="stage"))

The age distribution of patients across each stage is listed above. It is clear the center of the distribution seems to shift to an “older” center as the patient age increases. It appears older patients with PBC tend to have a more terminal stage of PBC. The data are heavily skewed, and do not appear normally distributed.

# plot the statuses (0 - living, 1 - deceased) of patients
ggplot(ciro_rand_cleaned.df, aes(x = age, fill=as.factor(status)))+
  geom_density(alpha = 0.40, color=NA)+
  scale_fill_manual(values=c("gold","purple"))+
  labs(title = "Kernel Density of Cirrhosis Stage across Age and Status 
       \n(0) - Alive, (1) - Deceased")+
  theme(plot.title = element_text(hjust = 0.5))+
  facet_grid(as.factor(status) ~ .)+
  guides(fill=guide_legend(title="status"))

For living and deceased patients with PBC (recall, the deceased status doesn’t necessarily imply that the PBC was the cause of death), the center of the deceased patients appears to be higher (as expected), and with less variance. The living patient distribution appears to be parametrically different than the deceased patient distribution, which is more normal.

Age does appear to be indicative of the PBC criticality and severity.

The plots for the remaining quantitative variables are compared via their kernel densities with respect to the patients being living, or deceased. From the correlation matrix, these appeared to be also important in determining the criticality of the PBC.

# plot interesting variables from correlation plot, and statuses 
# (0 - living, 1 - deceased) of patients

# Bilirubin levels for living and deceased patients
bilplot <- ggplot(ciro_rand_cleaned.df, aes(x = bili, fill=as.factor(status)))+
  geom_density(alpha = 0.40, color=NA)+
  scale_fill_manual(values=c("gold","blue"))+
  labs(title = "Kernel Density of Serum Bilirubin Levels (mg/dl) Among Living (0) 
       and Deceased (1) Patients")+
  xlab("Serum Bilirubin Levels (mg/dl)")+
  theme(plot.title = element_text(hjust = 0.5))+
  guides(fill=guide_legend(title="status"))

# prothrombin time for living and deceased patients
proplot <- ggplot(ciro_rand_cleaned.df, aes(x = protime, 
                                            fill=as.factor(status)))+
  geom_density(alpha = 0.40, color=NA)+
  scale_fill_manual(values=c("gold","blue"))+
  labs(title = "Kernel Density of Prothrombin Time (s) Among Living (0) 
       and Deceased (1) Patients")+
  xlab("Prothrombin Time (s) ")+
  theme(plot.title = element_text(hjust = 0.5))+
  guides(fill=guide_legend(title="status"))

# urine copper levels among living and deceased patients
copplot <- ggplot(ciro_rand_cleaned.df, aes(x = copper, 
                                            fill=as.factor(status)))+
  geom_density(alpha = 0.40, color=NA)+
  scale_fill_manual(values=c("gold","blue"))+
  labs(title = "Kernel Density of Urine Copper (ug/day) Among Living (0) 
       and Deceased Patients (1)")+
  xlab("Urine Copper (ug/day)")+
  theme(plot.title = element_text(hjust = 0.5))+
  guides(fill=guide_legend(title="status"))

# alkaline phosphate levels among living and deceased patients
alkplot <- ggplot(ciro_rand_cleaned.df, aes(x = alk.phos, 
                                            fill=as.factor(status)))+
  geom_density(alpha = 0.40, color=NA)+
  scale_fill_manual(values=c("gold","blue"))+
  labs(title = "Kernel Density of Alkaline Phosphate (U/L) Among Living (0) 
       and Deceased Patients (1)")+
  theme(plot.title = element_text(hjust = 0.5))+
  guides(fill=guide_legend(title="status"))

# grid plots of quantitative variables
grid.arrange(bilplot, proplot) 

grid.arrange(copplot, alkplot)

For all four factors, many of the deceased patients had higher values. The spreads of the data are all incredibly skewed through the respective ranges. Bilirubin levels are very spread out, with a large variance. Prothrombin time among living and deceased patients appear to be the most similar, but spread differently nonetheless.

The variance across all four groups is very high. It does appear that the variances increase for the deceased patients. This consistent in each kernel density comparison.

It is unclear if the distributions are impacted by a few outliers, but there is some difference in each distribution between living and deceased PBC patients. These visualizations demonstrate these quantitative factors may be all important when determining the severity of PBC.

These plots indicate that bilirubin levels and prothrombin time may be significant contributors to PBC criticality.

Feature Selection via Boruta Analysis

Boruta analysis is a machine learning algorithm that determines whether or not a feature is important. It is a heuristic, and serial algorithm that uses random forests to have a feature compete against a “shadow feature” of itself to keep the distribution the same, but highlight if it is “important” [7].

To use it, the response, status (if living or deceased from study), will be “fit” with a training data set. This training dataset will omit the status variable, and execute the algorithm to evaluate if the features are stochastically different and important. A plot is then generated of the important features.

# Boruta Feature Analysis
train <-ciro_rand_cleaned.df
train[,"status"] = NULL

set.seed(42)
boruta_analysis = Boruta(ciro_rand_cleaned.df$status ~ ., data = train, maxRuns = 200)

plot(boruta_analysis, main = "Boruta Analysis Results", xlab = "Feature", las =2, cex.axis=.7)

Interestingly enough, this model does seem to co-align with the previously discussed exploratory data analysis. Bilirubin levels appear to be most significant, followed by prothrombin time, copper levels, alkaline phosphate levels, age, ascites presence, edema, etc.

as.data.frame(boruta_analysis$finalDecision)

The data frame above lists all of the predicted significant and important features for predicting patient status. As discussed, spiders dies not appear to be a great way of predicting the severity of PBC, but is still a good way of determining if it’s present.

Platelet measurements are also not as important, which is interesting as it appeared to correlate with many of the other variables in the matrix plot. The variables that are most important through the Boruta algorithm tend to show relatively stronger correlation values in this matrix.

Hepatomegaly is tentatively important, which is is somewhat contrary to exploratory data analysis. Perhaps it is not necessarily a sign of lethal PBC, but can indicate if it is progressing in an alarming fashion.

The drugs used in the clinical trial do not predict severity of PBC. It can be argued that this is a noise source in the analysis, as it contains placebo treatments and experimental treatments.

The treatment applied is likely not a great predictor for if a patient is in a critical state of PBC progression. The treatment was randomly applied to patients, and it is difficult, in the model these features are contributing to, to confirm if the treatment is exacerbating PBC symptoms, or aiding in slowing them down.

A different clinical study can be done, perhaps an ANOVA or Kruskal-Wallis test between experimental and control groups. This was not the intent of this study, and the predictive model these features may belong to.

Summary

1. Problem Summary

The objective of this analysis was to use the Mayo Primary Biliary Cirrhosis data to determine what factors contributed to determining the severity of PBC. The patients in the clinical trial have already been diagnosed with PBC, so the cases in which the patient did not survive, or were in a more severe stage are of great interest.

2. Study Methodology and Data Structure for Analysis

To understand the severity, this study cleaned the data and isolated the cases that are completely reported by data. This cleaned dataset was then structured in a way that could be analyzed with the R programming language, and statistical software packages.

The cleaned data was then analyzed via exploratory data analysis. This was split into two analytical groups, categorical data and quantitative, continuous, data.

Once the exploratory data analysis was completed, Boruta analysis was executed for training and test data. This led to the determination of which features are important for predicting patient status, and which are seemingly not as important. This set up a basis for further model building and analysis of the PBC dataset.

3. Insights and Conclusions from Data Analysis

Categorical Data Exploration and Visualization Insights

A series of plots was done for categorical factors. To see the progression of the disease with respect to these factors, bar charts were generated via ggplot. An understanding of the patient distribution via each stage was determined, revealing a skew towards for severe cases. With this in mind, a number of additional categorical factors were analyzed.

The presence of ascites was the most notable, as it is solely present in the terminal stage (stage 4) of PBC.

Spiders and hepatomegaly are present in stage(s) 2-4 of the disease. Spiders was not as frequent, but they both appear to be more frequent as the disease progresses. It is debatable if these two variables are important for determining disease criticality.

The drug distribution is seemingly distributed in a more proportional to the stage distribution, but ultimately looked random. This will likely not contribute to the prediction of the patient status and criticality, as it is a mix of experimental treatment and placebo.

Quantitative / Continuous Data Exploration and Visualization Insights

A correlation plot matrix of an isolated dataframe led to some interesting insights. Age, bilirubin levels, prothrombin time, copper content in urine, and alkaline phosphate levels all appear to be present in many of the interactions.

The stage progression and ages of patients revealed that age tends to increase with the more terminal progressions of PBC. As expected, this patter was observed when looking at living and deceased patients, though interestingly enough, the deceased patients seemed to have a more uniform distribution.

For the age, bilirubin levels, prothrombin time, copper content in urine, and alkaline phosphate levels, a similar pattern was seen. There are many outliers, with a strong right tailed skew. The center of the living patients seem to be more dense, perhaps indicating a center closer to the healthy, normal values of these biomarkers. The variance of deceased patients seems to be much higher. The bilirubin levels seem to be the most different between the two, with prothrombin time being more similar, but still different, between the living and deceased patients. It is clear that bilirubin levels and prothrombin time contribute to severity of the disease. From all of these plots, which are variables observed in the correlation matrix, the factors appear to contribute to the severity of the disease.

Overall, the data are not apparently normal. There are heavy skews and differences between each of these continuous observations.

Boruta Analysis and Feature Selection

Boruta Analysis is performed through the existing R package. It is a machine algorithm that selects future performance. The model used to determine importance was the patient status with the cleaned factors, categorical and continuous. Since there is a mix of factors, this can setup a number of features for future model building.

The final decision on the variable importance is plotted and a datafrane of all of the confirmed values are listed.

4. Implications of the Analysis and Study

Conclusion - It is clear from the study that bilirubin levels, albumin levels, disease stage, prothrombin time, age, ascites, alkaline phosphate levels, copper levels in urine, and the remainder listed in the Boruta dataframe are important factors for patient outcome. These were observed as important in the exploratory data analysis, and are further supported in the Boruta dataframe.

The study also confirms that spiders and the drug type in this clinical trial did not contribute in a significant manner as observed in the previous analysis and discussion. For statisticians, data scientists, biostatisticians, medical reserachers, bioengineers, etc. that are researching treatment for Primary Biliary Cirrhosis, these factors can be identified as quantitatively different for more severe cases of patient outcome. The aforementioned factors are all apparent in terminal stages, and have high levels when comparing living and deceased patients. Since there is no direct cure to PBC, these factors can be flagged as important for determining if a patient is in a more severe stage of PBC.

5. Study Limitations and Future Applications

The data analyzed in the study are skewed, and from almost 40 years ago. It could be beneficial to run the clinical trial, and run this analysis to see if it is consistent. There are also some noise factors, as mentioned before the skew is towards more severe cases and there is a drug treatment which may be introducing some noise. There are also a large number of variables interacting with one another.

With the sample set, it is possible the sample is too small. Especially with the heavy data skews, it would be interesting to see if a larger sample size across different test centers would improve the skewness. This also can be problematic for a future model. If there are many factors at hand that contribute to patient status, it is likely that many data points will be necessary to use these features in a comprehensive predictive model.

There are many ways these selected features can be analyzed further.

A list of potential analysis to build off this study are listed below:

  • Machine Learning Algorithms of Features selected: e.g. CNN or Random Forest Model of Selected Features
  • Multivariate analysis of dataset variables
  • Multiple linear or logistic regression of features
  • ANOVA or MANOVA of quantitative variables
  • Categorical data analysis across the categorical factors observed
  • Development of further metrics and biomarkers that contribute to the severity of PBC
  • Further literature review of PBC to see if these features have been observed below, and if there are other studies to perform this study on
  • Data analysis of studies focused on the flagged features (those that contribute to patient status and disease criticality.)

Appendix

References

[1] Primary biliary cirrhosis / cholangitis: https://www.mayoclinic.org/diseases-conditions/primary-biliary-cholangitis/symptoms-causes/syc-20376874

[2] PBC Treatments: https://www.nhs.uk/conditions/primary-biliary-cirrhosis-pbc/treatment/

[3] Ludwig’s Classification (PBC Histological Stage) - NIH: https://www.ncbi.nlm.nih.gov/books/NBK534940/table/cl.app4.table11/

[4] Vanderbilt Biostatistics Datasets: https://hbiostat.org/data/

[5] PBC Mayo Clinical Trial Data Description: https://hbiostat.org/data/repo/Cpbc.html

[6] Fleming and Harrington, Counting Processes and Survival Analysis, Wiley, 1991. A more extended discussion can be found in Dickson, et al., Hepatology 10:1-7 (1989) and in Markus, et al., N Eng J of Med 320:1709-13 (1989).

[7] Boruta Analysis for those in a Hurry: https://cran.r-project.org/web/packages/Boruta/vignettes/inahurry.pdf