INTRODUCTION For this assignment, we worked with a Cirrhosis Patient Survival dataset to explore how different factors relate to liver cirrhosis outcomes. The dataset includes records for 418 patients and 20 variables, such as bilirubin, cholesterol, albumin, copper levels, liver disease stage,survival status,etc. Using R, we generated descriptive statistics, visualizations, and correlation analyses to identify which variables are most informative for understanding patient outcomes in liver cirrhosis.
str(cirrhosis)
## 'data.frame': 418 obs. of 20 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ N_Days : int 400 4500 1012 1925 1504 2503 1832 2466 2400 51 ...
## $ Status : chr "D" "C" "D" "D" ...
## $ Drug : chr "D-penicillamine" "D-penicillamine" "D-penicillamine" "D-penicillamine" ...
## $ Age : int 21464 20617 25594 19994 13918 24201 20284 19379 15526 25772 ...
## $ Sex : chr "F" "F" "M" "F" ...
## $ Ascites : chr "Y" "N" "N" "N" ...
## $ Hepatomegaly : chr "Y" "Y" "N" "Y" ...
## $ Spiders : chr "Y" "Y" "N" "Y" ...
## $ Edema : chr "Y" "N" "S" "S" ...
## $ Bilirubin : num 14.5 1.1 1.4 1.8 3.4 0.8 1 0.3 3.2 12.6 ...
## $ Cholesterol : int 261 302 176 244 279 248 322 280 562 200 ...
## $ Albumin : num 2.6 4.14 3.48 2.54 3.53 3.98 4.09 4 3.08 2.74 ...
## $ Copper : int 156 54 210 64 143 50 52 52 79 140 ...
## $ Alk_Phos : num 1718 7395 516 6122 671 ...
## $ SGOT : num 137.9 113.5 96.1 60.6 113.2 ...
## $ Tryglicerides: int 172 88 55 92 72 63 213 189 88 143 ...
## $ Platelets : int 190 221 151 183 136 NA 204 373 251 302 ...
## $ Prothrombin : num 12.2 10.6 12 10.3 10.9 11 9.7 11 11 11.5 ...
## $ Stage : int 4 3 4 4 3 3 3 3 2 4 ...
The dataset contains 418 observations and 20 variables. Variables include both numeric types (Bilirubin, Cholesterol, Albumin, etc.) and character types (Status, Drug, Sex, Ascites). The Status variable is our dependent variable indicating patient outcome (C = Censored, CL = Liver Transplant, D = Death).
names(cirrhosis)
## [1] "ID" "N_Days" "Status" "Drug"
## [5] "Age" "Sex" "Ascites" "Hepatomegaly"
## [9] "Spiders" "Edema" "Bilirubin" "Cholesterol"
## [13] "Albumin" "Copper" "Alk_Phos" "SGOT"
## [17] "Tryglicerides" "Platelets" "Prothrombin" "Stage"
The dataset contains 20 variables covering patient demographics (Age, Sex), clinical symptoms (Ascites, Hepatomegaly, Spiders, Edema), lab results (Bilirubin, Cholesterol, Albumin, Copper, SGOT), and disease progression indicators (Stage, N_Days).
head(cirrhosis, 15)
## ID N_Days Status Drug Age Sex Ascites Hepatomegaly Spiders
## 1 1 400 D D-penicillamine 21464 F Y Y Y
## 2 2 4500 C D-penicillamine 20617 F N Y Y
## 3 3 1012 D D-penicillamine 25594 M N N N
## 4 4 1925 D D-penicillamine 19994 F N Y Y
## 5 5 1504 CL Placebo 13918 F N Y Y
## 6 6 2503 D Placebo 24201 F N Y N
## 7 7 1832 C Placebo 20284 F N Y N
## 8 8 2466 D Placebo 19379 F N N N
## 9 9 2400 D D-penicillamine 15526 F N N Y
## 10 10 51 D Placebo 25772 F Y N Y
## 11 11 3762 D Placebo 19619 F N Y Y
## 12 12 304 D Placebo 21600 F N N Y
## 13 13 3577 C Placebo 16688 F N N N
## 14 14 1217 D Placebo 20535 M Y Y N
## 15 15 3584 D D-penicillamine 23612 F N N N
## Edema Bilirubin Cholesterol Albumin Copper Alk_Phos SGOT Tryglicerides
## 1 Y 14.5 261 2.60 156 1718.0 137.95 172
## 2 N 1.1 302 4.14 54 7394.8 113.52 88
## 3 S 1.4 176 3.48 210 516.0 96.10 55
## 4 S 1.8 244 2.54 64 6121.8 60.63 92
## 5 N 3.4 279 3.53 143 671.0 113.15 72
## 6 N 0.8 248 3.98 50 944.0 93.00 63
## 7 N 1.0 322 4.09 52 824.0 60.45 213
## 8 N 0.3 280 4.00 52 4651.2 28.38 189
## 9 N 3.2 562 3.08 79 2276.0 144.15 88
## 10 Y 12.6 200 2.74 140 918.0 147.25 143
## 11 N 1.4 259 4.16 46 1104.0 79.05 79
## 12 N 3.6 236 3.52 94 591.0 82.15 95
## 13 N 0.7 281 3.85 40 1181.0 88.35 130
## 14 Y 0.8 NA 2.27 43 728.0 71.00 NA
## 15 N 0.8 231 3.87 173 9009.8 127.71 96
## Platelets Prothrombin Stage
## 1 190 12.2 4
## 2 221 10.6 3
## 3 151 12.0 4
## 4 183 10.3 4
## 5 136 10.9 3
## 6 NA 11.0 3
## 7 204 9.7 3
## 8 373 11.0 3
## 9 251 11.0 2
## 10 302 11.5 4
## 11 258 12.0 4
## 12 71 13.6 4
## 13 244 10.6 3
## 14 156 11.0 4
## 15 295 11.0 3
bilirubin_category <- function(bilirubin) {
if (bilirubin < 1.2) {
return("Normal")
} else if (bilirubin >= 1.2 & bilirubin < 3.0) {
return("Mildly Elevated")
} else {
return("Severely Elevated")
}
}
bilirubin_category(0.6)
## [1] "Normal"
bilirubin_category(2.5)
## [1] "Mildly Elevated"
bilirubin_category(4.5)
## [1] "Severely Elevated"
results <- data.frame(
Bilirubin = cirrhosis$Bilirubin[1:10],
Category = sapply(cirrhosis$Bilirubin[1:10], bilirubin_category))
print(results)
## Bilirubin Category
## 1 14.5 Severely Elevated
## 2 1.1 Normal
## 3 1.4 Mildly Elevated
## 4 1.8 Mildly Elevated
## 5 3.4 Severely Elevated
## 6 0.8 Normal
## 7 1.0 Normal
## 8 0.3 Normal
## 9 3.2 Severely Elevated
## 10 12.6 Severely Elevated
A custom function bilirubin_category() was created using the Bilirubin variable. It classifies patients into Normal (< 1.2 mg/dl), Mildly Elevated (1.2–2.9), or Severely Elevated (≥ 3.0) based on standard clinical thresholds for liver function assessment.
# Filter high-risk patients: Stage 2 AND Bilirubin > 4
high_risk <- cirrhosis %>%
filter(Stage == 2 & Bilirubin > 4.0)
cat("Number of high-risk patients (Stage 2 & Bilirubin > 4):", nrow(high_risk), "\n")
## Number of high-risk patients (Stage 2 & Bilirubin > 4): 11
head(high_risk, 10)
## ID N_Days Status Drug Age Sex Ascites Hepatomegaly Spiders Edema
## 1 31 3839 D Placebo 15177 F N Y N N
## 2 95 130 D Placebo 16944 F Y Y Y Y
## 3 156 853 D Placebo 21699 F N Y N N
## 4 166 2721 C Placebo 15105 F N Y N N
## 5 288 1067 CL Placebo 17874 F N Y N S
## 6 312 788 C Placebo 12109 F N N Y N
## 7 338 791 D <NA> 17167 F <NA> <NA> <NA> N
## 8 340 3495 C <NA> 19358 F <NA> <NA> <NA> N
## 9 343 625 D <NA> 17532 F <NA> <NA> <NA> N
## 10 362 2267 CL <NA> 17897 F <NA> <NA> <NA> N
## Bilirubin Cholesterol Albumin Copper Alk_Phos SGOT Tryglicerides Platelets
## 1 4.7 296 3.44 114 9933.2 206.40 101 195
## 2 17.4 NA 2.64 182 559.0 119.35 NA 401
## 3 25.5 358 3.52 219 2468.0 201.50 205 151
## 4 5.7 1480 3.26 84 1960.0 457.25 108 213
## 5 8.7 310 3.89 107 637.0 117.00 242 298
## 6 6.4 576 3.79 186 2115.0 136.00 149 200
## 7 16.0 NA 3.42 NA NA NA NA 475
## 8 5.4 NA 4.19 NA NA NA NA 141
## 9 11.1 NA 2.84 NA NA NA NA NA
## 10 18.0 NA 3.04 NA NA NA NA 432
## Prothrombin Stage
## 1 10.3 2
## 2 11.7 2
## 3 11.5 2
## 4 9.5 2
## 5 9.6 2
## 6 10.8 2
## 7 13.8 2
## 8 11.2 2
## 9 12.2 2
## 10 9.7 2
# Filter deceased male patients
deceased_male <- cirrhosis %>%
filter(Status == "c" & Sex == "M")
cat("Number of deceased male patients:", nrow(deceased_male), "\n")
## Number of deceased male patients: 0
head(deceased_male, 10)
## [1] ID N_Days Status Drug Age
## [6] Sex Ascites Hepatomegaly Spiders Edema
## [11] Bilirubin Cholesterol Albumin Copper Alk_Phos
## [16] SGOT Tryglicerides Platelets Prothrombin Stage
## <0 rows> (or 0-length row.names)
Here we applied two filters. The first identifies the critical patients — those in Stage 2 with severely elevated bilirubin (> 4.0 mg/dl). The second filter isolates deceased male patients to examine gender-specific outcomes in the dataset
# Dependent variable: Status (patient outcome)
# Independent variables: Bilirubin, Cholesterol, Albumin, Age, Stage, etc.
# Create data frame 1 — key lab results (independent variables)
df_labs <- cirrhosis %>%
select(Bilirubin, Cholesterol, Albumin, Copper, SGOT) %>%
mutate(PatientID = row_number())
# Create data frame 2 — patient info and outcome (dependent variable)
df_outcome <- cirrhosis %>%
select(Status, Age, Stage, Sex, N_Days) %>%
mutate(PatientID = row_number())
# Join the two data frames using inner_join (reshaping technique)
df_joined <- inner_join(df_labs, df_outcome, by = "PatientID")
cat("Joined data frame:", nrow(df_joined), "rows x", ncol(df_joined), "columns\n")
## Joined data frame: 418 rows x 11 columns
head(df_joined, 10)
## Bilirubin Cholesterol Albumin Copper SGOT PatientID Status Age Stage Sex
## 1 14.5 261 2.60 156 137.95 1 D 21464 4 F
## 2 1.1 302 4.14 54 113.52 2 C 20617 3 F
## 3 1.4 176 3.48 210 96.10 3 D 25594 4 M
## 4 1.8 244 2.54 64 60.63 4 D 19994 4 F
## 5 3.4 279 3.53 143 113.15 5 CL 13918 3 F
## 6 0.8 248 3.98 50 93.00 6 D 24201 3 F
## 7 1.0 322 4.09 52 60.45 7 C 20284 3 F
## 8 0.3 280 4.00 52 28.38 8 D 19379 3 F
## 9 3.2 562 3.08 79 144.15 9 D 15526 2 F
## 10 12.6 200 2.74 140 147.25 10 D 25772 4 F
## N_Days
## 1 400
## 2 4500
## 3 1012
## 4 1925
## 5 1504
## 6 2503
## 7 1832
## 8 2466
## 9 2400
## 10 51
Status is the dependent variable representing patient outcome. The dataset was split into two data frames. First one containing lab results and one containing patient demographics and outcomes. Then we rejoined it using inner_join() as a reshaping technique.
# Check missing values in each column
cat("Missing values per column:\n")
## Missing values per column:
print(colSums(is.na(cirrhosis)))
## ID N_Days Status Drug Age
## 0 0 0 106 0
## Sex Ascites Hepatomegaly Spiders Edema
## 0 106 106 106 0
## Bilirubin Cholesterol Albumin Copper Alk_Phos
## 0 134 0 108 106
## SGOT Tryglicerides Platelets Prothrombin Stage
## 106 136 11 2 6
# Remove rows with any missing values
cirrhosis_clean <- na.omit(cirrhosis)
cat("\nRows before removing NA:", nrow(cirrhosis), "\n")
##
## Rows before removing NA: 418
cat("Rows after removing NA:", nrow(cirrhosis_clean), "\n")
## Rows after removing NA: 276
cat("Rows removed:", nrow(cirrhosis) - nrow(cirrhosis_clean), "\n")
## Rows removed: 142
The dataset had significant missing values across several columns. Most notably Cholesterol (134 missing), Tryglicerides (136 missing), Drug (106 missing), and Copper (108 missing). After applying na.omit(), the dataset was reduced from 418 to 276 complete records, which are used for the remainder of the analysis.
# Check for duplicated rows
cat("Number of duplicate rows:", sum(duplicated(cirrhosis_clean)), "\n")
## Number of duplicate rows: 0
# Remove duplicates
cirrhosis_clean <- cirrhosis_clean[!duplicated(cirrhosis_clean), ]
cat("Rows after removing duplicates:", nrow(cirrhosis_clean), "\n")
## Rows after removing duplicates: 276
The dataset was checked for duplicate rows using duplicated(). Since each row represents a unique patient with a unique ID, no duplicate records were found after cleaning.
# Sort by Copper descending (highest risk first)
cirrhosis_sorted <- cirrhosis_clean %>%
arrange(desc(Copper), desc(Stage))
cat("Dataset sorted by Copper (descending) and Stage (descending):\n")
## Dataset sorted by Copper (descending) and Stage (descending):
head(cirrhosis_sorted[, c("ID", "Copper", "Stage", "Status", "Age")], 15)
## ID Copper Stage Status Age
## 1 18 588 4 D 19698
## 2 23 558 4 D 20442
## 3 22 464 4 D 20555
## 4 120 444 3 CL 12839
## 5 233 412 4 C 15591
## 6 253 380 4 C 28650
## 7 184 358 3 D 13736
## 8 241 308 4 CL 15112
## 9 149 290 4 D 22574
## 10 48 281 3 C 17947
## 11 74 280 4 D 18964
## 12 80 269 4 D 24622
## 13 193 267 4 D 20736
## 14 54 262 4 D 14317
## 15 138 262 3 D 18719
The dataset was reordered in descending order by Copper first and then by Stage. This brings the most critically ill patients — those with the highest Copper and most advanced disease stage — to the top of the dataset.
cirrhosis_renamed <- cirrhosis_clean %>%
rename(
PID = ID,
SurDays = N_Days,
PtStatus = Status,
TreatmentDrug = Drug,
PtAge = Age,
PtSex = Sex,
AbdominalFluid = Ascites,
EnlargedLiver = Hepatomegaly,
SkinSpiders = Spiders,
SrBilirubin = Bilirubin,
SrCholesterol = Cholesterol,
Srlbumin = Albumin,
UrineCopper = Copper,
DiseaseStage = Stage )
cat("Renamed column names:\n")
## Renamed column names:
print(names(cirrhosis_renamed))
## [1] "PID" "SurDays" "PtStatus" "TreatmentDrug"
## [5] "PtAge" "PtSex" "AbdominalFluid" "EnlargedLiver"
## [9] "SkinSpiders" "Edema" "SrBilirubin" "SrCholesterol"
## [13] "Srlbumin" "UrineCopper" "Alk_Phos" "SGOT"
## [17] "Tryglicerides" "Platelets" "Prothrombin" "DiseaseStage"
Key columns were renamed for improved readability.For example, N_Days became SurvivalDays, Ascites became AbdominalFluid, and Hepatomegaly became EnlargedLiver, making the dataset more self-explanatory for readers unfamiliar with clinical abbreviations.
# 1. Bilirubin_Double: Bilirubin multiplied by 2 (as required by rubric)
cirrhosis_clean$Bilirubin_Double <- cirrhosis_clean$Bilirubin * 2
# 2. Albumin_Bilirubin_Ratio: ratio of Albumin to Bilirubin (liver health index)
cirrhosis_clean$Albumin_Bilirubin_Ratio <- round(cirrhosis_clean$Albumin / cirrhosis_clean$Bilirubin, 3)
# 3. Age_Years: convert Age from days to years
cirrhosis_clean$Age_Years <- round(cirrhosis_clean$Age / 365, 1)
# 4. Risk_Score: weighted formula using key clinical variables
cirrhosis_clean$Risk_Score <- round(
0.4 * cirrhosis_clean$Bilirubin +
0.3 * cirrhosis_clean$Prothrombin +
0.3 * cirrhosis_clean$Stage, 3
)
# Preview the new variables
head(cirrhosis_clean[, c("Bilirubin", "Albumin", "Age",
"Bilirubin_Double",
"Albumin_Bilirubin_Ratio",
"Age_Years", "Risk_Score")], 10)
## Bilirubin Albumin Age Bilirubin_Double Albumin_Bilirubin_Ratio Age_Years
## 1 14.5 2.60 21464 29.0 0.179 58.8
## 2 1.1 4.14 20617 2.2 3.764 56.5
## 3 1.4 3.48 25594 2.8 2.486 70.1
## 4 1.8 2.54 19994 3.6 1.411 54.8
## 5 3.4 3.53 13918 6.8 1.038 38.1
## 7 1.0 4.09 20284 2.0 4.090 55.6
## 8 0.3 4.00 19379 0.6 13.333 53.1
## 9 3.2 3.08 15526 6.4 0.963 42.5
## 10 12.6 2.74 25772 25.2 0.217 70.6
## 11 1.4 4.16 19619 2.8 2.971 53.8
## Risk_Score
## 1 10.66
## 2 4.52
## 3 5.36
## 4 5.01
## 5 5.53
## 7 4.21
## 8 4.32
## 9 5.18
## 10 9.69
## 11 5.36
Four new variables were added. Bilirubin_Double multiplies Bilirubin by 2 as required by the rubric. Albumin_Bilirubin_Ratio is a recognized liver health indicator. Age_Years converts the original age-in-days to a more readable age-in-years. Risk_Score is a custom weighted formula combining Bilirubin, Prothrombin time, and disease Stage.
# Set seed for reproducibility (random number generator engine)
set.seed(1234)
# Create 70/30 train-test split
train_index <- sample(1:nrow(cirrhosis_clean), size = 0.70 * nrow(cirrhosis_clean))
TrainingSet <- cirrhosis_clean[train_index, ]
TestingSet <- cirrhosis_clean[-train_index, ]
cat("Total rows in cleaned dataset:", nrow(cirrhosis_clean), "\n")
## Total rows in cleaned dataset: 276
dim(TrainingSet)
## [1] 193 24
dim(TestingSet)
## [1] 83 24
Using set.seed(1234) as the random number generator engine to ensure reproducibility, 70% of the cleaned dataset was assigned to the training set and 30% to the testing set. The training set can be used for building predictive models on cirrhosis outcomes.
summary(cirrhosis_clean)
## ID N_Days Status Drug
## Min. : 1.00 Min. : 41 Length:276 Length:276
## 1st Qu.: 79.75 1st Qu.:1186 Class :character Class :character
## Median :157.50 Median :1788 Mode :character Mode :character
## Mean :158.62 Mean :1979
## 3rd Qu.:240.25 3rd Qu.:2690
## Max. :312.00 Max. :4556
## Age Sex Ascites Hepatomegaly
## Min. : 9598 Length:276 Length:276 Length:276
## 1st Qu.:15162 Class :character Class :character Class :character
## Median :18157 Mode :character Mode :character Mode :character
## Mean :18189
## 3rd Qu.:20668
## Max. :28650
## Spiders Edema Bilirubin Cholesterol
## Length:276 Length:276 Min. : 0.300 Min. : 120.0
## Class :character Class :character 1st Qu.: 0.800 1st Qu.: 249.5
## Mode :character Mode :character Median : 1.400 Median : 310.0
## Mean : 3.334 Mean : 371.3
## 3rd Qu.: 3.525 3rd Qu.: 401.0
## Max. :28.000 Max. :1775.0
## Albumin Copper Alk_Phos SGOT
## Min. :1.960 Min. : 4.00 Min. : 289.0 Min. : 28.38
## 1st Qu.:3.310 1st Qu.: 42.75 1st Qu.: 922.5 1st Qu.: 82.46
## Median :3.545 Median : 74.00 Median : 1277.5 Median :116.62
## Mean :3.517 Mean :100.77 Mean : 1996.6 Mean :124.12
## 3rd Qu.:3.772 3rd Qu.:129.25 3rd Qu.: 2068.2 3rd Qu.:153.45
## Max. :4.400 Max. :588.00 Max. :13862.4 Max. :457.25
## Tryglicerides Platelets Prothrombin Stage
## Min. : 33.0 Min. : 62.0 Min. : 9.00 Min. :1.00
## 1st Qu.: 85.0 1st Qu.:200.0 1st Qu.:10.00 1st Qu.:2.00
## Median :108.0 Median :257.0 Median :10.60 Median :3.00
## Mean :125.0 Mean :261.8 Mean :10.74 Mean :3.04
## 3rd Qu.:151.2 3rd Qu.:318.2 3rd Qu.:11.20 3rd Qu.:4.00
## Max. :598.0 Max. :563.0 Max. :17.10 Max. :4.00
## Bilirubin_Double Albumin_Bilirubin_Ratio Age_Years Risk_Score
## Min. : 0.600 Min. : 0.1160 Min. :26.30 Min. : 3.440
## 1st Qu.: 1.600 1st Qu.: 0.9543 1st Qu.:41.55 1st Qu.: 4.277
## Median : 2.800 Median : 2.5535 Median :49.75 Median : 4.795
## Mean : 6.667 Mean : 3.0045 Mean :49.83 Mean : 5.466
## 3rd Qu.: 7.050 3rd Qu.: 4.3780 3rd Qu.:56.62 3rd Qu.: 5.763
## Max. :56.000 Max. :13.6000 Max. :78.50 Max. :15.560
The summary statistics reveal key patterns in the dataset. The average Bilirubin is approximately 2.8 mg/dl, while the median Stage is 3, suggesting most patients are in advanced stages of cirrhosis. Cholesterol and Copper show wide ranges, indicating high variability in lab results across patients.
# Custom mode function (R has no built-in statistical mode)
get_mode <- function(x) {
uniq_vals <- unique(x)
uniq_vals[which.max(tabulate(match(x, uniq_vals)))]
}
cat("=== Bilirubin ===\n")
## === Bilirubin ===
cat("Mean: ", round(mean(cirrhosis_clean$Bilirubin), 2), "\n")
## Mean: 3.33
cat("Median:", median(cirrhosis_clean$Bilirubin), "\n")
## Median: 1.4
cat("Mode: ", get_mode(cirrhosis_clean$Bilirubin), "\n")
## Mode: 0.7
cat("Range: ", range(cirrhosis_clean$Bilirubin), "\n\n")
## Range: 0.3 28
cat("=== Cholesterol ===\n")
## === Cholesterol ===
cat("Mean: ", round(mean(cirrhosis_clean$Cholesterol), 2), "\n")
## Mean: 371.26
cat("Median:", median(cirrhosis_clean$Cholesterol), "\n")
## Median: 310
cat("Mode: ", get_mode(cirrhosis_clean$Cholesterol), "\n")
## Mode: 260
cat("Range: ", range(cirrhosis_clean$Cholesterol), "\n")
## Range: 120 1775
For Bilirubin, the mean is higher than the median, suggesting a right-skewed distribution — a few patients have very high bilirubin levels pulling the average up. For Cholesterol, the wide range reflects significant variation in lipid metabolism among cirrhosis patients.
ggplot(cirrhosis_clean, aes(x = Age_Years, y = Bilirubin, color = Status)) +
geom_point(alpha = 0.6, size = 2) +
labs(
title = "Scatter Plot: Age vs Bilirubin Level",
x = "Age (Years)",
y = "Serum Bilirubin (mg/dl)",
color = "Patient Status"
) +
scale_color_manual(
values = c("C" = "steelblue", "CL" = "orange", "D" = "tomato"),
labels = c("Censored", "Liver Transplant", "Deceased")
) +
theme_minimal()
The scatter plot shows the relationship between patient age and serum
bilirubin, colored by survival status. Deceased patients (red) tend to
cluster at higher bilirubin levels across all age groups, suggesting
that elevated bilirubin is strongly associated with mortality. Patients
who received liver transplants (orange) also show elevated bilirubin,
confirming the severity of their condition. This suggests bilirubin is a
meaningful predictor of patient outcomes in cirrhosis.
# Average Bilirubin by Patient Status
bar_data <- cirrhosis_clean %>%
group_by(Status) %>%
summarise(avg_bilirubin = mean(Bilirubin))
ggplot(bar_data, aes(x = factor(Status), y = avg_bilirubin, fill = factor(Status))) +
geom_bar(stat = "identity", width = 0.5) +
geom_text(aes(label = round(avg_bilirubin, 2)), vjust = -0.5, size = 4, fontface = "bold") +
labs(
title = "Average Bilirubin by Patient Status",
x = "Patient Status",
y = "Average Bilirubin (mg/dl)",
fill = "Status"
) +
scale_x_discrete(labels = c("Censored", "Liver Transplant", "Deceased")) +
scale_fill_manual(
values = c("C" = "steelblue", "CL" = "orange", "D" = "tomato"),
labels = c("Censored", "Liver Transplant", "Deceased")
) +
theme_minimal()
The bar plot compares the average bilirubin level across the three
patient status groups. Deceased patients have the highest average
bilirubin, followed by liver transplant patients, while censored
(surviving) patients have the lowest levels. This pattern strongly
supports bilirubin as a key clinical marker for predicting cirrhosis
mortality.
# Pearson correlation between Bilirubin and Prothrombin
cor_value <- cor(cirrhosis_clean$Bilirubin, cirrhosis_clean$Prothrombin, method = "pearson")
cat("Pearson Correlation (Bilirubin vs Prothrombin):", round(cor_value, 4), "\n")
## Pearson Correlation (Bilirubin vs Prothrombin): 0.3312
# Full correlation test with p-value and confidence interval
cor.test(cirrhosis_clean$Bilirubin, cirrhosis_clean$Prothrombin, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: cirrhosis_clean$Bilirubin and cirrhosis_clean$Prothrombin
## t = 5.8098, df = 274, p-value = 1.732e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2217790 0.4323401
## sample estimates:
## cor
## 0.3311762
The Pearson correlation between Bilirubin and Prothrombin time is positive, indicating that as bilirubin levels increase, prothrombin time also tends to increase. Both are liver function markers — when the liver is severely damaged, it fails to regulate both bilirubin clearance and blood clotting (measured by prothrombin), making this a clinically meaningful relationship. A p-value < 0.05 confirms the correlation is statistically significant.
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.