COMP4028 – Assignment 1: Data Analysis using R Programming

INTRODUCTION For this assignment, we worked with a Cirrhosis Patient Survival dataset to explore how different factors relate to liver cirrhosis outcomes. The dataset includes records for 418 patients and 20 variables, such as bilirubin, cholesterol, albumin, copper levels, liver disease stage,survival status,etc. Using R, we generated descriptive statistics, visualizations, and correlation analyses to identify which variables are most informative for understanding patient outcomes in liver cirrhosis.

str(cirrhosis)

## 'data.frame':    418 obs. of  20 variables:
##  $ ID           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ N_Days       : int  400 4500 1012 1925 1504 2503 1832 2466 2400 51 ...
##  $ Status       : chr  "D" "C" "D" "D" ...
##  $ Drug         : chr  "D-penicillamine" "D-penicillamine" "D-penicillamine" "D-penicillamine" ...
##  $ Age          : int  21464 20617 25594 19994 13918 24201 20284 19379 15526 25772 ...
##  $ Sex          : chr  "F" "F" "M" "F" ...
##  $ Ascites      : chr  "Y" "N" "N" "N" ...
##  $ Hepatomegaly : chr  "Y" "Y" "N" "Y" ...
##  $ Spiders      : chr  "Y" "Y" "N" "Y" ...
##  $ Edema        : chr  "Y" "N" "S" "S" ...
##  $ Bilirubin    : num  14.5 1.1 1.4 1.8 3.4 0.8 1 0.3 3.2 12.6 ...
##  $ Cholesterol  : int  261 302 176 244 279 248 322 280 562 200 ...
##  $ Albumin      : num  2.6 4.14 3.48 2.54 3.53 3.98 4.09 4 3.08 2.74 ...
##  $ Copper       : int  156 54 210 64 143 50 52 52 79 140 ...
##  $ Alk_Phos     : num  1718 7395 516 6122 671 ...
##  $ SGOT         : num  137.9 113.5 96.1 60.6 113.2 ...
##  $ Tryglicerides: int  172 88 55 92 72 63 213 189 88 143 ...
##  $ Platelets    : int  190 221 151 183 136 NA 204 373 251 302 ...
##  $ Prothrombin  : num  12.2 10.6 12 10.3 10.9 11 9.7 11 11 11.5 ...
##  $ Stage        : int  4 3 4 4 3 3 3 3 2 4 ...

The dataset contains 418 observations and 20 variables. Variables include both numeric types (Bilirubin, Cholesterol, Albumin, etc.) and character types (Status, Drug, Sex, Ascites). The Status variable is our dependent variable indicating patient outcome (C = Censored, CL = Liver Transplant, D = Death).

names(cirrhosis)

##  [1] "ID"            "N_Days"        "Status"        "Drug"         
##  [5] "Age"           "Sex"           "Ascites"       "Hepatomegaly" 
##  [9] "Spiders"       "Edema"         "Bilirubin"     "Cholesterol"  
## [13] "Albumin"       "Copper"        "Alk_Phos"      "SGOT"         
## [17] "Tryglicerides" "Platelets"     "Prothrombin"   "Stage"

The dataset contains 20 variables covering patient demographics (Age, Sex), clinical symptoms (Ascites, Hepatomegaly, Spiders, Edema), lab results (Bilirubin, Cholesterol, Albumin, Copper, SGOT), and disease progression indicators (Stage, N_Days).

head(cirrhosis, 15)

##    ID N_Days Status            Drug   Age Sex Ascites Hepatomegaly Spiders
## 1   1    400      D D-penicillamine 21464   F       Y            Y       Y
## 2   2   4500      C D-penicillamine 20617   F       N            Y       Y
## 3   3   1012      D D-penicillamine 25594   M       N            N       N
## 4   4   1925      D D-penicillamine 19994   F       N            Y       Y
## 5   5   1504     CL         Placebo 13918   F       N            Y       Y
## 6   6   2503      D         Placebo 24201   F       N            Y       N
## 7   7   1832      C         Placebo 20284   F       N            Y       N
## 8   8   2466      D         Placebo 19379   F       N            N       N
## 9   9   2400      D D-penicillamine 15526   F       N            N       Y
## 10 10     51      D         Placebo 25772   F       Y            N       Y
## 11 11   3762      D         Placebo 19619   F       N            Y       Y
## 12 12    304      D         Placebo 21600   F       N            N       Y
## 13 13   3577      C         Placebo 16688   F       N            N       N
## 14 14   1217      D         Placebo 20535   M       Y            Y       N
## 15 15   3584      D D-penicillamine 23612   F       N            N       N
##    Edema Bilirubin Cholesterol Albumin Copper Alk_Phos   SGOT Tryglicerides
## 1      Y      14.5         261    2.60    156   1718.0 137.95           172
## 2      N       1.1         302    4.14     54   7394.8 113.52            88
## 3      S       1.4         176    3.48    210    516.0  96.10            55
## 4      S       1.8         244    2.54     64   6121.8  60.63            92
## 5      N       3.4         279    3.53    143    671.0 113.15            72
## 6      N       0.8         248    3.98     50    944.0  93.00            63
## 7      N       1.0         322    4.09     52    824.0  60.45           213
## 8      N       0.3         280    4.00     52   4651.2  28.38           189
## 9      N       3.2         562    3.08     79   2276.0 144.15            88
## 10     Y      12.6         200    2.74    140    918.0 147.25           143
## 11     N       1.4         259    4.16     46   1104.0  79.05            79
## 12     N       3.6         236    3.52     94    591.0  82.15            95
## 13     N       0.7         281    3.85     40   1181.0  88.35           130
## 14     Y       0.8          NA    2.27     43    728.0  71.00            NA
## 15     N       0.8         231    3.87    173   9009.8 127.71            96
##    Platelets Prothrombin Stage
## 1        190        12.2     4
## 2        221        10.6     3
## 3        151        12.0     4
## 4        183        10.3     4
## 5        136        10.9     3
## 6         NA        11.0     3
## 7        204         9.7     3
## 8        373        11.0     3
## 9        251        11.0     2
## 10       302        11.5     4
## 11       258        12.0     4
## 12        71        13.6     4
## 13       244        10.6     3
## 14       156        11.0     4
## 15       295        11.0     3

bilirubin_category <- function(bilirubin) {
  if (bilirubin < 1.2) {
    return("Normal")
  } else if (bilirubin >= 1.2 & bilirubin < 3.0) {
    return("Mildly Elevated")
  } else {
    return("Severely Elevated")
  }
}
bilirubin_category(0.6)

## [1] "Normal"

bilirubin_category(2.5)

## [1] "Mildly Elevated"

bilirubin_category(4.5)

## [1] "Severely Elevated"

results <- data.frame(
  Bilirubin = cirrhosis$Bilirubin[1:10],
  Category  = sapply(cirrhosis$Bilirubin[1:10], bilirubin_category))
print(results)

##    Bilirubin          Category
## 1       14.5 Severely Elevated
## 2        1.1            Normal
## 3        1.4   Mildly Elevated
## 4        1.8   Mildly Elevated
## 5        3.4 Severely Elevated
## 6        0.8            Normal
## 7        1.0            Normal
## 8        0.3            Normal
## 9        3.2 Severely Elevated
## 10      12.6 Severely Elevated

A custom function bilirubin_category() was created using the Bilirubin variable. It classifies patients into Normal (< 1.2 mg/dl), Mildly Elevated (1.2–2.9), or Severely Elevated (≥ 3.0) based on standard clinical thresholds for liver function assessment.

# Filter high-risk patients: Stage 2 AND Bilirubin > 4
high_risk <- cirrhosis %>%
  filter(Stage == 2 & Bilirubin > 4.0)
cat("Number of high-risk patients (Stage 2 & Bilirubin > 4):", nrow(high_risk), "\n")

## Number of high-risk patients (Stage 2 & Bilirubin > 4): 11

head(high_risk, 10)

##     ID N_Days Status    Drug   Age Sex Ascites Hepatomegaly Spiders Edema
## 1   31   3839      D Placebo 15177   F       N            Y       N     N
## 2   95    130      D Placebo 16944   F       Y            Y       Y     Y
## 3  156    853      D Placebo 21699   F       N            Y       N     N
## 4  166   2721      C Placebo 15105   F       N            Y       N     N
## 5  288   1067     CL Placebo 17874   F       N            Y       N     S
## 6  312    788      C Placebo 12109   F       N            N       Y     N
## 7  338    791      D    <NA> 17167   F    <NA>         <NA>    <NA>     N
## 8  340   3495      C    <NA> 19358   F    <NA>         <NA>    <NA>     N
## 9  343    625      D    <NA> 17532   F    <NA>         <NA>    <NA>     N
## 10 362   2267     CL    <NA> 17897   F    <NA>         <NA>    <NA>     N
##    Bilirubin Cholesterol Albumin Copper Alk_Phos   SGOT Tryglicerides Platelets
## 1        4.7         296    3.44    114   9933.2 206.40           101       195
## 2       17.4          NA    2.64    182    559.0 119.35            NA       401
## 3       25.5         358    3.52    219   2468.0 201.50           205       151
## 4        5.7        1480    3.26     84   1960.0 457.25           108       213
## 5        8.7         310    3.89    107    637.0 117.00           242       298
## 6        6.4         576    3.79    186   2115.0 136.00           149       200
## 7       16.0          NA    3.42     NA       NA     NA            NA       475
## 8        5.4          NA    4.19     NA       NA     NA            NA       141
## 9       11.1          NA    2.84     NA       NA     NA            NA        NA
## 10      18.0          NA    3.04     NA       NA     NA            NA       432
##    Prothrombin Stage
## 1         10.3     2
## 2         11.7     2
## 3         11.5     2
## 4          9.5     2
## 5          9.6     2
## 6         10.8     2
## 7         13.8     2
## 8         11.2     2
## 9         12.2     2
## 10         9.7     2

# Filter deceased male patients
deceased_male <- cirrhosis %>%
  filter(Status == "c" & Sex == "M")
cat("Number of deceased male patients:", nrow(deceased_male), "\n")

## Number of deceased male patients: 0

head(deceased_male, 10)

##  [1] ID            N_Days        Status        Drug          Age          
##  [6] Sex           Ascites       Hepatomegaly  Spiders       Edema        
## [11] Bilirubin     Cholesterol   Albumin       Copper        Alk_Phos     
## [16] SGOT          Tryglicerides Platelets     Prothrombin   Stage        
## <0 rows> (or 0-length row.names)

Here we applied two filters. The first identifies the critical patients — those in Stage 2 with severely elevated bilirubin (> 4.0 mg/dl). The second filter isolates deceased male patients to examine gender-specific outcomes in the dataset

# Dependent variable: Status (patient outcome)
# Independent variables: Bilirubin, Cholesterol, Albumin, Age, Stage, etc.

# Create data frame 1 — key lab results (independent variables)
df_labs <- cirrhosis %>%
  select(Bilirubin, Cholesterol, Albumin, Copper, SGOT) %>%
  mutate(PatientID = row_number())

# Create data frame 2 — patient info and outcome (dependent variable)
df_outcome <- cirrhosis %>%
  select(Status, Age, Stage, Sex, N_Days) %>%
  mutate(PatientID = row_number())

# Join the two data frames using inner_join (reshaping technique)
df_joined <- inner_join(df_labs, df_outcome, by = "PatientID")

cat("Joined data frame:", nrow(df_joined), "rows x", ncol(df_joined), "columns\n")

## Joined data frame: 418 rows x 11 columns

head(df_joined, 10)

##    Bilirubin Cholesterol Albumin Copper   SGOT PatientID Status   Age Stage Sex
## 1       14.5         261    2.60    156 137.95         1      D 21464     4   F
## 2        1.1         302    4.14     54 113.52         2      C 20617     3   F
## 3        1.4         176    3.48    210  96.10         3      D 25594     4   M
## 4        1.8         244    2.54     64  60.63         4      D 19994     4   F
## 5        3.4         279    3.53    143 113.15         5     CL 13918     3   F
## 6        0.8         248    3.98     50  93.00         6      D 24201     3   F
## 7        1.0         322    4.09     52  60.45         7      C 20284     3   F
## 8        0.3         280    4.00     52  28.38         8      D 19379     3   F
## 9        3.2         562    3.08     79 144.15         9      D 15526     2   F
## 10      12.6         200    2.74    140 147.25        10      D 25772     4   F
##    N_Days
## 1     400
## 2    4500
## 3    1012
## 4    1925
## 5    1504
## 6    2503
## 7    1832
## 8    2466
## 9    2400
## 10     51

Status is the dependent variable representing patient outcome. The dataset was split into two data frames. First one containing lab results and one containing patient demographics and outcomes. Then we rejoined it using inner_join() as a reshaping technique.

# Check missing values in each column
cat("Missing values per column:\n")

## Missing values per column:

print(colSums(is.na(cirrhosis)))

##            ID        N_Days        Status          Drug           Age 
##             0             0             0           106             0 
##           Sex       Ascites  Hepatomegaly       Spiders         Edema 
##             0           106           106           106             0 
##     Bilirubin   Cholesterol       Albumin        Copper      Alk_Phos 
##             0           134             0           108           106 
##          SGOT Tryglicerides     Platelets   Prothrombin         Stage 
##           106           136            11             2             6

# Remove rows with any missing values
cirrhosis_clean <- na.omit(cirrhosis)

cat("\nRows before removing NA:", nrow(cirrhosis), "\n")

## 
## Rows before removing NA: 418

cat("Rows after removing NA:", nrow(cirrhosis_clean), "\n")

## Rows after removing NA: 276

cat("Rows removed:", nrow(cirrhosis) - nrow(cirrhosis_clean), "\n")

## Rows removed: 142

The dataset had significant missing values across several columns. Most notably Cholesterol (134 missing), Tryglicerides (136 missing), Drug (106 missing), and Copper (108 missing). After applying na.omit(), the dataset was reduced from 418 to 276 complete records, which are used for the remainder of the analysis.

# Check for duplicated rows
cat("Number of duplicate rows:", sum(duplicated(cirrhosis_clean)), "\n")

## Number of duplicate rows: 0

# Remove duplicates
cirrhosis_clean <- cirrhosis_clean[!duplicated(cirrhosis_clean), ]
cat("Rows after removing duplicates:", nrow(cirrhosis_clean), "\n")

## Rows after removing duplicates: 276

The dataset was checked for duplicate rows using duplicated(). Since each row represents a unique patient with a unique ID, no duplicate records were found after cleaning.

# Sort by Copper descending (highest risk first)
cirrhosis_sorted <- cirrhosis_clean %>%
  arrange(desc(Copper), desc(Stage))

cat("Dataset sorted by Copper (descending) and Stage (descending):\n")

## Dataset sorted by Copper (descending) and Stage (descending):

head(cirrhosis_sorted[, c("ID", "Copper", "Stage", "Status", "Age")], 15)

##     ID Copper Stage Status   Age
## 1   18    588     4      D 19698
## 2   23    558     4      D 20442
## 3   22    464     4      D 20555
## 4  120    444     3     CL 12839
## 5  233    412     4      C 15591
## 6  253    380     4      C 28650
## 7  184    358     3      D 13736
## 8  241    308     4     CL 15112
## 9  149    290     4      D 22574
## 10  48    281     3      C 17947
## 11  74    280     4      D 18964
## 12  80    269     4      D 24622
## 13 193    267     4      D 20736
## 14  54    262     4      D 14317
## 15 138    262     3      D 18719

The dataset was reordered in descending order by Copper first and then by Stage. This brings the most critically ill patients — those with the highest Copper and most advanced disease stage — to the top of the dataset.

cirrhosis_renamed <- cirrhosis_clean %>%
  rename(
    PID        = ID,
    SurDays     = N_Days,
    PtStatus    = Status,
    TreatmentDrug    = Drug,
    PtAge       = Age,
    PtSex       = Sex,
    AbdominalFluid   = Ascites,
    EnlargedLiver    = Hepatomegaly,
    SkinSpiders      = Spiders,
    SrBilirubin   = Bilirubin,
    SrCholesterol = Cholesterol,
    Srlbumin     = Albumin,
    UrineCopper      = Copper,
    DiseaseStage     = Stage )

cat("Renamed column names:\n")

## Renamed column names:

print(names(cirrhosis_renamed))

##  [1] "PID"            "SurDays"        "PtStatus"       "TreatmentDrug" 
##  [5] "PtAge"          "PtSex"          "AbdominalFluid" "EnlargedLiver" 
##  [9] "SkinSpiders"    "Edema"          "SrBilirubin"    "SrCholesterol" 
## [13] "Srlbumin"       "UrineCopper"    "Alk_Phos"       "SGOT"          
## [17] "Tryglicerides"  "Platelets"      "Prothrombin"    "DiseaseStage"

Key columns were renamed for improved readability.For example, N_Days became SurvivalDays, Ascites became AbdominalFluid, and Hepatomegaly became EnlargedLiver, making the dataset more self-explanatory for readers unfamiliar with clinical abbreviations.

# 1. Bilirubin_Double: Bilirubin multiplied by 2 (as required by rubric)
cirrhosis_clean$Bilirubin_Double <- cirrhosis_clean$Bilirubin * 2

# 2. Albumin_Bilirubin_Ratio: ratio of Albumin to Bilirubin (liver health index)
cirrhosis_clean$Albumin_Bilirubin_Ratio <- round(cirrhosis_clean$Albumin / cirrhosis_clean$Bilirubin, 3)

# 3. Age_Years: convert Age from days to years
cirrhosis_clean$Age_Years <- round(cirrhosis_clean$Age / 365, 1)

# 4. Risk_Score: weighted formula using key clinical variables
cirrhosis_clean$Risk_Score <- round(
  0.4 * cirrhosis_clean$Bilirubin +
  0.3 * cirrhosis_clean$Prothrombin +
  0.3 * cirrhosis_clean$Stage, 3
)

# Preview the new variables
head(cirrhosis_clean[, c("Bilirubin", "Albumin", "Age",
    "Bilirubin_Double",
    "Albumin_Bilirubin_Ratio",
                            "Age_Years", "Risk_Score")], 10)

##    Bilirubin Albumin   Age Bilirubin_Double Albumin_Bilirubin_Ratio Age_Years
## 1       14.5    2.60 21464             29.0                   0.179      58.8
## 2        1.1    4.14 20617              2.2                   3.764      56.5
## 3        1.4    3.48 25594              2.8                   2.486      70.1
## 4        1.8    2.54 19994              3.6                   1.411      54.8
## 5        3.4    3.53 13918              6.8                   1.038      38.1
## 7        1.0    4.09 20284              2.0                   4.090      55.6
## 8        0.3    4.00 19379              0.6                  13.333      53.1
## 9        3.2    3.08 15526              6.4                   0.963      42.5
## 10      12.6    2.74 25772             25.2                   0.217      70.6
## 11       1.4    4.16 19619              2.8                   2.971      53.8
##    Risk_Score
## 1       10.66
## 2        4.52
## 3        5.36
## 4        5.01
## 5        5.53
## 7        4.21
## 8        4.32
## 9        5.18
## 10       9.69
## 11       5.36

Four new variables were added. Bilirubin_Double multiplies Bilirubin by 2 as required by the rubric. Albumin_Bilirubin_Ratio is a recognized liver health indicator. Age_Years converts the original age-in-days to a more readable age-in-years. Risk_Score is a custom weighted formula combining Bilirubin, Prothrombin time, and disease Stage.

# Set seed for reproducibility (random number generator engine)
set.seed(1234)

# Create 70/30 train-test split
train_index <- sample(1:nrow(cirrhosis_clean), size = 0.70 * nrow(cirrhosis_clean))

TrainingSet <- cirrhosis_clean[train_index, ]
TestingSet  <- cirrhosis_clean[-train_index, ]

cat("Total rows in cleaned dataset:", nrow(cirrhosis_clean), "\n")

## Total rows in cleaned dataset: 276

dim(TrainingSet)

## [1] 193  24

dim(TestingSet)

## [1] 83 24

Using set.seed(1234) as the random number generator engine to ensure reproducibility, 70% of the cleaned dataset was assigned to the training set and 30% to the testing set. The training set can be used for building predictive models on cirrhosis outcomes.

summary(cirrhosis_clean)

##        ID             N_Days        Status              Drug          
##  Min.   :  1.00   Min.   :  41   Length:276         Length:276        
##  1st Qu.: 79.75   1st Qu.:1186   Class :character   Class :character  
##  Median :157.50   Median :1788   Mode  :character   Mode  :character  
##  Mean   :158.62   Mean   :1979                                        
##  3rd Qu.:240.25   3rd Qu.:2690                                        
##  Max.   :312.00   Max.   :4556                                        
##       Age            Sex              Ascites          Hepatomegaly      
##  Min.   : 9598   Length:276         Length:276         Length:276        
##  1st Qu.:15162   Class :character   Class :character   Class :character  
##  Median :18157   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :18189                                                           
##  3rd Qu.:20668                                                           
##  Max.   :28650                                                           
##    Spiders             Edema             Bilirubin       Cholesterol    
##  Length:276         Length:276         Min.   : 0.300   Min.   : 120.0  
##  Class :character   Class :character   1st Qu.: 0.800   1st Qu.: 249.5  
##  Mode  :character   Mode  :character   Median : 1.400   Median : 310.0  
##                                        Mean   : 3.334   Mean   : 371.3  
##                                        3rd Qu.: 3.525   3rd Qu.: 401.0  
##                                        Max.   :28.000   Max.   :1775.0  
##     Albumin          Copper          Alk_Phos            SGOT       
##  Min.   :1.960   Min.   :  4.00   Min.   :  289.0   Min.   : 28.38  
##  1st Qu.:3.310   1st Qu.: 42.75   1st Qu.:  922.5   1st Qu.: 82.46  
##  Median :3.545   Median : 74.00   Median : 1277.5   Median :116.62  
##  Mean   :3.517   Mean   :100.77   Mean   : 1996.6   Mean   :124.12  
##  3rd Qu.:3.772   3rd Qu.:129.25   3rd Qu.: 2068.2   3rd Qu.:153.45  
##  Max.   :4.400   Max.   :588.00   Max.   :13862.4   Max.   :457.25  
##  Tryglicerides     Platelets      Prothrombin        Stage     
##  Min.   : 33.0   Min.   : 62.0   Min.   : 9.00   Min.   :1.00  
##  1st Qu.: 85.0   1st Qu.:200.0   1st Qu.:10.00   1st Qu.:2.00  
##  Median :108.0   Median :257.0   Median :10.60   Median :3.00  
##  Mean   :125.0   Mean   :261.8   Mean   :10.74   Mean   :3.04  
##  3rd Qu.:151.2   3rd Qu.:318.2   3rd Qu.:11.20   3rd Qu.:4.00  
##  Max.   :598.0   Max.   :563.0   Max.   :17.10   Max.   :4.00  
##  Bilirubin_Double Albumin_Bilirubin_Ratio   Age_Years       Risk_Score    
##  Min.   : 0.600   Min.   : 0.1160         Min.   :26.30   Min.   : 3.440  
##  1st Qu.: 1.600   1st Qu.: 0.9543         1st Qu.:41.55   1st Qu.: 4.277  
##  Median : 2.800   Median : 2.5535         Median :49.75   Median : 4.795  
##  Mean   : 6.667   Mean   : 3.0045         Mean   :49.83   Mean   : 5.466  
##  3rd Qu.: 7.050   3rd Qu.: 4.3780         3rd Qu.:56.62   3rd Qu.: 5.763  
##  Max.   :56.000   Max.   :13.6000         Max.   :78.50   Max.   :15.560

The summary statistics reveal key patterns in the dataset. The average Bilirubin is approximately 2.8 mg/dl, while the median Stage is 3, suggesting most patients are in advanced stages of cirrhosis. Cholesterol and Copper show wide ranges, indicating high variability in lab results across patients.

# Custom mode function (R has no built-in statistical mode)
get_mode <- function(x) {
  uniq_vals <- unique(x)
  uniq_vals[which.max(tabulate(match(x, uniq_vals)))]
}

cat("=== Bilirubin ===\n")

## === Bilirubin ===

cat("Mean:  ", round(mean(cirrhosis_clean$Bilirubin), 2), "\n")

## Mean:   3.33

cat("Median:", median(cirrhosis_clean$Bilirubin), "\n")

## Median: 1.4

cat("Mode:  ", get_mode(cirrhosis_clean$Bilirubin), "\n")

## Mode:   0.7

cat("Range: ", range(cirrhosis_clean$Bilirubin), "\n\n")

## Range:  0.3 28

cat("=== Cholesterol ===\n")

## === Cholesterol ===

cat("Mean:  ", round(mean(cirrhosis_clean$Cholesterol), 2), "\n")

## Mean:   371.26

cat("Median:", median(cirrhosis_clean$Cholesterol), "\n")

## Median: 310

cat("Mode:  ", get_mode(cirrhosis_clean$Cholesterol), "\n")

## Mode:   260

cat("Range: ", range(cirrhosis_clean$Cholesterol), "\n")

## Range:  120 1775

For Bilirubin, the mean is higher than the median, suggesting a right-skewed distribution — a few patients have very high bilirubin levels pulling the average up. For Cholesterol, the wide range reflects significant variation in lipid metabolism among cirrhosis patients.

ggplot(cirrhosis_clean, aes(x = Age_Years, y = Bilirubin, color = Status)) +
  geom_point(alpha = 0.6, size = 2) +
  labs(
    title = "Scatter Plot: Age vs Bilirubin Level",
    x     = "Age (Years)",
    y     = "Serum Bilirubin (mg/dl)",
    color = "Patient Status"
  ) +
  scale_color_manual(
    values = c("C" = "steelblue", "CL" = "orange", "D" = "tomato"),
    labels = c("Censored", "Liver Transplant", "Deceased")
  ) +
  theme_minimal()

The scatter plot shows the relationship between patient age and serum bilirubin, colored by survival status. Deceased patients (red) tend to cluster at higher bilirubin levels across all age groups, suggesting that elevated bilirubin is strongly associated with mortality. Patients who received liver transplants (orange) also show elevated bilirubin, confirming the severity of their condition. This suggests bilirubin is a meaningful predictor of patient outcomes in cirrhosis.

# Average Bilirubin by Patient Status
bar_data <- cirrhosis_clean %>%
  group_by(Status) %>%
  summarise(avg_bilirubin = mean(Bilirubin))

ggplot(bar_data, aes(x = factor(Status), y = avg_bilirubin, fill = factor(Status))) +
  geom_bar(stat = "identity", width = 0.5) +
  geom_text(aes(label = round(avg_bilirubin, 2)), vjust = -0.5, size = 4, fontface = "bold") +
  labs(
    title = "Average Bilirubin by Patient Status",
    x     = "Patient Status",
    y     = "Average Bilirubin (mg/dl)",
    fill  = "Status"
  ) +
  scale_x_discrete(labels = c("Censored", "Liver Transplant", "Deceased")) +
  scale_fill_manual(
    values = c("C" = "steelblue", "CL" = "orange", "D" = "tomato"),
    labels = c("Censored", "Liver Transplant", "Deceased")
  ) +
  theme_minimal()

The bar plot compares the average bilirubin level across the three patient status groups. Deceased patients have the highest average bilirubin, followed by liver transplant patients, while censored (surviving) patients have the lowest levels. This pattern strongly supports bilirubin as a key clinical marker for predicting cirrhosis mortality.

# Pearson correlation between Bilirubin and Prothrombin
cor_value <- cor(cirrhosis_clean$Bilirubin, cirrhosis_clean$Prothrombin, method = "pearson")
cat("Pearson Correlation (Bilirubin vs Prothrombin):", round(cor_value, 4), "\n")

## Pearson Correlation (Bilirubin vs Prothrombin): 0.3312

# Full correlation test with p-value and confidence interval
cor.test(cirrhosis_clean$Bilirubin, cirrhosis_clean$Prothrombin, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  cirrhosis_clean$Bilirubin and cirrhosis_clean$Prothrombin
## t = 5.8098, df = 274, p-value = 1.732e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2217790 0.4323401
## sample estimates:
##       cor 
## 0.3311762

The Pearson correlation between Bilirubin and Prothrombin time is positive, indicating that as bilirubin levels increase, prothrombin time also tends to increase. Both are liver function markers — when the liver is severely damaged, it fails to regulate both bilirubin clearance and blood clotting (measured by prothrombin), making this a clinically meaningful relationship. A p-value < 0.05 confirms the correlation is statistically significant.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

COMP4028 – Assignment 1: Data Analysis using R Programming

Shikha Lachoriya,Priyanshi Kakadiya,Yesha Gajjar, Sara Mushtaq, Hetakshee Koshti, Vishwa Rathod, Nikhil Milind Dode, Tiffany Toussaint, Lalit Ratlani

2026-03-28