# devtools::install_github('thomasp85/patchwork')
# tinytex::install_tinytex()

Loading libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(dplyr)
library(tidyr)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:purrr':
## 
##     compact
library(patchwork)
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:patchwork':
## 
##     area
## 
## The following object is masked from 'package:dplyr':
## 
##     select

Overview

The dataset (cholangitis.csv) comes from a randomized, clinical trial of the immunosuppressive drug D-penicillamine at the Mayo Clinic. Patients were randomly given either D-penicillamine or a placebo (variable drug), and both the patient and the health care providers were unaware as to which the patient received (“double-blinded”). The study consisted of patients living with primary biliary cholangitis, a fatal chronic autoimmune disease of unknown cause affecting the liver. There are 418 observations of 20 variables, both numeric and categorical. The description of the variables is found in cholangitis_dictionary.csv. The study lasted about 12 years. The goal of the project is to fit a linear regression equation to the number of days a patient survives from the time of registration (n_days)

In this project, I will be exploring the dataset (cholangitis.csv) and any relationship among variables, notably how different types of drugs have affected the patient outcome and if there is any interesting patterns in demographic groups.

Importing the dictionary

chol.dict <- read.csv("cholangitis_dictionary.csv", header = TRUE)
head(chol.dict)
##   Variable
## 1       id
## 2   n_days
## 3   status
## 4     drug
## 5      age
## 6      sex
##                                                                                        Description
## 1                                                                                Unique identifier
## 2  Number of days between registration and the earlier of: death, transplantation, or end of study
## 3 Status of the patient at end of study: C (not dead), CL (received liver transplant), or D (died)
## 4                                                        Drug received: D-penicillamine or placebo
## 5                                                                         Age at enrollment [days]
## 6                                                                           M (male) or F (female)

Part I: Visualizaing Data

Data Cleaning

Importing the data: Read the data into R.

chol <- read.csv("cholangitis.csv", header = TRUE)
head(chol)
##   id n_days status            drug   age sex ascites hepatomegaly spiders edema
## 1  1    400      D D-penicillamine 21464   F       Y            Y       Y     Y
## 2  2   4500      C D-penicillamine 20617   F       N            Y       Y     N
## 3  3   1012      D D-penicillamine 25594   M       N            N       N     S
## 4  4   1925      D D-penicillamine 19994   F       N            Y       Y     S
## 5  5   1504     CL         Placebo 13918   F       N            Y       Y     N
## 6  6   2503      D         Placebo 24201   F       N            Y       N     N
##   bilirubin cholesterol albumin copper alk_phos   sgot tryglicerides platelets
## 1      14.5         261    2.60    156   1718.0 137.95           172       190
## 2       1.1         302    4.14     54   7394.8 113.52            88       221
## 3       1.4         176    3.48    210    516.0  96.10            55       151
## 4       1.8         244    2.54     64   6121.8  60.63            92       183
## 5       3.4         279    3.53    143    671.0 113.15            72       136
## 6       0.8         248    3.98     50    944.0  93.00            63       361
##   prothrombin stage
## 1        12.2     4
## 2        10.6     3
## 3        12.0     4
## 4        10.3     4
## 5        10.9     3
## 6        11.0     3
names(chol)
##  [1] "id"            "n_days"        "status"        "drug"         
##  [5] "age"           "sex"           "ascites"       "hepatomegaly" 
##  [9] "spiders"       "edema"         "bilirubin"     "cholesterol"  
## [13] "albumin"       "copper"        "alk_phos"      "sgot"         
## [17] "tryglicerides" "platelets"     "prothrombin"   "stage"

Color-coding: different levels of sex, stages, drugs, and status are assigned to different colors.

# Defining/ Assigning colors
colSex <- c("lightpink", "lightblue1")
names(colSex) <- levels(chol$sex)

colStage <- palette()
names(colStage) <- levels(chol$stage)

colDrug <- c("lightgreen", "coral3", "khaki1")
names(colDrug) <- levels(chol$drug)

colStatus <- c("salmon1", "royalblue4", "palegreen4")
names(colStatus) <- levels(chol$status)

Converting all categorical variables into factors

chol[, c(3:4, 6:10, 20)] <- lapply(chol[, c(3:4, 6:10, 20)],
    as.factor)

Exploratory data analysis (EDA)

In this part below, I performed exploratory data analysis (EDA) on the data, using appropriate tools. My objective is to find any interesting relationship among variables, features of the data that could especially explain the effects of D-penicillamine drug on various variables since the experiment focuses on the types of drugs given to patients.

For the first part of my EDA, I am going to explore the distribution of patients’ demographic and treatment, including sex, status, drugs received, age, and stage.

For the second part of my EDA, I would explore the relationship and correlation among variables, mainly:

  1. Correlation among all variables
  2. various variables vs n_days
  3. patients’ drugs vs status
  4. patients’ drugs vs stage
  5. patients’ status vs stage
  6. patients’ status vs presence of ascites, hepatomegaly, spiders, and edema
  7. patients’ stage vs presence of ascites, hepatomegaly, spiders, and edema
  8. patients’ drugs vs presence of ascites, hepatomegaly, spiders, and edema
  9. patients’ status vs level of bilirubin, cholesterol, albumin, urine copper, alkaline phosphatase, serum glutamic-oxaloacetic transaminase, triglycerides, platelets and prothrombin
  10. patients’ stage vs level of bilirubin, cholesterol, albumin, urine copper, alkaline phosphatase, serum glutamic-oxaloacetic transaminase, triglycerides, platelets and prothrombin
  11. patients’ drugs vs level of bilirubin, cholesterol, albumin, urine copper, alkaline phosphatase, serum glutamic-oxaloacetic transaminase, triglycerides, platelets and prothrombin

For the third part of my EDA, I would identify outliers and further explore the dataset with PCA analysis.

EDA (Part I)

# Basic summary of original data
summary(chol)
##        id            n_days     status                drug          age       
##  Min.   :  1.0   Min.   :  41   C :232   D-penicillamine:158   Min.   : 9598  
##  1st Qu.:105.2   1st Qu.:1093   CL: 25   Placebo        :154   1st Qu.:15644  
##  Median :209.5   Median :1730   D :161   NA's           :106   Median :18628  
##  Mean   :209.5   Mean   :1918                                  Mean   :18533  
##  3rd Qu.:313.8   3rd Qu.:2614                                  3rd Qu.:21272  
##  Max.   :418.0   Max.   :4795                                  Max.   :28650  
##                                                                               
##  sex     ascites hepatomegaly spiders edema     bilirubin       cholesterol    
##  F:374   N:390   N:203        N:298   N:354   Min.   : 0.300   Min.   : 120.0  
##  M: 44   Y: 28   Y:215        Y:120   S: 44   1st Qu.: 0.800   1st Qu.: 248.0  
##                                       Y: 20   Median : 1.400   Median : 310.0  
##                                               Mean   : 3.221   Mean   : 365.5  
##                                               3rd Qu.: 3.400   3rd Qu.: 400.0  
##                                               Max.   :28.000   Max.   :1775.0  
##                                                                NA's   :5       
##     albumin          copper          alk_phos            sgot       
##  Min.   :1.960   Min.   :  4.00   Min.   :  289.0   Min.   : 26.35  
##  1st Qu.:3.243   1st Qu.: 41.25   1st Qu.:  857.2   1st Qu.: 82.04  
##  Median :3.530   Median : 72.50   Median : 1257.0   Median :114.11  
##  Mean   :3.497   Mean   : 95.71   Mean   : 1937.1   Mean   :121.75  
##  3rd Qu.:3.770   3rd Qu.:123.00   3rd Qu.: 2039.0   3rd Qu.:151.90  
##  Max.   :4.640   Max.   :588.00   Max.   :13862.4   Max.   :457.25  
##                                                                     
##  tryglicerides     platelets      prothrombin    stage  
##  Min.   : 33.0   Min.   : 62.0   Min.   : 9.00   1: 21  
##  1st Qu.: 84.0   1st Qu.:190.0   1st Qu.:10.00   2: 94  
##  Median :109.0   Median :250.0   Median :10.60   3:156  
##  Mean   :122.9   Mean   :257.4   Mean   :10.73   4:147  
##  3rd Qu.:151.0   3rd Qu.:318.0   3rd Qu.:11.10          
##  Max.   :598.0   Max.   :721.0   Max.   :18.00          
##  NA's   :5

I printed the summary of the dataset, showing features of the dataset, including mean, median, range and IQR for numerical variables, mode for categorical.

# Drugs distribution
chol %>%
    ggplot(aes(x = drug)) + geom_bar(color = "black", fill = c("lightgreen",
    "coral3", "khaki1"), alpha = 0.4) + geom_text(stat = "count",
    aes(label = after_stat(count), vjust = 2.5)) + labs(title = "Fig. 1: Types of Drugs") +
    theme_bw()

Then, I explored the distribution of drugs. From the distribution of the drug, we can tell that there are a lot of NAs in the drugs. A reason contributing to these NAs might be patients opting out of the experiment but continuing to keep track of their measurements. Therefore, these patients would not have taken the drugs but would have their measurements recorded. I would remove the NAs under each part of the project as removing them all at once right now would result in a mismatch in indices.

Demographics
# Distribution of Sex
s <- ggplot(data = na.omit(chol), aes(x = sex, y = prop.table(stat(count)),
    label = scales::percent(prop.table(stat(count))))) + geom_bar(fill = c("lightpink",
    "lightblue1"), color = "black", alpha = 0.7) + geom_text(stat = "count",
    vjust = -0.3, size = 3) + theme_bw() + ylim(0, 1) + labs(x = "Sex",
    y = "Proportion", title = "Fig. 2: Distributiuon of Sex")

# Distribution of Age
mean_all <- na.omit(chol) %>%
    summarise(mean = mean(age)/365)
head(mean_all)
##       mean
## 1 50.01922
a <- ggplot(data = na.omit(chol), aes(x = age/365, y = ..density..)) +
    geom_histogram(color = "black", fill = "lightslategrey") +
    geom_density(color = "darkred") + geom_vline(data = mean_all,
    aes(xintercept = mean), color = "darkred", linetype = "dashed") +
    labs(x = "Age (in years)", y = "Density", title = "Fig. 3: Distribution of Age") +
    theme_bw()

# Sex vs Age
mu <- ddply(na.omit(chol), "sex", summarise, grp.mean = mean(age)/365)
head(mu)
##   sex grp.mean
## 1   F 49.19279
## 2   M 56.24041
sa <- ggplot(data = na.omit(chol), aes(x = age/365, fill = sex)) +
    geom_density(alpha = 0.4) + geom_vline(data = mu, aes(xintercept = grp.mean,
    color = sex), linetype = "dashed") + labs(x = "Age (in years)",
    y = "Density", title = "Fig. 4: Female Age vs Male Age") +
    theme_bw()

(s + a)/sa
## Warning: `stat(count)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

After removing all rows with NAs, we observe the distribution of demographics of the data. 88% of the patients are female, whereas 12% of those are male. Age is slightly right-skewed, median is at around 50 years old. When we compare the distribution of age between female and male, female’s age is slightly more right-skewed than that of male. Female has a lower median age at 49, compared to that of male at 56.

# Sex vs Stage
ss1 <- ggplot(data = chol, aes(x = stage, fill = sex)) + geom_bar(alpha = 0.4,
    color = "black") + labs(x = "Stage", y = "Count", title = "Fig. 5: Female vs Male Stage Distribution") +
    theme_bw()

ss2 <- ggplot(data = chol, aes(x = stage, fill = sex)) + geom_bar(alpha = 0.4,
    color = "black", position = "fill") + labs(x = "Stage", y = "Proportion",
    title = "Fig. 6: Scaled Female vs Male Stage Distribution") +
    theme_bw()

ss1/ss2

After removing all rows with NAs, we observe the distribution of female and male patients and the respective distribution of 4 stages. From figure 1, we can tell that most patients are diagnosed with stage 3 and 4, the least patients are diagnosed as stage 1. From figure 2, where the proportions are scaled, we can tell that the proportion distribution of female and male does not vary too much among al 4 stages, with around 0.12 male and 0.88 female for all four stages.

EDA (Part II)

In part II of the EDA, I would explore the relationship and correlation among variables.

# Correlation of all variables
chol1 <- chol %>%
    mutate(stagee = as.numeric(chol$stage))

print(cor(na.omit(chol1[, c(2, 5, 11:19, 21)])))
##                    n_days          age    bilirubin   cholesterol     albumin
## n_days         1.00000000 -0.135800572 -0.402359619 -0.1419653502  0.42669448
## age           -0.13580057  1.000000000  0.007746838 -0.1192923730 -0.19572023
## bilirubin     -0.40235962  0.007746838  1.000000000  0.3102500400 -0.31138756
## cholesterol   -0.14196535 -0.119292373  0.310250040  1.0000000000 -0.04959322
## albumin        0.42669448 -0.195720231 -0.311387565 -0.0495932223  1.00000000
## copper        -0.28194001  0.021450946  0.382008669  0.0850033119 -0.20468385
## alk_phos       0.15964958 -0.053454952  0.113636671  0.0887981600 -0.03805606
## sgot          -0.18931321 -0.106036931  0.365478055  0.2767445827 -0.17699989
## tryglicerides -0.09224617  0.033222223  0.292524807  0.2306153802 -0.08041092
## platelets      0.14527441 -0.144963482 -0.009612271  0.1967770313  0.15491706
## prothrombin   -0.10745289  0.118736978  0.310038829  0.0008462541 -0.18993628
## stagee        -0.36467216  0.193675801  0.202635363  0.0395684159 -0.31247837
##                     copper    alk_phos        sgot tryglicerides    platelets
## n_days        -0.281940014  0.15964958 -0.18931321  -0.092246174  0.145274413
## age            0.021450946 -0.05345495 -0.10603693   0.033222223 -0.144963482
## bilirubin      0.382008669  0.11363667  0.36547805   0.292524807 -0.009612271
## cholesterol    0.085003312  0.08879816  0.27674458   0.230615380  0.196777031
## albumin       -0.204683850 -0.03805606 -0.17699989  -0.080410919  0.154917061
## copper         1.000000000  0.20526068  0.23315000   0.204247150 -0.005660661
## alk_phos       0.205260675  1.00000000  0.10463999   0.089333814  0.085824312
## sgot           0.233150000  0.10463999  1.00000000   0.072227940 -0.112919143
## tryglicerides  0.204247150  0.08933381  0.07222794   1.000000000  0.061078540
## platelets     -0.005660661  0.08582431 -0.11291914   0.061078540  1.000000000
## prothrombin    0.172045017  0.05276679  0.08169323   0.002640346 -0.158842837
## stagee         0.199635875  0.02130375  0.14003577   0.100896301 -0.224920050
##                 prothrombin      stagee
## n_days        -0.1074528873 -0.36467216
## age            0.1187369776  0.19367580
## bilirubin      0.3100388289  0.20263536
## cholesterol    0.0008462541  0.03956842
## albumin       -0.1899362820 -0.31247837
## copper         0.1720450172  0.19963587
## alk_phos       0.0527667870  0.02130375
## sgot           0.0816932326  0.14003577
## tryglicerides  0.0026403455  0.10089630
## platelets     -0.1588428371 -0.22492005
## prothrombin    1.0000000000  0.21015604
## stagee         0.2101560423  1.00000000

I transformed stage into a numeric variable, in order to analyze the possible patterns and correlation between stage and other variables. As seen on the correlation printed above, “n_days” has the strongest relationship with “albumin”, “bilirubin”, and “stage”. We would further explore the relationship between “n_days” and categorical variables that we were not able to analyze previously.

# Drugs vs n_days
data_summary <- function(x) {
    m <- mean(x)
    ymin <- m - sd(x)
    ymax <- m + sd(x)
    return(c(y = m, ymin = ymin, ymax = ymax))
}

st <- na.omit(chol) %>%
    ggplot(aes(x = drug, y = n_days, fill = drug)) + geom_violin(color = "black",
    alpha = 0.4) + labs(title = "Fig. 7: Drug vs n_days") + stat_summary(fun.data = data_summary,
    color = "darkred", alpha = 0.7) + theme_bw()


st1 <- na.omit(chol) %>%
    ggplot(aes(x = stage, y = n_days, fill = stage)) + geom_violin(color = "black",
    alpha = 0.4) + labs(title = "Fig. 9: Stage vs n_days") +
    stat_summary(fun.data = data_summary, color = "darkred",
        alpha = 0.7) + theme_bw()

st2 <- na.omit(chol) %>%
    ggplot(aes(x = status, y = n_days, fill = status)) + geom_violin(color = "black",
    alpha = 0.4) + labs(title = "Fig. 8: Status vs n_days") +
    stat_summary(fun.data = data_summary, color = "darkred",
        alpha = 0.7) + theme_bw()

(st | st2)/st1

From fig 7, I can tell that the distribution of “n_days” does not differ much between the two drugs. However, I could observe some interesting patterns from both fig 8 and fig9, where those diagnosed with later stage (i.e. 4) survive fewer days than those diagnosed with earlier stage (i.e. 1).

# Stage / Status vs n_days, colored by drugs
sn1 <- na.omit(chol) %>%
    ggplot(aes(x = status, y = n_days, fill = drug)) + geom_boxplot(alpha = 0.4) +
    theme_bw() + labs(title = "Fig. 10: status vs n_days, colored by drugs")

sn2 <- na.omit(chol) %>%
    ggplot(aes(x = stage, y = n_days, fill = drug)) + geom_boxplot(alpha = 0.4) +
    theme_bw() + labs(title = "Fig. 11: stage vs n_days, colored by drugs")

sn1/sn2

Since fig 8 and 9 have shown interesting patterns, I further analyze the relationship among status, stage, n_days, and drugs. Fig 10 show that the distribution of n_days between the two drugs for each status does not differ that much. However, as shown in fig 11, stage 1 patients who receive placebo report a longer number of days than those who receive the drug. The distribution of n_days also does not differ much between the two drugs for patients who are diagnosed stage 2, 3, 4.

It is important to note that there might be confounding variables contributing to the distribution; the measurement of n_days is also not standardized for every patient that it records the number of days between registration and the earlier of either death, transplantation, or end of study. Patients might receive transplant but decease right after the study ended. The tails of each boxplot are also long that it is important to note that there is a large distribution of data, which might affect the accuracy of analysis. The study, therefore, might not give the full picture of the effects of drugs on patients.

I would further look into the number of patients assigned to each drug, and the distribution of status and stage in the following figures.

# Stage 1 patients who received Placebo
o_p <- na.omit(chol) %>%
    filter(stage == "1" & drug == "Placebo")

cat("Stage 1 patients who received placebo:\n")
## Stage 1 patients who received placebo:
cat("Number of observations:\n")
## Number of observations:
nrow(o_p)
## [1] 4
cat("Summary statistics:\n")
## Summary statistics:
summary(mean(o_p$n_days))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3681    3681    3681    3681    3681    3681
# Stage 1 patients who received D-penicillamine
o_d <- na.omit(chol) %>%
    filter(stage == "1" & drug == "D-penicillamine")

cat("Stage 1 patients who received D-penicillamine:\n")
## Stage 1 patients who received D-penicillamine:
cat("Number of observations:\n")
## Number of observations:
nrow(o_d)
## [1] 12
cat("Summary statistics:\n")
## Summary statistics:
summary(mean(o_d$n_days))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2693    2693    2693    2693    2693    2693

Given the stage 1 patients who receive different drugs differ a lot in their number of days survived, when I look closer in each of their summary statistics above, the number of observation of stage 1 patients who received placebo is significantly lower than those who received D-penicillamine, which might contribute to the difference in distribution of days of survival.

# Drugs vs status
d <- na.omit(chol) %>%
    ggplot(aes(x = drug, fill = status)) + geom_bar(color = "black",
    alpha = 0.4) + geom_text(stat = "count", aes(label = after_stat(count)),
    vjust = -0.18, size = 2) + labs(title = "Fig. 12: Non-Scaled Drugs vs Status") +
    scale_fill_discrete(labels = c("C (not dead)", "CL (received liver transplant)",
        "D (died)")) + theme_bw()

# Patients' status vs stage
s2 <- na.omit(chol) %>%
    ggplot(aes(x = stage, fill = status)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 13: Scaled Stage vs Status") +
    scale_fill_discrete(labels = c("C (not dead)", "CL (received liver transplant)",
        "D (died)")) + theme_bw()

d/s2

From figure 12, we observe that the distribution of both drugs is similar, and that within each drug, the distribution of patients with each 3 status is also similar.

From figure 13, we observe that there might be a relationship between stage and status and least patients are not dead when they are diagnosed with stage 1, whereas most deceased patients are diagnosed with either stage 3 or 4.

# Patients' drugs vs presence of ascites, hepatomegaly,
# spiders, and edema
a <- na.omit(chol) %>%
    ggplot(aes(x = drug, fill = ascites)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 14: Drugs vs Ascites") +
    theme_bw()

h <- na.omit(chol) %>%
    ggplot(aes(x = drug, fill = hepatomegaly)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 15: Drugs vs Hepatomegaly") +
    theme_bw()

s <- na.omit(chol) %>%
    ggplot(aes(x = drug, fill = spiders)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 16: Drugs vs Spiders") +
    theme_bw()

e <- na.omit(chol) %>%
    ggplot(aes(x = drug, fill = edema)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 17: Drugs vs Edema") +
    theme_bw()

(a + h)/(s + e)

From figure 14 to 17, we observe that patients who take D-penicillamine versus those who take placebo do not show significant difference in their presence of ascites, hepatomegaly, spiders, and edema.

# Patients' status vs presence of ascites, hepatomegaly,
# spiders, and edema
a1 <- na.omit(chol) %>%
    ggplot(aes(x = status, fill = ascites)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 18: Status vs Ascites",
    y = "proportion") + theme_bw()

h1 <- na.omit(chol) %>%
    ggplot(aes(x = status, fill = hepatomegaly)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 19: Status vs Hepatomegaly") +
    theme_bw()

s1 <- na.omit(chol) %>%
    ggplot(aes(x = status, fill = spiders)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 20: Status vs Spiders") +
    theme_bw()

e1 <- na.omit(chol) %>%
    ggplot(aes(x = status, fill = edema)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 21: Status vs Edema") +
    theme_bw()

(a1 + h1)/(s1 + e1)

From figure 18 to 21, we observe that there might be a relationship between status and presence of ascites, hepatomegaly, spiders, and edema. Deceased patients are found to have the highest proportion of showing presence of ascites, hepatomegaly, spiders, and edema. Generally, a higher proportion of patients show presence of hepatomegaly, as opposed to ascites, which lowest proportion of patients show presence to.

# Patients' stage vs presence of ascites, hepatomegaly,
# spiders, and edema
a2 <- na.omit(chol) %>%
    ggplot(aes(x = stage, fill = ascites)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 22: Stage vs Ascites") +
    theme_bw()

h2 <- na.omit(chol) %>%
    ggplot(aes(x = stage, fill = hepatomegaly)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 23: Stage vs Hepatomegaly") +
    theme_bw()

s2 <- na.omit(chol) %>%
    ggplot(aes(x = stage, fill = spiders)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 24: Stage vs Spiders") +
    theme_bw()

e2 <- na.omit(chol) %>%
    ggplot(aes(x = stage, fill = edema)) + geom_bar(color = "black",
    alpha = 0.4, position = "fill") + labs(title = "Fig. 25: Stage vs Edema") +
    theme_bw()

(a2 + h2)/(s2 + e2)

From figure 22 to 25, we observe that there might be a relationship between stage and presence of ascites, hepatomegaly, spiders, and edema. Stage 4 patients are shown to have the highest proportion of showing presence of ascites, hepatomegaly, spiders, and edema. Generally, a higher proportion of patients show presence of hepatomegaly, as opposed to ascites, which lowest proportion of patients show presence to.

# Exploring relationship among level of bilirubin,
# cholesterol, albumin, urine copper, alkaline phosphatase,
# serum glutamic-oxaloacetic transaminase, triglycerides,
# platelets and prothrombin
pairs(chol[, c(11:19)], col = colSex, main = "Fig. 26: Pairs-plot colored by Sex")

pairs(chol[, c(11:19)], col = colStage, main = "Fig. 27: Pairs-plot colored by Stage")

pairs(chol[, c(11:19)], col = colDrug, main = "Fig. 28: Pairs-plot colored by Drug")

pairs(chol[, c(11:19)], col = colStatus, main = "Fig. 29: Pairs-plot colored by Status")

From figure 26 to 29, I used pairsplot to observe the relationship among level of bilirubin, cholesterol, albumin, urine copper, alkaline phosphatase, serum glutamic-oxaloacetic transaminase, triglycerides, platelets and prothrombin, and compared to the differences among different levels of patients’ drugs, sex, status, and stage by color-code. There is no significant difference among the figures in different color code.

EDA (Part III)

Outliers: Identification and removal

In the following part, I would utilize several plots to identify outlying points and extreme values.

# Identifying outliers for variables
pairs(chol[, c(2, 11:19)], main = "Fig. 30: Pairs-plot of original data")

# Outliers
ou1 = which(chol$sgot > 400)
ou2 = which(chol$tryglicerides > 550)
ou3 = which(chol$prothrombin > 15)
ou4 = which(chol$platelets > 700)
ou = c(ou1, ou2, ou3, ou4)
ou  #outliers index
## [1] 166  75 107 191 325 334
chol[ou, ]
##      id n_days status            drug   age sex ascites hepatomegaly spiders
## 166 166   2721      C         Placebo 15105   F       N            Y       N
## 75   75   1191      D D-penicillamine 15895   F       Y            Y       Y
## 107 107   3388      C         Placebo 22836   F       N            N       N
## 191 191    216      D         Placebo 19246   F       Y            Y       Y
## 325 325   4795      C            <NA> 12419   F       N            Y       Y
## 334 334    466      D            <NA> 20454   F       N            N       N
##     edema bilirubin cholesterol albumin copper alk_phos   sgot tryglicerides
## 166     N       5.7        1480    3.26     84     1960 457.25           108
## 75      S      17.1         674    2.53    207     2078 182.90           598
## 107     N       0.6         212    4.03     10      648  71.30            77
## 191     N      24.5        1092    3.35    233     3740 147.25           432
## 325     N       1.8         280    3.24    129     1112  67.08            92
## 334     N       7.1        1775    3.51    173      622 128.65           185
##     platelets prothrombin stage
## 166       213         9.5     2
## 75        268        11.5     4
## 107       316        17.1     1
## 191       399        15.2     4
## 325       219        18.0     2
## 334       721        11.8     4

I utilized a pairsplot among all continuous variables in order to identify extreme values. From figure 30, I can tell that there are several outliers where sgot > 400, tryglicerides > 550, and prothrombin > 15, and platelets > 700. Therefore, utilizing the which() function, I was able to identify the row indices of those outliers. However, there are potential outliers that I might want to explore further and decide whether they contribute to the extreme values. In the following code, I took a close-up to “tryglicerides” vs “prothrombin”.

# Close-up; outliers
highlight_df <- na.omit(chol[chol$tryglicerides > 550 | chol$prothrombin >
    15, ])
potential_df <- na.omit(chol[chol$prothrombin > 14 & chol$prothrombin <
    15, ])

ggplot(na.omit(chol), aes(x = prothrombin, y = tryglicerides)) +
    geom_point(alpha = 0.6) + geom_point(data = highlight_df,
    aes(x = prothrombin, y = tryglicerides), color = "firebrick3") +
    geom_point(data = potential_df, aes(x = prothrombin, y = tryglicerides),
        color = "dodgerblue") + labs(title = "Fig. 31: Closeup of tryglicerides vs prothrombin") +
    theme_bw()

Even the blue point seems ambiguous, at ths point I would not consider it as an outlier and would proceed with further EDA.

# Pairs plot without outliers
pairs(chol[-ou, c(2, 11:19)], main = "Fig. 32: Pairs-plot of data without outliers")

After removing 6 outliers identified in figure 31, figure 32 shows a cleaner and distinct pattern.

# Cook's distance
par(mfrow = c(1, 2))
plot(lm(n_days ~ . - id, data = chol), which = c(4:5))

whOut <- c(44, 56, 87)
chol[whOut, ]
##    id n_days status            drug   age sex ascites hepatomegaly spiders
## 44 44   3428      D         Placebo 13727   F       N            Y       Y
## 56 56   1847      D         Placebo 12279   F       N            Y       Y
## 87 87    198      D D-penicillamine 13616   F       N            N       N
##    edema bilirubin cholesterol albumin copper alk_phos   sgot tryglicerides
## 44     Y       3.3         299    3.55    131   1029.0 119.35            50
## 56     N       1.1         498    3.80     88  13862.4  95.46           319
## 87     N       1.1         345    4.40     75   1860.0 218.55            72
##    platelets prothrombin stage
## 44       199        11.7     3
## 56       365        10.6     2
## 87       447        10.7     3
without <- c(whOut, ou)

In order to make closer observation and detect the outlying points, I fitted a linear model with the original data, and only select the residuals vs leverage plot and cook’s distance plot from diagnostic plot. Large leverage or residuals, as well as cooks distance can be utilized for indicating potential outliers. On the plot above, it gives me a clearer sense of the index of potential outliers. I put the indices of those points into a vector and combined the vector of outliers identified from pairsplot.

# Original data: PCA and outliers
pcaChol <- prcomp(na.omit(chol[, -c(1, 3:4, 6:10, 20)]), center = TRUE,
    scale = TRUE)
plot(pcaChol$x[, 1:2], main = "Fig. 33: PC1 vs PC2")
points(pcaChol$x[-without, 1:2])
text(pcaChol$x[without, 1:2], labels = without, col = "red",
    pch = 19)

I further explore the presence of outliers by carrying out PCA analysis on the numerical continuous variables. In figure 33, I performed PCA on the explanatory variables and highlighted the points to be removed in order to observe how the points are separated. As I do not see clear pattern of some points distancing from each other, I decided that I would further plot a pairsplot on all PC components and distinguish outliers.

pairs(pcaChol$x[, 1:11], main = "Fig. 34: Pairsplot of PCA")

There are potential outliers for PC5 and PC6, therefore, I would take a close-up to these two PC components in my next plot.

# Close-up: PC 5 and PC 6
plot(pcaChol$x[, 5:6], main = "Fig. 35: PC5 vs PC6")
points(pcaChol$x[-without, 5:6])
text(pcaChol$x[without, 5:6], labels = without, col = "red",
    pch = 19)

By plotting a close-up plot of PC5 against PC6 (with outliers identified previously labelled), I can tell that there are several outliers. I am going to pull out those outliers in the code afterwards.

# Identifying and removing outliers
subfive <- pcaChol$x[, 5]
subsix <- pcaChol$x[, 6]

ouu <- c(which(subfive > 4), which(subsix < -4), which(subsix >
    3))

pcaChol$x[ouu, 5:6]
##            PC5        PC6
## 107  5.6666465 -0.5482246
## 191  4.2669506  1.2008599
## 325  5.6333220 -0.6063024
## 361 -0.8830898 -4.0478721
## 24  -1.2185679  3.3080145
## 75   0.4548978  3.7090272
## 337 -0.3754562  4.0917859
plot(subfive[-ouu], subsix[-ouu], main = "Fig. 36: PC5 vs PC6",
    xlab = "PC5 (outliers removed)", ylab = "PC2=6 (outliers removed)")
points(pcaChol$x[-without, 5:6])
text(pcaChol$x[without, 5:6], labels = without, col = "red",
    pch = 19)

After removing the outliers, I plotted PC5 against PC6 again (with outliers identified previously labelled), the plot has a clearer and cleaner pattern.

Utilizing the information I compiled from the PCA analysis above, I assigned all the indices of outliers to one vector and clean the data before carrying on to building a linear model in the next section of the project.

# FInalized: Data-Cleaning
without <- c(whOut, ou, ouu)  # Combining outliers from pairsplot previously identified and from the leverage graph 

# Cleaned data without outliers
cleaned_chol <- chol[-without, ]

Part II: Multivariate Regression

Multivariate regression analysis

# lm of original data
lm1 <- lm(n_days ~ . - id, data = chol)
cat("R-squared of original data (with outliers):", summary(lm1)$r.squared)
## R-squared of original data (with outliers): 0.4592946
# Diagnostic of non-removed/ original data lm
par(mfrow = c(2, 2))
plot(lm1)

I explore the pattern of the dataset by fitting a linear model. My initial linear model would be exploring any interesting patterns between the response variable (number of days) and all other variables except id as explanatory variables.

I plotted diagnostic plots for the initial linear model, and observe any interesting patterns, including outlying points, non-linearity, heteroscedasticity, and normal assumption.

# lm of cleaned data
lm2 <- lm(n_days ~ . - id, data = na.omit(cleaned_chol))
cat("R-squared without outliers:", summary(lm2)$r.squared)
## R-squared without outliers: 0.4952265
# Diagnostic of cloeaned data lm
par(mfrow = c(2, 2))
plot(lm2)

I then plotted diagnostic plots again with the cleaned dataset, with the same explanatory variables. I compared the r square for this model (lm2), it has increased from 0.4592946 to 0.4952265.

# Determining non-linearity and heteroskedasticity
par(mfrow = c(1, 2))
plot(lm2, which = c(1, 3))

I am interested in whether the data shows patterns of non-linearity and heteroscedasticity. Therefore, I took a close-up at the “residuals vs fitted” and “scale-location” graph from diagnostic plots. From the “residuals vs fitted” plot, I am able to observe the pattern of non-linearity since the best fit line is curved. The graph has also shown pattern of heteroskedasticity as there is slight pattern with increasing variability in the residuals for larger fitted values than for smaller ones, and that the residuals are not centered at zero given that the red fit line is not flat. In the following part, I would transform the data and observe any differences made.

# Log transformation
lmLog = lm(log(n_days) ~ . - id, data = na.omit(cleaned_chol))
cat("Log-transformed model R-squared:", summary(lmLog)$r.squared)
## Log-transformed model R-squared: 0.5992828
# Sqrt transformation
lmSqrt = lm(sqrt(n_days) ~ . - id, data = na.omit(cleaned_chol))
cat("\nSqrt-transformed model R-squared:", summary(lmSqrt)$r.squared)
## 
## Sqrt-transformed model R-squared: 0.552496
par(mfrow = c(2, 2))
plot(lmLog, which = 1, main = "Log")
plot(lmSqrt, which = 1, main = "Sqrt")
plot(lmLog, which = 3, main = "Log")
plot(lmSqrt, which = 3, main = "Sqrt")

I performed both log and square root transformation on the explanatory variable, and performed regression diagnostics once again. I am only comparing the “scale-location” and “residuals vs fitted” graph to get a clearer sense of the pattern of heteroskedasticity and non-linearity. The log transformation has shown more severe pattern of heteroskedasticity, whereas the square root transformation has shown similar effects as the original graph. Despite the log-transformed model has shown a higher R-squared of 0.5980597 as opposed to that of the sqrt-transformed model of 0.5515128, square-root transformation has shown less pattern of heteroskedasticity.

par(mfrow = c(1, 2))
plot(lm2, which = 1, main = "Original")
plot(lmSqrt, which = 1, main = "Sqrt")

In order to show that the square root transformation has shown less pattern of heteroskedasticity and has fixed the problem of non-linearity, I put the “residuals vs fitted” plot of the data before and after transformation side-by-side for comparison. As shown, the square root transformation has made the data less non-linear, and that the residuals have shown lesser pattern of increasing variance.

# Numeric comparison: normal distribution of the
# standardized residuals
quantile(stdres(lm2), 1/nrow(cleaned_chol))
## 0.2487562% 
##  -2.316778
qnorm(1/nrow(cleaned_chol))
## [1] -2.80864
# Normal distribution vs QQ plot
par(mfrow = c(1, 2))
hist(stdres(lm2), freq = FALSE, breaks = 10, xlim = c(-3.5, 3.5))
curve(dnorm, add = TRUE)
plot(lm2, which = 2)

In order to perform further analytics on whether the normal assumption holds, I used the quantile() function that computes the value of the standardized residuals at the 1/n quantile, and compare to the qnorm() function that computes the value at the probability 1/n under the assumption of standard normal distribution. Since the two values slightly vary from each other, we can determine that the curve does not absolutely follow the normal assumption.

To gather more concrete information about normality, I used visualization for my comparison, where I selected the QQ plot and compare side-by-side with the distribution of standardized residuals, with normal density curve overlayed. If the standardized residuals are normally distributed, the histogram should have a roughly bell-shaped curve and follow the normal density curve overlayed. From the graph above, it shows that the distribution of standardized residuals slightly deviate from the normal distribution as it skews slightly to the left as opposed to the curve overlayed.

#
par(mfrow = c(2, 2))
plot(lm2, which = 2, main = "Original (counts)")
plot(lmLog, which = 2, main = "Log")
plot(lmSqrt, which = 2, main = "Sqrt")

I then used the square root and log transformation that I performed in previous parts, and selected only the QQ-plots to compare side-by-side. Log transformation performed the worst in holding the normal assumption, whereas the square root transformation resembles the original QQ-plot.

The QQ-plot of square root transformation seemingly best fits the line, hence, showing the most normal distribution.

In conclusion, I decided to ommit the observations that are identified previously as outliers as the removal of these observations would give me a higer r^2, I would also transform my response variable with square root as the square root transformation has solved both the problem of non-linearity and heteroskedasticity in the data.

Part III: Variable selection

In the following part, I performed variable selection to select a suitable model involving a subset of explanatory variables. I will be utilizing the stepwise method. Due to potential technical issues that may arise, I am removing all the categorical variables with more than 2 variables from the linear model before carrying out the variable selection process.

# Removal of categorical variables with more than 2 levels
lmSqrt1 <- lm(sqrt(n_days) ~ . - id - status - edema - stage,
    data = na.omit(cleaned_chol))

# Stepwise method
chol.stepwise <- step(lmSqrt1, direction = "both")
## Start:  AIC=1380.66
## sqrt(n_days) ~ (id + status + drug + age + sex + ascites + hepatomegaly + 
##     spiders + edema + bilirubin + cholesterol + albumin + copper + 
##     alk_phos + sgot + tryglicerides + platelets + prothrombin + 
##     stage) - id - status - edema - stage
## 
##                 Df Sum of Sq   RSS    AIC
## - sex            1      1.17 28189 1378.7
## - drug           1      1.81 28190 1378.7
## - tryglicerides  1      3.57 28191 1378.7
## - prothrombin    1      3.78 28192 1378.7
## - sgot           1     29.91 28218 1379.0
## - cholesterol    1     43.10 28231 1379.1
## - age            1    165.07 28353 1380.4
## <none>                       28188 1380.7
## - spiders        1    191.53 28379 1380.7
## - hepatomegaly   1    264.91 28453 1381.4
## - platelets      1    266.29 28454 1381.4
## - ascites        1    627.96 28816 1385.2
## - copper         1   1414.40 29602 1393.2
## - bilirubin      1   1433.50 29621 1393.3
## - albumin        1   2532.88 30721 1404.1
## - alk_phos       1   2635.43 30823 1405.1
## 
## Step:  AIC=1378.67
## sqrt(n_days) ~ drug + age + ascites + hepatomegaly + spiders + 
##     bilirubin + cholesterol + albumin + copper + alk_phos + sgot + 
##     tryglicerides + platelets + prothrombin
## 
##                 Df Sum of Sq   RSS    AIC
## - drug           1      1.74 28191 1376.7
## - prothrombin    1      3.24 28192 1376.7
## - tryglicerides  1      3.37 28192 1376.7
## - sgot           1     29.82 28219 1377.0
## - cholesterol    1     44.88 28234 1377.1
## - age            1    177.44 28366 1378.5
## <none>                       28189 1378.7
## - spiders        1    192.91 28382 1378.7
## - hepatomegaly   1    267.47 28456 1379.5
## - platelets      1    271.36 28460 1379.5
## + sex            1      1.17 28188 1380.7
## - ascites        1    627.22 28816 1383.2
## - bilirubin      1   1455.52 29644 1391.6
## - copper         1   1591.27 29780 1392.9
## - albumin        1   2565.18 30754 1402.5
## - alk_phos       1   2640.82 30830 1403.2
## 
## Step:  AIC=1376.69
## sqrt(n_days) ~ age + ascites + hepatomegaly + spiders + bilirubin + 
##     cholesterol + albumin + copper + alk_phos + sgot + tryglicerides + 
##     platelets + prothrombin
## 
##                 Df Sum of Sq   RSS    AIC
## - prothrombin    1      2.92 28194 1374.7
## - tryglicerides  1      3.26 28194 1374.7
## - sgot           1     29.68 28220 1375.0
## - cholesterol    1     43.72 28234 1375.2
## - age            1    175.94 28367 1376.5
## <none>                       28191 1376.7
## - spiders        1    191.65 28382 1376.7
## - platelets      1    269.67 28460 1377.5
## - hepatomegaly   1    275.71 28466 1377.6
## + drug           1      1.74 28189 1378.7
## + sex            1      1.10 28190 1378.7
## - ascites        1    625.49 28816 1381.2
## - bilirubin      1   1479.19 29670 1389.8
## - copper         1   1589.54 29780 1390.9
## - albumin        1   2564.24 30755 1400.5
## - alk_phos       1   2662.29 30853 1401.4
## 
## Step:  AIC=1374.72
## sqrt(n_days) ~ age + ascites + hepatomegaly + spiders + bilirubin + 
##     cholesterol + albumin + copper + alk_phos + sgot + tryglicerides + 
##     platelets
## 
##                 Df Sum of Sq   RSS    AIC
## - tryglicerides  1      2.31 28196 1372.8
## - sgot           1     29.23 28223 1373.0
## - cholesterol    1     46.62 28240 1373.2
## - age            1    173.03 28367 1374.5
## - spiders        1    188.77 28382 1374.7
## <none>                       28194 1374.7
## - platelets      1    268.17 28462 1375.5
## - hepatomegaly   1    273.48 28467 1375.6
## + prothrombin    1      2.92 28191 1376.7
## + drug           1      1.42 28192 1376.7
## + sex            1      0.60 28193 1376.7
## - ascites        1    624.47 28818 1379.2
## - bilirubin      1   1531.50 29725 1388.4
## - copper         1   1586.72 29780 1388.9
## - albumin        1   2561.55 30755 1398.5
## - alk_phos       1   2730.62 30924 1400.1
## 
## Step:  AIC=1372.75
## sqrt(n_days) ~ age + ascites + hepatomegaly + spiders + bilirubin + 
##     cholesterol + albumin + copper + alk_phos + sgot + platelets
## 
##                 Df Sum of Sq   RSS    AIC
## - sgot           1     30.27 28226 1371.1
## - cholesterol    1     44.79 28241 1371.2
## - age            1    171.12 28367 1372.5
## <none>                       28196 1372.8
## - spiders        1    192.37 28388 1372.8
## - platelets      1    271.63 28468 1373.6
## - hepatomegaly   1    272.19 28468 1373.6
## + tryglicerides  1      2.31 28194 1374.7
## + prothrombin    1      1.97 28194 1374.7
## + drug           1      1.38 28194 1374.7
## + sex            1      0.55 28195 1374.7
## - ascites        1    624.81 28821 1377.2
## - bilirubin      1   1579.94 29776 1386.9
## - copper         1   1594.29 29790 1387.0
## - albumin        1   2581.62 30778 1396.7
## - alk_phos       1   2737.15 30933 1398.2
## 
## Step:  AIC=1371.06
## sqrt(n_days) ~ age + ascites + hepatomegaly + spiders + bilirubin + 
##     cholesterol + albumin + copper + alk_phos + platelets
## 
##                 Df Sum of Sq   RSS    AIC
## - cholesterol    1     53.40 28280 1369.6
## - age            1    152.89 28379 1370.7
## <none>                       28226 1371.1
## - spiders        1    192.46 28419 1371.1
## - hepatomegaly   1    260.94 28487 1371.8
## - platelets      1    300.09 28526 1372.2
## + sgot           1     30.27 28196 1372.8
## + tryglicerides  1      3.35 28223 1373.0
## + prothrombin    1      1.46 28225 1373.0
## + drug           1      1.27 28225 1373.0
## + sex            1      0.52 28226 1373.1
## - ascites        1    606.18 28832 1375.3
## - copper         1   1686.94 29913 1386.2
## - bilirubin      1   1943.34 30170 1388.8
## - albumin        1   2704.18 30930 1396.1
## - alk_phos       1   2716.40 30943 1396.3
## 
## Step:  AIC=1369.62
## sqrt(n_days) ~ age + ascites + hepatomegaly + spiders + bilirubin + 
##     albumin + copper + alk_phos + platelets
## 
##                 Df Sum of Sq   RSS    AIC
## - age            1    132.98 28413 1369.0
## - spiders        1    187.39 28467 1369.6
## <none>                       28280 1369.6
## - platelets      1    261.74 28541 1370.3
## - hepatomegaly   1    273.61 28553 1370.5
## + cholesterol    1     53.40 28226 1371.1
## + sgot           1     38.89 28241 1371.2
## + prothrombin    1      4.38 28275 1371.6
## + sex            1      1.79 28278 1371.6
## + tryglicerides  1      0.93 28279 1371.6
## + drug           1      0.20 28279 1371.6
## - ascites        1    563.20 28843 1373.5
## - copper         1   1662.97 29943 1384.5
## - bilirubin      1   2511.74 30791 1392.8
## - alk_phos       1   2691.55 30971 1394.5
## - albumin        1   2743.56 31023 1395.0
## 
## Step:  AIC=1369.01
## sqrt(n_days) ~ ascites + hepatomegaly + spiders + bilirubin + 
##     albumin + copper + alk_phos + platelets
## 
##                 Df Sum of Sq   RSS    AIC
## - spiders        1    147.89 28560 1368.5
## <none>                       28413 1369.0
## + age            1    132.98 28280 1369.6
## - platelets      1    291.26 28704 1370.0
## - hepatomegaly   1    305.92 28718 1370.2
## + cholesterol    1     33.49 28379 1370.7
## + sgot           1     17.31 28395 1370.8
## + sex            1     14.28 28398 1370.9
## + drug           1      0.81 28412 1371.0
## + prothrombin    1      0.47 28412 1371.0
## + tryglicerides  1      0.06 28412 1371.0
## - ascites        1    677.29 29090 1374.0
## - copper         1   1702.13 30115 1384.2
## - bilirubin      1   2453.67 30866 1391.5
## - alk_phos       1   2727.94 31140 1394.2
## - albumin        1   2964.64 31377 1396.4
## 
## Step:  AIC=1368.55
## sqrt(n_days) ~ ascites + hepatomegaly + bilirubin + albumin + 
##     copper + alk_phos + platelets
## 
##                 Df Sum of Sq   RSS    AIC
## <none>                       28560 1368.5
## + spiders        1    147.89 28413 1369.0
## + age            1     93.48 28467 1369.6
## - platelets      1    341.09 28902 1370.1
## + cholesterol    1     32.44 28528 1370.2
## + sgot           1     19.69 28541 1370.3
## + sex            1      1.48 28559 1370.5
## + tryglicerides  1      1.41 28559 1370.5
## + drug           1      1.27 28559 1370.5
## + prothrombin    1      0.35 28560 1370.5
## - hepatomegaly   1    408.52 28969 1370.8
## - ascites        1    674.10 29235 1373.5
## - copper         1   1856.26 30417 1385.2
## - bilirubin      1   2621.01 31181 1392.5
## - alk_phos       1   2791.63 31352 1394.2
## - albumin        1   3099.21 31660 1397.0

In order to carry out variable selection, I utilized the stepwise method where R would add or remove a single variable, determining the model that most improves the model criterion score (AIC). I chose direction =“both” as it combines both backward elimination and forward selection. By combining the two methods, the risk where some variables are added or removed early in the process would be minimized and I would gain the flexibility to change my mind about the model later.

Based on the summary of stepwise method, AIC for the model “lm(sqrt(n_days) ~ ascites + hepatomegaly + bilirubin + albumin + copper + alk_phos + platelets)” is 1368.55, which is lower than any other models described in the summary. This means that the model defined above would be the optimal. Therefore, we can conclude that the variables defined above would be deemed the most important, and the final model would be “lm(sqrt(n_days) ~ ascites + hepatomegaly + bilirubin + albumin + copper + alk_phos + platelets)”.

Part IV: Final Model Regression Diagnostics

# Final model
lm_final <- lm(sqrt(n_days) ~ ascites + hepatomegaly + bilirubin +
    albumin + copper + alk_phos + platelets, data = na.omit(cleaned_chol))
cat("R-squared of final model:", summary(lm_final)$r.squared)
## R-squared of final model: 0.4779805
# Final model diagnostics
par(mfrow = c(2, 2))
plot(lm_final)

Finally, I performed diagnostics on the final model again. The final model has a r^2 of 0.4779805, which has slightly improved compared to the original model of 0.4592946. According to the “residuals vs fitted” plot, the points are scattered and the residuals do not show increasing variances with an increased in fitted values, there is no pattern of heteroskedasticity. Moreover, even the best fitted line is slightly curved, it has comparatively resembled better to a straight linear line compared to the model before transformation, and therefore, the final model has shown less sign of non-linearity. Lastly, the normal QQ plot has shown that the points stick close to the best fit line and therefore the normal assumption holds. However, it is important to note that categorical variables with more than 2 levels are manually removed before variable selection, which might have contributed to biases or have decreased the level of fitness. In order to improve the accuracy and level of fitness, I would need more skill set for dealing with the complex technical issues arise, which is, fortunately, out of the scope of this class.

In conclusion, I believe there are no regression assumptions that are obviously violated for this dataset and final model.

Part V: Conclusion

Since data comes from a randomized test and drugs are randomly assigned, we can define causality from this clinical trial.

From the exploratory data analysis I performed in part one of this report, it is noted that there are no significant effects on n_days with different drugs used except on stage 1 patients, where we observe a negative effect when using the drug compared to the placebo. Given the stage 1 patients who received placebo is significantly lower than those who received D-penicillamine, the difference in distribution of days of survival might not be explained merely by the difference of drugs. It is also important to note that n_days is described with ambiguity (as mentioned in part one) and might have contributed to measurement biases or variances throughout the report.

However, after finalizing the model in part four, drug is not selected as one of the variables that define significance through the process of stepwise regression, it shows that the type of drug does not explain a huge variation of number of days a patient survive. Therefore, in conclusion, the effect of the drug D-penicillamine on the number of days a patient survived is not significant.