Cancer Survival Analysis

Sameer Mathur

One-way ANOVA

---

READING AND DESCRIBING DATA

Reading Data

# reading data into R
cancer.df <- read.csv(paste("cancer-survival.csv"))
# attaching data columns of the dataframe
attach(cancer.df)
# dimension of the dataframe
dim(cancer.df)

[1] 64  2

This data set tracks the number of days cancer patients survived after getting cancer treatment, as a function of the type of cancer they were sufferring from i.e. as a function of the body organ which had cancer.

Descriptive Statistics by each Organ

# average survival time by organ
library(data.table)
dt <- data.table(cancer.df)
dt[, list(Count = .N,
        mean = round(mean(Survival), 3), 
        sd = round(mean(Survival), 3),
        median = round(median(Survival), 3),
        min = min(Survival),
        max = max(Survival)), 
   by = list(Organ)]

      Organ Count     mean       sd median min  max
1:  Stomach    13  286.000  286.000    124  25 1112
2: Bronchus    17  211.588  211.588    155  20  859
3:    Colon    17  457.412  457.412    372  20 1843
4:    Ovary     6  884.333  884.333    406  89 2970
5:   Breast    11 1395.909 1395.909   1166  24 3808

ALTERNATE: Descriptive Statistics by each Organ

# descriptive statistics by each organ
library(psych)
describeBy(Survival, Organ)


 Descriptive statistics by group 
group: Breast
   vars  n    mean      sd median trimmed    mad min  max range skew
X1    1 11 1395.91 1238.97   1166 1280.33 662.72  24 3808  3784 0.81
   kurtosis     se
X1     -0.7 373.56
-------------------------------------------------------- 
group: Bronchus
   vars  n   mean     sd median trimmed    mad min max range skew kurtosis
X1    1 17 211.59 209.86    155   181.2 133.43  20 859   839 1.75     2.66
     se
X1 50.9
-------------------------------------------------------- 
group: Colon
   vars  n   mean     sd median trimmed    mad min  max range skew
X1    1 17 457.41 427.17    372   394.2 244.63  20 1843  1823 1.96
   kurtosis    se
X1     3.76 103.6
-------------------------------------------------------- 
group: Ovary
   vars n   mean      sd median trimmed    mad min  max range skew
X1    1 6 884.33 1098.58    406  884.33 386.96  89 2970  2881 1.01
   kurtosis     se
X1    -0.75 448.49
-------------------------------------------------------- 
group: Stomach
   vars  n mean     sd median trimmed    mad min  max range skew kurtosis
X1    1 13  286 346.31    124  234.64 121.57  25 1112  1087 1.27     0.25
      se
X1 96.05

Number of observations in Organ

# number of observations in organ
addmargins(table(Organ))

Organ
  Breast Bronchus    Colon    Ovary  Stomach      Sum 
      11       17       17        6       13       64

Boxplot of Survival by Organ

# box plot of organ
boxplot(Survival ~ Organ, data = cancer.df,
        main = "Boxplot of Organ",
        xlab = "Organ", ylab = "Survival (Days)")

plot of chunk unnamed-chunk-6

Mean Plot of Survival Time by Organ

# mean plot by organ
library(gplots)
plotmeans(Survival ~ Organ, data = cancer.df,
          xlab = "Organ", ylab = "Survival (Days)",
          digits=2, col = "black", ccol = "blue", barwidth = 2,
          legends = TRUE, mean.labels = TRUE, frame = TRUE)

plot of chunk unnamed-chunk-8

ANOVA Assumptions

Normality of the Dependent Variable (Survival)

# check for normality in each group
with(cancer.df, tapply(Survival, Organ, shapiro.test))

$Breast

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.86857, p-value = 0.07431


$Bronchus

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.76596, p-value = 0.0007186


$Colon

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.76056, p-value = 0.0006134


$Ovary

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.76688, p-value = 0.029


$Stomach

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.75473, p-value = 0.002075

Normality of the Dependent Variable (Survival)

Null hypothesis (\( H_0 \)): The data is normally distributed.

we reject the null hypothesis for Breast
We fail to reject the null hypothesis for other organs like Bronchus, Colon, Ovary, Stomach.

In other words, we can not assume the normality in each group.

Homogeneity of Variance

# Check for homogeneity of variance
library(car)
leveneTest(Survival ~ Organ, data = cancer.df)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value   Pr(>F)   
group  4  4.4524 0.003271 **
      59                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null hypothesis (\( H_0 \)): The variance of cancer Survival is homogenous across diffrent types of cancers (Organ).

We reject the null hypothesis. There is heterogeneity in variance of cancer survival.

Rectify Violation of Assumptions

We use log transformation to the dependent variable (Survival) and again check for the normality and homogenity of the variance.

Log-Transformation to the Survival variable

# change Survival to log of Survival
cancer.df$LogSurvival <- log(cancer.df$Survival)
# first few rows of the dataframe
head(cancer.df)

  Survival   Organ LogSurvival
1      124 Stomach    4.820282
2       42 Stomach    3.737670
3       25 Stomach    3.218876
4       45 Stomach    3.806662
5      412 Stomach    6.021023
6       51 Stomach    3.931826

Normality of the Dependent Variable after Log Transformation

# check for normality in each group after log transformation
with(cancer.df, tapply(LogSurvival, Organ, shapiro.test))

$Breast

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.802, p-value = 0.009995


$Bronchus

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.98047, p-value = 0.9613


$Colon

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.92636, p-value = 0.1891


$Ovary

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.983, p-value = 0.9655


$Stomach

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.92837, p-value = 0.3245

Homogeneity of Variance after Log Transformation

# Check for homogeneity of variance after log transformation
library(car)
leveneTest(LogSurvival ~ Organ, data = cancer.df)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  4  0.6685 0.6164
      59

Inference After Transformation

Now we can see that after log transformation, problem of normality and equality of variance has been resolved, as suggested by test output.

Hence, we have normality in each group as well as homogeneity in variance in each group.

Analysis of Variance (ANOVA)

Mean Plot of Survival Time by Organ

# mean plot by organ
library(gplots)
plotmeans(Survival ~ Organ, data = cancer.df,
          xlab = "Organ", ylab = "Survival (Days)",
          digits=2, col = "black", ccol = "blue", barwidth = 2,
          legends = TRUE, mean.labels = TRUE, frame = TRUE)

plot of chunk unnamed-chunk-15

One-way ANOVA Before Log-Transformation

# one-way ANOVA 
oneWayfit <- aov(Survival ~ Organ, data = cancer.df)
# summary of the ANOVA model
summary(oneWayfit)

            Df   Sum Sq Mean Sq F value   Pr(>F)    
Organ        4 11535761 2883940   6.433 0.000229 ***
Residuals   59 26448144  448274                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One-way ANOVA After Log-Transformation

# one-way ANOVA 
oneWayTransfit <- aov(LogSurvival ~ Organ, data = cancer.df)
# summary of the ANOVA model
summary(oneWayTransfit)

            Df Sum Sq Mean Sq F value  Pr(>F)   
Organ        4  24.49   6.122   4.286 0.00412 **
Residuals   59  84.27   1.428                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation of the Result

As the p-value < 0.05, we can conclude that there are significant differences in survival rates between the cancer organ groups.

Multiple Pairwise-Comparison between Meansof the Groups

In one-way ANOVA test, a significant p-value indicates that some of the group means are different, but we don't know which pairs of groups are different.

It's possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant.

Hence, we use Tukey Honestly Significant Differences (HSD) test to check pairwise-comparison.

Tukey HSD Pairwise Comparison Test

# Tukey comparison test
TukeyHSD(oneWayTransfit)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = LogSurvival ~ Organ, data = cancer.df)

$Organ
                        diff       lwr        upr     p adj
Bronchus-Breast  -1.60543320 -2.906741 -0.3041254 0.0083352
Colon-Breast     -0.80948110 -2.110789  0.4918267 0.4119156
Ovary-Breast     -0.40798703 -2.114754  1.2987803 0.9615409
Stomach-Breast   -1.59068365 -2.968399 -0.2129685 0.0158132
Colon-Bronchus    0.79595210 -0.357534  1.9494382 0.3072938
Ovary-Bronchus    1.19744617 -0.399483  2.7943753 0.2296079
Stomach-Bronchus  0.01474955 -1.224293  1.2537924 0.9999997
Ovary-Colon       0.40149407 -1.195435  1.9984232 0.9540004
Stomach-Colon    -0.78120255 -2.020245  0.4578403 0.3981146
Stomach-Ovary    -1.18269662 -2.842480  0.4770864 0.2763506

Tukey Multiple Pairwise Comparisons Plot

# Tukey pair-wise comparisons plot
plot(TukeyHSD(oneWayfit))

plot of chunk unnamed-chunk-20

If Variances are Heterogeneous?

The classical one-way ANOVA test requires an assumption of equal variances for all groups. In our example, the homogeneity of variance assumption turned out to be fine: the Levene test is not significant.

But, if homogeneity of variance is violated?

Welch One-way Test

This test is used when variances are heterogeneous.

ANOVA Test with Heterogeneity of Variance

# anova test when variances are not same
oneway.test(Survival ~ Organ, data = cancer.df)


    One-way analysis of means (not assuming equal variances)

data:  Survival and Organ
F = 3.5152, num df = 4.000, denom df = 19.862, p-value = 0.02514

Pairwise-Comparison Test

pairwise.t.test(Survival, Organ, data = cancer.df,
                p.adjust.method = "BH", pool.sd = FALSE)


    Pairwise comparisons using t tests with non-pooled SD 

data:  Survival and Organ 

         Breast Bronchus Colon Ovary
Bronchus 0.073  -        -     -    
Colon    0.110  0.110    -     -    
Ovary    0.443  0.349    0.443 -    
Stomach  0.073  0.502    0.349 0.349

P value adjustment method: BH

Kruskal-Wallis Rank Sum Test

This is a non-parametric test. This test can be used when the normality assumption is violated.

Kruskal-Wallis Rank Sum Test

# Kruskal-Wallis rank sum test
kruskal.test(Survival ~ Organ, data = cancer.df)


    Kruskal-Wallis rank sum test

data:  Survival by Organ
Kruskal-Wallis chi-squared = 14.954, df = 4, p-value = 0.004798

EXTRA

ANOVA Assumptions Before-After Transformation

NORMALITY ASSUMTIONS: Comparing Normality of Residuals Before-After Log-transformation

Before Log-transformation

# residual versus fitted plot
plot(oneWayfit, 2)

plot of chunk unnamed-chunk-24

After Log-transformation

# residual versus fitted plot
plot(oneWayTransfit, 2)

plot of chunk unnamed-chunk-25

NORMALITY ASSUMTIONS: Comapring Statistical Test for Normality Before After Log-transformation

Before Log-transformation

# extract the residuals
aovResiduals <- residuals(oneWayfit)
# Anderson-Darling normality test
library(nortest)
ad.test(aovResiduals)


    Anderson-Darling normality test

data:  aovResiduals
A = 4.2752, p-value = 9.621e-11

After Log-transformation

# extract the residuals
aovTransResiduals <- residuals(oneWayTransfit)
# Anderson-Darling normality test
library(nortest)
ad.test(aovTransResiduals)


    Anderson-Darling normality test

data:  aovTransResiduals
A = 0.53873, p-value = 0.161

EQALITY OF VARIANCES: Comparing Plots for the Homogeneity of Variance Before After Log-transformation

Before Log-transformations

# residual versus fitted plot before log-transformation
plot(oneWayfit, 1)

plot of chunk unnamed-chunk-28

After Log-transformations

# residual versus fitted plot after log-transformation
plot(oneWayTransfit, 1)

plot of chunk unnamed-chunk-29

EQALITY OF VARIANCES: Comparing Statistical Test for the Homogeneity of Variances

Before Log-transformation

# test homogeneity of variance after log-transformation
library(car)
leveneTest(Survival ~ Organ, data = cancer.df)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value   Pr(>F)   
group  4  4.4524 0.003271 **
      59                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After Log-transformation

# test homogeneity of variance after log-transformation
library(car)
leveneTest(LogSurvival ~ Organ, data = cancer.df)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  4  0.6685 0.6164
      59