Sameer Mathur
One-way ANOVA
---
# reading data into R
cancer.df <- read.csv(paste("cancer-survival.csv"))
# attaching data columns of the dataframe
attach(cancer.df)
# dimension of the dataframe
dim(cancer.df)
[1] 64 2
This data set tracks the number of days cancer patients survived after getting cancer treatment, as a function of the type of cancer they were sufferring from i.e. as a function of the body organ which had cancer.
# average survival time by organ
library(data.table)
dt <- data.table(cancer.df)
dt[, list(Count = .N,
mean = round(mean(Survival), 3),
sd = round(mean(Survival), 3),
median = round(median(Survival), 3),
min = min(Survival),
max = max(Survival)),
by = list(Organ)]
Organ Count mean sd median min max
1: Stomach 13 286.000 286.000 124 25 1112
2: Bronchus 17 211.588 211.588 155 20 859
3: Colon 17 457.412 457.412 372 20 1843
4: Ovary 6 884.333 884.333 406 89 2970
5: Breast 11 1395.909 1395.909 1166 24 3808
# descriptive statistics by each organ
library(psych)
describeBy(Survival, Organ)
Descriptive statistics by group
group: Breast
vars n mean sd median trimmed mad min max range skew
X1 1 11 1395.91 1238.97 1166 1280.33 662.72 24 3808 3784 0.81
kurtosis se
X1 -0.7 373.56
--------------------------------------------------------
group: Bronchus
vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 17 211.59 209.86 155 181.2 133.43 20 859 839 1.75 2.66
se
X1 50.9
--------------------------------------------------------
group: Colon
vars n mean sd median trimmed mad min max range skew
X1 1 17 457.41 427.17 372 394.2 244.63 20 1843 1823 1.96
kurtosis se
X1 3.76 103.6
--------------------------------------------------------
group: Ovary
vars n mean sd median trimmed mad min max range skew
X1 1 6 884.33 1098.58 406 884.33 386.96 89 2970 2881 1.01
kurtosis se
X1 -0.75 448.49
--------------------------------------------------------
group: Stomach
vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 13 286 346.31 124 234.64 121.57 25 1112 1087 1.27 0.25
se
X1 96.05
# number of observations in organ
addmargins(table(Organ))
Organ
Breast Bronchus Colon Ovary Stomach Sum
11 17 17 6 13 64
# box plot of organ
boxplot(Survival ~ Organ, data = cancer.df,
main = "Boxplot of Organ",
xlab = "Organ", ylab = "Survival (Days)")
# mean plot by organ
library(gplots)
plotmeans(Survival ~ Organ, data = cancer.df,
xlab = "Organ", ylab = "Survival (Days)",
digits=2, col = "black", ccol = "blue", barwidth = 2,
legends = TRUE, mean.labels = TRUE, frame = TRUE)
# check for normality in each group
with(cancer.df, tapply(Survival, Organ, shapiro.test))
$Breast
Shapiro-Wilk normality test
data: X[[i]]
W = 0.86857, p-value = 0.07431
$Bronchus
Shapiro-Wilk normality test
data: X[[i]]
W = 0.76596, p-value = 0.0007186
$Colon
Shapiro-Wilk normality test
data: X[[i]]
W = 0.76056, p-value = 0.0006134
$Ovary
Shapiro-Wilk normality test
data: X[[i]]
W = 0.76688, p-value = 0.029
$Stomach
Shapiro-Wilk normality test
data: X[[i]]
W = 0.75473, p-value = 0.002075
Null hypothesis (\( H_0 \)): The data is normally distributed.
we reject the null hypothesis for Breast
We fail to reject the null hypothesis for other organs like Bronchus, Colon, Ovary, Stomach.
In other words, we can not assume the normality in each group.
# Check for homogeneity of variance
library(car)
leveneTest(Survival ~ Organ, data = cancer.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 4.4524 0.003271 **
59
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null hypothesis (\( H_0 \)): The variance of cancer Survival is homogenous across diffrent types of cancers (Organ).
We reject the null hypothesis. There is heterogeneity in variance of cancer survival.
We use log transformation to the dependent variable (Survival) and again check for the normality and homogenity of the variance.
# change Survival to log of Survival
cancer.df$LogSurvival <- log(cancer.df$Survival)
# first few rows of the dataframe
head(cancer.df)
Survival Organ LogSurvival
1 124 Stomach 4.820282
2 42 Stomach 3.737670
3 25 Stomach 3.218876
4 45 Stomach 3.806662
5 412 Stomach 6.021023
6 51 Stomach 3.931826
# check for normality in each group after log transformation
with(cancer.df, tapply(LogSurvival, Organ, shapiro.test))
$Breast
Shapiro-Wilk normality test
data: X[[i]]
W = 0.802, p-value = 0.009995
$Bronchus
Shapiro-Wilk normality test
data: X[[i]]
W = 0.98047, p-value = 0.9613
$Colon
Shapiro-Wilk normality test
data: X[[i]]
W = 0.92636, p-value = 0.1891
$Ovary
Shapiro-Wilk normality test
data: X[[i]]
W = 0.983, p-value = 0.9655
$Stomach
Shapiro-Wilk normality test
data: X[[i]]
W = 0.92837, p-value = 0.3245
# Check for homogeneity of variance after log transformation
library(car)
leveneTest(LogSurvival ~ Organ, data = cancer.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 0.6685 0.6164
59
Now we can see that after log transformation, problem of normality and equality of variance has been resolved, as suggested by test output.
Hence, we have normality in each group as well as homogeneity in variance in each group.
# mean plot by organ
library(gplots)
plotmeans(Survival ~ Organ, data = cancer.df,
xlab = "Organ", ylab = "Survival (Days)",
digits=2, col = "black", ccol = "blue", barwidth = 2,
legends = TRUE, mean.labels = TRUE, frame = TRUE)
# one-way ANOVA
oneWayfit <- aov(Survival ~ Organ, data = cancer.df)
# summary of the ANOVA model
summary(oneWayfit)
Df Sum Sq Mean Sq F value Pr(>F)
Organ 4 11535761 2883940 6.433 0.000229 ***
Residuals 59 26448144 448274
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# one-way ANOVA
oneWayTransfit <- aov(LogSurvival ~ Organ, data = cancer.df)
# summary of the ANOVA model
summary(oneWayTransfit)
Df Sum Sq Mean Sq F value Pr(>F)
Organ 4 24.49 6.122 4.286 0.00412 **
Residuals 59 84.27 1.428
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the p-value < 0.05, we can conclude that there are significant differences in survival rates between the cancer organ groups.
In one-way ANOVA test, a significant p-value indicates that some of the group means are different, but we don't know which pairs of groups are different.
It's possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant.
Hence, we use Tukey Honestly Significant Differences (HSD) test to check pairwise-comparison.
# Tukey comparison test
TukeyHSD(oneWayTransfit)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = LogSurvival ~ Organ, data = cancer.df)
$Organ
diff lwr upr p adj
Bronchus-Breast -1.60543320 -2.906741 -0.3041254 0.0083352
Colon-Breast -0.80948110 -2.110789 0.4918267 0.4119156
Ovary-Breast -0.40798703 -2.114754 1.2987803 0.9615409
Stomach-Breast -1.59068365 -2.968399 -0.2129685 0.0158132
Colon-Bronchus 0.79595210 -0.357534 1.9494382 0.3072938
Ovary-Bronchus 1.19744617 -0.399483 2.7943753 0.2296079
Stomach-Bronchus 0.01474955 -1.224293 1.2537924 0.9999997
Ovary-Colon 0.40149407 -1.195435 1.9984232 0.9540004
Stomach-Colon -0.78120255 -2.020245 0.4578403 0.3981146
Stomach-Ovary -1.18269662 -2.842480 0.4770864 0.2763506
# Tukey pair-wise comparisons plot
plot(TukeyHSD(oneWayfit))
The classical one-way ANOVA test requires an assumption of equal variances for all groups. In our example, the homogeneity of variance assumption turned out to be fine: the Levene test is not significant.
But, if homogeneity of variance is violated?
This test is used when variances are heterogeneous.
# anova test when variances are not same
oneway.test(Survival ~ Organ, data = cancer.df)
One-way analysis of means (not assuming equal variances)
data: Survival and Organ
F = 3.5152, num df = 4.000, denom df = 19.862, p-value = 0.02514
pairwise.t.test(Survival, Organ, data = cancer.df,
p.adjust.method = "BH", pool.sd = FALSE)
Pairwise comparisons using t tests with non-pooled SD
data: Survival and Organ
Breast Bronchus Colon Ovary
Bronchus 0.073 - - -
Colon 0.110 0.110 - -
Ovary 0.443 0.349 0.443 -
Stomach 0.073 0.502 0.349 0.349
P value adjustment method: BH
This is a non-parametric test. This test can be used when the normality assumption is violated.
# Kruskal-Wallis rank sum test
kruskal.test(Survival ~ Organ, data = cancer.df)
Kruskal-Wallis rank sum test
data: Survival by Organ
Kruskal-Wallis chi-squared = 14.954, df = 4, p-value = 0.004798
ANOVA Assumptions Before-After Transformation
Before Log-transformation
# residual versus fitted plot
plot(oneWayfit, 2)
After Log-transformation
# residual versus fitted plot
plot(oneWayTransfit, 2)
Before Log-transformation
# extract the residuals
aovResiduals <- residuals(oneWayfit)
# Anderson-Darling normality test
library(nortest)
ad.test(aovResiduals)
Anderson-Darling normality test
data: aovResiduals
A = 4.2752, p-value = 9.621e-11
After Log-transformation
# extract the residuals
aovTransResiduals <- residuals(oneWayTransfit)
# Anderson-Darling normality test
library(nortest)
ad.test(aovTransResiduals)
Anderson-Darling normality test
data: aovTransResiduals
A = 0.53873, p-value = 0.161
Before Log-transformations
# residual versus fitted plot before log-transformation
plot(oneWayfit, 1)
After Log-transformations
# residual versus fitted plot after log-transformation
plot(oneWayTransfit, 1)
Before Log-transformation
# test homogeneity of variance after log-transformation
library(car)
leveneTest(Survival ~ Organ, data = cancer.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 4.4524 0.003271 **
59
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After Log-transformation
# test homogeneity of variance after log-transformation
library(car)
leveneTest(LogSurvival ~ Organ, data = cancer.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 0.6685 0.6164
59