Biostatistics 208: Final Exam

Due: Friday, August 17, 2012

## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, echo = TRUE, 
    tidy = FALSE, fig.width = 8, fig.height = 7)
options(width = 116, scipen = 10)

setwd("~/statistics/bio208/")

Problem #2: Stature as a Prognostic Factor in Children with Cardiomyopathy
a. No hospitalizations versus one or more hospitalizations during the year (binary). How would you compare this outcome between the three groups? Describe the test you would run, summary statistics you would present, and any graphics you would produce. [4 points]

test: Chi-squared trend test can be used, as this is a 3 x 2 table problem with the row corresponding to an ordinal variable exposure status.
summary: The proportion without hospitalization during the follow-up in each group would be appropriate.
graphics: A three-bar bar chart of the proportion without hospitalization in three groups would be easy to interpret.

b. Change in weight z-score between 1 year and baseline. How would you compare this outcome between the three groups? Describe the test you would run, summary statistics you would present, and any graphics you would produce. [4 points]

test: Three-sample ANOVA would be appropriate if the distribution of continuous weight Z-score change is normal enough. If the distribution is not normal, three-sample Kruskal-Wallis test would be appropriate. If the above stated tests are significant, pairwise comparison using t-test or Wilcoxon rank sum test can be used with appropriate multiple-test adjustment for p-values.
summary: If the changes are normally distributed, means and standard deviations should be reported. If the distribution is not normal, medians and IQRs should be reported.
graphics: In either case, a boxplot is the most conventional way of graphical display for this kind of data.

c. Time to incidence of congestive heart failure. How would you compare this outcome between the three groups? Describe the test you would run, summary statistics you would present, and any graphics you would produce. [4 points]

test: For time-to-event data, log-rank test is appropriate.
summary: The median time to event accompanied by 95% confidence intervals and/or 1 year event-free rate can be used.
graphics: A Kaplan-Meier plot is good for this purpose.

d. New York Heart Class (4-point ordinal scale) at one year. How would you compare this outcome between the three groups? Describe the test you would run, summary statistics you would present, and any graphics you would produce. [4 points]

test: Fisher exact test or general chi-squared test for a 3 x 4 table is possible, or Mantel-Haenszel trend testis another possibility.
summary: The proportion in each NYHA category for each group is the preferable way. However, the median NYHA class is another possibility although it is not statistically ideal.
graphics: A stacked barchart of the NYHA categories for each group is the easiest to interpret.

e. Change in weight z -score between 1 year and baseline. This time, you will carry out an analysis separately in each of the three subgroups which make up the cohort (ie., you are no longer comparing the three subgroups to each other). For each subgroup, how would you decide whether there has been a change in weight z-score? Describe the test you would run, summary statistics you would present, and any graphics you would produce (for each subgroup). [4 points]

test: Now this is a within-group pre-post comparison, thus either one of paired t-test or Wilcoxon signed rank test should be used, depending on the normality of the changes in Z-scores.
summary: The mean and standard deviation or the median and IQR, depending on the normality.
graphics: A line graph showing a line from a mean pre value to mean post value, accompanied with vertical bars representing either one of +/- standard error or 95% confidence interval is most common.

Problem #3: Income and Survival Time in Cancer Patients

library(XLConnect)
wb.3 <- loadWorkbook("~/statistics/bio208/survses.xls")
survses <- readWorksheet(wb.3, sheet = 1)
names(survses) <- tolower(names(survses))

library(survival)
survses <- within(survses, {
    surv <- Surv(time, dead == 1)
})

a. Run a test to determine whether there is a relationship between income and survival times. Explain why you chose this test, interpret the results of the test, and present a table showing the clinical magnitude of the impact of income. [10 points]
This is a comparison of survival curves. To appropriately compare the entire survival surves, the log-rank test is the most common test.

The result of log-rank test.

library(survival)
res.diff <- survdiff(data = survses, surv ~ income)
res.diff

Call:
survdiff(formula = surv ~ income, data = survses)

n=931, 1163 observations deleted due to missingness.

           N Observed Expected (O-E)^2/E (O-E)^2/V
income=1 252      224    190.8    5.7689     7.659
income=2 390      315    342.2    2.1546     3.853
income=3 200      168    176.6    0.4214     0.546
income=4  89       75     72.4    0.0934     0.103

 Chisq= 8.5  on 3 degrees of freedom, p= 0.0371

The p-value is 0.0371, indicating a statistically significant difference among the income categories. Compared to the expected numbers of events, the income 1 and 4 groups had more observed death, whereas the opposite was true for the income 2 and 3 groups.

Median survival and 95% confidence interval for each group

res.fit <- survfit(data = survses, surv ~ income)
res.tab <- summary(res.fit)$table[,5:7]                 # summary()$table is the table shown by print()
colnames(res.tab) <- c("Median survival", "lower 95% boundary", "upper 95% boundary")
res.tab

         Median survival lower 95% boundary upper 95% boundary
income=1             304                268                385
income=2             387                333                452
income=3             426                349                491
income=4             370                329                422

The median survival time was the longest in the income group 3, followed by income groups 2, 4, and 1. Either extreme of income status is associated with shorter survival, although it was more prominent in the lowest income group.

b. It is possible that IPS could be a confounder of the relationship between income and survival time. Scientifically (i.e., do not do any data analysis for this part), what two conditions must be met in order for IPS to be a confounder? Do you think that IPS would satisfy these two conditions? [4 points]
In order to be a confounder, a variable has to have an association with both the outcome and the exposure in question. Additionally, it must not be a causal intermediate, i.e., the pathway through which the exposure results in the outcome.
In this particular case, IPS indicates the general health of patients, thus it is likely to be associated with survival, the outcome. The IPS can be associated with income, as patients in bad general status are not likely to be able to work. Thus it is a potential confounder. However a higher income may also buy better health care, which by improving the patients' general health status, may result in better survival. As this is hard to prove here, meeting the first two conditions should constitute a confounder.'

c. Use the SURVSES data to examine empirically whether the two scientific conditions in part (b) might be satisfied or not. It is sufficient here to simply report and interpret p-values (no need to present or interpret effect estimates). [4 points]
The associations between survival and IPS, and between IPS and income should be assessed. The log-rank test to assess the relationship between IPS and survival gave a significant p-value of 0.00104, thus there is an association. Chi-squared test to check the association between two categorical variables, income and IPS, was significant with a p-value of 0.00354. Thus, both of the conditions were met, and IPS is a confounder.

survses.subset <- subset(survses, !is.na(income))
res.diff2 <- survdiff(data = survses.subset, surv ~ ips)
#res.diff2

tab.ips.income <- xtabs(data = survses.subset, ~ +ips +income)
res.chisq <- chisq.test(tab.ips.income, correct = FALSE)

Problem #4: The Class POMS Scores

library(XLConnect)
wb.4 <- loadWorkbook("~/statistics/bio208/prepost12.xls")
prepost12 <- readWorksheet(wb.4, sheet = 1)
names(prepost12) <- tolower(names(prepost12))

SeeNormality <-
function(data.frame, na.rm = TRUE) {
    require(e1071)
    result <- lapply(data.frame,
                     function(y) {
                         data.frame(Number      = length(y),
                                    Not.NA      = sum(!is.na(y)),
                                    Mean        = mean(y, na.rm = na.rm),
                                    SD          = sd(y, na.rm = na.rm),
                                    Skewness    = skewness(y, type = 2, na.rm = na.rm),
                                    Kurtosis    = kurtosis(y, type = 2, na.rm = na.rm),
                                    Median      = median(y, na.rm = na.rm),
                                    Quartile_25 = quantile(y, probs = 0.25, na.rm = na.rm),
                                    Quartile_75 = quantile(y, probs = 0.75, na.rm = na.rm)
                                    )
                     })
    do.call(rbind, result)
}

a. I would like to know if there has been a change in Depression (DDEPRESS), Anxiety (DANXIETY) or in Fatigue (DFATIGUE) over the course of July. Be sure to describe what test you used, why, interpret its result, and present a measure of the clinical magnitude of the effect. The sample size for each analysis should be 128. Use the continuous change scores. [6 points]

As the continuous change scores are assessed, one-sample test against null hypothesis of \( \Delta = 0 \) should be carried out.

Assessment of normality

library(reshape2)
q4a <- prepost12[,c("ddepress","danxiety","dfatigue")]
prepost12.melt <- melt(q4a)

SeeNormality(q4a)

         Number Not.NA  Mean    SD Skewness Kurtosis Median Quartile_25 Quartile_75
ddepress    167    128 1.234 8.623  -0.8405   5.1520      0       -2.00        5.00
danxiety    167    128 1.102 6.938  -0.2769   1.3745      1       -4.00        5.00
dfatigue    167    128 2.094 5.764   0.1429   0.5802      2       -1.25        5.25


library(ggplot2)
ggplot(data = prepost12.melt) +
    geom_density(aes(x = value, color = variable)) +
    facet_grid(variable ~ .) +
    opts(legend.position = "none")

plot of chunk unnamed-chunk-7

For ddepress and danxiety, kurtosis was high, thus these are not normally distributed.

res.ddep <- wilcox.test(prepost12$ddepress)
library(BSDA)
res.ddep.sign <- SIGN.test(prepost12$ddepress)


    One-sample Sign-Test

data:  prepost12$ddepress 
s = 63, p-value = 0.2589
alternative hypothesis: true median is not equal to 0 
95 percent confidence interval:
 0 2 
sample estimates:
median of x 
          0


res.danx <- wilcox.test(prepost12$danxiety)
res.dfat <- t.test(prepost12$dfatigue)

ddepress: One-sample Wilcoxon signed rank test was used due to non-normality of the change score. Skewness was within the -1 to 1 range, thus the distribution was considered symmetric enough for one-sample Wilcoxon signed rank test. The p-value was significant at 0.0193, although the median score was 0. There were 63 data points > 0, whereas there were 50 data points < 0. Thus, depression score increased significantly during the follow-up period.
danxiety: One-sample Wilcoxon signed rank test was used for the same reason. The p-value was significant at 0.049. The median change score was positive at 1 (IQR -4, 5), thus, during this period, the anxiety score significantly increased.
dfatigue: One-sample t-test was used as the distribution was normal enough. The p-value was significant at 0.0001. The mean change in the score was positive at 2.09 (sd 5.76), thus the fatigue score also increased significantly during this period.

b. Dichotomize the baseline Fatigue scores (i.e., Fatigue ≤ 7 vs Fatigue > 7 ) and also dichotomize the follow-up Fatigue scores (i.e., Fatigue2 ≤ 7 vs Fatigue2 > 7 ). Using these dichotomous variables, test whether there has been a change in Fatigue over the course of July. . Be sure to describe what test you used, why, interpret its result, and present a measure of the clinical magnitude of the effect. The sample size for this analysis should also be 128. [4 points]

Table

library(plyr)
prepost12 <-
    mutate(prepost12,
           fatigued.pre  = fatigue  > 7,
           fatigued.post = fatigue2 > 7,
           bio = factor(bio)
           )
tab.pre.post <- xtabs(data = prepost12, ~ +fatigued.pre +fatigued.post)[2:1,2:1]
addmargins(tab.pre.post)

            fatigued.post
fatigued.pre TRUE FALSE Sum
       TRUE    41    13  54
       FALSE   39    35  74
       Sum     80    48 128


library(exact2x2)
res.exact.mc <- exact2x2::mcnemar.exact(tab.pre.post)

As the comparison is pre-post change of dichotomous status, a test of discordance should be performed. Thus, McNemar test (exact version) was carried out. The p-value was significant at 0.0004. There were 39 people who newly developed fatigue, whereas only 13 people recovered from fatigued status. Therefore, during this period, there was a significant increase in the number of people fatigued.

c. Dichotomize the baseline Depression scores (i.e., Depress ≤ 4 vs Depress > 4 ) and also dichotomize the follow-up Depression scores (i.e., Depress2 ≤ 4 vs Depress2 > 4 ). Using these dichotomous variables, quantify the reliability of the POMS to detect depression by reporting the agreement rate and the Kappa statistic. Briefly describe each measure and how they differ from each other. The sample size for this analysis should also be 128. (Note: To really measure reliability, the POMS should be administered during a time period during which a person’s true depression status is not changing. Although this may not be true for the data you are analyzing, pretend it is true for the sake of the exercise.) [4 points]

Table

library(plyr)
prepost12 <-
    mutate(prepost12,
           depressed.pre  = depress  > 4,
           depressed.post = depress2 > 4
           )
tab.pre.post.2 <- xtabs(data = prepost12, ~ +depressed.pre +depressed.post)[2:1,2:1]
addmargins(tab.pre.post.2)

             depressed.post
depressed.pre TRUE FALSE Sum
        TRUE    41    18  59
        FALSE   25    44  69
        Sum     66    62 128


ai <- tab.pre.post.2[1,1]
bi <- tab.pre.post.2[1,2]
ci <- tab.pre.post.2[2,1]
di <- tab.pre.post.2[2,2]
ti <- ai + bi + ci + di

observed.agreement <- (ai + di) / ti
expected.agreement <- ((ai + bi) / ti) * ((ai + ci) / ti) + ((di + bi) / ti) * ((di + ci) / ti)
library(vcd)
res.kappa <- Kappa(tab.pre.post.2)
res.kappa.conf <- confint(res.kappa)
discard <- list(res.kappa, res.kappa.conf)

The agreement rate is (41 + 44) / 128 = 0.6641. This is the proportion of subjects in whom the status was reproducible, i.e., the test was positive twice or negative twice. This is the observed agreement rate (O). However, agreement can occur even if the two test results were completely at random. This is the expected agreement rate (E). This is calculated by adding expected probabilities of the a and d cells, which are calculated from the marginal probabilities under the null hypothesis. The Kappa statistics is calculated as follows. This is how good the agreement was in addition to the expected agreement.

\[ Kappa = \frac {O - E} {1 - E} \]

In this case, the expected agreement is 0.4988. Thus, Kappa statistic is 0.3298.

For the next three questions, I would like you to compare the follow-up (end of July) POMS scores between 3 groups of students, divided according to the variable BIO (i.e., compare the follow-up POMS scores between student who took Bio 207 versus Bio 208 versus Bio 209).

Definition of grouping:

group 1: BIO 207 students
group 2: BIO 208 students
group 3: BIO 209 students

d. Describe briefly how you would decide whether to use ANOVA or a Kruskal-Wallis test to compare POMS scores between three groups. (You should not be doing any calculations or computing for this part.) [4 points]
The distribution of variables should be assessed in each group. If the skewness is not within -1 to 1 range or the kurtosis is higher than 1 in any subgroup, non-parametric method should be utilized.

library(plyr)
res.normality <- dlply(prepost12,
                       "bio",
                       function(y) {
                           SeeNormality(y[,c("depress2","anxiety2","fatigue2","confuse2")])
                       })
res.normality  <- lapply(res.normality[1:3], invisible)
res.mean.sd    <- lapply(res.normality, "[", c("Mean", "SD"))
res.median.iqr <- lapply(res.normality, "[", c("Median", "Quartile_25", "Quartile_75"))

e. Use ANOVA (regardless of appropriateness) to compare follow-up depression (DEPRESS2), anxiety (ANXIETY2), fatigue (FATIGUE2), and confusion (CONFUSE2) scores between the three groups of students. Present your results in a table with effect estimates as well as p-values. Summarize your results in a sentence (or two). [4 points]

Table: Mean (sd) for each group and ANOVA p-value

## Loop for anova
res.anova <- sapply(simplify = FALSE,
              c("depress2","anxiety2","fatigue2","confuse2"),
              function(var){
                  formula <- as.formula(paste(var, "~ bio"))
                  res.anova <- anova(lm(formula = formula, data = prepost12))
                  res.anova
              })
## extract ANOVA p-values
anova.p.values <- sapply(res.anova, "[", 1, 5)

## two digits below .
formatit <- function(x, fmt = "%.2f") sprintf(fmt, x)

## Combine mean and sd to mean (sd) format
mean.sd.col.grp <-
    lapply(res.mean.sd,
           function(y) {
               y <- within(y, {
                   Mean.SD <- paste(formatit(Mean), " (", formatit(SD, "%5.2f"), ")", sep = "")
               })
               y[3]
           })

## Combine p-value, rename, and print
mean.sd.p.col.grp        <- cbind(do.call(cbind, mean.sd.col.grp), sprintf("%.3f", anova.p.values))
names(mean.sd.p.col.grp) <- c("bio = 1","bio = 2","bio = 3","p-value")
mean.sd.p.col.grp

               bio = 1       bio = 2       bio = 3 p-value
depress2 11.57 (10.98)  6.96 ( 7.03) 10.33 ( 7.71)   0.033
anxiety2  9.29 ( 8.03)  7.92 ( 6.22) 11.42 ( 6.05)   0.186
fatigue2 10.95 ( 5.70) 10.15 ( 5.48)  9.17 ( 5.15)   0.662
confuse2  6.29 ( 4.61)  3.39 ( 3.67)  4.42 ( 4.36)   0.010

depress2: The score was lowest in group 2, followed by group 3. There was a statistically significant difference among groups.
anxiety2: The score was lowest in group 2, followed by group 1, although there was no statistically significant result.
fatigue2: The score was lowest in group 3, followed by group 2, although there was no statistically significant result.
confuse2: The score was lowest in group 2, followed by group 1. There was a statistically significant difference among groups.

There were statistically significant differences in the depression score and confusion score among these three groups.

f. Use Kruskal-Wallis tests (regardless of appropriateness) to compare follow-up depression (DEPRESS2), anxiety (ANXIETY2), fatigue (FATIGUE2), and confusion (CONFUSE2) scores between the three groups of students. Present your results in a table with effect estimates as well as p-values. Summarize your results in a sentence (or two). [4 points]
Table: Median [IQR] for each group and Kruskal-Wallis test p-value

## Loop for KW
kw.p.values <- sapply(simplify = FALSE,
              c("depress2","anxiety2","fatigue2","confuse2"),
              function(var){
                  formula <- as.formula(paste(var, "~ bio"))
                  res.kw <- kruskal.test(formula = formula, data = prepost12)
                  res.kw$p.value
              })

## one digits below .
formatit <- function(x, fmt = "%4.1f") sprintf(fmt, x)

## Combine median Q25 Q75 in median [Q25,Q75]
median.iqr.col.grp <-
    lapply(res.median.iqr,
           function(y) {
               y <- within(y, {
                   Median.IQR <- paste(formatit(Median), " [", formatit(Quartile_25,"%.1f"),"," , formatit(Quartile_75), "]", sep = "")
               })
               y[4]
           })

## Combine p-value, rename, and print
median.iqr.p.col.grp <- cbind(do.call(cbind, median.iqr.col.grp), sprintf("%.3f", kw.p.values))
names(median.iqr.p.col.grp) <- c("bio = 1","bio = 2","bio = 3","p-value")
median.iqr.p.col.grp

                 bio = 1         bio = 2         bio = 3 p-value
depress2  5.0 [5.0,19.0]  5.0 [1.0,10.0]  5.5 [4.0,17.0]   0.043
anxiety2  6.0 [3.0,14.0]  7.0 [3.0,13.0] 11.0 [9.2,13.2]   0.240
fatigue2 10.0 [8.0,13.0]  9.0 [6.0,15.0]  8.0 [6.0,13.5]   0.691
confuse2  5.0 [3.0, 8.0]  3.0 [0.0, 5.0]  5.5 [0.5, 7.2]   0.018

depress2: There was a statistically significant difference among the three groups. The median was highest in group 3, followed by groups 1 and 2 (tie).
anxiety2: There was NO statistically significant difference among the three groups. The median was highest in group3, followed by group 2.
fatigue2: There was NO statistically significant difference among the three groups. The median was highest in group 1, followed by group 2.
confuse2: There was a statistically significant difference among the three groups. The median was highest in group 3, followed by group 1.

There were statistically significant differences among the three groups for the depression score and confusion score. In both cases group 3 (BIO 209 students) had the highest score, although it was not confirmed by pair-wise comparison.

g. Use analysis of variance to test whether follow-up Anger scores (ANGER2) differ between the three groups of students. Pursue pair-wise testing if appropriate using the Bonferroni method and the Student-Newman-Keul method. Interpret the results, paying particular attention to the difference between the results from the Bonferroni versus SNK procedures. [4 points]

The mean anger2 score was highest in group 3, followed by group 1.

with(prepost12, by(anger2, bio, summary))

bio: 1
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    3.00    7.00    7.71   10.00   32.00 
--------------------------------------------------------------------------------------- 
bio: 2
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     2.0     5.0     6.6     9.0    29.0 
--------------------------------------------------------------------------------------- 
bio: 3
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    7.25   12.50   12.30   15.80   30.00

ANOVA gave a p-value of 0.023, a significant result. Therefore, both pair-wise t-test with the Bonferroni correction and SNK procedure were conducted.

anova(lm(anger2 ~ bio, data = prepost12))

Analysis of Variance Table

Response: anger2
           Df Sum Sq Mean Sq F value Pr(>F)  
bio         2    353   176.4    3.91  0.023 *
Residuals 123   5553    45.1                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the Bonferroni correction the cut-off for p-value is 0.05/3 = 0.01666667. By pair-wise comparison, only the comparison between group 2 and 3 gave a significant result.

list(
        unadjusted.p = with(prepost12, pairwise.t.test(x = anger2, g = bio, p.adjust = "none"))$p.value,
        adjusted.p = with(prepost12, pairwise.t.test(x = anger2, g = bio, p.adjust = "bon"))$p.value
)

$unadjusted.p
        1        2
2 0.49459       NA
3 0.05982 0.006277

$adjusted.p
       1       2
2 1.0000      NA
3 0.1795 0.01883

In the SNK procedure, in addition to the differences between groups 2 and 3, the difference between groups 1 and 3 was also significant.

library(mutoss)
res.snk <- snk(anger2 ~ bio, data = prepost12, alpha = 0.05, silent = TRUE)
res.snk$SNK

  Comparison  Diff Statistic  Adj.P Rejected Layer
1        3-2 5.731    3.9326 0.0171     TRUE     1
2        1-2 1.112    0.9688 0.4946    FALSE     2
3        3-1 4.619    2.6865 0.0598    FALSE     2

In conclusion, group 3 (BIO 209 students) had a significantly higher anger2 score than group 2 (BIO 208 students) in both methods. With the SNK procedure, group 3 (BIO 209 students) also had a significantly higher anger2 score than group 1 (BIO 207 students). This is likely to be the reflection of additonal power of the SNK procedure compared to the most conservative method, Bonferroni correction.

All computations, except for question 4-g, were carried out with R 2.15.1 (optional packages: XLConnect, survival, reshape, ggplot2, plyr, exact2x2, vcd, mutoss, knitr, markdown). Question 4-g's pairwise comparison was performed with SAS 9.3's anova procedure