GSMS Basic Medical Statistics

Question 1

A study was made of all 26 astronauts on the first eight space shuttle flights (Bungo et al., 1985). On voluntary basis 17 astronauts consumed large quantities of salt and fluid prior to landing as a countermeasure to space de-conditioning, while nine did not. The table below shows supine heart rates (beats/minute) before and after flights in the space shuttle. Please use the data in file ex5_1.sav

Compare the pre-and post-flight measurements in the countermeasure group using both a parametric and a non-parametric approach. Formulate your null hypothesis and alternative. Which analysis is preferable, and why?

ANSWER: Before starting, let’s run some simple exploratory analysis to get a feel for the data

library(tidyverse)
library(haven)
library(modelsummary)

df <- read_sav("data/ex5_1.sav") |> 
  mutate(across(where(is.double), as.double)) |> 
  mutate(
    counter = fct_recode(
      factor(counter), 
      countermeasure = "0", 
      no_countermeasure = "1")
    )

df |> 
  filter(counter == "countermeasure") |> 
  select(-counter) |> 
  datasummary_skim(title = "Summary with countermeasure")

Summary with countermeasure
	Unique (#)	Mean	SD	Min	Median	Max
pre	13	56.9	7.3	48.0	55.0	71.0
post	14	63.8	8.9	47.0	65.0	77.0
change	16	6.9	10.7	−10.0	6.0	29.0

df |> filter(counter == "no_countermeasure") |> 
  select(-counter) |> 
  datasummary_skim(title = "Summary without countermeasure")

Summary without countermeasure
	Unique (#)	Mean	SD	Min	Median	Max
pre	6	57.2	8.4	52.0	54.0	78.0
post	7	74.7	13.0	61.0	77.0	103.0
change	8	17.4	10.1	0.0	24.0	27.0

df |> 
  ggplot(aes(y = change, x = counter)) +
  geom_violin() +
  labs(x = NULL) +
  theme_minimal()

The figures above can help us evaluate the distributions of our data, make an initial comparison between the two groups etc… We can now turn to the analysis itself. Let’s begin with the parametric approach.

library(broom)
# parametric approach
df_cm <- df |> 
  filter(counter == "countermeasure")  

t.test(df_cm$change) |> 
  tidy()

## # A tibble: 1 × 8
##   estimate statistic p.value parameter conf.low conf.high method     alternative
##      <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr>      <chr>      
## 1     6.88      2.65  0.0174        16     1.38      12.4 One Sampl… two.sided

For the parametric approach: \[ H0: \delta = 0 \\ H1: delta \ne 0 \]

Using a paired t-test (which is identical to a single sample t-test of the difference), we get a p-value= 0.017; 95% CI for mean difference is \([1.38; 12.4]\).

Conclusion: reject H0 that the mean change equals zero;

Let’s now do some non-parametric testing. Here our null hypothesis assumes that the pre and post measurement come from the same symmetric distribution; Let’s run the Wilcoxon signed rank test.

# non-parametric approach
wilcox.test(df_cm$change, correct = FALSE, exact = FALSE) |> 
  tidy()

## # A tibble: 1 × 4
##   statistic p.value method                    alternative
##       <dbl>   <dbl> <chr>                     <chr>      
## 1       111  0.0261 Wilcoxon signed rank test two.sided

ANSWER: The p-value = 0.026; Reject the null hypothesis;

In the light of the answer to (a), perform a suitable analysis to compare changes in heart rate in two groups. Formulate again first your null hypothesis and alternative. What conclusions can be made about the effectiveness after of the countermeasure?

ANSWER: The first group seems rather normal, the second less so.

# non-parametric approach
wilcox.test(df$change ~ df$counter,  
            exact = FALSE, 
            correct = FALSE) |> 
  tidy()

## # A tibble: 1 × 4
##   statistic p.value method                 alternative
##       <dbl>   <dbl> <chr>                  <chr>      
## 1      36.5  0.0309 Wilcoxon rank sum test two.sided

Our p-value is 0.031, the same as SPSS asymptotic significance test. SPSS also reports the result of an exact test, but R complains when we try to set the exact argument to TRUE. In any case, since P < 0.05 we can reject the null hypothesis. There is a tendency of larger changes in the group that did not take countermeasures.

Two astronauts each flew on two missions and are thus presented twice in the data. Why does this matter?

ANSWER: It is incorrect to analyze multiple observations on the same individuals as if they were from different people. The assumption of independence is not satisfied. Information coming from the same source is treated as if it came from independent sources, which means that we are overestimating our degrees of freedom and therefore we are more likely to make errors of type I.

Comment on the voluntary aspect of the study, and how it might affect on the interpretation of the results.

ANSWER: In clinical research it is highly undesirable to let subjects choose their own treatments. No clinical trial conducted in this way would have credibility. Ideally (from the research point of view) the astronauts should have been randomized to receive the dietary countermeasure, but this was not set up as a prospective study

Question 2

The Table below shows concentration of antibody to Type III Group B Strepoccus (GBS) in 20 volunteers before and after immunization (Baker et al., 1980). See also data file ex5_2.sav.

	Unique (#)	Mean	SD	Min	Median	Max
subject	20	10.5	5.9	1.0	10.5	20.0
before	9	0.7	0.4	0.4	0.6	2.0
after	12	1.9	3.0	0.4	0.8	12.2
change	11	1.2	2.9	−0.1	0.1	11.6

Formulate the H0 for the comparison of the antibody levels before and after immunization.

ANSWER: H0: mean concentration before immunization equals the mean concentration after immunization in the population; H1: H0 not true

The comparison was summarized in the report of this study as t=1.8; P-value > 0.05 Comment on this result. What method would be more appropriate, and why?

ANSWER: The paired t-test gives P = 0.08. However, the test assumes that the differences have reasonably Normal distribution, which is clearly not the case here. A log transformation does not solve the problem; the distribution is still skewed. The nonparametric approach is therefore recommended: the Wilcoxon signed rank sum test assumes a symmetric distribution of the differences, which is not the case. The Sign test can be used.

Analyze the data with the method you mentioned in (b); what are your conclusions?

wilcox.test(df$change, correct = TRUE, exact = FALSE) |> 
  tidy()

## # A tibble: 1 × 4
##   statistic p.value method                                           alternative
##       <dbl>   <dbl> <chr>                                            <chr>      
## 1        76 0.00412 Wilcoxon signed rank test with continuity corre… two.sided

ANSWER: As before, R complains when we try to run an exact test. The results of an assymptotic test yields a p-value of 0.00412. SPSS calculates a P-value = 0.006;

In both cases, we reject the null hypothesis that the data before immunization come from the same population as the data after immunization. After immunization, people appear to have higher concentrations than before.

Question 3

40 Patients receiving chemotherapy as outpatients were randomized to receive either an active antiemetic treatment (n=20) or placebo (n=20) (Williams et al., 1989). The following table shows measurements (in mm) on a 100 mm linear analogue self-assessment scale for nausea. See also data file ex5_3.sav.

Formulate your H0 and alternative to compare the values in both groups.

ANSWER: H0: mean nausea is the same for both groups in the population; H1: H0 not true

Which analysis is appropriate to test this null hypothesis, and why?

ANSWER: The data do not meet the distributional assumptions for parametric methods based on the t-distribution. The groups can be compared using the non-parametric Mann-Whitney test, which tests the null hypothesis that the data in both groups come from the same distribution.

Perform the test. What are your conclusions?

Let’s run both tests, for fun:

t.test(df$nausea ~ df$group) |> 
  tidy()

## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
## 1    -30.2      17.0      47.1     -4.00 0.000328      33.7    -45.5     -14.8
## # ℹ 2 more variables: method <chr>, alternative <chr>

wilcox.test(df$nausea ~ df$group, 
            correct = TRUE, 
            exact = FALSE)  |> 
  tidy()

## # A tibble: 1 × 4
##   statistic p.value method                                           alternative
##       <dbl>   <dbl> <chr>                                            <chr>      
## 1      78.5 0.00104 Wilcoxon rank sum test with continuity correcti… two.sided

ANSWER: Both tests show that the result is significant, the non-parametric P = 0.001 (also for the SPSS output). We can reject our null hypothesis, the placebo group has a tendency to higher nausea scores

Question 4

Patients with chronic renal failure undergoing haemodialysis were divided into groups with low or normal plasma heparin cofactor II (HCII) levels (Toulon et al., 1987). Five months later the acute effects of haemodialysis were studied by analyzing plasma samples taken before and after haemodialysis. As dialysis increases total protein concentration in plasma, the ratio of HC II to protein was calculated with results shown in the following table:

These data are also in the data file ex5_4.sav. The aim was to compare the change in both groups. The data were analyzed by separate paired Wilcoxon tests on thedata for each group, giving P-value < 0.01 for group 1 and P-value > 0.05 for group 2.

	Unique (#)	Mean	SD	Min	Median	Max
before	11	1.0	0.3	0.7	1.0	1.4
after	12	1.1	0.3	0.7	1.0	1.5
change	12	0.1	0.1	−0.1	0.1	0.2

	Unique (#)	Mean	SD	Min	Median	Max
before	12	1.5	0.3	1.2	1.5	2.1
after	11	1.6	0.3	1.2	1.5	2.1
change	11	0.1	0.1	−0.1	0.0	0.4

Why is it wrong to conclude, as the authors did, that HC II activity increased in group 1 but not in group 2?

Let’s run the tests, shall we?

df_low  <- df |> filter(group == "low")
df_normal  <- df |> filter(group == "normal")


wilcox.test(df_low$change, exact = FALSE)  |> 
  tidy() # p = 0.01347

## # A tibble: 1 × 4
##   statistic p.value method                                           alternative
##       <dbl>   <dbl> <chr>                                            <chr>      
## 1        71  0.0135 Wilcoxon signed rank test with continuity corre… two.sided

wilcox.test(df_normal$change, exact = FALSE) |> 
  tidy() # p = 0.09084

## # A tibble: 1 × 4
##   statistic p.value method                                           alternative
##       <dbl>   <dbl> <chr>                                            <chr>      
## 1      52.5  0.0908 Wilcoxon signed rank test with continuity corre… two.sided

ANSWER: The p-values associated with paired Wilcoxon tests of the data for groups 1 and 2 are 0.0135 and 0.091 (in SPSS 0.01 and 0.08). P-values are not so far apart. Moreover, the mean changes in the two groups are almost the same. The correct way to compare the groups is by testing directly the difference between changes in the two groups. The two sample t-test or the Mann-Whitney test.

Carry out a better analysis to compare the change in both groups: formulate first your H0 and alternative. What are your conclusions?

car::leveneTest(df$change ~ df$group) |> 
  tidy() # no significant difference, choose equal variance

## # A tibble: 1 × 4
##   statistic p.value    df df.residual
##       <dbl>   <dbl> <int>       <int>
## 1      1.53   0.229     1          22

t.test(df$change ~ df$group, var.equal = TRUE) |> 
  tidy()

## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1  0.00667    0.0775    0.0708     0.156   0.878        22  -0.0820    0.0954
## # ℹ 2 more variables: method <chr>, alternative <chr>

# p = 0.878


wilcox.test(df$change ~ df$group, exact = FALSE) |> 
  tidy()

## # A tibble: 1 × 4
##   statistic p.value method                                           alternative
##       <dbl>   <dbl> <chr>                                            <chr>      
## 1      85.5   0.453 Wilcoxon rank sum test with continuity correcti… two.sided

# p = 0.453

ANSWER: As before, H0: the mean change is in both groups the same, delta_1 - delta_2 = 0; H1: delta_1 - delta_2 != 0; From the Levene’s test it is known that we can not reject the hypothesis that the variances are equal (R reports p-value 0.23, SPSS reports p-value = 0.14), and since there is no indication that the data in both groups are not normally distributed, the two sample t-test can be used.

The t-test shows a p-value= 0.88 (and the same value shown in R and SPSS); so we do not reject the null hypothesis. There is no significant difference. Non-parametric Wilcox shows p-value of 0.453, which does not change our decision about not rejecting the null hypothesis.

Question 5

A group of 20 patients in remission from Hodgkin’s disease and a group of 20 patients in remission from other diverse, disseminated malignancies (called the non-Hodgkin’s disease group) were compared with respect to the number of T4 cells per mm3 in their blood. In the Table below the numbers are given. See also data file ex5_5.sav.

Which method is for these data more appropriate to test the null hypothesis that the numbers are the same in groups, the parametric approach or the non-parametric approach? Motivate your answer, and perform the analysis.

t.test(df$t4 ~ df$group, exact = FALSE) |> 
  tidy()

## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1    -301.      522.      823.     -2.11  0.0436      28.5    -593.     -9.29
## # ℹ 2 more variables: method <chr>, alternative <chr>

wilcox.test(df$t4 ~ df$group, exact = FALSE) |> 
  tidy()

## # A tibble: 1 × 4
##   statistic p.value method                                           alternative
##       <dbl>   <dbl> <chr>                                            <chr>      
## 1       135  0.0810 Wilcoxon rank sum test with continuity correcti… two.sided

We choose the non-parametric approach, since the data are rather skewed. The Mann-Whitney test gives a P-value of 0.08. So there is no indication that both groups differ with respect to the T4-counts.

Try a logarithmic transformation (ln), and check whether a parametric approach is appropriate on these transformed data. Perform the parametric test. What is your conclusion?

df |> 
  ggplot(aes(x = group, y = t4)) + 
  geom_violin() + 
  scale_y_log10() + 
  theme_minimal()

car::leveneTest(log(df$t4) ~ df$group) |> 
  tidy() # 0.320 --> cannot reject same variance

## # A tibble: 1 × 4
##   statistic p.value    df df.residual
##       <dbl>   <dbl> <int>       <int>
## 1      1.02   0.320     1          38

t.test(log(df$t4) ~ df$group, var.equal = TRUE)  |> 
  tidy() # p = 0.0682

## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1   -0.398      6.09      6.49     -1.88  0.0682        38   -0.828    0.0313
## # ℹ 2 more variables: method <chr>, alternative <chr>

wilcox.test(log(df$t4) ~ df$group, exact = FALSE)  |> 
  tidy() # p = 0.0810

## # A tibble: 1 × 4
##   statistic p.value method                                           alternative
##       <dbl>   <dbl> <chr>                                            <chr>      
## 1       135  0.0810 Wilcoxon rank sum test with continuity correcti… two.sided

ANSWER: After the transformation, the data are less skewed, and more importantly, it can be assumed that the variances are equal (based on the Levene’s test). The result of the two-sample t-test gives a P-value of 0.07; this is very close to the nonparametric results. The null hypothesis that the means T4 are in both groups the same can not be rejected.

Make a 95% confidence interval on the transformed data to compare both groups, and interpret the results. What is your conclusion?

mns  <- c(-0.82805477, 
    mean(c(-0.82805477, 0.03147101)), # = -0.398
  0.03147101)
exp(mns)

## [1] 0.4368983 0.6714660 1.0319715

# 0.44; 0.67, 1.03

ANSWER: The difference between the logs is between -0.828 and 0.031, so that means that the ratio is between 0.44 and 1.03. The transformed data the mean difference of -0.398, and a 95% CI for the mean difference: [-0.828;0.032]. After backwards transformation the estimation of the ratio is exp(-0.398) = 0.67; For the 95% CI for the ratio of the geometric means in both groups this is: [0.44; 1.03].

Interpretation: we are for 95% confident that the ratio of both geometric means is within this interval.

Question 6

Twenty-two patients undergoing cardiac bypass surgery were randomized to one of three ventilation groups: Group I received a 50% nitrous oxide and 50% oxygen mixture continuously for 24 hours; Group II received a 50% nitrous oxide and 50% oxygen mixture only during the operation; Group III received no nitrous oxide and 35-50% oxygen mixture continuously for 24 hours. In data file ex5_6.sav, the red cell folate levels for the three groups after 24 hours ventilation are given. The aim of this study was to compare the three groups and test whether they have the same red cell folate levels. (a) Make a scatter plot of the data (group on the horizontal line). Based on this plot, what are your first conclusions with respect to the mean and variances of the different groups?

ANSWER: The groups seem to be different with respect to their means, but also with respect to their variation (range has different length)

Perform a one way ANOVA, and interpret the results. Are the conditions satisfied?

aov(rcfl ~ group, data = df)  |> 
  summary()

##             Df Sum Sq Mean Sq F value Pr(>F)  
## group        2  15516    7758   3.711 0.0436 *
## Residuals   19  39716    2090                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANSWER: The ANOVA reveals significant results P= 0.044; this means that the null hypothesis of equal means in the three groups must be rejected. However, according to the test of homogeneity of variances, the hypotheses of equal variances must be rejected. A transformation must be considered.

Try a log transformation on the data, and perform again a one-way ANOVA. Are now the assumptions satisfied?

ANSWER: The test of equal variances for the transformed data reveals that the assumption of equal variances can be made. Test results for the ANOVA: P-value = 0.049. This is on the boundary of significance. There is indication that at least one pair of groups have different means.

Which means do differ according to you? Why?

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  df$rcfl and df$group 
## 
##       g_I   g_II 
## g_II  0.042 -    
## g_III 0.464 1.000
## 
## P value adjustment method: bonferroni

ANSWER: From the paired comparisons with Bonferroni-corrections it appeared that group1 and group 2 differ with respect to their means (but on the boundary, P-value = 0.047).

Try a non-parametric approach on these data. What are now your conclusions?

## 
##  Kruskal-Wallis rank sum test
## 
## data:  rcfl by group
## Kruskal-Wallis chi-squared = 4.1852, df = 2, p-value = 0.1234

ANSWER: The Kruskal-Wallis test gives P-value = 0.123: no significant differences between the three groups

In the light of the previous answers, what would your conclusion be about the differences between the groups?

ANSWER: In a paper, you should be very careful to give the conclusion that there are differences between the three groups. You might stay on the safe side, and publish the non-parametric results.