Principles of Biostatistics Chap 13: Nonparametric Methods

References

Non-parametric methods

Ref. Regression Methods in Biostatistics

Sign test

## Load foreign package
library(foreign)

## resting energy expenditure: CF subjects vs healthy subjects
ree <- read.dta("~/statistics/Pagano_Gauvreau/Stata/chap13/resting energy expenditure.dta")
ree
   pair   cf healthy
1     1 1153     996
2     2 1132    1080
3     3 1165    1182
4     4 1460    1452
5     5 1634    1162
6     6 1493    1619
7     7 1358    1140
8     8 1453    1123
9     9 1185    1113
10   10 1824    1463
11   11 1793    1632
12   12 1930    1614
13   13 2075    1836

## Add difference variable
ree$diff <- with(ree, cf - healthy)
ree
   pair   cf healthy diff
1     1 1153     996  157
2     2 1132    1080   52
3     3 1165    1182  -17
4     4 1460    1452    8
5     5 1634    1162  472
6     6 1493    1619 -126
7     7 1358    1140  218
8     8 1453    1123  330
9     9 1185    1113   72
10   10 1824    1463  361
11   11 1793    1632  161
12   12 1930    1614  316
13   13 2075    1836  239

## Density plot for difference
library(lattice)
densityplot(ree$diff)

plot of chunk unnamed-chunk-2


## 11 positive signs out of 13 non-zero differences
D <- 11
## Expected value of the number of plus signs under null. np
Null <- 13*(1/2)
Null
[1] 6.5
## Variance of it. npq
Var <- 13*(1/2)*(1/2)
Var
[1] 3.25
## Test statistic (large sample method)
Zplus <- (D - Null)/sqrt(Var)
Zplus
[1] 2.496
## 2-sided normal distribution-based p-value
two.side.norm.p <- pnorm(Zplus, lower.tail = F)*2
two.side.norm.p
[1] 0.01255
## 1-sided binomial distribution-based p-value
one.side.binom.p <- sum(dbinom(11:13, size = 13, prob = 1/2))
## Approximate 2-sided binomial distribution-based p-value
one.side.binom.p*2
[1] 0.02246

## Sign test: Test if the median is zero
library(BSDA)
SIGN.test(ree$diff)

    One-sample Sign-Test

data:  ree$diff
s = 11, p-value = 0.02246
alternative hypothesis: true median is not equal to 0
95 percent confidence interval:
  25.35 324.48
sample estimates:
median of x 
        161 
                  Conf.Level L.E.pt U.E.pt
Lower Achieved CI     0.9077  52.00  316.0
Interpolated CI       0.9500  25.35  324.5
Upper Achieved CI     0.9775   8.00  330.0

This is significant by either the large-sample method or the small sample method. The problem is that this method completely ignores the magnitude of these differences.

Wilcoxon signed-rank test

## Load CF FVC reduction data
fvc <- read.dta("~/statistics/Pagano_Gauvreau/Stata/chap13/reduction in fvc.dta")
## Differences
fvc$diff <- with(fvc, placebo - drug)
## Absolute differences
fvc$abs.diff <- abs(fvc$diff)
## Ranks of absolute differences
fvc$rank <- rank(fvc$abs.diff)
## Create signs
fvc$sign <- factor(fvc$diff < 0, c(FALSE,TRUE), c("1","-1"))
fvc$sign <- as.numeric(as.character(fvc$sign))
## Get signed ranks
fvc$signed.rank <- with(fvc, rank * sign)
## Show
fvc
   placebo drug diff abs.diff rank sign signed.rank
1      224  213   11       11    1    1           1
2       80   95  -15       15    2   -1          -2
3       75   33   42       42    3    1           3
4      541  440  101      101    4    1           4
5       74  -32  106      106    5    1           5
6       85  -28  113      113    6    1           6
7      293  445 -152      152    7   -1          -7
8      -23 -178  155      155    8    1           8
9      525  367  158      158    9    1           9
10     -38  140 -178      178   10   -1         -10
11     508  323  185      185   11    1          11
12     255   10  245      245   12    1          12
13     525   65  460      460   13    1          13
14    1023  343  680      680   14    1          14
## Add negative ranks as absolute values
sum.abs.neg.ranks <- sum(abs(fvc$signed.rank[fvc$signed.rank<0]))
## The expected value under null
Null <- 14*(14 + 1)/4
Null
[1] 52.5
## The variance
Var <- 14*(14 + 1)*(14*2 + 1)/24
## Test statistic
Zt <- (sum.abs.neg.ranks - Null)/sqrt(Var)
## 2-sided normal distribution-based p-value
pnorm(Zt)*2
[1] 0.03546
## Approximate 2-sided p-value from signed rank statistic
psignrank(sum.abs.neg.ranks, 14)*2
[1] 0.03528

## Using a built-in function
wilcox.test(fvc$diff)

    Wilcoxon signed rank test

data:  fvc$diff
V = 86, p-value = 0.03528
alternative hypothesis: true location is not equal to 0

Wilcoxon rank sum test

It is the non-parametric equivalent of two-sample t-test. It does not require normality, but does require two sample distributions have the same general shape

## resting energy expenditure: CF subjects vs healthy subjects
pku <- read.dta("~/statistics/Pagano_Gauvreau/Stata/chap13/pku.dta")
pku$exposure <- factor(pku$exposure)
## Assign ranks
pku$rank <- rank(pku$nma)
## Show
pku
   exposure  nma rank
1      high 28.0  1.0
2      high 35.0  3.0
3      high 37.0  4.5
4      high 37.0  4.5
5      high 43.5  9.0
6      high 44.0 10.0
7      high 45.5 11.5
8      high 46.0 13.0
9      high 48.0 17.0
10     high 48.3 18.0
11     high 48.7 19.5
12     high 51.0 23.0
13     high 52.0 25.5
14     high 53.0 28.0
15     high 53.0 28.0
16     high 54.0 31.5
17     high 54.0 31.5
18     high 55.0 34.5
19      low 34.5  2.0
20      low 37.5  6.0
21      low 39.5  7.0
22      low 40.0  8.0
23      low 45.5 11.5
24      low 47.0 14.5
25      low 47.0 14.5
26      low 47.5 16.0
27      low 48.7 19.5
28      low 49.0 21.0
29      low 51.0 23.0
30      low 51.0 23.0
31      low 52.0 25.5
32      low 53.0 28.0
33      low 54.0 31.5
34      low 54.0 31.5
35      low 55.0 34.5
36      low 56.5 36.0
37      low 57.0 37.0
38      low 58.5 38.5
39      low 58.5 38.5

## Calculate rank sums
rank.sum <- with(pku, tapply(rank, exposure, sum))
rank.sum
high  low 
 313  467 
## Set the smaller as the W statistic
W <- min(rank.sum)
W
[1] 313
## Expected value for rank sum
Null <- 18*(18 + 21 + 1)/2
Null
[1] 360
## Variance for rank sum
Var <- 18*21*(18 + 21 + 1)/12
Var
[1] 1260
## Test statistic
Zw <- (W - Null)/sqrt(Var)
Zw
[1] -1.324
## 2-sided normal distribution-based p-value
pnorm(Zw)*2
[1] 0.1855
## Approximate 2-sided p-value from rank sum test (this did not work??)
## pwilcox(W, m = 18, n = 21, lower.tail = F)*2

## Using a built-in function
wilcox.test(nma ~ exposure, data = pku)

    Wilcoxon rank sum test with continuity correction

data:  nma by exposure
W = 142, p-value = 0.1896
alternative hypothesis: true location shift is not equal to 0
## Using coin package
library(coin)
wilcox_test(nma ~ exposure, data = pku)

    Asymptotic Wilcoxon Mann-Whitney Rank Sum Test

data:  nma by exposure (high, low)
Z = -1.326, p-value = 0.1849
alternative hypothesis: true mu is not equal to 0

Quantile regression

## Load
library(quantreg)
## Fit
res.rq <- rq(nma ~ exposure, data = pku)
res.rq.null <- rq(nma ~ 1, data = pku)
## Result
summary(res.rq, se = "boot")

Call: rq(formula = nma ~ exposure, data = pku)

tau: [1] 0.5

Coefficients:
            Value  Std. Error t value Pr(>|t|)
(Intercept) 48.300  2.168     22.275   0.000  
exposurelow  2.700  2.684      1.006   0.321  
## Likelihood ratio test
neg2logLik <- 2 * as.numeric(logLik(res.rq) - logLik(res.rq.null))
pchisq(neg2logLik, 1, lower.tail = F)
[1] 0.2522

The predicted median value for the high exposure group is 48.3, whereas it is 48.3 + 2.7 = 51.0 for the low exposure group.

Gamma regression

The gamma family GLM can be used to model the mean of a skewed distribution. It can be both additive and multiplicative.

Gamma regression additive model (mean difference)

res.gamma <- glm(formula = nma ~ exposure,
                 family  = Gamma(link = "identity"),
                 data    = pku)
summary(res.gamma)

Call:
glm(formula = nma ~ exposure, family = Gamma(link = "identity"), 
    data = pku)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4637  -0.0557   0.0327   0.1145   0.1778  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    46.28       1.66   27.85   <2e-16 ***
exposurelow     3.09       2.34    1.32     0.19    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.0232)

    Null deviance: 1.00077  on 38  degrees of freedom
Residual deviance: 0.96038  on 37  degrees of freedom
AIC: 273.2

Number of Fisher Scoring iterations: 3

The predicted mean value for the high exposure group is 46.3, whereas it is 46.3 + 3.1 = 49.4 for the low exposure group (additive effect of exposure).

Gamma regression multiplicative model (mean ratio)

res.gamma.log <- glm(formula = nma ~ exposure,
                     family  = Gamma(link = "log"),
                     data    = pku)
summary(res.gamma.log)

Call:
glm(formula = nma ~ exposure, family = Gamma(link = "log"), data = pku)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4637  -0.0557   0.0327   0.1145   0.1778  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.8347     0.0359  106.81   <2e-16 ***
exposurelow   0.0646     0.0489    1.32     0.19    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.0232)

    Null deviance: 1.00077  on 38  degrees of freedom
Residual deviance: 0.96038  on 37  degrees of freedom
AIC: 273.2

Number of Fisher Scoring iterations: 4
exp(coef(res.gamma.log))
(Intercept) exposurelow 
     46.278       1.067 

The predicted mean value for the high exposure group is 46.3, whereas it is 46.3 * 1.066 = 49.4 for the low exposure group (multiplicative effect of exposure).