Ref. Regression Methods in Biostatistics
They do not assume normal distribution.
They are not assumption free.
Wilcoxon and Kruskal-Wallis test assume location difference, but assumes the same variance (scale) and shape.
They are less efficient where parametric tests are appropriate.
They only provide p-values, no effect measures are provided.
Median regression can provide an effect estimate and inference about the median in nonnormal distributions.
Gamm regression can provide an effect estimate and inference about the mean in skewed distributions.
## Load foreign package
library(foreign)
## resting energy expenditure: CF subjects vs healthy subjects
ree <- read.dta("~/statistics/Pagano_Gauvreau/Stata/chap13/resting energy expenditure.dta")
ree
pair cf healthy
1 1 1153 996
2 2 1132 1080
3 3 1165 1182
4 4 1460 1452
5 5 1634 1162
6 6 1493 1619
7 7 1358 1140
8 8 1453 1123
9 9 1185 1113
10 10 1824 1463
11 11 1793 1632
12 12 1930 1614
13 13 2075 1836
## Add difference variable
ree$diff <- with(ree, cf - healthy)
ree
pair cf healthy diff
1 1 1153 996 157
2 2 1132 1080 52
3 3 1165 1182 -17
4 4 1460 1452 8
5 5 1634 1162 472
6 6 1493 1619 -126
7 7 1358 1140 218
8 8 1453 1123 330
9 9 1185 1113 72
10 10 1824 1463 361
11 11 1793 1632 161
12 12 1930 1614 316
13 13 2075 1836 239
## Density plot for difference
library(lattice)
densityplot(ree$diff)
## 11 positive signs out of 13 non-zero differences
D <- 11
## Expected value of the number of plus signs under null. np
Null <- 13*(1/2)
Null
[1] 6.5
## Variance of it. npq
Var <- 13*(1/2)*(1/2)
Var
[1] 3.25
## Test statistic (large sample method)
Zplus <- (D - Null)/sqrt(Var)
Zplus
[1] 2.496
## 2-sided normal distribution-based p-value
two.side.norm.p <- pnorm(Zplus, lower.tail = F)*2
two.side.norm.p
[1] 0.01255
## 1-sided binomial distribution-based p-value
one.side.binom.p <- sum(dbinom(11:13, size = 13, prob = 1/2))
## Approximate 2-sided binomial distribution-based p-value
one.side.binom.p*2
[1] 0.02246
## Sign test: Test if the median is zero
library(BSDA)
SIGN.test(ree$diff)
One-sample Sign-Test
data: ree$diff
s = 11, p-value = 0.02246
alternative hypothesis: true median is not equal to 0
95 percent confidence interval:
25.35 324.48
sample estimates:
median of x
161
Conf.Level L.E.pt U.E.pt
Lower Achieved CI 0.9077 52.00 316.0
Interpolated CI 0.9500 25.35 324.5
Upper Achieved CI 0.9775 8.00 330.0
This is significant by either the large-sample method or the small sample method. The problem is that this method completely ignores the magnitude of these differences.
## Load CF FVC reduction data
fvc <- read.dta("~/statistics/Pagano_Gauvreau/Stata/chap13/reduction in fvc.dta")
## Differences
fvc$diff <- with(fvc, placebo - drug)
## Absolute differences
fvc$abs.diff <- abs(fvc$diff)
## Ranks of absolute differences
fvc$rank <- rank(fvc$abs.diff)
## Create signs
fvc$sign <- factor(fvc$diff < 0, c(FALSE,TRUE), c("1","-1"))
fvc$sign <- as.numeric(as.character(fvc$sign))
## Get signed ranks
fvc$signed.rank <- with(fvc, rank * sign)
## Show
fvc
placebo drug diff abs.diff rank sign signed.rank
1 224 213 11 11 1 1 1
2 80 95 -15 15 2 -1 -2
3 75 33 42 42 3 1 3
4 541 440 101 101 4 1 4
5 74 -32 106 106 5 1 5
6 85 -28 113 113 6 1 6
7 293 445 -152 152 7 -1 -7
8 -23 -178 155 155 8 1 8
9 525 367 158 158 9 1 9
10 -38 140 -178 178 10 -1 -10
11 508 323 185 185 11 1 11
12 255 10 245 245 12 1 12
13 525 65 460 460 13 1 13
14 1023 343 680 680 14 1 14
## Add negative ranks as absolute values
sum.abs.neg.ranks <- sum(abs(fvc$signed.rank[fvc$signed.rank<0]))
## The expected value under null
Null <- 14*(14 + 1)/4
Null
[1] 52.5
## The variance
Var <- 14*(14 + 1)*(14*2 + 1)/24
## Test statistic
Zt <- (sum.abs.neg.ranks - Null)/sqrt(Var)
## 2-sided normal distribution-based p-value
pnorm(Zt)*2
[1] 0.03546
## Approximate 2-sided p-value from signed rank statistic
psignrank(sum.abs.neg.ranks, 14)*2
[1] 0.03528
## Using a built-in function
wilcox.test(fvc$diff)
Wilcoxon signed rank test
data: fvc$diff
V = 86, p-value = 0.03528
alternative hypothesis: true location is not equal to 0
It is the non-parametric equivalent of two-sample t-test. It does not require normality, but does require two sample distributions have the same general shape
## resting energy expenditure: CF subjects vs healthy subjects
pku <- read.dta("~/statistics/Pagano_Gauvreau/Stata/chap13/pku.dta")
pku$exposure <- factor(pku$exposure)
## Assign ranks
pku$rank <- rank(pku$nma)
## Show
pku
exposure nma rank
1 high 28.0 1.0
2 high 35.0 3.0
3 high 37.0 4.5
4 high 37.0 4.5
5 high 43.5 9.0
6 high 44.0 10.0
7 high 45.5 11.5
8 high 46.0 13.0
9 high 48.0 17.0
10 high 48.3 18.0
11 high 48.7 19.5
12 high 51.0 23.0
13 high 52.0 25.5
14 high 53.0 28.0
15 high 53.0 28.0
16 high 54.0 31.5
17 high 54.0 31.5
18 high 55.0 34.5
19 low 34.5 2.0
20 low 37.5 6.0
21 low 39.5 7.0
22 low 40.0 8.0
23 low 45.5 11.5
24 low 47.0 14.5
25 low 47.0 14.5
26 low 47.5 16.0
27 low 48.7 19.5
28 low 49.0 21.0
29 low 51.0 23.0
30 low 51.0 23.0
31 low 52.0 25.5
32 low 53.0 28.0
33 low 54.0 31.5
34 low 54.0 31.5
35 low 55.0 34.5
36 low 56.5 36.0
37 low 57.0 37.0
38 low 58.5 38.5
39 low 58.5 38.5
## Calculate rank sums
rank.sum <- with(pku, tapply(rank, exposure, sum))
rank.sum
high low
313 467
## Set the smaller as the W statistic
W <- min(rank.sum)
W
[1] 313
## Expected value for rank sum
Null <- 18*(18 + 21 + 1)/2
Null
[1] 360
## Variance for rank sum
Var <- 18*21*(18 + 21 + 1)/12
Var
[1] 1260
## Test statistic
Zw <- (W - Null)/sqrt(Var)
Zw
[1] -1.324
## 2-sided normal distribution-based p-value
pnorm(Zw)*2
[1] 0.1855
## Approximate 2-sided p-value from rank sum test (this did not work??)
## pwilcox(W, m = 18, n = 21, lower.tail = F)*2
## Using a built-in function
wilcox.test(nma ~ exposure, data = pku)
Wilcoxon rank sum test with continuity correction
data: nma by exposure
W = 142, p-value = 0.1896
alternative hypothesis: true location shift is not equal to 0
## Using coin package
library(coin)
wilcox_test(nma ~ exposure, data = pku)
Asymptotic Wilcoxon Mann-Whitney Rank Sum Test
data: nma by exposure (high, low)
Z = -1.326, p-value = 0.1849
alternative hypothesis: true mu is not equal to 0
## Load
library(quantreg)
## Fit
res.rq <- rq(nma ~ exposure, data = pku)
res.rq.null <- rq(nma ~ 1, data = pku)
## Result
summary(res.rq, se = "boot")
Call: rq(formula = nma ~ exposure, data = pku)
tau: [1] 0.5
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 48.300 2.168 22.275 0.000
exposurelow 2.700 2.684 1.006 0.321
## Likelihood ratio test
neg2logLik <- 2 * as.numeric(logLik(res.rq) - logLik(res.rq.null))
pchisq(neg2logLik, 1, lower.tail = F)
[1] 0.2522
The predicted median value for the high exposure group is 48.3, whereas it is 48.3 + 2.7 = 51.0 for the low exposure group.
The gamma family GLM can be used to model the mean of a skewed distribution. It can be both additive and multiplicative.
Gamma regression additive model (mean difference)
res.gamma <- glm(formula = nma ~ exposure,
family = Gamma(link = "identity"),
data = pku)
summary(res.gamma)
Call:
glm(formula = nma ~ exposure, family = Gamma(link = "identity"),
data = pku)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.4637 -0.0557 0.0327 0.1145 0.1778
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.28 1.66 27.85 <2e-16 ***
exposurelow 3.09 2.34 1.32 0.19
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.0232)
Null deviance: 1.00077 on 38 degrees of freedom
Residual deviance: 0.96038 on 37 degrees of freedom
AIC: 273.2
Number of Fisher Scoring iterations: 3
The predicted mean value for the high exposure group is 46.3, whereas it is 46.3 + 3.1 = 49.4 for the low exposure group (additive effect of exposure).
Gamma regression multiplicative model (mean ratio)
res.gamma.log <- glm(formula = nma ~ exposure,
family = Gamma(link = "log"),
data = pku)
summary(res.gamma.log)
Call:
glm(formula = nma ~ exposure, family = Gamma(link = "log"), data = pku)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.4637 -0.0557 0.0327 0.1145 0.1778
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.8347 0.0359 106.81 <2e-16 ***
exposurelow 0.0646 0.0489 1.32 0.19
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.0232)
Null deviance: 1.00077 on 38 degrees of freedom
Residual deviance: 0.96038 on 37 degrees of freedom
AIC: 273.2
Number of Fisher Scoring iterations: 4
exp(coef(res.gamma.log))
(Intercept) exposurelow
46.278 1.067
The predicted mean value for the high exposure group is 46.3, whereas it is 46.3 * 1.066 = 49.4 for the low exposure group (multiplicative effect of exposure).