Wilcoxon Rank Sum Test

1. Solve problem Ex1.03 (No 3 of Chapter 1) in Higgins. The data below are the yearly rainfall totals (inches) in Scranton, P.A. for the years 1951 − 1984

21.3 28.8 17.6 23.0 27.2 28.5 32.8 28.2 25.9 22.5 27.2 33.1 28.7 24.8 24.3 27.1 30.6 26.8 18.9 36.3 28.0 17.9 25.0 27.5 27.7 32.1 28.0 30.9 20.0 20.2 33.5 26.4 30.9 33.2

(a) Use significance level 0.05 and test if the 20th percentile of rainfall is significantly above 18 inches.

Answer

rainfall_data <- cbind(21.3,28.8,17.6,23.0,27.2,28.5,32.8,28.2,25.9,22.5,27.2,33.1,28.7,24.8,24.3,27.1,30.6,
26.8, 18.9, 36.3, 28.0, 17.9, 25.0, 27.5, 27.7, 32.1, 28.0, 30.9, 20.0, 20.2, 33.5, 26.4, 30.9, 33.2)
#plotting of rainfall data to see ECDF
plot(ecdf(rainfall_data), verticals= TRUE, do.points = FALSE, 
     main="yearly rainfall totals in Scranton (inches)", ylab=" ECDF", xlab="Rainfall Data")

ECDF hepls us to identify exactly, how many data points are below or above \(x\).

We can also do hypothesis testing (Upper Tailed Test) to identify the given researh proposition:

Hypothesis:

Null Hypothesis, \(H_o\): \(\theta_{0.2}\) = 18 Alternative Hypothesis, \(H_a\): \(\theta_{0.2}\) > 18 Here, we are assuming a significance level of \(\alpha\) = 0.05.

Considering the binomial test statistic,

library(matrixStats)
#Number of total observations
n= length(rainfall_data>18)
n

## [1] 34

#count the number of values greater than 18
data <- rbind(c(21.3,28.8,17.6,23.0,27.2,28.5,32.8,28.2,25.9,22.5,27.2,33.1,28.7,24.8,24.3,27.1,30.6,
26.8, 18.9, 36.3, 28.0, 17.9, 25.0, 27.5, 27.7, 32.1, 28.0, 30.9, 20.0, 20.2, 33.5, 26.4, 30.9, 33.2))
S = rowCounts(data>18)
S

## [1] 32

\(s_{obs}\)=Number of Observations \(X_i's\) that exceed 18 = 32

We also know that S is the total number of observations out of n that exceed the hypothesized median \(\theta_{0.2}\). In another word,

\[S = {\sum_{i=1}^{n}I(X_i > \theta_{0.2})}\] where I(A) is the indicator function that takes value 1 if the statement A holds and zero otherwise. ##Distribution of S under H0: The 20th percentile implies that \(p = 0.2\) and \(1 − p = 0.8\) So, when \(H_0\) is true,

we can write: \[S ~ Binomial(34,0.8)\] The \(p-value\) with n-1 degrees of freedom: \[p-value = P (S ≥ s_{obs}|H_0)= 1-P(S≤ 31|S~𝐵𝑖(34,0.8)) = 0.02259587\] Since the p − value < 0:05; we reject the null hypothesis and conclude that the 20th percentile of rainfall is significantly above 18 at significance level α = 0.05.

The Rcode for calculating the upper tailed test p − value is:

y <- pbinom(31,34,0.8)
z<- 1-y

(b) Construct a 95% confidence interval for the median.

Answer

The 95% confidence interval implies that the p = 0.5 and α = 0.05. Therefore the distribution of S is binomial distribution,i.e. S ∼ Bi(34, 0.5)

#The order of lower and upper critical values for 95% CI are
lower_limit= 1+ qbinom(0.025,34,0.5) 
lower_limit

## [1] 12

upper_limit = 1 + qbinom(0.975,34,0.5)
upper_limit

## [1] 24

We can aslo build the 95% CI using the confidence interval function:

x= c(21.3,28.8,17.6,23.0,27.2,28.5,32.8,28.2,25.9,22.5,27.2,33.1,28.7,24.8,24.3,27.1,30.6,
26.8, 18.9, 36.3, 28.0, 17.9, 25.0, 27.5, 27.7, 32.1, 28.0, 30.9, 20.0, 20.2, 33.5, 26.4, 30.9, 33.2)
sort (x)

##  [1] 17.6 17.9 18.9 20.0 20.2 21.3 22.5 23.0 24.3 24.8 25.0 25.9 26.4 26.8 27.1
## [16] 27.2 27.2 27.5 27.7 28.0 28.0 28.2 28.5 28.7 28.8 30.6 30.9 30.9 32.1 32.8
## [31] 33.1 33.2 33.5 36.3

conf.med<-function(x, alpha)
{
        v <- sort(x, na.last = NA)
        n <- length(x)
        if(n > 0) {
                m <- median(x)
                l <- qbinom(alpha/2, n, 0.5)
                if(l > 0) 
                        {
                            u = n-l+1
                        r <- c(m, v[l], v[u])
                       }
                else r <- c(m, NA, NA)
        }
        else r <- c(NA, NA, NA)
        r <- as.data.frame(list(median = r[1], lower = r[2], upper = r[3]))
        class(r) <- "table"
        r
}

conf.med(x,0.05) #CI for median

## median  lower  upper 
##  27.35     25   28.7

From the above R code, we can say that,95% CI for the median is \[[x_{(l)},x_{(U)}] = [x_{(12)},x_{(24)}] = [25, 28.7]\] We can also double check using R code: using \(u=24\) and \(l= 12\): \[pbinom(u−1,34,0.5)−pbinom(l−1,34,0.5) = pbinom(23,34,0.5)−pbinom(11,34,0.5) ≈ 0.9590404 ≥ 0.95\]

#difference with n-1 degrees of freedom
a = pbinom(23,34,0.5)
b=pbinom(11,34,0.5)
diff= a-b
diff

## [1] 0.9590404

(c) Construct a 90% confidence interval for the 20th percentile.

Answer Calculation of 90% confidence interval for the 20th percentile:

#The order of lower and upper critical values of 90% confidence interval for the 20th percentile
lower_limit1= 1+ qbinom(0.05,34,0.2) 
lower_limit1

## [1] 4

upper_limit1 = 1 + qbinom(0.95,34,0.2)
upper_limit1

## [1] 12

conf.quantile<-function(x, alpha, p)
{
        #1-alpha confidence interval for theta_p: the pth quantile, e.g. p=0.75 corresponding to the 75th percentile 
        v <- sort(x, na.last = NA)
        n <- length(x)
        if(n > 0) {
                m <- quantile(x, p)
                l <- qbinom(alpha/2, n, p)
                u = qbinom(1-alpha/2, n, p)+1
                if(l > 0) 
                        {
                        r <- c(m, v[l], v[u])
                       }
                else r <- c(m, NA, NA)
        }
        else r <- c(NA, NA, NA)
        r <- as.data.frame(list(p_quantile = r[1], lower = r[2], upper = r[3]))
        class(r) <- "table"
        r
}

Calculation of 90% CI using R function:

conf.quantile(x, alpha=0.05, p=0.2)

## p_quantile      lower      upper 
##       22.8       18.9       26.4

we can say that,95% CI for the median is \[[x_{(l)},x_{(U)}] = [x_{(4)},x_{(12)}] = [18.9, 26.4]\] We can also double check using R code: using \(u=12\) and \(l= 4\) and n-1 degrees of freedom \[pbinom(u−1, 34, 0.2)−pbinom(l−1, 34, 0.2) = pbinom(11, 34, 0.2)−pbinom(3, 34, 0.2)= 0.902554 ≈ 0.903\]

#difference with n-1 degrees of freedom
a1 = pbinom(11, 34, 0.2)
b1 = pbinom(3, 34, 0.2)
diff1= a1-b1
diff1

## [1] 0.902554

(d) The confidence interval procedure assumes that the observations are independent and identically distributed. Do you think this is a reasonable assumption for the rainfall data? If not, what could cause this assumption to be violated?

Answer As we know that the rainfall data were collected from the same region of Scranton, P.A. for the years 1951 − 1984. We may say that, data from the same subjective implies thatthey are most likely auto-correlated, rather than independent. The distribution of the data could be identical. Due to the violation of independent, the confidence interval is not as accurate as that of the data are identical independent distrbuted.

2. (Ex4.04 of Higgins) Measurements of a blood enzyme LDH were taken on seven subjects before fasting and after fasting. Is there a significant difference between the LDH readings before and after fasting? Use significance level 0.05.

Subject 1 2 3 4 5 6 7 Before 89 90 87 98 120 85 97 After 76 101 84 86 105 84 93

(a) Construct appropriate null and alternative hypotheses.

Answer We have a paired data over here.So,We can do hypothesis testing (two Tailed Test) to identify the given researh proposition:

Hypothesis:

Null Hypothesis, \(H_o\): \(\theta\) = 0 Alternative Hypothesis, \(H_a\): \(\theta\) \(\neq\) 0 Here, we are assuming a significance level of \(\alpha\) = 0.05

where θ is the difference between blood enzyme LDH readings before and after fasting

(b) Test using Wilcoxons signed-rank statistic. Compute the p-value of the test using the exact null distribution of SR+ and state your conclusion.

Answer R code for Wilcoxons signed-rank statistic:

before_data = c(89,90,87,98,120,85,97)
after_data = c(76,101,84,86,105,84,93)
wilcox.test(before_data,after_data, alternative="two.sided",exact=FALSE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  before_data and after_data
## W = 32, p-value = 0.3706
## alternative hypothesis: true location shift is not equal to 0

Since the p-value (0.3706) is greater than α = 0.05; we don’t have strong evidence to say that there is a significant difference between the blood enzyme LDH readings before and after fasting.

3. (Ex4.02 of Higgins) Students in an introductory statistics course were asked at the beginning of the course and at the end of the course the extent to which they disagreed or agreed with the statement: Statistics is very important to my major area of study. They responded on a scale of 1 to 5 with (1) strongly disagree, (2) moderately disagree, (3) neutral, (4) moderately agree, and (5) strongly agree. Data are shown in the table. Use Wilcoxons signed-rank test to determine whether responses to this question changed significantly from the beginning to the end of the semester. Use significance level 0.05.

Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Before 2 3 4 4 3 1 3 4 4 5 3 4 2 2 4 3 4 2 2 After 2 4 4 4 4 4 3 5 4 4 4 5 4 2 5 5 4 1 2

(a) Construct appropriate null and alternative hypotheses.

Answer

Hypothesis:

Null Hypothesis, \(H_o\): \(\theta\) = 0 Alternative Hypothesis, \(H_a\): \(\theta\) \(\neq\) 0 Here, we are assuming a significance level of \(\alpha\) = 0.05 where θ is the changes of responses of the students in an introductory statistics course to the question between the beginning and the end of the semester.

(b) Test using Wilcoxons signed-rank statistic. Compute the p-value of the test using the exact null distribution of SR+ and state your conclusion.

Answer

#install packages to deal with exact test Wilcoxon sign rank test with ties
#install.packages(exactRankTests)
#install.packages("coin")
library(coin) #load the library coin

## Loading required package: survival

before_data1 = c(2,3,4,4,3,1,3,4,4,5,3,4,2,2,4,3,4,2,2)
after_data1 = c(2,4,4,4,4,4,3,5,4,4,4,5,4,2,5,5,4,1,2)
wilcox.test(before_data1,after_data1, paired = TRUE, alternative="two.sided",exact = NULL)

## Warning in wilcox.test.default(before_data1, after_data1, paired = TRUE, :
## cannot compute exact p-value with ties

## Warning in wilcox.test.default(before_data1, after_data1, paired = TRUE, :
## cannot compute exact p-value with zeroes

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  before_data1 and after_data1
## V = 9, p-value = 0.02903
## alternative hypothesis: true location shift is not equal to 0

From R code, we can conclude by saying that Since the p-value 0.02903 which is less than the significance level. We can reject the null hypothesis and conclude that the responses to the question have changed significantly from the beginning to the end of the semester.

Wilcoxon Rank Sum Test

Md Mominul Islam ID: 101009250

2/19/2021