Two sample location problem: A simulation study

Subhrangsu Bhunia

Rhitankar Bandyopadhyay

Sudipta Sarkar

2024-03-18

Introductory sections, objectives & the set-up

Introduction

Let us consider an example in order to introduce the Mann-Whitney statistic and cogitate the essence of non parametric inference in real life situations.

Suppose we have the data on weights of n persons from Delhi and m persons from Kolkata. We want to test whether the average rate of obesity is higher in Delhi than that in Kolkata.

This real life scenario can be formulated as a testing problem and in this context, we can perform Mann-Whitney U test; proposed by Henry Mann and Donald Ransom Whitney, around 1947.

Assumptions

In order to carry out the Mann-Whitney U test, consider two family of distributions F(X) and G(X) = F(X-\(\theta\))

Assume that both F and G are continuous.

We draw random samples (X₁, X₂,…, X_m) and (Y₁, Y₂,…, Y_n), identically and independently, from F and G.

Objectives

Here we want to test whether the aforementioned random samples actually comes from the same population distribution or not, when the distributions are completely unknown.

Firstly we examine and establish that the test statistic is distribution free under the null hypothesis, for different choices of distributions and sample sizes.
Then we check the behaviour of the test statistic under similar situations, but under the alternative hypothesis.
After that, the limiting distribution of the test statistic is verified for large sample sizes.
Then we consider some discrete distributions under the hypotheses and study the behaviour of the statistic.
Penultimately, we conduct some equivalent parametric tests and compare the sizes and powers of those tests with the aforementioned non-parametric tests.
Then we try to find out the Hodges-Lehmann estimator and ultimately we carry out some parallel studies like checking robustness, stochastic dominance and observing the relation between Mann-Whitney statistic and Wilcoxon rank-sum statistic.

Underlying set-up

Testing Procedure

For convenience, here we are interested only in testing H₀: \(\theta\) = 0 ag. H₁: \(\theta\) > 0. Then in similar manner, the testing procedures for the equivalent alternative hypotheses H₂: \(\theta\) < 0 and H₃: \(\theta\) \(\neq\) 0 will be executed.

Test Statistic

Consider an indicator random variable, I(X_i > Y_j) \(\forall\) i=1(1)m, \(\forall\) j=1(1)n.
Define \(\phi\)(X_i , Y_j) = I(X_i > Y_j)
Then the Mann- Whitney statistic U is defined as follows:

U = \(\sum_{i=1}^{m} \sum_{j=1}^{n} \phi(x_i, y_j)\)
H₀ will be rejected in favor of H₁ for small values of U. The critical regions will be adjusted correspondingly for the equivalent alternative hypotheses.
Under H₀, E(U) = \(\frac {mn}{2}\) , Var(U) = \(\frac {mn(m+n+1)}{12}\)
We will show, later on, that U is a distribution free statistic, under H₀.

DISTRIBUTION FREE?

For continuous distributions

Under null hypothesis

Here we will show that the Mann-Whitney statistic is distribution free under null hypothesis, for different choices of continuous distributions, considering different choices of sample sizes. Normal, Cauchy, Laplace and Logistic distributions have been used here to prove and disprove that the statistic is distribution free, under null and alternative hypotheses, respectively.

\(\textbf{m=4, n=7}\)

Code

library(ggthemes)
library(extraDistr)
Mann_Whitney<-function(sample1,sample2){
  count=0
  for(i in 1:length(sample1)){
    for(j in 1:length(sample2)){
      if(sample1[i]>sample2[j]) count=count+1
    }
  }
  count
}
sampling<-function(dist,par1,par2,size,...){
  if (dist == 'normal') {
    mean=par1;sd=par2;sample <- rnorm(size, mean, sd)
     } 
  else if (dist == 'uniform') {
    min_val= par1;max_val=par2;sample=runif(size, min_val, max_val)
  } 
  else if (dist == 'poisson') {
    lambda <- par1;sample <- rpois(size, lambda)
  } 
  else if (dist == 'cauchy') {
    location=par1;scale=par2;sample <- rcauchy(size, location,scale)
  } 
  else if (dist == 'exponential') {
    rate=par1;sample <- rexp(size,rate=par1)
  } 
  else if (dist == 'gamma') {
    shape=par1;rate=par2;sample <- rcauchy(size,shape,rate)
  } 
  else if (dist == 'beta') {
    shape1=par1;shape2=par2;sample <- rbeta(size,shape1,shape2)
  } 
  else if (dist == 'laplace') {
    mu=par1;sigma=par2;sample <- rlaplace(size,mu,sigma)
  } 
  else if (dist == 'binomial') {
    trials=par1;prob=par2;sample <- rbinom(size,trials,prob)
  } 
  else if (dist == 'geometric') {
    prob=par1;sample <- rgeom(size,prob)
  } 
  else if (dist == 'logistic') {
    location=par1;scale=par2;sample <- rlogis(size,location,scale)
  } 
  else if (dist == 'lognormal') {
    location=par1;scale=par2;sample <- rglnorm(size,location,scale)
  } 
  else if (dist == 't') {
    sample <- rt(size,par1)
  } 
  else if (dist == 'chisq') {
    sample <- rchisq(size,par1)
  } 
  else if (dist == 'f') {
    location=par1;scale=par2;sample <- rf(size,location,scale)
  } 
  else {
    stop("Unsupported distribution")
  }
  
  return(sample)
}

library(ggplot2)
Mann_WhitneyHist<-function(iter=1000,dist,size1,size2,location=0,scale=1,theta=0,...){
  test=NULL
  for(i in 1:iter){
    x=sampling(dist,par1=location,par2=scale,size=size1)
    y=sampling(dist,par1=location+theta,par2=scale,size=size2)
    test=c(test,Mann_Whitney(x,y))
  }
 return(test)
}

dist_h0<-function(m,n,s...){
data=data.frame(values=c(Mann_WhitneyHist(dist="normal",size1 =m,size2 = n,location = 0,scale=1),Mann_WhitneyHist(dist="cauchy",size1 = m,size2 = n,location = 0,scale=1),
Mann_WhitneyHist(dist="laplace",size1 = m,size2 = n,location = 0,scale=1),Mann_WhitneyHist(dist="logistic",size1 = m,size2 = n,location = 0,scale=1)),Distribution=rep(c("normal","cauchy","laplace","logistic"),each=1000))
ggplot(data,aes(x=values,col=Distribution,fill=Distribution))+
  geom_density(alpha=0.5)+
  geom_vline(xintercept = m*n/2,linetype="dashed",col="black",lwd=0.7)+
  scale_color_manual(values = c("normal"="black","cauchy"="black","laplace"="black","logistic"="black")) +
  scale_fill_manual(values = c("normal"="#90EE90","cauchy"="red","laplace"="violet","logistic"="#87CEFA")) +
  labs(title = "Distributions under H_0 ",x="Mann-Whitney Statistic",y="Density")+
  theme_stata()

}              
dist_h0(m=4,n=7)

\(\textbf{m=11, n=15}\)

Code

dist_h0(m=11,n=15)

\(\textbf{m=24, n=18}\)

Code

dist_h0(m=24,n=18)

\(\textbf{m=33, n=30}\)

Code

dist_h0(m=33,n=30)

Under alternative hypothesis

Now we will show that the statistic is definitely not distribution free under the alternative hypothesis taken into consideration, for different choices of continuous distributions, considering different choices of sample sizes.

\(\textbf{m=4, n=7}\)

Code

dist_h1<-function(m,n,s...){
  data=data.frame(values=c(Mann_WhitneyHist(dist="normal",size1 =m,size2 = n,location = 0,scale=1,theta=0.5),Mann_WhitneyHist(dist="cauchy",size1 = m,size2 = n,location = 0,scale=1,theta=0.5),
                           Mann_WhitneyHist(dist="laplace",size1 = m,size2 = n,location = 0,scale=1,theta=0.5),Mann_WhitneyHist(dist="logistic",size1 = m,size2 = n,location = 0,scale=1,theta=0.5)),Distribution=rep(c("normal","cauchy","laplace","logistic"),each=1000))
  ggplot(data,aes(x=values,col=Distribution,fill=Distribution))+
    geom_density(alpha=0.5)+
   scale_color_manual(values = c("normal"="black","cauchy"="black","laplace"="black","logistic"="black")) +
    scale_fill_manual(values = c("normal"="#90EE90","cauchy"="red","laplace"="violet","logistic"="#87CEFA")) +
    labs(title = "Distributions under H_1",x="Mann-Whitney Statistic",y="Density")+
    theme_stata()
}  
dist_h1(4,7)

\(\textbf{m=11, n=15}\)

Code

dist_h1(11,15)

\(\textbf{m=24, n=18}\)

Code

dist_h1(24,18)

\(\textbf{m=33, n=30}\)

Code

dist_h1(33,30)

For discrete distributions

Under null hypothesis

Here we will show that, for different choices of discrete distributions, the Mann-Whitney statistic is not distribution free, neither under the null hypothesis, nor under the alternative hypothesis, taken into consideration here, and for different choices of sample sizes. Binomial, Poisson and Geometric distributions are used here in order to show that the statistic is not distribution free under the hypotheses.

\(\textbf{m=4, n=7}\)

Code

discretedist_h0<-function(m,n,s...){
  data=data.frame(values=c(Mann_WhitneyHist(dist="poisson",size1 =m,size2 = n,location = 2),Mann_WhitneyHist(dist="binomial",size1 = m,size2 = n,location = 15,scale=0.6),
                           Mann_WhitneyHist(dist="geometric",size1 = m,size2 = n,location =0.6)),Distribution=rep(c("poisson","binomial","geometric"),each=1000))
  ggplot(data,aes(x=values,col=Distribution,fill=Distribution))+
    geom_density(alpha=0.5)+
   scale_color_manual(values = c("poisson"="black","binomial"="black","geometric"="black")) +
    scale_fill_manual(values = c("poisson"="#90EE90","binomial"="red","geometric"="violet")) +
    labs(title = "Distributions under H_0 ",x="Mann-Whitney Statistic",y="Density")+
    theme_stata()
  
}
discretedist_h0(4,7)

\(\textbf{m=11, n=15}\)

Code

discretedist_h0(11,15)

\(\textbf{m=24, n=18}\)

Code

discretedist_h0(24,18)

\(\textbf{m=33, n=30}\)

Code

discretedist_h0(33,30)

Under alternative hypothesis

Now we will show that the statistic is definitely not distribution free under the alternative hypothesis, for different choices of discrete distributions, considering different choices of sample sizes.

\(\textbf{m=4, n=7}\)

Code

Mann_Whitneybin<-function(iter=1000,dist,size1,size2,location=0,scale=1,theta=0,...){
  test=NULL
  for(i in 1:iter){
    x=sampling(dist,par1=location,par2=scale,size=size1)
    y=sampling(dist,par1=location,par2=scale+theta,size=size2)
    test=c(test,Mann_Whitney(x,y))
  }
  return(test)
}
discretedist_h1<-function(m,n,s...){
  data=data.frame(values=c(Mann_WhitneyHist(dist="poisson",size1 =m,size2 = n,location = 2,theta =1.5),Mann_Whitneybin(dist="binomial",size1 = m,size2 = n,location = 15,scale=0.6,theta=0.1),
                           Mann_WhitneyHist(dist="geometric",size1 = m,size2 = n,location =0.6,theta=0.1)),Distribution=rep(c("poisson","binomial","geometric"),each=1000))
  ggplot(data,aes(x=values,col=Distribution,fill=Distribution))+
    geom_density(alpha=0.5)+
    scale_color_manual(values = c("poisson"="black","binomial"="black","geometric"="black")) +
    scale_fill_manual(values = c("poisson"="#90EE90","binomial"="red","geometric"="violet")) +
    labs(title = "Distributions under H_1 ",x="Mann-Whitney Statistic",y="Density")+
    theme_stata()
  
}
discretedist_h1(4,7)

\(\textbf{m=11, n=15}\)

Code

discretedist_h1(11,15)

\(\textbf{m=24, n=18}\)

Code

discretedist_h1(24,18)

\(\textbf{m=33, n=30}\)

Code

discretedist_h1(33,30)

EXACT DISTRIBUTION

Exact distribution of the statistic under some fixed sample sizes

Here we calculate small sample exact probability distribution of Mann-Whitney U statistic for some small sample sizes and observe their behaviours. Here the statistic takes values from 0 to mn, inclusive of both the boundaries.
In this context, we consider the recursion relation: d(m, n, k) = d(m, n-1, k) + d(m-1, n, k-n); where d(m, n, k) is the number of arrangements of m X’s and n Y’s leading to U = k.

\(\textbf{m=2, n=2}\)

Code

## small sample exact prob dist under H0
p_21<-NULL;p_12<-NULL
p_21[1]<-1/3;p_21[2]<-1/3;p_21[3]<-1/3;p_12[1]<-1/3;p_12[2]<-1/3;p_12[3]<-1/3

# Distribution of p(2,2)
m=2;n=2
p<-NULL;q<-NULL;r<-NULL;p_22<-NULL;p<-p_21;q<-p_12
for (k in 1:(m*n+1)){
  if(k<=m){
    r[k] = (m/(m+n))*q[k]
  }
  else if (k>m && k<=((m-1)*n +1) ){
    r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  }
  else {
    r[k]=(n/(m+n))*p[k-m]
  }
}
p_22<-r
prob22=data.frame(values=0:(2*2),probability=p_22)

#Dstribution of p(3,0)
p_30<-NULL;p_30[1]<-1

#Distribution of p(3,1)
p<-NULL;q<-NULL;r<-NULL;p_31<-NULL
m=3;n=1
p<-p_30;q<-p_21
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_31<-r
prob31=data.frame(values=0:(3*1),probability=p_31)
#Distribution of p(3,2)
p<-NULL;q<-NULL;r<-NULL;p_32<-NULL;m=3;n=2;p<-p_31;q<-p_22;r<-NULL
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_32<-r
prob32=data.frame(values=0:(3*2),probability=p_32)
p_03<-1

# Distribution of P(1,3)
p<-NULL;q<-NULL;r<-NULL;p_13<-NULL;m=1;n=3;p<-p_12;q<-p_03;r<-NULL
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_13<-r
prob13=data.frame(values=0:(1*3),probability=p_13)

# Distribution of P(2,3)
p<-NULL;q<-NULL;r<-NULL;p_23<-NULL;m=2;n=3;p<-p_22;q<-p_13;r<-NULL
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_23<-r
prob23=data.frame(values=0:(2*3),probability=p_23)

# Distribution of P(3,3)
p<-NULL;q<-NULL;r<-NULL;p_33<-NULL;m=3;n=3;p<-p_32;q<-p_23
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_33<-r
prob33=data.frame(values=0:(3*3),probability=p_33)
p_40<-1

# Distribution of P(4,1)
p<-NULL;q<-NULL;r<-NULL;p_41<-NULL;m=4;n=1;p<-p_40;q<-p_31
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_41<-r
prob41=data.frame(values=0:(4*1),probability=p_41)

# Distribution of P(4,2)
p<-NULL;q<-NULL;r<-NULL;p_42<-NULL;m=4;n=2;p<-p_41;q<-p_32
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_42<-r
prob42=data.frame(values=0:(4*2),probability=p_42)

# Distribution of P(4,3)
p<-NULL;q<-NULL;r<-NULL;p_43<-NULL;m=4;n=3;p<-p_42;q<-p_33
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_43<-r
prob43=data.frame(values=0:(4*3),probability=p_43)

p_04<-1
# Distribution of P(1,4)
p<-NULL;q<-NULL;r<-NULL;p_14<-NULL;m=1;n=4;p<-p_13;q<-p_04
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_14<-r
prob14=data.frame(values=0:(1*4),probability=p_14)


# Distribution of P(2,4)
p<-NULL;q<-NULL;r<-NULL;p_24<-NULL;m=2;n=4;p<-p_23;q<-p_14
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_24<-r
prob24=data.frame(values=0:(2*4),probability=p_24)

# Distribution of P(3,4)
p<-NULL;q<-NULL;r<-NULL;p_34<-NULL;m=3;n=4;p<-p_33;q<-p_24
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_34<-r
prob34=data.frame(values=0:(3*4),probability=p_34)

# Distribution of P(4,4)
p<-NULL;q<-NULL;r<-NULL;p_44<-NULL;m=4;n=4;p<-p_43;q<-p_34
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_44<-r
prob44=data.frame(values=0:(4*4),probability=p_44)
prob22

  values probability
1      0   0.1666667
2      1   0.1666667
3      2   0.3333333
4      3   0.1666667
5      4   0.1666667

\(\textbf{m=4, n=1}\)

Code

prob41

  values probability
1      0         0.2
2      1         0.2
3      2         0.2
4      3         0.2
5      4         0.2

Code

prob44

   values probability
1       0  0.01428571
2       1  0.01428571
3       2  0.02857143
4       3  0.04285714
5       4  0.07142857
6       5  0.07142857
7       6  0.10000000
8       7  0.10000000
9       8  0.11428571
10      9  0.10000000
11     10  0.10000000
12     11  0.07142857
13     12  0.07142857
14     13  0.04285714
15     14  0.02857143
16     15  0.01428571
17     16  0.01428571

\(\textbf{m=4, n=4}\)

Code

prob44

   values probability
1       0  0.01428571
2       1  0.01428571
3       2  0.02857143
4       3  0.04285714
5       4  0.07142857
6       5  0.07142857
7       6  0.10000000
8       7  0.10000000
9       8  0.11428571
10      9  0.10000000
11     10  0.10000000
12     11  0.07142857
13     12  0.07142857
14     13  0.04285714
15     14  0.02857143
16     15  0.01428571
17     16  0.01428571

ASYMPTOTIC ANALYSIS

Checking asymptotic normality

1. U-statistic approach

Under the appropriate assumptions, \({U^{*} = \frac{1}{mn} \sum_{i=1}^{m} \sum_{j=1}^{n} \phi(x_i, y_j)}\) is the two-sample U-statistic corresponding to the Mann-Whitney statistic. We know that, under the assumption that all the appropriate order moments exist, under null hypothesis,
\(\sqrt{N}\)[ \(U^{*}\) - \(\gamma\)(F, G)] \(\xrightarrow{D}\) \(\mathcal{N}\)(0,\(\frac{r^{2} \xi_{10}}{\lambda} + \frac{s^{2} \xi_{01}}{1 - \lambda}\)) as N \(\xrightarrow{}\) \(\infty\)
Where (r, s) is the degree of the U-statistic; \(\frac{m}{N}\xrightarrow{} \lambda\) and \(\xi_{cd}\) denotes the covariance between two kernels of the U-statistic, having c and d \(X\) and \(Y\) observations in common, respectively. Under null hypothesis, \(\xi_{10}\) = \(\frac{1}{4}\) and \(\xi_{01}\) = \(\frac{1}{12}\).
Here we will show the Q-Q plots and verify the asymptotic normality of the U-statistic corresponding to the Mann-Whitney statistic, considering different continuous distributions, namely, Cauchy, Laplace and Logistic, separately in the hypotheses.

Cauchy Distribution

\(\textbf{m=11, n=15}\)

Code

check_qq<-function(distn,m,n){
  c=m/(m+n)
  a=(Mann_WhitneyHist(dist=distn,size1 =m,size2 = n,location = 0,scale=1)/(m*n)-1/2)*sqrt(n+m)/((3-2*c)/(12*c*(1-c)))
  data=data.frame(x=a)
    ggplot(data,aes(sample=x))+geom_qq(col="blue")+geom_qq_line(col="red",linetype="dashed",lwd=0.7)+
      labs(x="Theoretical Quantiles",y="Sample Quantiles",x="Theoretical Quantiles",y="Sample Quantiles")+ggtitle(paste0("QQ Plot for ",distn))+
      theme_stata()
}
check_qq("cauchy",11,15)

\(\textbf{m=24, n=18}\)

Code

check_qq("cauchy",24,18)

Laplace Distribution

\(\textbf{m=11, n=15}\)

Code

check_qq("laplace",11,15)

\(\textbf{m=24, n=18}\)

Code

check_qq("laplace",24,18)

Logistic Distribution

\(\textbf{m=11, n=15}\)

Code

check_qq("logistic",11,15)

\(\textbf{m=24, n=18}\)

Code

check_qq("logistic",24,18)

2. Goodness of fit test

From the QQ plots we can have an apparent idea that the behaviour of the asymptotic distribution of the non parametric test statistic is similar to that of normal distribution. For a stronger mathematical verification, we carry out a Goodness of Fit test procedure. Here we carry out the the non-parametric Kolmogorov-Smirnov test for normality.
Under the null hypothesis we assume that the test statistic follows normal distribution, asymptotically and under the both sided alternative, it is assumed that the null hypothesis is false. We reject the null hypothesis for small p-values; i.e. we can conclude that the test statistic follows normal distribution, asymptotically, for large p-values.

Cauchy Distribution

\(\textbf{m=11, n=15}\)

Code

check_normal<-function(distn,m,n){
  c=m/(m+n)
  a=(Mann_WhitneyHist(dist=distn,size1 =m,size2 = n,location = 0,scale=1)/(m*n)-1/2)*sqrt(n+m)/((3-2*c)/(12*c*(1-c)))
  data=data.frame(x=a)
return(ks.test(x=a,y=rnorm(m+n,0,1)))
}
check_normal("cauchy",11,15)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  a and rnorm(m + n, 0, 1)
D = 0.18669, p-value = 0.3402
alternative hypothesis: two-sided

\(\textbf{m=24, n=18}\)

Code

check_normal("cauchy",24,18)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  a and rnorm(m + n, 0, 1)
D = 0.15233, p-value = 0.3069
alternative hypothesis: two-sided

Logistic Distribution

\(\textbf{m=11, n=15}\)

Code

check_normal("logistic",11,15)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  a and rnorm(m + n, 0, 1)
D = 0.11685, p-value = 0.8795
alternative hypothesis: two-sided

\(\textbf{m=24, n=18}\)

Code

check_normal("logistic",24,18)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  a and rnorm(m + n, 0, 1)
D = 0.15343, p-value = 0.2988
alternative hypothesis: two-sided

Laplace Distribution

\(\textbf{m=11, n=15}\)

Code

check_normal("laplace",11,15)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  a and rnorm(m + n, 0, 1)
D = 0.18062, p-value = 0.3801
alternative hypothesis: two-sided

\(\textbf{m=24, n=18}\)

Code

check_normal("laplace",24,18)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  a and rnorm(m + n, 0, 1)
D = 0.083333, p-value = 0.9423
alternative hypothesis: two-sided

CONCLUSION

From the QQ plots and the Kolmogorov-Smirnov test, we conclude that the Mann-Whitney U statistic asymptotically follows Normal distribution, irrespective of the underlying continuous distributions in the corresponding hypotheses and hence the test procedure is asymptotically distribution-free.

CONSISTENT?
CHECK POWER CURVES varying location and sample sizes

Power curves

Here we plot the power curves, considering the CDFs of Normal, Cauchy, Laplace and Logistic distributions, separately, as F and G, under both the null and alternative hypotheses, namely H₀: \(\theta\) = 0 and H₁: \(\theta\) > 0, following the usual and aforementioned notations.

Plotting the power curves, we observe that for larger sample sizes, power of the test approaches 1 faster, as expected. And the behavioral pattern of the curves are different for different distributions.

Also observe that all the tests will be unbiased, for large enough sample sizes.

Under alternative: H₁: \(\theta\) > 0

Normal Distribution

Code

Mann_WhitneyPowerRight<-function(distn,size1,size2,alpha=0.05,iter=1000,rge=3,...){
  effect_sizes <- seq(0,rge, by = 0.1);power_values=numeric(length(effect_sizes))
  for(i in 1:length(effect_sizes)){
    reject<-0
    for(j in 1:iter){
      sample1=sampling(dist=distn,par1=0,par2=1,size=size1);sample2=sampling(dist=distn,par1=(0+effect_sizes[i]),par2=1,size=size1)
      if(wilcox.test(sample1,sample2)$p.value <alpha) reject=reject+1
      }
    power_values[i]=reject/iter
  }
  return(list(a=effect_sizes,b=power_values))
}
## Mann_WhitneyPower(distn="normal",size1 = 12,size2 = 15)
## normal
pov=data.frame(x1= rep(seq(0,3, by = 0.1),4),y1=c(Mann_WhitneyPowerRight(distn="normal",size1 = 4,size2 = 7)$b,
Mann_WhitneyPowerRight(distn="normal",size1 = 11,size2 = 15)$b,
Mann_WhitneyPowerRight(distn="normal",size1 = 24,size2 = 18)$b,Mann_WhitneyPowerRight(distn="normal",size1 = 30,size2 = 33)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(0,3, by = 0.1))))
ggplot(data=pov,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", N ( theta,1)," for ",theta >0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Cauchy Distribution

Code

pov1=data.frame(x1= rep(seq(0,5, by = 0.1),4),y1=c(Mann_WhitneyPowerRight(distn="cauchy",size1 = 4,size2 = 7,rge = 5)$b,
Mann_WhitneyPowerRight(distn="cauchy",size1 = 11,size2 = 15,rge=5)$b,
Mann_WhitneyPowerRight(distn="cauchy",size1 = 24,size2=18,rge=5)$b,Mann_WhitneyPowerRight(distn="cauchy",size1 = 33,size2 = 30,rge=5)$b),size=rep(c("m=4 n=7","m=11 n=15","m=24 n=18","m= 30 n=33"),each=length(seq(0,5, by = 0.1))))
ggplot(data=pov1,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", C ( theta,1)," for ",theta >0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Laplace Distribution

Code

pov2=data.frame(x1= rep(seq(0,5, by = 0.1),4),y1=c(Mann_WhitneyPowerRight(distn="laplace",size1 = 4,size2 = 7,rge = 5)$b,
                                                   Mann_WhitneyPowerRight(distn="laplace",size1 = 11,size2 = 15,rge=5)$b,
                                                   Mann_WhitneyPowerRight(distn="laplace",size1 = 24,size2 = 18,rge=5)$b,Mann_WhitneyPowerRight(distn="laplace",size1 = 33,size2 = 30,rge = 5)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(0,5, by = 0.1))))
ggplot(data=pov2,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", Laplace ( theta,1)," for ",theta >0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Logistic Distribution

Code

pov3=data.frame(x1= rep(seq(0,5, by = 0.1),4),y1=c(Mann_WhitneyPowerRight(distn="logistic",size1 = 4,size2 = 7,rge = 5)$b,
                                                   Mann_WhitneyPowerRight(distn="logistic",size1 = 11,size2 = 15,rge=5)$b,
                                                   Mann_WhitneyPowerRight(distn="logistic",size1 = 24,size2 = 18,rge=5)$b,Mann_WhitneyPowerRight(distn="logistic",size1 = 33,size2 = 30,rge=5)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(0,5, by = 0.1))))
ggplot(data=pov3,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", Logistic ( theta,1)," for ",theta >0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Under alternative: H₂: \(\theta\) < 0

Normal Distribution

Code

Mann_WhitneyPowerleft<-function(distn,size1,size2,alpha=0.05,iter=1000,rge=-3,...){
  effect_sizes <- seq(rge,0, by = 0.1);power_values=numeric(length(effect_sizes))
  for(i in 1:length(effect_sizes)){
    reject<-0
    for(j in 1:iter){
      sample1=sampling(dist=distn,par1=0,par2=1,size=size1);sample2=sampling(dist=distn,par1=(0+effect_sizes[i]),par2=1,size=size1)
      if(wilcox.test(sample1,sample2)$p.value <alpha) reject=reject+1
    }
    power_values[i]=reject/iter
  }
  return(list(a=effect_sizes,b=power_values))
}
## normal
pov=data.frame(x1= rep(seq(-3,0, by = 0.1),4),y1=c(Mann_WhitneyPowerleft(distn="normal",size1 = 4,size2 = 7)$b,
Mann_WhitneyPowerleft(distn="normal",size1 = 11,size2 = 15)$b,
Mann_WhitneyPowerleft(distn="normal",size1 = 24,size2 = 18)$b,Mann_WhitneyPowerleft(distn="normal",size1 = 30,size2 = 33)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(0,3, by = 0.1))))
ggplot(data=pov,aes(x=x1,y=y1,col=size))+geom_line(lwd=.9)+
labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", N ( theta,1)," for ",theta <0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Cauchy Distribution

Code

pov1=data.frame(x1= rep(seq(-5,0, by = 0.1),4),y1=c(Mann_WhitneyPowerleft(distn="cauchy",size1 = 4,size2 = 7,rge = -5)$b,
Mann_WhitneyPowerleft(distn="cauchy",size1 = 11,size2 = 15,rge=-5)$b,
Mann_WhitneyPowerleft(distn="cauchy",size1 = 24,size2=18,rge=-5)$b,Mann_WhitneyPowerleft(distn="cauchy",size1 = 33,size2 = 30,rge=-5)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 30, n=33"),each=length(seq(0,5, by = 0.1))))
ggplot(data=pov1,aes(x=x1,y=y1,col=size))+geom_line(lwd=.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", C ( theta,1)," for ",theta <0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Laplace Distribution

Code

pov2=data.frame(x1= rep(seq(-5,0, by = 0.1),4),y1=c(Mann_WhitneyPowerleft(distn="laplace",size1 = 4,size2 = 7,rge = -5)$b,
                                                   Mann_WhitneyPowerleft(distn="laplace",size1 = 11,size2 = 15,rge=-5)$b,
                                                   Mann_WhitneyPowerleft(distn="laplace",size1 = 24,size2 = 18,rge=-5)$b,Mann_WhitneyPowerleft(distn="laplace",size1 = 33,size2 = 30,rge = -5)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(0,5, by = 0.1))))
ggplot(data=pov2,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", Laplace ( theta,1)," for ",theta <0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Logistic Distribution

Code

pov3=data.frame(x1= rep(seq(-5,0, by = 0.1),4),y1=c(Mann_WhitneyPowerleft(distn="logistic",size1 = 4,size2 = 7,rge = -5)$b,
                                                   Mann_WhitneyPowerleft(distn="logistic",size1 = 11,size2 = 15,rge=-5)$b,
                                                   Mann_WhitneyPowerleft(distn="logistic",size1 = 24,size2 = 18,rge=-5)$b,Mann_WhitneyPowerleft(distn="logistic",size1 = 33,size2 = 30,rge=-5)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(0,5, by = 0.1))))
ggplot(data=pov3,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", Logistic ( theta,1)," for ",theta <0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Under alternative: H₃: \(\theta\) \(\neq\) 0

Normal Distribution

Code

Mann_WhitneyPowerboth<-function(distn,size1,size2,alpha=0.05,iter=1000,rge1=-3,rge2=3,...){
  effect_sizes <- seq(rge1,rge2, by = 0.1);power_values=numeric(length(effect_sizes))
  for(i in 1:length(effect_sizes)){
    reject<-0
    for(j in 1:iter){
      sample1=sampling(dist=distn,par1=0,par2=1,size=size1);sample2=sampling(dist=distn,par1=(0+effect_sizes[i]),par2=1,size=size1)
      if(wilcox.test(sample1,sample2)$p.value <alpha) reject=reject+1
    }
    power_values[i]=reject/iter
  }
  return(list(a=effect_sizes,b=power_values))
}
## normal
pov=data.frame(x1= rep(seq(-3,3, by = 0.1),4),y1=c(Mann_WhitneyPowerboth(distn="normal",size1 = 4,size2 = 7)$b,
Mann_WhitneyPowerboth(distn="normal",size1 = 11,size2 = 15)$b,
Mann_WhitneyPowerboth(distn="normal",size1 = 24,size2 = 18)$b,Mann_WhitneyPowerboth(distn="normal",size1 = 30,size2 = 33)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(-3,3, by = 0.1))))
ggplot(data=pov,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", N ( theta,1)," for ",theta != 0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Cauchy Distribution

Code

pov1=data.frame(x1= rep(seq(-5,5, by = 0.1),4),y1=c(Mann_WhitneyPowerboth(distn="cauchy",size1 = 4,size2 = 7,rge1 = -5, rge2=5)$b,
Mann_WhitneyPowerboth(distn="cauchy",size1 = 11,size2 = 15,rge1 = -5, rge2=5)$b,
Mann_WhitneyPowerboth(distn="cauchy",size1 = 24,size2=18,rge1 = -5, rge2=5)$b,Mann_WhitneyPowerboth(distn="cauchy",size1 = 33,size2 = 30,rge1 = -5, rge2=5)$b),size=rep(c("m=4 n=7","m=11 n=15","m=24 n=18","m= 30 n=33"),each=length(seq(-5,5, by = 0.1))))
ggplot(data=pov1,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", C ( theta,1)," for ",theta != 0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Laplace Distribution

Code

pov2=data.frame(x1= rep(seq(-3,3, by = 0.1),4),y1=c(Mann_WhitneyPowerboth(distn="laplace",size1 = 4,size2 = 7,rge1 = -3, rge2= 3)$b,
                                                   Mann_WhitneyPowerboth(distn="laplace",size1 = 11,size2 = 15,rge1 = -3, rge2= 3)$b,
                                                   Mann_WhitneyPowerboth(distn="laplace",size1 = 24,size2 = 18,rge1 = -3, rge2= 3)$b,Mann_WhitneyPowerboth(distn="laplace",size1 = 33,size2 = 30,rge1 = -3, rge2= 3)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(-3,3, by = 0.1))))
ggplot(data=pov2,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", Laplace ( theta,1)," for ",theta != 0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Logistic Distribution

Code

pov3=data.frame(x1= rep(seq(-5,5, by = 0.1),4),y1=c(Mann_WhitneyPowerboth(distn="logistic",size1 = 4,size2 = 7,rge1=-5, rge2=5)$b,
                                                   Mann_WhitneyPowerboth(distn="logistic",size1 = 11,size2 = 15,rge1=-5, rge2=5)$b,
                                                   Mann_WhitneyPowerboth(distn="logistic",size1 = 24,size2 = 18,rge1=-5, rge2=5)$b,Mann_WhitneyPowerboth(distn="logistic",size1 = 33,size2 = 30,rge1=-5, rge2=5)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(-5,5, by = 0.1))))
ggplot(data=pov3,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", Logistic ( theta,1)," for ",theta != 0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

POWER COMPARISON: PARAMETRIC & NON-PARAMETRIC

Power comparison between t-test and Mann-Whitney U test

Under right sided alternative, H₁: \(\theta\) > 0

Normal Distribution

Here we will perform the Mann-Whitney U test, taking CDFs of Normal distributions in both the hypotheses; then will perform the corresponding parametric t-test for Normal parameters.
Then we will observe and compare the power curves and hence will conclude that t-test produces better result than the aforementioned non parametric test, in terms of power and consistency.

\(\textbf{m=4, n=7}\)

Code

## parametric counterpart :
parametricright<-function(dt,m,n,alpha=0.05,iter=1000,...){
  par_power<-c()
  locate<-seq(0,3, by = 0.1)
  for(j in 1:length(locate)){
    pr=0
    for( i in 1:iter){
      a=sampling(dist=dt,par1 = 0,par2 = 1,size = m);b=sampling(dist=dt,par1=-locate[j],par2=1,size=n)
      if( t.test(a,b,var.equal = T,alternative = "greater")$p.value < alpha)  pr=pr+1
    }
    par_power=c(par_power,pr/iter)
  }
  return(par_power)
}
parametricPlotright<-function(dis,sz1,sz2,...){
  pov=data.frame(x1=rep(seq(0,3 ,by = 0.1),2),y1=c(parametricright(dt=dis,sz1,sz2),Mann_WhitneyPowerRight(distn=dis,size1 = sz1,size2 = sz2)$b),Test=rep(c("t-test","Mann-whitney test"),each=length(seq(0,3, by = 0.1))))
  ggplot(data=pov,aes(x=x1,y=y1,col=Test))+geom_line(lwd=0.9)+
    labs(x=expression(Theta),y="Power")+ggtitle(bquote("Power comparison, "~theta==0~"  vs " ~ theta > 0 ~ "under"~.(dis) ~ "with sizes("~.(sz1)~","~.(sz2)~")"))+
    theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)
}
parametricPlotright("normal",4,7)

\(\textbf{m=11, n=15}\)

Code

parametricPlotright("normal",11,15)

\(\textbf{m=24, n=18}\)

Code

parametricPlotright("normal",24,18)

\(\textbf{m=33, n=30}\)

Code

parametricPlotright("normal",33,30)

###Non-Normal Distribution

Here we will perform the Mann-Whitney U test, taking CDFs of some non-Normal distributions, in both the hypotheses; then will perform the corresponding parametric t-test forcefully for the parameters.
Then we will observe and compare the power curves and hence will conclude that Mann-Whitney U test produces better result than the t-test, for all the non-normal distributions, in terms of power, consistency and unbiasedness.
Also we observe that, for large enough sample sizes, the non parametric test brings forth equivalent results for Logistic and Laplace distribution and better results for Cauchy distribution; as expected; because of the asymptotic distribution of the Mann-Whitney statistic.

###Cauchy Distribution \(\textbf{m=4, n=7}\)

Code

parametricPlotright("cauchy",4,7)

### \(\textbf{m=11, n=15}\)

Code

parametricPlotright("cauchy",11,15)

### \(\textbf{m=24, n=18}\)

Code

parametricPlotright("cauchy",24,18)

### \(\textbf{m=33, n=30}\)

Code

parametricPlotright("cauchy",33,30)

###Laplace Distribution \(\textbf{m=4, n=7}\)

Code

parametricPlotright("laplace",4,7)

### \(\textbf{m=11, n=15}\)

Code

parametricPlotright("laplace",11,15)

### \(\textbf{m=24, n=18}\)

Code

parametricPlotright("laplace",24,18)

### \(\textbf{m=33, n=30}\)

Code

parametricPlotright("laplace",33,30)

###Logistic Distribution \(\textbf{m=4, n=7}\)

Code

parametricPlotright("logistic",4,7)

### \(\textbf{m=11, n=15}\)

Code

parametricPlotright("logistic",11,15)

### \(\textbf{m=24, n=18}\)

Code

parametricPlotright("logistic",24,18)

### \(\textbf{m=33, n=30}\)

Code

parametricPlotright("logistic",33,30)

[Under left sided alternative, H₂: \(\theta\) < 0]

Normal Distribution

Similarly we perform the Mann-Whitney U test again, taking CDFs of Normal distributions in both the hypotheses; then will perform the corresponding parametric test.
Then we will observe that t-test produces better result than the aforementioned non parametric test, in terms of accuracy.

\(\textbf{m=4, n=7}\)

Code

## parametric counterpart :
parametricleft<-function(dt,m,n,alpha=0.05,iter=1000,...){
  par_power<-c()
  locate<-seq(-3,0, by = 0.1)
  for(j in 1:length(locate)){
    pr=0
    for( i in 1:iter){
      a=sampling(dist=dt,par1 = 0,par2 = 1,size = m);b=sampling(dist=dt,par1=-locate[j],par2=1,size=n)
      if( t.test(a,b,var.equal = T,alternative = "less")$p.value < alpha)  pr=pr+1
    }
    par_power=c(par_power,pr/iter)
  }
  return(par_power)
}
parametricPlotleft<-function(dis,sz1,sz2,...){
  pov=data.frame(x1=rep(seq(-3,0 ,by = 0.1),2),y1=c(parametricleft(dt=dis,sz1,sz2),Mann_WhitneyPowerleft(distn=dis,size1 = sz1,size2 = sz2)$b),Test=rep(c("t-test","Mann-whitney test"),each=length(seq(0,3, by = 0.1))))
  ggplot(data=pov,aes(x=x1,y=y1,col=Test))+geom_line(lwd=0.9)+
    labs(x=expression(Theta),y="Power")+ggtitle(bquote("Power comparison, "~theta==0~"  vs " ~ theta < 0 ~ "under"~.(dis) ~ "with sizes("~.(sz1)~","~.(sz2)~")"))+
    theme_solarized()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)
}
parametricPlotleft("normal",4,7)

\(\textbf{m=11, n=15}\)

Code

parametricPlotleft("normal",11,15)

\(\textbf{m=24, n=18}\)

Code

parametricPlotleft("normal",24,18)

\(\textbf{m=33, n=30}\)

Code

parametricPlotleft("normal",33,30)

###Non-Normal Distribution

Here we will perform the Mann-Whitney U test, taking CDFs of some non-Normal distributions, in both the hypotheses; then will perform the corresponding parametric t-test forcefully for the parameters.
Then we will observe and compare the power curves and hence will conclude that Mann-Whitney U test produces better result than the t-test, for all the non-normal distributions, in terms of power, consistency and unbiasedness.
Also we observe that, for large enough sample sizes, the non parametric test brings forth equivalent results for Logistic and Laplace distribution and better results for Cauchy distribution; as expected; because of the asymptotic distribution of the Mann-Whitney statistic.

###Cauchy Distribution \(\textbf{m=4, n=7}\)

Code

parametricPlotleft("cauchy",4,7)

### \(\textbf{m=11, n=15}\)

Code

parametricPlotleft("cauchy",11,15)

### \(\textbf{m=24, n=18}\)

Code

parametricPlotleft("cauchy",24,18)

### \(\textbf{m=33, n=30}\)

Code

parametricPlotleft("cauchy",33,30)

###Laplace Distribution \(\textbf{m=4, n=7}\)

Code

parametricPlotleft("laplace",4,7)

### \(\textbf{m=11, n=15}\)

Code

parametricPlotleft("laplace",11,15)

### \(\textbf{m=24, n=18}\)

Code

parametricPlotleft("laplace",24,18)

### \(\textbf{m=33, n=30}\)

Code

parametricPlotleft("laplace",33,30)

###Logistic Distribution \(\textbf{m=4, n=7}\)

Code

parametricPlotleft("logistic",4,7)

### \(\textbf{m=11, n=15}\)

Code

parametricPlotleft("logistic",11,15)

### \(\textbf{m=24, n=18}\)

Code

parametricPlotleft("logistic",24,18)

### \(\textbf{m=33, n=30}\)

Code

parametricPlotleft("logistic",33,30)

[Under both sided alternative, H₃: \(\theta\) \(\neq\) 0]

Normal Distribution

Here again we perform the Mann-Whitney U test, taking CDFs of Normal distributions in both the hypotheses; then will perform the corresponding parametric test.
Then we will observe that t-test produces better result than the aforementioned non parametric test, in terms of accuracy. Though asymptotically both the tests produces similar results, because of the asymptotic nature of the U-statistic.

\(\textbf{m=4, n=7}\)

Code

## parametric counterpart :
parametricboth<-function(dt,m,n,alpha=0.05,iter=1000,...){
par_power<-c()
locate<-seq(-5,5, by = 0.1)
for(j in 1:length(locate)){
  pr=0
for( i in 1:iter){
  a=sampling(dist=dt,par1 = 0,par2 = 1,size = m);b=sampling(dist=dt,par1=locate[j],par2=1,size=n)
  if( t.test(a,b,var.equal = T)$p.value < alpha)  pr=pr+1
}
  par_power=c(par_power,pr/iter)
}
return(par_power)
}
parametricPlot<-function(dis,sz1,sz2,...){
pov=data.frame(x1=rep(seq(-5,5, by = 0.1),2),y1=c(parametricboth(dt=dis,sz1,sz2),Mann_WhitneyPowerboth(distn=dis,size1 = sz1,size2 = sz2,rge1 = -5,rge2 = 5)$b),Test=rep(c("t-test","Mann-whitney test"),each=length(seq(-5,5, by = 0.1))))
ggplot(data=pov,aes(x=x1,y=y1,col=Test))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(bquote("Power comparison"~ theta != 0 ~ "under"~.(dis) ~ "with sizes ("~.(sz1)~","~.(sz2)~")"))+
  theme_economist()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)
}
parametricPlot(dis="normal",sz1=4,sz2=7)

\(\textbf{m=11, n=15}\)

Code

parametricPlot(dis="normal",sz1=11,sz2=15)

\(\textbf{m=24, n=18}\)

Code

parametricPlot(dis="normal",sz1=24,sz2=18)

\(\textbf{m=33, n=30}\)

Code

parametricPlot(dis="normal",sz1=33,sz2=30)

###Non-Normal Distribution

Here we will perform the Mann-Whitney U test, taking CDFs of some non-Normal distributions, in both the hypotheses; then will perform the corresponding parametric t-test forcefully for the parameters.
Then we will observe and compare the power curves and hence will conclude that Mann-Whitney U test produces better result than the t-test, for all the non-normal distributions, in terms of power, consistency and unbiasedness.
Also we observe that, for large enough sample sizes, the non parametric test brings forth equivalent results for Logistic and Laplace distribution and better results for Cauchy distribution; as expected; because of the asymptotic distribution of the Mann-Whitney statistic.

###Cauchy Distribution \(\textbf{m=4, n=7}\)

Code

parametricPlot(dis="cauchy",sz1=4,sz2=7)

### \(\textbf{m=11, n=15}\)

Code

parametricPlot(dis="cauchy",sz1=11,sz2=15)

### \(\textbf{m=24, n=18}\)

Code

parametricPlot(dis="cauchy",sz1=24,sz2=18)

### \(\textbf{m=33, n=30}\)

Code

parametricPlot(dis="cauchy",sz1=33,sz2=30)

###Laplace Distribution \(\textbf{m=4, n=7}\)

Code

parametricPlot(dis="laplace",sz1=4,sz2=7)

### \(\textbf{m=11, n=15}\)

Code

parametricPlot(dis="laplace",sz1=11,sz2=15)

### \(\textbf{m=24, n=18}\)

Code

parametricPlot(dis="laplace",sz1=24,sz2=18)

### \(\textbf{m=33, n=30}\)

Code

parametricPlot(dis="laplace",sz1=33,sz2=30)

###Logistic Distribution \(\textbf{m=4, n=7}\)

Code

parametricPlot(dis="logistic",sz1=4,sz2=7)

### \(\textbf{m=11, n=15}\)

Code

parametricPlot(dis="logistic",sz1=11,sz2=15)

### \(\textbf{m=24, n=18}\)

Code

parametricPlot(dis="logistic",sz1=24,sz2=18)

### \(\textbf{m=33, n=30}\)

Code

parametricPlot(dis="logistic",sz1=33,sz2=30)

UNBIASED?
CHECK POWER CURVES varying level of significance

Power curves for different \(\alpha\)’s

Now we plot the power curves for the both sided alternative H₃: \(\theta\) \(\neq\) 0 for different sizes and observe that the test is unbiased. Similarly, the test can be shown unbiased for the one-sided alternatives.

Normal Distribution

\(\textbf{m=11, n=15}\)

Code

Mann_WhitneySize<-function(distn,size1,size2,alpha=c(0.025,0.05,0.1),iter=1000,rge1=-3,rge2=3,...){
  effect_sizes <- seq(rge1,rge2, by = 0.1);puti=c()
  for(k in 1:length(alpha)){
    power_values=c()
  for(i in 1:length(effect_sizes)){
    reject<-0
    for(j in 1:iter){
      sample1=sampling(dist=distn,par1=0,par2=1,size=size1);sample2=sampling(dist=distn,par1=(0+effect_sizes[i]),par2=1,size=size1)
      if(wilcox.test(sample1,sample2)$p.value <alpha[k]) reject=reject+1
    }
    power_values=c(power_values,reject/iter)
  }
  puti=c(puti,power_values)
  }
  return(list(a=effect_sizes,b=puti))
  }
sizeplot<-function(dt,sz1,sz2,...){
pov=data.frame(x1= rep(seq(-3,3, by = 0.1),3),y1=Mann_WhitneySize(distn=dt,size1 = sz1,size2 = sz2)$b,
                    alpha=rep(c("0.025","0.05","0.1"),each=length(seq(-3,3, by = 0.1))))
ggplot(data=pov,aes(x=x1,y=y1,col=alpha))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+geom_hline(yintercept = 0.05,type="dashed",col="green",lwd=0.8)+
  geom_hline(yintercept = 0.025,type="dashed",col="orange",lwd=0.8)+geom_hline(yintercept = 0.1,type="dashed",col="skyblue",lwd=0.8)+
  ggtitle(bquote("Power comparison"~theta !=0~"under"~.(dt)~ "with sizes ("~.(sz1)~" ,"~.(sz2)~" )"))+
  theme_economist()
}
sizeplot(dt="normal",11,15)

Cauchy Distribution

\(\textbf{m=11, n=15}\)

Code

sizeplot(dt="cauchy",11,15)

Laplace Distribution

\(\textbf{m=11, n=15}\)

Code

sizeplot(dt="laplace",11,15)

Logistic Distribution

\(\textbf{m=11, n=15}\)

Code

sizeplot(dt="logistic",11,15)

ROBUST?

Mixing noise from Uniform distribution and showing robustness

Here we generate data from some standard continuous distributions, namely Normal, Cauchy, Laplace and Logistic with added noise from Uniform(-4,4) and observe the behaviour of the distribution of the Mann-Whitney statistic while the data contains outlier points.
We plot the power curves and compare them under the non parametric and parametric setup for the both sided alternative H₃: \(\theta\) \(\neq\) 0 for different sizes and observe that the test is robust. Similarly, the test can be shown robust for the one-sided alternatives.

Normal Distribution

\(\textbf{m=4, n=7}\)

Code

robustboth<-function(dt,m,n,alpha=0.05,iter=1000,...){
  par_power<-c()
  locate<-seq(-3,3, by = 0.1)
  for(j in 1:length(locate)){
    pr=0
    for( i in 1:iter){
      a=sampling(dist=dt,par1 = 0,par2 = 1,size = m)+runif(m,-4,4);b=sampling(dist=dt,par1=locate[j],par2=1,size=n)+runif(n,-4,4)
      if( t.test(a,b,var.equal = T)$p.value < alpha)  pr=pr+1
    }
    par_power=c(par_power,pr/iter)
  }
  return(par_power)
}
robustPlot<-function(dis,sz1,sz2,...){
  pov=data.frame(x1=rep(seq(-3,3, by = 0.1),2),y1=c(robustboth(dt=dis,sz1,sz2),Mann_WhitneyPowerboth(distn=dis,size1 = sz1,size2 = sz2)$b),Test=rep(c("t-test","Mann-whitney test"),each=length(seq(-3,3, by = 0.1))))
  ggplot(data=pov,aes(x=x1,y=y1,col=Test))+geom_line(lwd=0.9)+
    labs(x=expression(Theta),y="Power")+ggtitle(bquote("Power comparison, " ~ theta !=0 ~ " under"~.(dis) ~ "with sizes ("~.(sz1)~","~.(sz2)~")"))+
    theme_economist()+geom_hline(yintercept = 0.05,)
}
robustPlot(dis="normal",sz1=4,sz2=7)

\(\textbf{m=11, n=15}\)

Code

robustPlot(dis="normal",sz1=11,sz2=15)

\(\textbf{m=24, n=18}\)

Code

robustPlot(dis="normal",sz1=24,sz2=18)

Cauchy Distribution

\(\textbf{m=4, n=7}\)

Code

robustPlot(dis="cauchy",sz1=4,sz2=7)

\(\textbf{m=11, n=15}\)

Code

robustPlot(dis="cauchy",sz1=11,sz2=15)

\(\textbf{m=24, n=18}\)

Code

robustPlot(dis="cauchy",sz1=24,sz2=18)

Laplace Distribution

\(\textbf{m=4, n=7}\)

Code

robustPlot(dis="laplace",sz1=4,sz2=7)

\(\textbf{m=11, n=15}\)

Code

robustPlot(dis="laplace",sz1=11,sz2=15)

\(\textbf{m=24, n=18}\)

Code

robustPlot(dis="laplace",sz1=24,sz2=18)

Logistic Distribution

\(\textbf{m=4, n=7}\)

Code

robustPlot(dis="logistic",sz1=4,sz2=7)

\(\textbf{m=11, n=15}\)

Code

robustPlot(dis="logistic",sz1=11,sz2=15)

\(\textbf{m=24, n=18}\)

Code

robustPlot(dis="logistic",sz1=24,sz2=18)

CONCLUSION

So observing the power plots, it can be concluded that the non parametric test procedure brings forth better results, in terms of power and consistency, compared to the conventional parametric t-test. Though the parametric test gives somehow better results for larger sample, but overall, the Mann-Whitney test is better, whatever be the underlying distribution of the generated data.

POWER COMPARISON AMONG DIFFERENT DISTRIBUTIONS for fixed \(\theta\), fixed sample sizes

Under which distributional assumption, Mann-Whitney test performs the best?

Here we fix m=24, n=18 and then compare the power of the test for different distributions, while fixing the value of \(\theta\) in each of the cases.

\(\theta\)=0.5

Code

## power for diff dist
Nonparametricpower=function(m=24,n=18,level=0.05,iter=1000,thet,...)
{ x1=0;x2=0;x3=0;x4=0
  for( i in 1:iter){
    a1=sampling(dist="normal",0,1,size=m);b1=sampling(dist="normal",thet,1,size=m)
    a2=sampling(dist="cauchy",0,1,size=m);b2=sampling(dist="cauchy",thet,1,size=m)
    a3=sampling(dist="laplace",0,1,size=m);b3=sampling(dist="laplace",thet,1,size=m)
    a4=sampling(dist="logistic",0,1,size=m);b4=sampling(dist="logistic",thet,1,size=m)
    if(wilcox.test(a1,b1)$p.value <level) x1=x1+1      
    if(wilcox.test(a2,b2)$p.value <level) x2=x2+1    
    if(wilcox.test(a3,b3)$p.value <level) x3=x3+1    
    if(wilcox.test(a4,b4)$p.value <level) x4=x4+1    
  }
table=data.frame(power=c(x1,x2,x3,x4)/iter,distribution=c("normal","cauchy","laplace","logistic"))
ggplot(data=table,aes(x=factor(distribution),y=power,group=1))+geom_point(col="purple",size=3)+geom_line(lwd=0.9,col="violet")+
  labs(x="Distribution",y="Power")+scale_y_continuous(limits=c(0,1))+
  ggtitle(bquote("Power comparisons for different distributions at  " ~theta==~"" ~.(thet)))+
  theme_economist()
}
Nonparametricpower(thet=0.5)

\(\theta\)=1

Code

Nonparametricpower(thet=1)

\(\theta\)=2

Code

Nonparametricpower(thet=2)

\(\theta\)=0

Code

Nonparametricpower(thet=0)

CONCLUSION

We observe that in terms of performance, the order of the distributions is Normal, Laplace, Logistic, Cauchy, on the basis of the power of the non-parametric test; in decreasing order.
A possible justification for this conclusion can be that order of tails among the distributions is: Cauchy, Laplace, Logistic, Normal; in terms of decreasing order of heaviness; and the order of peaks is: Laplace, Normal, Logistic, Cauchy.
Also, the test produces similar results for \(\theta\) = 0, and for very large values of \(\theta\); Which is quite justified.

EFFICACY OF HODGES-LEHMANN ESTIMATOR

Comparison of MSEs between Hodges-Lehmann estimator and Maximum Likelihood Estimator

Under the aforementioned setup, the Hodges-Lehmann estimator can easily be obtained for estimating the location shift. Here, under the current setup, Hodges-Lehmann estimator is median of the Walsh averages.

Since the MLEs for both Logistic and Cauchy distribution do not have a closed form, we compare the performances of Hodges Lehmann estimator and Maximum Liklihood Estimator. The motivation behind this comparison is to check whether the Hodges Lehmann Estimator performs better than the MLE. If so, we may use Hodges Lehmann estimator instead of MLE because Hodges Lehmann estimator has a closed form and it can be obtained easily.

Normal Distribution

\(\textbf{m=24, n=18}\)

Code

library(ggplot2)

# Function to calculate MLE for mean parameter of normal distribution
mle_mean <- function(data) {
  mean(data)
}

# Function to calculate Hodges-Lehmann estimator with Walsh average
hl_walsh <- function(data) {
  n <- length(data)
  if (n %% 2 == 0) {
    m <- n / 2
    hl <- (sort(data)[m] + sort(data)[m + 1]) / 2
  } else {
    m <- (n + 1) / 2
    hl <- sort(data)[m]
  }
  return(hl)
}

# Function to calculate mean squared error (MSE)
mse <- function(estimates, true_value) {
  mean((estimates - true_value)^2)
}

# Simulation parameters
set.seed(123)
sample_sizes <- seq(10, 1000, by = 10)
true_mean <- 0  # True mean of normal distribution

# Simulate data and calculate MSE for each estimator
mse_results <- data.frame(
  Sample_Size = numeric(),
  MLE = numeric(),
  Hodges_Lehmann = numeric()
)

for (n in sample_sizes) {
  data <- rnorm(n, true_mean)
  mle_estimate <- mle_mean(data)
  hl_estimate <- hl_walsh(data)
  mse_mle <- mse(mle_estimate, true_mean)
  mse_hl <- mse(hl_estimate, true_mean)
  mse_results <- rbind(mse_results, data.frame(Sample_Size = n, MLE = mse_mle, Hodges_Lehmann = mse_hl))
}

# Plot MSE comparison
ggplot(mse_results, aes(x = Sample_Size)) +
  geom_line(aes(y = MLE, color = "MLE")) +
  geom_line(aes(y = Hodges_Lehmann, color = "Hodges-Lehmann")) +
  scale_color_manual(values = c("red", "blue")) +
  labs(title = "MSE Comparison of MLE and Hodges-Lehmann Estimator",
       x = "Sample Size", y = "Mean Squared Error") +
  theme_minimal()

Cauchy Distribution

\(\textbf{m=24, n=18}\)

Code

cauchy.mle <- function(x,start,eps=1.e-8,max.iter=50){
  if (missing(start)) start <- median(x)
  theta <- start
  n <- length(x)
  score <- sum(2*(x-theta)/(1+(x-theta)^2))
  iter <- 1
  conv <- T
  while (abs(score)>eps && iter<=max.iter){
    info <- sum((2-2*(x-theta)^2)/(1+(x-theta)^2)^2)
    theta <- theta + score/info
    iter <- iter + 1
    score <- sum(2*(x-theta)/(1+(x-theta)^2))
  }
  if (abs(score)>eps) {
    print("No Convergence")
    conv <- F
  }
  loglik <- -sum(log(1+(x-theta)^2))
  info <- sum((2-2*(x-theta)^2)/(1+(x-theta)^2)^2)
  r <- list(theta=theta,loglik=loglik,info=info,convergence=conv)
  r
}
MSEcauchyPlot<-function(size=42,ran,iter=50,...){
  range=seq(-ran,ran,by=.05)
  for(i in 1:length(range)){
   ## mse1=c();mse2=c()
    mle=c()
    hle=c()
   ## for( j in 1:50){
      data=rcauchy(size,range[i],1)
      mle=c(mle,(cauchy.mle(data)$theta-range[i])^2)
    hle=c(hle,(median(data)-range[i])^2)
    }
   ## mse1=c(mse1,mean(mle));mse2=c(mse2,mean(hle))
  ## }
 estimate=data.frame(theta=c(range,range),est=c(mle,hle),estimator=rep(c("MLE","HLE"),each=length(range)))
return(ggplot(data=estimate,aes(x=theta,y=est,col=estimator))+
       geom_line(lwd=1)+labs(x=expression(theta),y="MSE")+scale_y_continuous(limits = c(0,0.3))+
     ggtitle(bquote("Comparison between MLE & HLE for Cauchy("~theta~",1)"))+
      theme_stata())
  
}
MSEcauchyPlot(ran=1)

Logistic Distribution

\(\textbf{m=24, n=18}\)

Code

library(EnvStats)
MSElogisPlot<-function(size=42,ran,iter=1000,...){
  range=seq(-ran,ran,by=.1)
  for(i in 1:length(range)){
    mse1=c();mse2=c()
    mle=c()
    hle=c()
    for( j in 1:iter){
      data=rlogis(size,range[i],1)
      mle=c(mle,(elogis(data,method = 'mle')$parameters[1]-range[i])^2)
      hle=c(hle,(median(data)-range[i])^2)
    }
    mse1=c(mse1,mean(mle));mse2=c(mse2,mean(hle))
  }
  estimate=data.frame(theta=c(range,range),est=c(mse1,mse2),estimator=rep(c("MLE","HLE"),each=length(range)))
  return(ggplot(data=estimate,aes(x=theta,y=est,col=estimator))+
           geom_line(lwd=1)+labs(x=expression(theta),y="MSE")+scale_y_continuous(limits = c(0,0.3))+
           ggtitle(bquote("Comparison between MLE & HLE for Logistic("~theta~",1)"))+
           theme_minimal())
  
}
MSElogisPlot(ran=1)

CONCLUSION

So observing the MSE plots, it can be concluded that the Hodges-Lehmann estimator definitely performs better for Normal distribution. Besides, it is also consistent and hence Hodges-Lehmann estimator should be preferred over MLE for these distributions.
But for Cauchy and Logistic distruibution, due to inefficiency of the ‘optim’, ‘nleqslv’ functions in R, it seems that the comparison of MSEs of MLE and Hodges Lehmann Estimator of Logistic distribution gives us absurd result. Alternatively we tried to simulate the data by solving the score equations but that process also brings forth similar results and hence for Logistic distribution, we could not draw a valid conclusion regarding the MSE comparisons.

Two sample location problem: A simulation study

Introductory sections, objectives & the set-up

Introduction

Assumptions

Objectives

Underlying set-up

Testing Procedure

Test Statistic

DISTRIBUTION FREE?

For continuous distributions

Under null hypothesis

\(\textbf{m=11, n=15}\)

\(\textbf{m=24, n=18}\)

\(\textbf{m=33, n=30}\)

Under alternative hypothesis

\(\textbf{m=11, n=15}\)

\(\textbf{m=24, n=18}\)

\(\textbf{m=33, n=30}\)

For discrete distributions

Under null hypothesis

\(\textbf{m=11, n=15}\)

\(\textbf{m=24, n=18}\)

\(\textbf{m=33, n=30}\)

Under alternative hypothesis

\(\textbf{m=11, n=15}\)

\(\textbf{m=24, n=18}\)

\(\textbf{m=33, n=30}\)

EXACT DISTRIBUTION

Exact distribution of the statistic under some fixed sample sizes

\(\textbf{m=2, n=2}\)

\(\textbf{m=4, n=1}\)

\(\textbf{m=4, n=4}\)

ASYMPTOTIC ANALYSIS

Checking asymptotic normality

1. U-statistic approach

Cauchy Distribution

\(\textbf{m=24, n=18}\)

Laplace Distribution

\(\textbf{m=24, n=18}\)

Logistic Distribution

\(\textbf{m=24, n=18}\)

2. Goodness of fit test

Cauchy Distribution

\(\textbf{m=24, n=18}\)

Logistic Distribution

\(\textbf{m=24, n=18}\)

Laplace Distribution

\(\textbf{m=24, n=18}\)

CONCLUSION

CONSISTENT?CHECK POWER CURVES varying location and sample sizes

Power curves

Under alternative: H1: \(\theta\) > 0

Normal Distribution

Cauchy Distribution

Laplace Distribution

Logistic Distribution

Under alternative: H2: \(\theta\) < 0

Normal Distribution

Cauchy Distribution

Laplace Distribution

Logistic Distribution

Under alternative: H3: \(\theta\) \(\neq\) 0

Normal Distribution

Cauchy Distribution

Laplace Distribution

Logistic Distribution

POWER COMPARISON: PARAMETRIC & NON-PARAMETRIC

Power comparison between t-test and Mann-Whitney U test

Under right sided alternative, H1: \(\theta\) > 0

\(\textbf{m=4, n=7}\)

\(\textbf{m=11, n=15}\)

\(\textbf{m=24, n=18}\)

\(\textbf{m=33, n=30}\)

[Under left sided alternative, H2: \(\theta\) < 0]

\(\textbf{m=4, n=7}\)

\(\textbf{m=11, n=15}\)

\(\textbf{m=24, n=18}\)

\(\textbf{m=33, n=30}\)

[Under both sided alternative, H3: \(\theta\) \(\neq\) 0]

\(\textbf{m=4, n=7}\)

CONSISTENT?
CHECK POWER CURVES varying location and sample sizes

Under alternative: H₁: \(\theta\) > 0

Under alternative: H₂: \(\theta\) < 0

Under alternative: H₃: \(\theta\) \(\neq\) 0

Under right sided alternative, H₁: \(\theta\) > 0

[Under left sided alternative, H₂: \(\theta\) < 0]

[Under both sided alternative, H₃: \(\theta\) \(\neq\) 0]

UNBIASED?
CHECK POWER CURVES varying level of significance