Two sample location problem: A simulation study

Subhrangsu Bhunia
Rhitankar Bandyopadhyay
Sudipta Sarkar

2024-03-18

Introductory sections, objectives & the set-up

Introduction

  • Let us consider an example in order to introduce the Mann-Whitney statistic and cogitate the essence of non parametric inference in real life situations.

  • Suppose we have the data on weights of n persons from Delhi and m persons from Kolkata. We want to test whether the average rate of obesity is higher in Delhi than that in Kolkata.

  • This real life scenario can be formulated as a testing problem and in this context, we can perform Mann-Whitney U test; proposed by Henry Mann and Donald Ransom Whitney, around 1947.

Assumptions

  • In order to carry out the Mann-Whitney U test, consider two family of distributions F(X) and G(X) = F(X-\(\theta\))
  • Assume that both F and G are continuous.

  • We draw random samples (X1, X2,…, Xm) and (Y1, Y2,…, Yn), identically and independently, from F and G.

Objectives

  • Here we want to test whether the aforementioned random samples actually comes from the same population distribution or not, when the distributions are completely unknown.
  • Firstly we examine and establish that the test statistic is distribution free under the null hypothesis, for different choices of distributions and sample sizes.

  • Then we check the behaviour of the test statistic under similar situations, but under the alternative hypothesis.

  • After that, the limiting distribution of the test statistic is verified for large sample sizes.

  • Then we consider some discrete distributions under the hypotheses and study the behaviour of the statistic.

  • Penultimately, we conduct some equivalent parametric tests and compare the sizes and powers of those tests with the aforementioned non-parametric tests.

  • Then we try to find out the Hodges-Lehmann estimator and ultimately we carry out some parallel studies like checking robustness, stochastic dominance and observing the relation between Mann-Whitney statistic and Wilcoxon rank-sum statistic.

Underlying set-up

Testing Procedure

For convenience, here we are interested only in testing H0: \(\theta\) = 0 ag. H1: \(\theta\) > 0. Then in similar manner, the testing procedures for the equivalent alternative hypotheses H2: \(\theta\) < 0 and H3: \(\theta\) \(\neq\) 0 will be executed.

Test Statistic

  • Consider an indicator random variable, I(Xi > Yj) \(\forall\) i=1(1)m, \(\forall\) j=1(1)n.

  • Define \(\phi\)(Xi , Yj) = I(Xi > Yj)

  • Then the Mann- Whitney statistic U is defined as follows:

    U = \(\sum_{i=1}^{m} \sum_{j=1}^{n} \phi(x_i, y_j)\)

  • H0 will be rejected in favor of H1 for small values of U. The critical regions will be adjusted correspondingly for the equivalent alternative hypotheses.

  • Under H0, E(U) = \(\frac {mn}{2}\) , Var(U) = \(\frac {mn(m+n+1)}{12}\)

  • We will show, later on, that U is a distribution free statistic, under H0.

DISTRIBUTION FREE?

For continuous distributions

Under null hypothesis

Here we will show that the Mann-Whitney statistic is distribution free under null hypothesis, for different choices of continuous distributions, considering different choices of sample sizes. Normal, Cauchy, Laplace and Logistic distributions have been used here to prove and disprove that the statistic is distribution free, under null and alternative hypotheses, respectively.


\(\textbf{m=4, n=7}\)


Code
library(ggthemes)
library(extraDistr)
Mann_Whitney<-function(sample1,sample2){
  count=0
  for(i in 1:length(sample1)){
    for(j in 1:length(sample2)){
      if(sample1[i]>sample2[j]) count=count+1
    }
  }
  count
}
sampling<-function(dist,par1,par2,size,...){
  if (dist == 'normal') {
    mean=par1;sd=par2;sample <- rnorm(size, mean, sd)
     } 
  else if (dist == 'uniform') {
    min_val= par1;max_val=par2;sample=runif(size, min_val, max_val)
  } 
  else if (dist == 'poisson') {
    lambda <- par1;sample <- rpois(size, lambda)
  } 
  else if (dist == 'cauchy') {
    location=par1;scale=par2;sample <- rcauchy(size, location,scale)
  } 
  else if (dist == 'exponential') {
    rate=par1;sample <- rexp(size,rate=par1)
  } 
  else if (dist == 'gamma') {
    shape=par1;rate=par2;sample <- rcauchy(size,shape,rate)
  } 
  else if (dist == 'beta') {
    shape1=par1;shape2=par2;sample <- rbeta(size,shape1,shape2)
  } 
  else if (dist == 'laplace') {
    mu=par1;sigma=par2;sample <- rlaplace(size,mu,sigma)
  } 
  else if (dist == 'binomial') {
    trials=par1;prob=par2;sample <- rbinom(size,trials,prob)
  } 
  else if (dist == 'geometric') {
    prob=par1;sample <- rgeom(size,prob)
  } 
  else if (dist == 'logistic') {
    location=par1;scale=par2;sample <- rlogis(size,location,scale)
  } 
  else if (dist == 'lognormal') {
    location=par1;scale=par2;sample <- rglnorm(size,location,scale)
  } 
  else if (dist == 't') {
    sample <- rt(size,par1)
  } 
  else if (dist == 'chisq') {
    sample <- rchisq(size,par1)
  } 
  else if (dist == 'f') {
    location=par1;scale=par2;sample <- rf(size,location,scale)
  } 
  else {
    stop("Unsupported distribution")
  }
  
  return(sample)
}

library(ggplot2)
Mann_WhitneyHist<-function(iter=1000,dist,size1,size2,location=0,scale=1,theta=0,...){
  test=NULL
  for(i in 1:iter){
    x=sampling(dist,par1=location,par2=scale,size=size1)
    y=sampling(dist,par1=location+theta,par2=scale,size=size2)
    test=c(test,Mann_Whitney(x,y))
  }
 return(test)
}

dist_h0<-function(m,n,s...){
data=data.frame(values=c(Mann_WhitneyHist(dist="normal",size1 =m,size2 = n,location = 0,scale=1),Mann_WhitneyHist(dist="cauchy",size1 = m,size2 = n,location = 0,scale=1),
Mann_WhitneyHist(dist="laplace",size1 = m,size2 = n,location = 0,scale=1),Mann_WhitneyHist(dist="logistic",size1 = m,size2 = n,location = 0,scale=1)),Distribution=rep(c("normal","cauchy","laplace","logistic"),each=1000))
ggplot(data,aes(x=values,col=Distribution,fill=Distribution))+
  geom_density(alpha=0.5)+
  geom_vline(xintercept = m*n/2,linetype="dashed",col="black",lwd=0.7)+
  scale_color_manual(values = c("normal"="black","cauchy"="black","laplace"="black","logistic"="black")) +
  scale_fill_manual(values = c("normal"="#90EE90","cauchy"="red","laplace"="violet","logistic"="#87CEFA")) +
  labs(title = "Distributions under H_0 ",x="Mann-Whitney Statistic",y="Density")+
  theme_stata()

}              
dist_h0(m=4,n=7)

\(\textbf{m=11, n=15}\)

Code
dist_h0(m=11,n=15)

\(\textbf{m=24, n=18}\)

Code
dist_h0(m=24,n=18)

\(\textbf{m=33, n=30}\)

Code
dist_h0(m=33,n=30)

Under alternative hypothesis

Now we will show that the statistic is definitely not distribution free under the alternative hypothesis taken into consideration, for different choices of continuous distributions, considering different choices of sample sizes.


\(\textbf{m=4, n=7}\)


Code
dist_h1<-function(m,n,s...){
  data=data.frame(values=c(Mann_WhitneyHist(dist="normal",size1 =m,size2 = n,location = 0,scale=1,theta=0.5),Mann_WhitneyHist(dist="cauchy",size1 = m,size2 = n,location = 0,scale=1,theta=0.5),
                           Mann_WhitneyHist(dist="laplace",size1 = m,size2 = n,location = 0,scale=1,theta=0.5),Mann_WhitneyHist(dist="logistic",size1 = m,size2 = n,location = 0,scale=1,theta=0.5)),Distribution=rep(c("normal","cauchy","laplace","logistic"),each=1000))
  ggplot(data,aes(x=values,col=Distribution,fill=Distribution))+
    geom_density(alpha=0.5)+
   scale_color_manual(values = c("normal"="black","cauchy"="black","laplace"="black","logistic"="black")) +
    scale_fill_manual(values = c("normal"="#90EE90","cauchy"="red","laplace"="violet","logistic"="#87CEFA")) +
    labs(title = "Distributions under H_1",x="Mann-Whitney Statistic",y="Density")+
    theme_stata()
}  
dist_h1(4,7)

\(\textbf{m=11, n=15}\)

Code
dist_h1(11,15)

\(\textbf{m=24, n=18}\)

Code
dist_h1(24,18)

\(\textbf{m=33, n=30}\)

Code
dist_h1(33,30)

For discrete distributions

Under null hypothesis

Here we will show that, for different choices of discrete distributions, the Mann-Whitney statistic is not distribution free, neither under the null hypothesis, nor under the alternative hypothesis, taken into consideration here, and for different choices of sample sizes. Binomial, Poisson and Geometric distributions are used here in order to show that the statistic is not distribution free under the hypotheses.


\(\textbf{m=4, n=7}\)


Code
discretedist_h0<-function(m,n,s...){
  data=data.frame(values=c(Mann_WhitneyHist(dist="poisson",size1 =m,size2 = n,location = 2),Mann_WhitneyHist(dist="binomial",size1 = m,size2 = n,location = 15,scale=0.6),
                           Mann_WhitneyHist(dist="geometric",size1 = m,size2 = n,location =0.6)),Distribution=rep(c("poisson","binomial","geometric"),each=1000))
  ggplot(data,aes(x=values,col=Distribution,fill=Distribution))+
    geom_density(alpha=0.5)+
   scale_color_manual(values = c("poisson"="black","binomial"="black","geometric"="black")) +
    scale_fill_manual(values = c("poisson"="#90EE90","binomial"="red","geometric"="violet")) +
    labs(title = "Distributions under H_0 ",x="Mann-Whitney Statistic",y="Density")+
    theme_stata()
  
}
discretedist_h0(4,7)

\(\textbf{m=11, n=15}\)

Code
discretedist_h0(11,15)

\(\textbf{m=24, n=18}\)

Code
discretedist_h0(24,18)

\(\textbf{m=33, n=30}\)

Code
discretedist_h0(33,30)

Under alternative hypothesis

Now we will show that the statistic is definitely not distribution free under the alternative hypothesis, for different choices of discrete distributions, considering different choices of sample sizes.


\(\textbf{m=4, n=7}\)


Code
Mann_Whitneybin<-function(iter=1000,dist,size1,size2,location=0,scale=1,theta=0,...){
  test=NULL
  for(i in 1:iter){
    x=sampling(dist,par1=location,par2=scale,size=size1)
    y=sampling(dist,par1=location,par2=scale+theta,size=size2)
    test=c(test,Mann_Whitney(x,y))
  }
  return(test)
}
discretedist_h1<-function(m,n,s...){
  data=data.frame(values=c(Mann_WhitneyHist(dist="poisson",size1 =m,size2 = n,location = 2,theta =1.5),Mann_Whitneybin(dist="binomial",size1 = m,size2 = n,location = 15,scale=0.6,theta=0.1),
                           Mann_WhitneyHist(dist="geometric",size1 = m,size2 = n,location =0.6,theta=0.1)),Distribution=rep(c("poisson","binomial","geometric"),each=1000))
  ggplot(data,aes(x=values,col=Distribution,fill=Distribution))+
    geom_density(alpha=0.5)+
    scale_color_manual(values = c("poisson"="black","binomial"="black","geometric"="black")) +
    scale_fill_manual(values = c("poisson"="#90EE90","binomial"="red","geometric"="violet")) +
    labs(title = "Distributions under H_1 ",x="Mann-Whitney Statistic",y="Density")+
    theme_stata()
  
}
discretedist_h1(4,7)

\(\textbf{m=11, n=15}\)

Code
discretedist_h1(11,15)

\(\textbf{m=24, n=18}\)

Code
discretedist_h1(24,18)

\(\textbf{m=33, n=30}\)

Code
discretedist_h1(33,30)

EXACT DISTRIBUTION

Exact distribution of the statistic under some fixed sample sizes

  • Here we calculate small sample exact probability distribution of Mann-Whitney U statistic for some small sample sizes and observe their behaviours. Here the statistic takes values from 0 to mn, inclusive of both the boundaries.
  • In this context, we consider the recursion relation: d(m, n, k) = d(m, n-1, k) + d(m-1, n, k-n); where d(m, n, k) is the number of arrangements of m X’s and n Y’s leading to U = k.

\(\textbf{m=2, n=2}\)

Code
## small sample exact prob dist under H0
p_21<-NULL;p_12<-NULL
p_21[1]<-1/3;p_21[2]<-1/3;p_21[3]<-1/3;p_12[1]<-1/3;p_12[2]<-1/3;p_12[3]<-1/3

# Distribution of p(2,2)
m=2;n=2
p<-NULL;q<-NULL;r<-NULL;p_22<-NULL;p<-p_21;q<-p_12
for (k in 1:(m*n+1)){
  if(k<=m){
    r[k] = (m/(m+n))*q[k]
  }
  else if (k>m && k<=((m-1)*n +1) ){
    r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  }
  else {
    r[k]=(n/(m+n))*p[k-m]
  }
}
p_22<-r
prob22=data.frame(values=0:(2*2),probability=p_22)

#Dstribution of p(3,0)
p_30<-NULL;p_30[1]<-1

#Distribution of p(3,1)
p<-NULL;q<-NULL;r<-NULL;p_31<-NULL
m=3;n=1
p<-p_30;q<-p_21
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_31<-r
prob31=data.frame(values=0:(3*1),probability=p_31)
#Distribution of p(3,2)
p<-NULL;q<-NULL;r<-NULL;p_32<-NULL;m=3;n=2;p<-p_31;q<-p_22;r<-NULL
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_32<-r
prob32=data.frame(values=0:(3*2),probability=p_32)
p_03<-1

# Distribution of P(1,3)
p<-NULL;q<-NULL;r<-NULL;p_13<-NULL;m=1;n=3;p<-p_12;q<-p_03;r<-NULL
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_13<-r
prob13=data.frame(values=0:(1*3),probability=p_13)

# Distribution of P(2,3)
p<-NULL;q<-NULL;r<-NULL;p_23<-NULL;m=2;n=3;p<-p_22;q<-p_13;r<-NULL
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_23<-r
prob23=data.frame(values=0:(2*3),probability=p_23)

# Distribution of P(3,3)
p<-NULL;q<-NULL;r<-NULL;p_33<-NULL;m=3;n=3;p<-p_32;q<-p_23
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_33<-r
prob33=data.frame(values=0:(3*3),probability=p_33)
p_40<-1

# Distribution of P(4,1)
p<-NULL;q<-NULL;r<-NULL;p_41<-NULL;m=4;n=1;p<-p_40;q<-p_31
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_41<-r
prob41=data.frame(values=0:(4*1),probability=p_41)

# Distribution of P(4,2)
p<-NULL;q<-NULL;r<-NULL;p_42<-NULL;m=4;n=2;p<-p_41;q<-p_32
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_42<-r
prob42=data.frame(values=0:(4*2),probability=p_42)

# Distribution of P(4,3)
p<-NULL;q<-NULL;r<-NULL;p_43<-NULL;m=4;n=3;p<-p_42;q<-p_33
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_43<-r
prob43=data.frame(values=0:(4*3),probability=p_43)

p_04<-1
# Distribution of P(1,4)
p<-NULL;q<-NULL;r<-NULL;p_14<-NULL;m=1;n=4;p<-p_13;q<-p_04
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_14<-r
prob14=data.frame(values=0:(1*4),probability=p_14)


# Distribution of P(2,4)
p<-NULL;q<-NULL;r<-NULL;p_24<-NULL;m=2;n=4;p<-p_23;q<-p_14
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_24<-r
prob24=data.frame(values=0:(2*4),probability=p_24)

# Distribution of P(3,4)
p<-NULL;q<-NULL;r<-NULL;p_34<-NULL;m=3;n=4;p<-p_33;q<-p_24
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_34<-r
prob34=data.frame(values=0:(3*4),probability=p_34)

# Distribution of P(4,4)
p<-NULL;q<-NULL;r<-NULL;p_44<-NULL;m=4;n=4;p<-p_43;q<-p_34
for (k in 1:(m*n+1)){
  if(k<=m) r[k] = (m/(m+n))*q[k]
  else if (k>m && k<=((m-1)*n +1) ) r[k]=(n/(m+n))*p[k-m]+(m/(m+n))*q[k]
  else r[k]=(n/(m+n))*p[k-m]
}
p_44<-r
prob44=data.frame(values=0:(4*4),probability=p_44)
prob22
  values probability
1      0   0.1666667
2      1   0.1666667
3      2   0.3333333
4      3   0.1666667
5      4   0.1666667

\(\textbf{m=4, n=1}\)

Code
prob41
  values probability
1      0         0.2
2      1         0.2
3      2         0.2
4      3         0.2
5      4         0.2
Code
prob44
   values probability
1       0  0.01428571
2       1  0.01428571
3       2  0.02857143
4       3  0.04285714
5       4  0.07142857
6       5  0.07142857
7       6  0.10000000
8       7  0.10000000
9       8  0.11428571
10      9  0.10000000
11     10  0.10000000
12     11  0.07142857
13     12  0.07142857
14     13  0.04285714
15     14  0.02857143
16     15  0.01428571
17     16  0.01428571

\(\textbf{m=4, n=4}\)

Code
prob44
   values probability
1       0  0.01428571
2       1  0.01428571
3       2  0.02857143
4       3  0.04285714
5       4  0.07142857
6       5  0.07142857
7       6  0.10000000
8       7  0.10000000
9       8  0.11428571
10      9  0.10000000
11     10  0.10000000
12     11  0.07142857
13     12  0.07142857
14     13  0.04285714
15     14  0.02857143
16     15  0.01428571
17     16  0.01428571

ASYMPTOTIC ANALYSIS

Checking asymptotic normality

1. U-statistic approach

Under the appropriate assumptions, \({U^{*} = \frac{1}{mn} \sum_{i=1}^{m} \sum_{j=1}^{n} \phi(x_i, y_j)}\) is the two-sample U-statistic corresponding to the Mann-Whitney statistic. We know that, under the assumption that all the appropriate order moments exist, under null hypothesis,
\(\sqrt{N}\)[ \(U^{*}\) - \(\gamma\)(F, G)] \(\xrightarrow{D}\) \(\mathcal{N}\)(0,\(\frac{r^{2} \xi_{10}}{\lambda} + \frac{s^{2} \xi_{01}}{1 - \lambda}\)) as N \(\xrightarrow{}\) \(\infty\)
Where (r, s) is the degree of the U-statistic; \(\frac{m}{N}\xrightarrow{} \lambda\) and \(\xi_{cd}\) denotes the covariance between two kernels of the U-statistic, having c and d \(X\) and \(Y\) observations in common, respectively. Under null hypothesis, \(\xi_{10}\) = \(\frac{1}{4}\) and \(\xi_{01}\) = \(\frac{1}{12}\).
Here we will show the Q-Q plots and verify the asymptotic normality of the U-statistic corresponding to the Mann-Whitney statistic, considering different continuous distributions, namely, Cauchy, Laplace and Logistic, separately in the hypotheses.

Cauchy Distribution

\(\textbf{m=11, n=15}\)

Code
check_qq<-function(distn,m,n){
  c=m/(m+n)
  a=(Mann_WhitneyHist(dist=distn,size1 =m,size2 = n,location = 0,scale=1)/(m*n)-1/2)*sqrt(n+m)/((3-2*c)/(12*c*(1-c)))
  data=data.frame(x=a)
    ggplot(data,aes(sample=x))+geom_qq(col="blue")+geom_qq_line(col="red",linetype="dashed",lwd=0.7)+
      labs(x="Theoretical Quantiles",y="Sample Quantiles",x="Theoretical Quantiles",y="Sample Quantiles")+ggtitle(paste0("QQ Plot for ",distn))+
      theme_stata()
}
check_qq("cauchy",11,15)

\(\textbf{m=24, n=18}\)

Code
check_qq("cauchy",24,18)

Laplace Distribution

\(\textbf{m=11, n=15}\)

Code
check_qq("laplace",11,15)

\(\textbf{m=24, n=18}\)

Code
check_qq("laplace",24,18)

Logistic Distribution

\(\textbf{m=11, n=15}\)

Code
check_qq("logistic",11,15)

\(\textbf{m=24, n=18}\)

Code
check_qq("logistic",24,18)

2. Goodness of fit test

  • From the QQ plots we can have an apparent idea that the behaviour of the asymptotic distribution of the non parametric test statistic is similar to that of normal distribution. For a stronger mathematical verification, we carry out a Goodness of Fit test procedure. Here we carry out the the non-parametric Kolmogorov-Smirnov test for normality.
  • Under the null hypothesis we assume that the test statistic follows normal distribution, asymptotically and under the both sided alternative, it is assumed that the null hypothesis is false. We reject the null hypothesis for small p-values; i.e. we can conclude that the test statistic follows normal distribution, asymptotically, for large p-values.

Cauchy Distribution

\(\textbf{m=11, n=15}\)

Code
check_normal<-function(distn,m,n){
  c=m/(m+n)
  a=(Mann_WhitneyHist(dist=distn,size1 =m,size2 = n,location = 0,scale=1)/(m*n)-1/2)*sqrt(n+m)/((3-2*c)/(12*c*(1-c)))
  data=data.frame(x=a)
return(ks.test(x=a,y=rnorm(m+n,0,1)))
}
check_normal("cauchy",11,15)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  a and rnorm(m + n, 0, 1)
D = 0.18669, p-value = 0.3402
alternative hypothesis: two-sided

\(\textbf{m=24, n=18}\)

Code
check_normal("cauchy",24,18)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  a and rnorm(m + n, 0, 1)
D = 0.15233, p-value = 0.3069
alternative hypothesis: two-sided

Logistic Distribution

\(\textbf{m=11, n=15}\)

Code
check_normal("logistic",11,15)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  a and rnorm(m + n, 0, 1)
D = 0.11685, p-value = 0.8795
alternative hypothesis: two-sided

\(\textbf{m=24, n=18}\)

Code
check_normal("logistic",24,18)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  a and rnorm(m + n, 0, 1)
D = 0.15343, p-value = 0.2988
alternative hypothesis: two-sided

Laplace Distribution

\(\textbf{m=11, n=15}\)

Code
check_normal("laplace",11,15)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  a and rnorm(m + n, 0, 1)
D = 0.18062, p-value = 0.3801
alternative hypothesis: two-sided

\(\textbf{m=24, n=18}\)

Code
check_normal("laplace",24,18)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  a and rnorm(m + n, 0, 1)
D = 0.083333, p-value = 0.9423
alternative hypothesis: two-sided

CONCLUSION

From the QQ plots and the Kolmogorov-Smirnov test, we conclude that the Mann-Whitney U statistic asymptotically follows Normal distribution, irrespective of the underlying continuous distributions in the corresponding hypotheses and hence the test procedure is asymptotically distribution-free.

CONSISTENT?
CHECK POWER CURVES varying location and sample sizes

Power curves

  • Here we plot the power curves, considering the CDFs of Normal, Cauchy, Laplace and Logistic distributions, separately, as F and G, under both the null and alternative hypotheses, namely H0: \(\theta\) = 0 and H1: \(\theta\) > 0, following the usual and aforementioned notations.
  • Plotting the power curves, we observe that for larger sample sizes, power of the test approaches 1 faster, as expected. And the behavioral pattern of the curves are different for different distributions.

  • Also observe that all the tests will be unbiased, for large enough sample sizes.

Under alternative: H1: \(\theta\) > 0

Normal Distribution

Code
Mann_WhitneyPowerRight<-function(distn,size1,size2,alpha=0.05,iter=1000,rge=3,...){
  effect_sizes <- seq(0,rge, by = 0.1);power_values=numeric(length(effect_sizes))
  for(i in 1:length(effect_sizes)){
    reject<-0
    for(j in 1:iter){
      sample1=sampling(dist=distn,par1=0,par2=1,size=size1);sample2=sampling(dist=distn,par1=(0+effect_sizes[i]),par2=1,size=size1)
      if(wilcox.test(sample1,sample2)$p.value <alpha) reject=reject+1
      }
    power_values[i]=reject/iter
  }
  return(list(a=effect_sizes,b=power_values))
}
## Mann_WhitneyPower(distn="normal",size1 = 12,size2 = 15)
## normal
pov=data.frame(x1= rep(seq(0,3, by = 0.1),4),y1=c(Mann_WhitneyPowerRight(distn="normal",size1 = 4,size2 = 7)$b,
Mann_WhitneyPowerRight(distn="normal",size1 = 11,size2 = 15)$b,
Mann_WhitneyPowerRight(distn="normal",size1 = 24,size2 = 18)$b,Mann_WhitneyPowerRight(distn="normal",size1 = 30,size2 = 33)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(0,3, by = 0.1))))
ggplot(data=pov,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", N ( theta,1)," for ",theta >0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Cauchy Distribution

Code
pov1=data.frame(x1= rep(seq(0,5, by = 0.1),4),y1=c(Mann_WhitneyPowerRight(distn="cauchy",size1 = 4,size2 = 7,rge = 5)$b,
Mann_WhitneyPowerRight(distn="cauchy",size1 = 11,size2 = 15,rge=5)$b,
Mann_WhitneyPowerRight(distn="cauchy",size1 = 24,size2=18,rge=5)$b,Mann_WhitneyPowerRight(distn="cauchy",size1 = 33,size2 = 30,rge=5)$b),size=rep(c("m=4 n=7","m=11 n=15","m=24 n=18","m= 30 n=33"),each=length(seq(0,5, by = 0.1))))
ggplot(data=pov1,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", C ( theta,1)," for ",theta >0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Laplace Distribution

Code
pov2=data.frame(x1= rep(seq(0,5, by = 0.1),4),y1=c(Mann_WhitneyPowerRight(distn="laplace",size1 = 4,size2 = 7,rge = 5)$b,
                                                   Mann_WhitneyPowerRight(distn="laplace",size1 = 11,size2 = 15,rge=5)$b,
                                                   Mann_WhitneyPowerRight(distn="laplace",size1 = 24,size2 = 18,rge=5)$b,Mann_WhitneyPowerRight(distn="laplace",size1 = 33,size2 = 30,rge = 5)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(0,5, by = 0.1))))
ggplot(data=pov2,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", Laplace ( theta,1)," for ",theta >0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Logistic Distribution

Code
pov3=data.frame(x1= rep(seq(0,5, by = 0.1),4),y1=c(Mann_WhitneyPowerRight(distn="logistic",size1 = 4,size2 = 7,rge = 5)$b,
                                                   Mann_WhitneyPowerRight(distn="logistic",size1 = 11,size2 = 15,rge=5)$b,
                                                   Mann_WhitneyPowerRight(distn="logistic",size1 = 24,size2 = 18,rge=5)$b,Mann_WhitneyPowerRight(distn="logistic",size1 = 33,size2 = 30,rge=5)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(0,5, by = 0.1))))
ggplot(data=pov3,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", Logistic ( theta,1)," for ",theta >0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Under alternative: H2: \(\theta\) < 0

Normal Distribution

Code
Mann_WhitneyPowerleft<-function(distn,size1,size2,alpha=0.05,iter=1000,rge=-3,...){
  effect_sizes <- seq(rge,0, by = 0.1);power_values=numeric(length(effect_sizes))
  for(i in 1:length(effect_sizes)){
    reject<-0
    for(j in 1:iter){
      sample1=sampling(dist=distn,par1=0,par2=1,size=size1);sample2=sampling(dist=distn,par1=(0+effect_sizes[i]),par2=1,size=size1)
      if(wilcox.test(sample1,sample2)$p.value <alpha) reject=reject+1
    }
    power_values[i]=reject/iter
  }
  return(list(a=effect_sizes,b=power_values))
}
## normal
pov=data.frame(x1= rep(seq(-3,0, by = 0.1),4),y1=c(Mann_WhitneyPowerleft(distn="normal",size1 = 4,size2 = 7)$b,
Mann_WhitneyPowerleft(distn="normal",size1 = 11,size2 = 15)$b,
Mann_WhitneyPowerleft(distn="normal",size1 = 24,size2 = 18)$b,Mann_WhitneyPowerleft(distn="normal",size1 = 30,size2 = 33)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(0,3, by = 0.1))))
ggplot(data=pov,aes(x=x1,y=y1,col=size))+geom_line(lwd=.9)+
labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", N ( theta,1)," for ",theta <0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Cauchy Distribution

Code
pov1=data.frame(x1= rep(seq(-5,0, by = 0.1),4),y1=c(Mann_WhitneyPowerleft(distn="cauchy",size1 = 4,size2 = 7,rge = -5)$b,
Mann_WhitneyPowerleft(distn="cauchy",size1 = 11,size2 = 15,rge=-5)$b,
Mann_WhitneyPowerleft(distn="cauchy",size1 = 24,size2=18,rge=-5)$b,Mann_WhitneyPowerleft(distn="cauchy",size1 = 33,size2 = 30,rge=-5)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 30, n=33"),each=length(seq(0,5, by = 0.1))))
ggplot(data=pov1,aes(x=x1,y=y1,col=size))+geom_line(lwd=.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", C ( theta,1)," for ",theta <0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Laplace Distribution

Code
pov2=data.frame(x1= rep(seq(-5,0, by = 0.1),4),y1=c(Mann_WhitneyPowerleft(distn="laplace",size1 = 4,size2 = 7,rge = -5)$b,
                                                   Mann_WhitneyPowerleft(distn="laplace",size1 = 11,size2 = 15,rge=-5)$b,
                                                   Mann_WhitneyPowerleft(distn="laplace",size1 = 24,size2 = 18,rge=-5)$b,Mann_WhitneyPowerleft(distn="laplace",size1 = 33,size2 = 30,rge = -5)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(0,5, by = 0.1))))
ggplot(data=pov2,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", Laplace ( theta,1)," for ",theta <0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Logistic Distribution

Code
pov3=data.frame(x1= rep(seq(-5,0, by = 0.1),4),y1=c(Mann_WhitneyPowerleft(distn="logistic",size1 = 4,size2 = 7,rge = -5)$b,
                                                   Mann_WhitneyPowerleft(distn="logistic",size1 = 11,size2 = 15,rge=-5)$b,
                                                   Mann_WhitneyPowerleft(distn="logistic",size1 = 24,size2 = 18,rge=-5)$b,Mann_WhitneyPowerleft(distn="logistic",size1 = 33,size2 = 30,rge=-5)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(0,5, by = 0.1))))
ggplot(data=pov3,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", Logistic ( theta,1)," for ",theta <0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Under alternative: H3: \(\theta\) \(\neq\) 0

Normal Distribution

Code
Mann_WhitneyPowerboth<-function(distn,size1,size2,alpha=0.05,iter=1000,rge1=-3,rge2=3,...){
  effect_sizes <- seq(rge1,rge2, by = 0.1);power_values=numeric(length(effect_sizes))
  for(i in 1:length(effect_sizes)){
    reject<-0
    for(j in 1:iter){
      sample1=sampling(dist=distn,par1=0,par2=1,size=size1);sample2=sampling(dist=distn,par1=(0+effect_sizes[i]),par2=1,size=size1)
      if(wilcox.test(sample1,sample2)$p.value <alpha) reject=reject+1
    }
    power_values[i]=reject/iter
  }
  return(list(a=effect_sizes,b=power_values))
}
## normal
pov=data.frame(x1= rep(seq(-3,3, by = 0.1),4),y1=c(Mann_WhitneyPowerboth(distn="normal",size1 = 4,size2 = 7)$b,
Mann_WhitneyPowerboth(distn="normal",size1 = 11,size2 = 15)$b,
Mann_WhitneyPowerboth(distn="normal",size1 = 24,size2 = 18)$b,Mann_WhitneyPowerboth(distn="normal",size1 = 30,size2 = 33)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(-3,3, by = 0.1))))
ggplot(data=pov,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", N ( theta,1)," for ",theta != 0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Cauchy Distribution

Code
pov1=data.frame(x1= rep(seq(-5,5, by = 0.1),4),y1=c(Mann_WhitneyPowerboth(distn="cauchy",size1 = 4,size2 = 7,rge1 = -5, rge2=5)$b,
Mann_WhitneyPowerboth(distn="cauchy",size1 = 11,size2 = 15,rge1 = -5, rge2=5)$b,
Mann_WhitneyPowerboth(distn="cauchy",size1 = 24,size2=18,rge1 = -5, rge2=5)$b,Mann_WhitneyPowerboth(distn="cauchy",size1 = 33,size2 = 30,rge1 = -5, rge2=5)$b),size=rep(c("m=4 n=7","m=11 n=15","m=24 n=18","m= 30 n=33"),each=length(seq(-5,5, by = 0.1))))
ggplot(data=pov1,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", C ( theta,1)," for ",theta != 0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Laplace Distribution

Code
pov2=data.frame(x1= rep(seq(-3,3, by = 0.1),4),y1=c(Mann_WhitneyPowerboth(distn="laplace",size1 = 4,size2 = 7,rge1 = -3, rge2= 3)$b,
                                                   Mann_WhitneyPowerboth(distn="laplace",size1 = 11,size2 = 15,rge1 = -3, rge2= 3)$b,
                                                   Mann_WhitneyPowerboth(distn="laplace",size1 = 24,size2 = 18,rge1 = -3, rge2= 3)$b,Mann_WhitneyPowerboth(distn="laplace",size1 = 33,size2 = 30,rge1 = -3, rge2= 3)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(-3,3, by = 0.1))))
ggplot(data=pov2,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", Laplace ( theta,1)," for ",theta != 0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

Logistic Distribution

Code
pov3=data.frame(x1= rep(seq(-5,5, by = 0.1),4),y1=c(Mann_WhitneyPowerboth(distn="logistic",size1 = 4,size2 = 7,rge1=-5, rge2=5)$b,
                                                   Mann_WhitneyPowerboth(distn="logistic",size1 = 11,size2 = 15,rge1=-5, rge2=5)$b,
                                                   Mann_WhitneyPowerboth(distn="logistic",size1 = 24,size2 = 18,rge1=-5, rge2=5)$b,Mann_WhitneyPowerboth(distn="logistic",size1 = 33,size2 = 30,rge1=-5, rge2=5)$b),size=rep(c("m=4, n=7","m=11, n=15","m=24, n=18","m= 33, n=30"),each=length(seq(-5,5, by = 0.1))))
ggplot(data=pov3,aes(x=x1,y=y1,col=size))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(expression(paste("Power Function for alternative distribution  ", Logistic ( theta,1)," for ",theta != 0)))+
  theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)

POWER COMPARISON: PARAMETRIC & NON-PARAMETRIC

Power comparison between t-test and Mann-Whitney U test

Under right sided alternative, H1: \(\theta\) > 0

Normal Distribution

  • Here we will perform the Mann-Whitney U test, taking CDFs of Normal distributions in both the hypotheses; then will perform the corresponding parametric t-test for Normal parameters.
  • Then we will observe and compare the power curves and hence will conclude that t-test produces better result than the aforementioned non parametric test, in terms of power and consistency.

\(\textbf{m=4, n=7}\)

Code
## parametric counterpart :
parametricright<-function(dt,m,n,alpha=0.05,iter=1000,...){
  par_power<-c()
  locate<-seq(0,3, by = 0.1)
  for(j in 1:length(locate)){
    pr=0
    for( i in 1:iter){
      a=sampling(dist=dt,par1 = 0,par2 = 1,size = m);b=sampling(dist=dt,par1=-locate[j],par2=1,size=n)
      if( t.test(a,b,var.equal = T,alternative = "greater")$p.value < alpha)  pr=pr+1
    }
    par_power=c(par_power,pr/iter)
  }
  return(par_power)
}
parametricPlotright<-function(dis,sz1,sz2,...){
  pov=data.frame(x1=rep(seq(0,3 ,by = 0.1),2),y1=c(parametricright(dt=dis,sz1,sz2),Mann_WhitneyPowerRight(distn=dis,size1 = sz1,size2 = sz2)$b),Test=rep(c("t-test","Mann-whitney test"),each=length(seq(0,3, by = 0.1))))
  ggplot(data=pov,aes(x=x1,y=y1,col=Test))+geom_line(lwd=0.9)+
    labs(x=expression(Theta),y="Power")+ggtitle(bquote("Power comparison, "~theta==0~"  vs " ~ theta > 0 ~ "under"~.(dis) ~ "with sizes("~.(sz1)~","~.(sz2)~")"))+
    theme_stata()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)
}
parametricPlotright("normal",4,7)

\(\textbf{m=11, n=15}\)

Code
parametricPlotright("normal",11,15)

\(\textbf{m=24, n=18}\)

Code
parametricPlotright("normal",24,18)

\(\textbf{m=33, n=30}\)

Code
parametricPlotright("normal",33,30)

###Non-Normal Distribution

  • Here we will perform the Mann-Whitney U test, taking CDFs of some non-Normal distributions, in both the hypotheses; then will perform the corresponding parametric t-test forcefully for the parameters.
  • Then we will observe and compare the power curves and hence will conclude that Mann-Whitney U test produces better result than the t-test, for all the non-normal distributions, in terms of power, consistency and unbiasedness.
  • Also we observe that, for large enough sample sizes, the non parametric test brings forth equivalent results for Logistic and Laplace distribution and better results for Cauchy distribution; as expected; because of the asymptotic distribution of the Mann-Whitney statistic.

###Cauchy Distribution \(\textbf{m=4, n=7}\)

Code
parametricPlotright("cauchy",4,7)

### \(\textbf{m=11, n=15}\)

Code
parametricPlotright("cauchy",11,15)

### \(\textbf{m=24, n=18}\)

Code
parametricPlotright("cauchy",24,18)

### \(\textbf{m=33, n=30}\)

Code
parametricPlotright("cauchy",33,30)

###Laplace Distribution \(\textbf{m=4, n=7}\)

Code
parametricPlotright("laplace",4,7)

### \(\textbf{m=11, n=15}\)

Code
parametricPlotright("laplace",11,15)

### \(\textbf{m=24, n=18}\)

Code
parametricPlotright("laplace",24,18)

### \(\textbf{m=33, n=30}\)

Code
parametricPlotright("laplace",33,30)

###Logistic Distribution \(\textbf{m=4, n=7}\)

Code
parametricPlotright("logistic",4,7)

### \(\textbf{m=11, n=15}\)

Code
parametricPlotright("logistic",11,15)

### \(\textbf{m=24, n=18}\)

Code
parametricPlotright("logistic",24,18)

### \(\textbf{m=33, n=30}\)

Code
parametricPlotright("logistic",33,30)

[Under left sided alternative, H2: \(\theta\) < 0]

Normal Distribution

  • Similarly we perform the Mann-Whitney U test again, taking CDFs of Normal distributions in both the hypotheses; then will perform the corresponding parametric test.
  • Then we will observe that t-test produces better result than the aforementioned non parametric test, in terms of accuracy.

\(\textbf{m=4, n=7}\)

Code
## parametric counterpart :
parametricleft<-function(dt,m,n,alpha=0.05,iter=1000,...){
  par_power<-c()
  locate<-seq(-3,0, by = 0.1)
  for(j in 1:length(locate)){
    pr=0
    for( i in 1:iter){
      a=sampling(dist=dt,par1 = 0,par2 = 1,size = m);b=sampling(dist=dt,par1=-locate[j],par2=1,size=n)
      if( t.test(a,b,var.equal = T,alternative = "less")$p.value < alpha)  pr=pr+1
    }
    par_power=c(par_power,pr/iter)
  }
  return(par_power)
}
parametricPlotleft<-function(dis,sz1,sz2,...){
  pov=data.frame(x1=rep(seq(-3,0 ,by = 0.1),2),y1=c(parametricleft(dt=dis,sz1,sz2),Mann_WhitneyPowerleft(distn=dis,size1 = sz1,size2 = sz2)$b),Test=rep(c("t-test","Mann-whitney test"),each=length(seq(0,3, by = 0.1))))
  ggplot(data=pov,aes(x=x1,y=y1,col=Test))+geom_line(lwd=0.9)+
    labs(x=expression(Theta),y="Power")+ggtitle(bquote("Power comparison, "~theta==0~"  vs " ~ theta < 0 ~ "under"~.(dis) ~ "with sizes("~.(sz1)~","~.(sz2)~")"))+
    theme_solarized()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)
}
parametricPlotleft("normal",4,7)

\(\textbf{m=11, n=15}\)

Code
parametricPlotleft("normal",11,15)

\(\textbf{m=24, n=18}\)

Code
parametricPlotleft("normal",24,18)

\(\textbf{m=33, n=30}\)

Code
parametricPlotleft("normal",33,30)

###Non-Normal Distribution

  • Here we will perform the Mann-Whitney U test, taking CDFs of some non-Normal distributions, in both the hypotheses; then will perform the corresponding parametric t-test forcefully for the parameters.
  • Then we will observe and compare the power curves and hence will conclude that Mann-Whitney U test produces better result than the t-test, for all the non-normal distributions, in terms of power, consistency and unbiasedness.
  • Also we observe that, for large enough sample sizes, the non parametric test brings forth equivalent results for Logistic and Laplace distribution and better results for Cauchy distribution; as expected; because of the asymptotic distribution of the Mann-Whitney statistic.

###Cauchy Distribution \(\textbf{m=4, n=7}\)

Code
parametricPlotleft("cauchy",4,7)

### \(\textbf{m=11, n=15}\)

Code
parametricPlotleft("cauchy",11,15)

### \(\textbf{m=24, n=18}\)

Code
parametricPlotleft("cauchy",24,18)

### \(\textbf{m=33, n=30}\)

Code
parametricPlotleft("cauchy",33,30)

###Laplace Distribution \(\textbf{m=4, n=7}\)

Code
parametricPlotleft("laplace",4,7)

### \(\textbf{m=11, n=15}\)

Code
parametricPlotleft("laplace",11,15)

### \(\textbf{m=24, n=18}\)

Code
parametricPlotleft("laplace",24,18)

### \(\textbf{m=33, n=30}\)

Code
parametricPlotleft("laplace",33,30)

###Logistic Distribution \(\textbf{m=4, n=7}\)

Code
parametricPlotleft("logistic",4,7)

### \(\textbf{m=11, n=15}\)

Code
parametricPlotleft("logistic",11,15)

### \(\textbf{m=24, n=18}\)

Code
parametricPlotleft("logistic",24,18)

### \(\textbf{m=33, n=30}\)

Code
parametricPlotleft("logistic",33,30)

[Under both sided alternative, H3: \(\theta\) \(\neq\) 0]

Normal Distribution

  • Here again we perform the Mann-Whitney U test, taking CDFs of Normal distributions in both the hypotheses; then will perform the corresponding parametric test.
  • Then we will observe that t-test produces better result than the aforementioned non parametric test, in terms of accuracy. Though asymptotically both the tests produces similar results, because of the asymptotic nature of the U-statistic.

\(\textbf{m=4, n=7}\)

Code
## parametric counterpart :
parametricboth<-function(dt,m,n,alpha=0.05,iter=1000,...){
par_power<-c()
locate<-seq(-5,5, by = 0.1)
for(j in 1:length(locate)){
  pr=0
for( i in 1:iter){
  a=sampling(dist=dt,par1 = 0,par2 = 1,size = m);b=sampling(dist=dt,par1=locate[j],par2=1,size=n)
  if( t.test(a,b,var.equal = T)$p.value < alpha)  pr=pr+1
}
  par_power=c(par_power,pr/iter)
}
return(par_power)
}
parametricPlot<-function(dis,sz1,sz2,...){
pov=data.frame(x1=rep(seq(-5,5, by = 0.1),2),y1=c(parametricboth(dt=dis,sz1,sz2),Mann_WhitneyPowerboth(distn=dis,size1 = sz1,size2 = sz2,rge1 = -5,rge2 = 5)$b),Test=rep(c("t-test","Mann-whitney test"),each=length(seq(-5,5, by = 0.1))))
ggplot(data=pov,aes(x=x1,y=y1,col=Test))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+ggtitle(bquote("Power comparison"~ theta != 0 ~ "under"~.(dis) ~ "with sizes ("~.(sz1)~","~.(sz2)~")"))+
  theme_economist()+geom_hline(yintercept = 0.05,linetype="dashed",lwd=0.5)
}
parametricPlot(dis="normal",sz1=4,sz2=7)

\(\textbf{m=11, n=15}\)

Code
parametricPlot(dis="normal",sz1=11,sz2=15)

\(\textbf{m=24, n=18}\)

Code
parametricPlot(dis="normal",sz1=24,sz2=18)

\(\textbf{m=33, n=30}\)

Code
parametricPlot(dis="normal",sz1=33,sz2=30)

###Non-Normal Distribution

  • Here we will perform the Mann-Whitney U test, taking CDFs of some non-Normal distributions, in both the hypotheses; then will perform the corresponding parametric t-test forcefully for the parameters.
  • Then we will observe and compare the power curves and hence will conclude that Mann-Whitney U test produces better result than the t-test, for all the non-normal distributions, in terms of power, consistency and unbiasedness.
  • Also we observe that, for large enough sample sizes, the non parametric test brings forth equivalent results for Logistic and Laplace distribution and better results for Cauchy distribution; as expected; because of the asymptotic distribution of the Mann-Whitney statistic.

###Cauchy Distribution \(\textbf{m=4, n=7}\)

Code
parametricPlot(dis="cauchy",sz1=4,sz2=7)

### \(\textbf{m=11, n=15}\)

Code
parametricPlot(dis="cauchy",sz1=11,sz2=15)

### \(\textbf{m=24, n=18}\)

Code
parametricPlot(dis="cauchy",sz1=24,sz2=18)

### \(\textbf{m=33, n=30}\)

Code
parametricPlot(dis="cauchy",sz1=33,sz2=30)

###Laplace Distribution \(\textbf{m=4, n=7}\)

Code
parametricPlot(dis="laplace",sz1=4,sz2=7)

### \(\textbf{m=11, n=15}\)

Code
parametricPlot(dis="laplace",sz1=11,sz2=15)

### \(\textbf{m=24, n=18}\)

Code
parametricPlot(dis="laplace",sz1=24,sz2=18)

### \(\textbf{m=33, n=30}\)

Code
parametricPlot(dis="laplace",sz1=33,sz2=30)

###Logistic Distribution \(\textbf{m=4, n=7}\)

Code
parametricPlot(dis="logistic",sz1=4,sz2=7)

### \(\textbf{m=11, n=15}\)

Code
parametricPlot(dis="logistic",sz1=11,sz2=15)

### \(\textbf{m=24, n=18}\)

Code
parametricPlot(dis="logistic",sz1=24,sz2=18)

### \(\textbf{m=33, n=30}\)

Code
parametricPlot(dis="logistic",sz1=33,sz2=30)

UNBIASED?
CHECK POWER CURVES varying level of significance

Power curves for different \(\alpha\)’s

Now we plot the power curves for the both sided alternative H3: \(\theta\) \(\neq\) 0 for different sizes and observe that the test is unbiased. Similarly, the test can be shown unbiased for the one-sided alternatives.

Normal Distribution

\(\textbf{m=11, n=15}\)

Code
Mann_WhitneySize<-function(distn,size1,size2,alpha=c(0.025,0.05,0.1),iter=1000,rge1=-3,rge2=3,...){
  effect_sizes <- seq(rge1,rge2, by = 0.1);puti=c()
  for(k in 1:length(alpha)){
    power_values=c()
  for(i in 1:length(effect_sizes)){
    reject<-0
    for(j in 1:iter){
      sample1=sampling(dist=distn,par1=0,par2=1,size=size1);sample2=sampling(dist=distn,par1=(0+effect_sizes[i]),par2=1,size=size1)
      if(wilcox.test(sample1,sample2)$p.value <alpha[k]) reject=reject+1
    }
    power_values=c(power_values,reject/iter)
  }
  puti=c(puti,power_values)
  }
  return(list(a=effect_sizes,b=puti))
  }
sizeplot<-function(dt,sz1,sz2,...){
pov=data.frame(x1= rep(seq(-3,3, by = 0.1),3),y1=Mann_WhitneySize(distn=dt,size1 = sz1,size2 = sz2)$b,
                    alpha=rep(c("0.025","0.05","0.1"),each=length(seq(-3,3, by = 0.1))))
ggplot(data=pov,aes(x=x1,y=y1,col=alpha))+geom_line(lwd=0.9)+
  labs(x=expression(Theta),y="Power")+geom_hline(yintercept = 0.05,type="dashed",col="green",lwd=0.8)+
  geom_hline(yintercept = 0.025,type="dashed",col="orange",lwd=0.8)+geom_hline(yintercept = 0.1,type="dashed",col="skyblue",lwd=0.8)+
  ggtitle(bquote("Power comparison"~theta !=0~"under"~.(dt)~ "with sizes ("~.(sz1)~" ,"~.(sz2)~" )"))+
  theme_economist()
}
sizeplot(dt="normal",11,15)

Cauchy Distribution

\(\textbf{m=11, n=15}\)

Code
sizeplot(dt="cauchy",11,15)

Laplace Distribution

\(\textbf{m=11, n=15}\)

Code
sizeplot(dt="laplace",11,15)

Logistic Distribution

\(\textbf{m=11, n=15}\)

Code
sizeplot(dt="logistic",11,15)

ROBUST?

Mixing noise from Uniform distribution and showing robustness

  • Here we generate data from some standard continuous distributions, namely Normal, Cauchy, Laplace and Logistic with added noise from Uniform(-4,4) and observe the behaviour of the distribution of the Mann-Whitney statistic while the data contains outlier points.
  • We plot the power curves and compare them under the non parametric and parametric setup for the both sided alternative H3: \(\theta\) \(\neq\) 0 for different sizes and observe that the test is robust. Similarly, the test can be shown robust for the one-sided alternatives.

Normal Distribution

\(\textbf{m=4, n=7}\)

Code
robustboth<-function(dt,m,n,alpha=0.05,iter=1000,...){
  par_power<-c()
  locate<-seq(-3,3, by = 0.1)
  for(j in 1:length(locate)){
    pr=0
    for( i in 1:iter){
      a=sampling(dist=dt,par1 = 0,par2 = 1,size = m)+runif(m,-4,4);b=sampling(dist=dt,par1=locate[j],par2=1,size=n)+runif(n,-4,4)
      if( t.test(a,b,var.equal = T)$p.value < alpha)  pr=pr+1
    }
    par_power=c(par_power,pr/iter)
  }
  return(par_power)
}
robustPlot<-function(dis,sz1,sz2,...){
  pov=data.frame(x1=rep(seq(-3,3, by = 0.1),2),y1=c(robustboth(dt=dis,sz1,sz2),Mann_WhitneyPowerboth(distn=dis,size1 = sz1,size2 = sz2)$b),Test=rep(c("t-test","Mann-whitney test"),each=length(seq(-3,3, by = 0.1))))
  ggplot(data=pov,aes(x=x1,y=y1,col=Test))+geom_line(lwd=0.9)+
    labs(x=expression(Theta),y="Power")+ggtitle(bquote("Power comparison, " ~ theta !=0 ~ " under"~.(dis) ~ "with sizes ("~.(sz1)~","~.(sz2)~")"))+
    theme_economist()+geom_hline(yintercept = 0.05,)
}
robustPlot(dis="normal",sz1=4,sz2=7)

\(\textbf{m=11, n=15}\)

Code
robustPlot(dis="normal",sz1=11,sz2=15)

\(\textbf{m=24, n=18}\)

Code
robustPlot(dis="normal",sz1=24,sz2=18)

Cauchy Distribution

\(\textbf{m=4, n=7}\)

Code
robustPlot(dis="cauchy",sz1=4,sz2=7)

\(\textbf{m=11, n=15}\)

Code
robustPlot(dis="cauchy",sz1=11,sz2=15)

\(\textbf{m=24, n=18}\)

Code
robustPlot(dis="cauchy",sz1=24,sz2=18)

Laplace Distribution

\(\textbf{m=4, n=7}\)

Code
robustPlot(dis="laplace",sz1=4,sz2=7)

\(\textbf{m=11, n=15}\)

Code
robustPlot(dis="laplace",sz1=11,sz2=15)

\(\textbf{m=24, n=18}\)

Code
robustPlot(dis="laplace",sz1=24,sz2=18)

Logistic Distribution

\(\textbf{m=4, n=7}\)

Code
robustPlot(dis="logistic",sz1=4,sz2=7)

\(\textbf{m=11, n=15}\)

Code
robustPlot(dis="logistic",sz1=11,sz2=15)

\(\textbf{m=24, n=18}\)

Code
robustPlot(dis="logistic",sz1=24,sz2=18)

CONCLUSION

So observing the power plots, it can be concluded that the non parametric test procedure brings forth better results, in terms of power and consistency, compared to the conventional parametric t-test. Though the parametric test gives somehow better results for larger sample, but overall, the Mann-Whitney test is better, whatever be the underlying distribution of the generated data.

POWER COMPARISON AMONG DIFFERENT DISTRIBUTIONS for fixed \(\theta\), fixed sample sizes

Under which distributional assumption, Mann-Whitney test performs the best?

Here we fix m=24, n=18 and then compare the power of the test for different distributions, while fixing the value of \(\theta\) in each of the cases.

\(\theta\)=0.5

Code
## power for diff dist
Nonparametricpower=function(m=24,n=18,level=0.05,iter=1000,thet,...)
{ x1=0;x2=0;x3=0;x4=0
  for( i in 1:iter){
    a1=sampling(dist="normal",0,1,size=m);b1=sampling(dist="normal",thet,1,size=m)
    a2=sampling(dist="cauchy",0,1,size=m);b2=sampling(dist="cauchy",thet,1,size=m)
    a3=sampling(dist="laplace",0,1,size=m);b3=sampling(dist="laplace",thet,1,size=m)
    a4=sampling(dist="logistic",0,1,size=m);b4=sampling(dist="logistic",thet,1,size=m)
    if(wilcox.test(a1,b1)$p.value <level) x1=x1+1      
    if(wilcox.test(a2,b2)$p.value <level) x2=x2+1    
    if(wilcox.test(a3,b3)$p.value <level) x3=x3+1    
    if(wilcox.test(a4,b4)$p.value <level) x4=x4+1    
  }
table=data.frame(power=c(x1,x2,x3,x4)/iter,distribution=c("normal","cauchy","laplace","logistic"))
ggplot(data=table,aes(x=factor(distribution),y=power,group=1))+geom_point(col="purple",size=3)+geom_line(lwd=0.9,col="violet")+
  labs(x="Distribution",y="Power")+scale_y_continuous(limits=c(0,1))+
  ggtitle(bquote("Power comparisons for different distributions at  " ~theta==~"" ~.(thet)))+
  theme_economist()
}
Nonparametricpower(thet=0.5)

\(\theta\)=1

Code
Nonparametricpower(thet=1)

\(\theta\)=2

Code
Nonparametricpower(thet=2)

\(\theta\)=0

Code
Nonparametricpower(thet=0)

CONCLUSION

  • We observe that in terms of performance, the order of the distributions is Normal, Laplace, Logistic, Cauchy, on the basis of the power of the non-parametric test; in decreasing order.
  • A possible justification for this conclusion can be that order of tails among the distributions is: Cauchy, Laplace, Logistic, Normal; in terms of decreasing order of heaviness; and the order of peaks is: Laplace, Normal, Logistic, Cauchy.
  • Also, the test produces similar results for \(\theta\) = 0, and for very large values of \(\theta\); Which is quite justified.

EFFICACY OF HODGES-LEHMANN ESTIMATOR

Comparison of MSEs between Hodges-Lehmann estimator and Maximum Likelihood Estimator

  • Under the aforementioned setup, the Hodges-Lehmann estimator can easily be obtained for estimating the location shift. Here, under the current setup, Hodges-Lehmann estimator is median of the Walsh averages.
  • Since the MLEs for both Logistic and Cauchy distribution do not have a closed form, we compare the performances of Hodges Lehmann estimator and Maximum Liklihood Estimator. The motivation behind this comparison is to check whether the Hodges Lehmann Estimator performs better than the MLE. If so, we may use Hodges Lehmann estimator instead of MLE because Hodges Lehmann estimator has a closed form and it can be obtained easily.

Normal Distribution

\(\textbf{m=24, n=18}\)

Code
library(ggplot2)

# Function to calculate MLE for mean parameter of normal distribution
mle_mean <- function(data) {
  mean(data)
}

# Function to calculate Hodges-Lehmann estimator with Walsh average
hl_walsh <- function(data) {
  n <- length(data)
  if (n %% 2 == 0) {
    m <- n / 2
    hl <- (sort(data)[m] + sort(data)[m + 1]) / 2
  } else {
    m <- (n + 1) / 2
    hl <- sort(data)[m]
  }
  return(hl)
}

# Function to calculate mean squared error (MSE)
mse <- function(estimates, true_value) {
  mean((estimates - true_value)^2)
}

# Simulation parameters
set.seed(123)
sample_sizes <- seq(10, 1000, by = 10)
true_mean <- 0  # True mean of normal distribution

# Simulate data and calculate MSE for each estimator
mse_results <- data.frame(
  Sample_Size = numeric(),
  MLE = numeric(),
  Hodges_Lehmann = numeric()
)

for (n in sample_sizes) {
  data <- rnorm(n, true_mean)
  mle_estimate <- mle_mean(data)
  hl_estimate <- hl_walsh(data)
  mse_mle <- mse(mle_estimate, true_mean)
  mse_hl <- mse(hl_estimate, true_mean)
  mse_results <- rbind(mse_results, data.frame(Sample_Size = n, MLE = mse_mle, Hodges_Lehmann = mse_hl))
}

# Plot MSE comparison
ggplot(mse_results, aes(x = Sample_Size)) +
  geom_line(aes(y = MLE, color = "MLE")) +
  geom_line(aes(y = Hodges_Lehmann, color = "Hodges-Lehmann")) +
  scale_color_manual(values = c("red", "blue")) +
  labs(title = "MSE Comparison of MLE and Hodges-Lehmann Estimator",
       x = "Sample Size", y = "Mean Squared Error") +
  theme_minimal()

Cauchy Distribution

\(\textbf{m=24, n=18}\)

Code
cauchy.mle <- function(x,start,eps=1.e-8,max.iter=50){
  if (missing(start)) start <- median(x)
  theta <- start
  n <- length(x)
  score <- sum(2*(x-theta)/(1+(x-theta)^2))
  iter <- 1
  conv <- T
  while (abs(score)>eps && iter<=max.iter){
    info <- sum((2-2*(x-theta)^2)/(1+(x-theta)^2)^2)
    theta <- theta + score/info
    iter <- iter + 1
    score <- sum(2*(x-theta)/(1+(x-theta)^2))
  }
  if (abs(score)>eps) {
    print("No Convergence")
    conv <- F
  }
  loglik <- -sum(log(1+(x-theta)^2))
  info <- sum((2-2*(x-theta)^2)/(1+(x-theta)^2)^2)
  r <- list(theta=theta,loglik=loglik,info=info,convergence=conv)
  r
}
MSEcauchyPlot<-function(size=42,ran,iter=50,...){
  range=seq(-ran,ran,by=.05)
  for(i in 1:length(range)){
   ## mse1=c();mse2=c()
    mle=c()
    hle=c()
   ## for( j in 1:50){
      data=rcauchy(size,range[i],1)
      mle=c(mle,(cauchy.mle(data)$theta-range[i])^2)
    hle=c(hle,(median(data)-range[i])^2)
    }
   ## mse1=c(mse1,mean(mle));mse2=c(mse2,mean(hle))
  ## }
 estimate=data.frame(theta=c(range,range),est=c(mle,hle),estimator=rep(c("MLE","HLE"),each=length(range)))
return(ggplot(data=estimate,aes(x=theta,y=est,col=estimator))+
       geom_line(lwd=1)+labs(x=expression(theta),y="MSE")+scale_y_continuous(limits = c(0,0.3))+
     ggtitle(bquote("Comparison between MLE & HLE for Cauchy("~theta~",1)"))+
      theme_stata())
  
}
MSEcauchyPlot(ran=1)

Logistic Distribution

\(\textbf{m=24, n=18}\)

Code
library(EnvStats)
MSElogisPlot<-function(size=42,ran,iter=1000,...){
  range=seq(-ran,ran,by=.1)
  for(i in 1:length(range)){
    mse1=c();mse2=c()
    mle=c()
    hle=c()
    for( j in 1:iter){
      data=rlogis(size,range[i],1)
      mle=c(mle,(elogis(data,method = 'mle')$parameters[1]-range[i])^2)
      hle=c(hle,(median(data)-range[i])^2)
    }
    mse1=c(mse1,mean(mle));mse2=c(mse2,mean(hle))
  }
  estimate=data.frame(theta=c(range,range),est=c(mse1,mse2),estimator=rep(c("MLE","HLE"),each=length(range)))
  return(ggplot(data=estimate,aes(x=theta,y=est,col=estimator))+
           geom_line(lwd=1)+labs(x=expression(theta),y="MSE")+scale_y_continuous(limits = c(0,0.3))+
           ggtitle(bquote("Comparison between MLE & HLE for Logistic("~theta~",1)"))+
           theme_minimal())
  
}
MSElogisPlot(ran=1)

CONCLUSION

  • So observing the MSE plots, it can be concluded that the Hodges-Lehmann estimator definitely performs better for Normal distribution. Besides, it is also consistent and hence Hodges-Lehmann estimator should be preferred over MLE for these distributions.
  • But for Cauchy and Logistic distruibution, due to inefficiency of the ‘optim’, ‘nleqslv’ functions in R, it seems that the comparison of MSEs of MLE and Hodges Lehmann Estimator of Logistic distribution gives us absurd result. Alternatively we tried to simulate the data by solving the score equations but that process also brings forth similar results and hence for Logistic distribution, we could not draw a valid conclusion regarding the MSE comparisons.

THANK YOU!