Two-sample t-test with Mean and Median difference

Based on the fictitious data set

Treatment 1	10	15	50
Treatment 2	12	17	19

Carry out the two-sample t-test to investigate if the mean of treatment 1 is significantly larger than the mean of treatment 2.

Hypotheses Testing

Here, we are assuming a significance level of $\alpha$ = 0.05.We have a fictitious data set of treatment 1 and 2.So,We can do hypothesis testing (upper Tailed Test) to to investigate if the mean of treatment 1 is significantly larger than the mean of treatment 2.

Hypothesis:

Null Hypothesis, $H_o$, $\mu_{x}$ = $\mu_{y}$

Alternative Hypothesis, $H_a$, $\mu_{x}$ > $\mu_{y}$

where,

Mean of treatment 1 : $\mu_{x}$

Mean of treatment 2 : $\mu_{y}$

Assumption

If we can safely make the assumption of the data in each group following a normal distribution, we can use a two-sample t-test to compare the means of random samples drawn from these two populations. In order to perform two-sample t-test, we can have following assumptions:

The data are continuous(not discrete).
The data follow the normal probability distribution.
The variances of the two populations are equal.
The two samples are independent identically distributed.
Both samples are simple random samples from their respective populations. Each individual in the population has an equal probability of being selected in the sample.

Test Statistic

Treatment1 <- c(10,15,50)
Treatment2 <- c(12,17,19)
#Standard Deviation of two treatment
S_x <- sd(Treatment1)
S_x

## [1] 21.79449

S_y <- sd(Treatment2)
S_y

## [1] 3.605551

# The two-sample t-test
t.test(Treatment1,Treatment2, alternative = "greater", var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  Treatment1 and Treatment2
## t = 0.70566, df = 2.1094, p-value = 0.2751
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -26.95749       Inf
## sample estimates:
## mean of x mean of y 
##        25        16

Here,

Mean value for treatment 1, $\bar{X}$ = 25

Mean value for treatment 2, $\bar{Y}$ = 16

Number of samples in treatment 1, $n_1$ = 3

Number of samples in treatment 2, $n_1$ = 3

Standard deviation for treatment 1, $S_x$ = 21.794

Standard deviation for treatment 2, $S_y$ = 3.6055

Mean difference, $\Delta$ = $\bar{X}- \bar{Y}= 25-16 = 9$

First, we calculate the pooled variance:

\[S_p^2 = \frac{((n_1 - 1)S_x^2) + ((n_2 - 1)S_y^2)} {n_1 + n_2 - 2}\] \[S_p^2 = \frac{((3 - 1)21.794^2) + ((3 - 1)3.6055^2)} {3 + 3 - 2}= 243.9890 \] Next, we take the square root of the pooled variance to get the pooled standard deviation. This is: \[\sqrt{243.9890} = 15.62014\] We now have all the pieces for our test statistic. We have the difference of the averages, the pooled standard deviation and the sample sizes. We calculate our test statistic as follows:

\[T_{obs} = \frac{\bar{X} - \bar{Y}} {S_{p}\sqrt{1/n_{1} + 1/n_{2}}}= \frac{25-9} {15.62\sqrt{1/3 + 1/3}}= 0.7056788\] From test statistic, we got value = 0.7056788, which is similar to the two-sample t-test from R code.

p-value

From R code, we got p-value = 0.2751. Degrees of freedom, df = 3+3-2 = 4 Similarly, We can calculate the p-value with the following way:

\[p − value = P (T > T_{obs}) = P (T > 0.70567) = 1 − P (T ≤ 0.70567) = 0.2597\] The R code for calculating p-value,

#Degrees of freedom, df= 3+3-2=4
p_value <- 1-pt(0.7056788,df = 4)
p_value

## [1] 0.2596584

So, using Welch Two Sample t-test, we got p-value = 0.2751 and using the normal calculation, we got the p-value= 0.2596584. In both cases, p-value is greater than the significance level.

Decision about Null Hypothesis

We fail to reject null hypothesis as the p-value= 0.2751 > 0.05.

Conclusion

We can conclude by saying that we don’t have strong evidence to say that difference between the mean of the two treatments are significant.

Let the test statistic D be the sample mean difference from two treatment groups. Then, Let’s Find the permutation distribution of D under the null hypothesis that two populations have the same distribution, we will use the permutation test to find the p-value for the observed data (use the same hypotheses as in part (a)).

#install.packages(gtools)
library(gtools)

# using data
G <- c(10,15,50,12,17,19)
idx = combinations(n=6, r=3)
A <- c(10,15,50)
B <- c(12,17,19)
AB = c(A,B) # the combined data set
permut = NULL # the permuted data set (a 20*6 matrix)
for(i in 1:20){
permut = rbind(permut, c(AB[idx[i,]], AB[-idx[i,]]))
}
permut.A = permut[, 1:3] # the permuted A matrix (20*3)
permut.B = permut[, 4:6] # the permuted B matrix (20*3)
#Mean difference after permutation of two treatments
delta1 = apply(permut.A, 1, mean) - apply(permut.B, 1, mean)
delta1

##  [1]   9.00000 -16.33333 -13.00000 -11.66667   7.00000  10.33333  11.66667
##  [8] -15.00000 -13.66667 -10.33333  10.33333  13.66667  15.00000 -11.66667
## [15] -10.33333  -7.00000  11.66667  13.00000  16.33333  -9.00000

TS <- combinations(n=6, r=3, v=G, set=FALSE, repeats.allowed=FALSE)
TS <- cbind(TS,delta1)
TS[order(TS[,4]),]

##                   delta1
##  [1,] 10 15 12 -16.33333
##  [2,] 10 12 17 -15.00000
##  [3,] 10 12 19 -13.66667
##  [4,] 10 15 17 -13.00000
##  [5,] 10 15 19 -11.66667
##  [6,] 15 12 17 -11.66667
##  [7,] 10 17 19 -10.33333
##  [8,] 15 12 19 -10.33333
##  [9,] 12 17 19  -9.00000
## [10,] 15 17 19  -7.00000
## [11,] 10 50 12   7.00000
## [12,] 10 15 50   9.00000
## [13,] 10 50 17  10.33333
## [14,] 15 50 12  10.33333
## [15,] 10 50 19  11.66667
## [16,] 50 12 17  11.66667
## [17,] 50 12 19  13.00000
## [18,] 15 50 17  13.66667
## [19,] 15 50 19  15.00000
## [20,] 50 17 19  16.33333

Diff	-16.33	-15.00	-13.67	-13.00	-11.67	-10.33	-9.00	-7.00	7.00	9.00	10.33	11.67	13.00	13.67	15.00	16.33
Freq	1	1	1	1	2	2	1	1	1	1	2	2	1	1	1	1
Prob	0.05	0.05	0.05	0.05	0.10	0.10	0.05	0.05	0.05	0.05	0.10	0.10	0.05	0.05	0.05	0.05

#p-value calculation for upper tailed test
delta1.obs = mean(A)-mean(B)
#pvalue for permutation of sample mean
pval1.upper = mean(delta1 >= delta1.obs) #upper-tailed
pval1.upper

## [1] 0.45

Since p-value 0.45 > 0.05, we fail to reject the null hypothesis and conclude that based on the permutation test, there is no significant difference between the mean of two treatments.

Use the permutation test to find the p-value for the observed data (use the same hypotheses as in part (a))

Answer

rand.perm = function(x, y, R, alternative = c("two.sided", "less", "greater"), stat=c("meandiff", "mediandiff", "trmdiff", "sumX"), trim=0)
{
  #stat="meandiff": mean difference between two groups
  #stat="mediandiff": median difference between two groups
  #stat="trmdiff: trimmed mean difference between two groups, trim=0.1 means 10% of the data will be trimmed.
  #stat="sumX": sum of observations from the x group
  
  m = length(x)
  n = length(y)
  N = m+n
  xy = c(x, y)
  permut = NULL
  for(i in 1:R)
  {
    idx = sample(1:N, replace=FALSE)
    permut = rbind(permut, c(xy[idx[1:m]], xy[idx[-(1:m)]]))
  }
  if(stat=="meandiff") trim=0
  if(stat %in% c("meandiff", "trmdiff"))
  {
    D = apply(permut, 1, function(x) -mean(x[-(1:m)], trim)+mean(x[1:m], trim))
    Dobs = mean(x, trim) - mean(y, trim)
  }
  if(stat=="sumX")
  {
    D = apply(permut, 1, function(x) sum(x[1:m]))
    Dobs = sum(x)
  }
  if(stat=="mediandiff")
  {
    D = apply(permut, 1, function(x) -median(x[-(1:m)])+median(x[1:m]))
    Dobs = -median(y) + median(x)
  }
  if(alternative=="greater")
    pval = mean(D >= Dobs)
  if(alternative=="less")
    pval = mean(D <= Dobs)
  if(alternative=="two.sided") 
    pval = mean(abs(D) >= abs(Dobs))
  
  hist(D, main="Hist of D under the null")
  abline(v=Dobs, col=2)
  
  return(list(pval=pval, Dobs=Dobs))
}

# take a look at the histogram of the distribution 
#  of D under the null hypothesis
par(mfrow=c(1,1))

#random permutation using function rand.perm
rand.perm(Treatment1,Treatment2, R=1000, alternative = "greater", stat= "meandiff")

## $pval
## [1] 0.472
## 
## $Dobs
## [1] 9

From part(a):

Hypotheses Testing

Null Hypothesis, $H_o$, $\mu_{x}$ = $\mu_{y}$

Alternative Hypothesis, $H_a$, $\mu_{x}$ > $\mu_{y}$

where,

Mean of treatment 1 : $\mu_{x}$

Mean of treatment 2 : $\mu_{y}$

Test Statistic

Mean difference, $\Delta$ = $\bar{X}- \bar{Y}= 25-16 = 9$

Also, from the histogram, depicted above, we can clearly say that test statistic D as the sample mean difference from two treatment groups = 9.

p-value

From our code, depicted above: We get p-value = 0.461 at our significance level of $\alpha$ = 0.05

Decision about Null Hypothesis

We fail to reject null hypothesis as the p-value= 0.461 > 0.05.

Conclusion

Based on the permutation test, we can conclude by saying that there is no significant difference between the mean of two treatments.

(c)Find the permutation distribution of the sum of the observations from treatment 1, and show that the p-value for the observed data is the same as the p-value in part (b).

From part(a):

Hypotheses Testing

Null Hypothesis, $H_o$, $\mu_{x}$ = $\mu_{y}$ Alternative Hypothesis, $H_a$, $\mu_{x}$ > $\mu_{y}$ where, Mean of treatment 1 : $\mu_{x}$ Mean of treatment 2 : $\mu_{y}$

Test Statistic

The test statistic is the sum of observations in Treatment 1.

# using data
G1 <- c(10,15,50,12,17,19)
TS1 <- combinations(n=6, r=3, v=G1, set=FALSE, repeats.allowed=FALSE)
sumi <- vector()
for(i in 1:length(TS1[,1])){
sumi[i] <- sum(TS1[i,])
}
TS1 <- cbind(TS1,sumi)
TS1[order(TS1[,4]),]

##                sumi
##  [1,] 10 15 12   37
##  [2,] 10 12 17   39
##  [3,] 10 12 19   41
##  [4,] 10 15 17   42
##  [5,] 10 15 19   44
##  [6,] 15 12 17   44
##  [7,] 10 17 19   46
##  [8,] 15 12 19   46
##  [9,] 12 17 19   48
## [10,] 15 17 19   51
## [11,] 10 50 12   72
## [12,] 10 15 50   75
## [13,] 10 50 17   77
## [14,] 15 50 12   77
## [15,] 10 50 19   79
## [16,] 50 12 17   79
## [17,] 50 12 19   81
## [18,] 15 50 17   82
## [19,] 15 50 19   84
## [20,] 50 17 19   86

#p-value calculation for upper tailed test
A <- c(10,15,50)
sum_A = sum(A)
sum_A

## [1] 75

#pvalue for permutation of sample mean
pval2.upper = sum(sumi >= sum_A)/20 #upper-tailed
pval2.upper

## [1] 0.45

Sum	37	39	41	42	44	46	48	51	72	75	77	79	81	82	84	86
Freq	1	1	1	1	2	2	1	1	1	1	2	2	1	1	1	1
Prob	0.05	0.05	0.05	0.05	0.10	0.10	0.05	0.05	0.05	0.05	0.10	0.10	0.05	0.05	0.05	0.05

p-value

From our code, depicted above: We get p-value = 0.45 at our significance level of $\alpha$ = 0.05

Decision about Null Hypothesis

We fail to reject null hypothesis as the p-value= 0.45 > 0.05.

Conclusion

Based on the permutation test, we can conclude by saying that there is no significant difference between the mean of two treatments.

Let the test statistic D be the sample median difference from two treatments. Find the permutation distribution of D (Do not simply print out 20 permuted D values but summarize them into frequency table.) Calculate the p-value for the observed data.

Hypotheses Testing

Null Hypothesis, $H_o$, $\theta_{x}$ = $\theta_{y}$

Alternative Hypothesis, $H_a$, $\theta_{x}$ > $\theta_{y}$

where,

Median of treatment 1 : $\theta_{x}$

Median of treatment 2 : $\theta_{y}$

Test Statistic

Test statistic D be the sample median difference from two treatments.

Median difference, $\Delta$ = $\theta_{x}- \theta_{y}$

#median difference after permutation of two treatments
delta3 = apply(permut.A, 1, median) - apply(permut.B, 1, median)
delta3

##  [1] -2 -7 -4 -2 -5  2  4 -7 -5  2 -2  5  7 -4 -2  5  2  4  7  2

delta3.obs = median(A) - median(B)
#pvalue for permutation of sample median
pval3.upper = mean(delta3 >= delta3.obs) #upper-tailed
pval3.upper

## [1] 0.7

p-value

From our code, depicted above: We get p-value = 0.70 at our significance level of $\alpha$ = 0.05

Decision about Null Hypothesis

We fail to reject null hypothesis as the p-value= 0.70 > 0.05.

Conclusion

Based on the permutation test, we can conclude by saying that there is no significant difference between the median of two treatments.

(e) Which test statistics would you suggest for analyzing this data set and why?

The permutation test for median difference is preferred for the two-sample t-test rather than The permutation test for mean difference. As we know that, the permutation test for median difference is stable to outliers and free to distribution.

Example 02

2.The carapace lengths (in mm) of crayfish were recorded for samples from two sections of a stream in Kansas.

Section 1	5 11 16 8 12
Section 2	17 14 15 21 19 13

You can use the following R code to load the data set: Section1=c(5,11,16,8,12)

Section2=c(17,14,15,21,19,13)

(a) Test for differences between the two sections using a permutation test (you can decide on the test statistic). Use significance level 0.05. State the null and alternative hypotheses.

The total number of permutations for two sections are = 462

Hypotheses Testing

Hypothesis:

Null Hypothesis, $H_o$, $Median Difference, (\theta_{x}-\theta_{y})$ = 0

Alternative Hypothesis, $H_a$, $Median Difference,({x}-{y}) $ $\neq$ 0

where,

Median of Section 1 : $\theta_{x}$

Median of Section 2 : $\theta_{y}$

Test Statistic

Test statistic be the sample median difference from two sections

Median difference, $\Delta$ = $\theta_{x}- \theta_{y}$

idx = combinations(n=11, r=5)
Section1=c(5,11,16,8,12)
Section2=c(17,14,15,21,19,13)
Section1Section2 = c(Section1,Section2) # the combined data set
permut = NULL # the permuted data set 
for(i in 1:462){
permut = rbind(permut, c(Section1Section2[idx[i,]], Section1Section2[-idx[i,]]))
}
permut.Section1 = permut[, 1:5] # the permuted A matrix (20*3)
permut.Section2 = permut[, 5:11] # the permuted B matrix (20*3)
#median difference after permutation of two treatments
delta4 = apply(permut.Section1, 1, median) - apply(permut.Section2, 1, median)
delta4.obs = median(Section1) - median(Section2)
delta4.obs

## [1] -5

#pvalue for permutation of sample median
pval4.2sided = mean(abs(delta4) >= abs(delta4.obs)) #two-tailed
pval4.2sided

## [1] 0.04112554

From R code, we get sample median difference of two sections is -5.

p-value

From our code, depicted above: We get p-value = 0.04112554 at our significance level of $\alpha$ = 0.05

Decision about Null Hypothesis

We reject null hypothesis as the p-value= 0.04112554 > 0.05.

Conclusion

Based on the permutation test, we can conclude by saying that the median of the carapace lengths(in mm) of crayfish between the two sections of a stream in Kansas has significant difference.

(b) Test for differences using the Wilcoxon rank-sum test. Use significance level 0.05. State the null and alternative hypotheses.

Answer

In order to perform the Wilcoxon Rank Sum Test, we have made the following assumptions:

The two samples are independent of one another.
The two populations have equal variance.
The two samples are independent identically distributed.

Hypotheses Testing

Hypothesis:

Null Hypothesis, $H_o$, $Median Difference, (\theta_{x}-\theta_{y})$ = 0

Alternative Hypothesis, $H_a$, $Median Difference,({x}-{y}) $ $\neq$ 0 where, Median of Section 1 : $\theta_{x}$ Median of Section 2 : $\theta_{y}$

Section1=c(5,11,16,8,12)
Section2=c(17,14,15,21,19,13)
wilcox.test(Section1, Section2, alternative="two.sided", paired=FALSE, 
            mu=0,exact=FALSE,correct=TRUE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Section1 and Section2
## W = 3, p-value = 0.03576
## alternative hypothesis: true location shift is not equal to 0

Test Statistic

From R code, we get:

The test statistic for the Wilcoxon Signed Rank Test is W = 3

p-value

From our code, depicted above: We get p-value = 0.03576 at our significance level of $\alpha$ = 0.05

Decision about Null Hypothesis

We reject null hypothesis as the p-value= 0.03576 > 0.05.

Conclusion

Based on the Wilcoxon rank sum test, we can conclude by saying that the median of the carapace lengths(in mm) of crayfish between the two sections of a stream in Kansas has significant difference.

Two-sample t-test with Mean and Median difference

Md Mominul Islam

3/5/2021