Based on the fictitious data set
| Treatment 1 | 10 | 15 | 50 |
|---|---|---|---|
| Treatment 2 | 12 | 17 | 19 |
Carry out the two-sample t-test to investigate if the mean of treatment 1 is significantly larger than the mean of treatment 2.
Hypotheses Testing
Here, we are assuming a significance level of \(\alpha\) = 0.05.We have a fictitious data set of treatment 1 and 2.So,We can do hypothesis testing (upper Tailed Test) to to investigate if the mean of treatment 1 is significantly larger than the mean of treatment 2.
Hypothesis:
Null Hypothesis, \(H_o\), \(\mu_{x}\) = \(\mu_{y}\)
Alternative Hypothesis, \(H_a\), \(\mu_{x}\) > \(\mu_{y}\)
where,
Mean of treatment 1 : \(\mu_{x}\)
Mean of treatment 2 : \(\mu_{y}\)
Assumption
If we can safely make the assumption of the data in each group following a normal distribution, we can use a two-sample t-test to compare the means of random samples drawn from these two populations. In order to perform two-sample t-test, we can have following assumptions:
Test Statistic
Treatment1 <- c(10,15,50)
Treatment2 <- c(12,17,19)
#Standard Deviation of two treatment
S_x <- sd(Treatment1)
S_x
## [1] 21.79449
S_y <- sd(Treatment2)
S_y
## [1] 3.605551
# The two-sample t-test
t.test(Treatment1,Treatment2, alternative = "greater", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: Treatment1 and Treatment2
## t = 0.70566, df = 2.1094, p-value = 0.2751
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -26.95749 Inf
## sample estimates:
## mean of x mean of y
## 25 16
Here,
Mean value for treatment 1, \(\bar{X}\) = 25
Mean value for treatment 2, \(\bar{Y}\) = 16
Number of samples in treatment 1, \(n_1\) = 3
Number of samples in treatment 2, \(n_1\) = 3
Standard deviation for treatment 1, \(S_x\) = 21.794
Standard deviation for treatment 2, \(S_y\) = 3.6055
Mean difference, \(\Delta\) = \(\bar{X}- \bar{Y}= 25-16 = 9\)
First, we calculate the pooled variance:
\[S_p^2 = \frac{((n_1 - 1)S_x^2) + ((n_2 - 1)S_y^2)} {n_1 + n_2 - 2}\] \[S_p^2 = \frac{((3 - 1)21.794^2) + ((3 - 1)3.6055^2)} {3 + 3 - 2}= 243.9890 \] Next, we take the square root of the pooled variance to get the pooled standard deviation. This is: \[\sqrt{243.9890} = 15.62014\] We now have all the pieces for our test statistic. We have the difference of the averages, the pooled standard deviation and the sample sizes. We calculate our test statistic as follows:
\[T_{obs} = \frac{\bar{X} - \bar{Y}} {S_{p}\sqrt{1/n_{1} + 1/n_{2}}}= \frac{25-9} {15.62\sqrt{1/3 + 1/3}}= 0.7056788\] From test statistic, we got value = 0.7056788, which is similar to the two-sample t-test from R code.
p-value
From R code, we got p-value = 0.2751. Degrees of freedom, df = 3+3-2 = 4 Similarly, We can calculate the p-value with the following way:
\[p − value = P (T > T_{obs}) = P (T > 0.70567) = 1 − P (T ≤ 0.70567) = 0.2597\] The R code for calculating p-value,
#Degrees of freedom, df= 3+3-2=4
p_value <- 1-pt(0.7056788,df = 4)
p_value
## [1] 0.2596584
So, using Welch Two Sample t-test, we got p-value = 0.2751 and using the normal calculation, we got the p-value= 0.2596584. In both cases, p-value is greater than the significance level.
Decision about Null Hypothesis
We fail to reject null hypothesis as the p-value= 0.2751 > 0.05.
Conclusion
We can conclude by saying that we don’t have strong evidence to say that difference between the mean of the two treatments are significant.
Let the test statistic D be the sample mean difference from two treatment groups. Then, Let’s Find the permutation distribution of D under the null hypothesis that two populations have the same distribution, we will use the permutation test to find the p-value for the observed data (use the same hypotheses as in part (a)).
#install.packages(gtools)
library(gtools)
# using data
G <- c(10,15,50,12,17,19)
idx = combinations(n=6, r=3)
A <- c(10,15,50)
B <- c(12,17,19)
AB = c(A,B) # the combined data set
permut = NULL # the permuted data set (a 20*6 matrix)
for(i in 1:20){
permut = rbind(permut, c(AB[idx[i,]], AB[-idx[i,]]))
}
permut.A = permut[, 1:3] # the permuted A matrix (20*3)
permut.B = permut[, 4:6] # the permuted B matrix (20*3)
#Mean difference after permutation of two treatments
delta1 = apply(permut.A, 1, mean) - apply(permut.B, 1, mean)
delta1
## [1] 9.00000 -16.33333 -13.00000 -11.66667 7.00000 10.33333 11.66667
## [8] -15.00000 -13.66667 -10.33333 10.33333 13.66667 15.00000 -11.66667
## [15] -10.33333 -7.00000 11.66667 13.00000 16.33333 -9.00000
TS <- combinations(n=6, r=3, v=G, set=FALSE, repeats.allowed=FALSE)
TS <- cbind(TS,delta1)
TS[order(TS[,4]),]
## delta1
## [1,] 10 15 12 -16.33333
## [2,] 10 12 17 -15.00000
## [3,] 10 12 19 -13.66667
## [4,] 10 15 17 -13.00000
## [5,] 10 15 19 -11.66667
## [6,] 15 12 17 -11.66667
## [7,] 10 17 19 -10.33333
## [8,] 15 12 19 -10.33333
## [9,] 12 17 19 -9.00000
## [10,] 15 17 19 -7.00000
## [11,] 10 50 12 7.00000
## [12,] 10 15 50 9.00000
## [13,] 10 50 17 10.33333
## [14,] 15 50 12 10.33333
## [15,] 10 50 19 11.66667
## [16,] 50 12 17 11.66667
## [17,] 50 12 19 13.00000
## [18,] 15 50 17 13.66667
## [19,] 15 50 19 15.00000
## [20,] 50 17 19 16.33333
| Diff | -16.33 | -15.00 | -13.67 | -13.00 | -11.67 | -10.33 | -9.00 | -7.00 | 7.00 | 9.00 | 10.33 | 11.67 | 13.00 | 13.67 | 15.00 | 16.33 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Freq | 1 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 |
| Prob | 0.05 | 0.05 | 0.05 | 0.05 | 0.10 | 0.10 | 0.05 | 0.05 | 0.05 | 0.05 | 0.10 | 0.10 | 0.05 | 0.05 | 0.05 | 0.05 |
#p-value calculation for upper tailed test
delta1.obs = mean(A)-mean(B)
#pvalue for permutation of sample mean
pval1.upper = mean(delta1 >= delta1.obs) #upper-tailed
pval1.upper
## [1] 0.45
Since p-value 0.45 > 0.05, we fail to reject the null hypothesis and conclude that based on the permutation test, there is no significant difference between the mean of two treatments.
Use the permutation test to find the p-value for the observed data (use the same hypotheses as in part (a))
Answer
rand.perm = function(x, y, R, alternative = c("two.sided", "less", "greater"), stat=c("meandiff", "mediandiff", "trmdiff", "sumX"), trim=0)
{
#stat="meandiff": mean difference between two groups
#stat="mediandiff": median difference between two groups
#stat="trmdiff: trimmed mean difference between two groups, trim=0.1 means 10% of the data will be trimmed.
#stat="sumX": sum of observations from the x group
m = length(x)
n = length(y)
N = m+n
xy = c(x, y)
permut = NULL
for(i in 1:R)
{
idx = sample(1:N, replace=FALSE)
permut = rbind(permut, c(xy[idx[1:m]], xy[idx[-(1:m)]]))
}
if(stat=="meandiff") trim=0
if(stat %in% c("meandiff", "trmdiff"))
{
D = apply(permut, 1, function(x) -mean(x[-(1:m)], trim)+mean(x[1:m], trim))
Dobs = mean(x, trim) - mean(y, trim)
}
if(stat=="sumX")
{
D = apply(permut, 1, function(x) sum(x[1:m]))
Dobs = sum(x)
}
if(stat=="mediandiff")
{
D = apply(permut, 1, function(x) -median(x[-(1:m)])+median(x[1:m]))
Dobs = -median(y) + median(x)
}
if(alternative=="greater")
pval = mean(D >= Dobs)
if(alternative=="less")
pval = mean(D <= Dobs)
if(alternative=="two.sided")
pval = mean(abs(D) >= abs(Dobs))
hist(D, main="Hist of D under the null")
abline(v=Dobs, col=2)
return(list(pval=pval, Dobs=Dobs))
}
# take a look at the histogram of the distribution
# of D under the null hypothesis
par(mfrow=c(1,1))
#random permutation using function rand.perm
rand.perm(Treatment1,Treatment2, R=1000, alternative = "greater", stat= "meandiff")
## $pval
## [1] 0.472
##
## $Dobs
## [1] 9
From part(a):
Hypotheses Testing
Null Hypothesis, \(H_o\), \(\mu_{x}\) = \(\mu_{y}\)
Alternative Hypothesis, \(H_a\), \(\mu_{x}\) > \(\mu_{y}\)
where,
Mean of treatment 1 : \(\mu_{x}\)
Mean of treatment 2 : \(\mu_{y}\)
Test Statistic
Mean difference, \(\Delta\) = \(\bar{X}- \bar{Y}= 25-16 = 9\)
Also, from the histogram, depicted above, we can clearly say that test statistic D as the sample mean difference from two treatment groups = 9.
p-value
From our code, depicted above: We get p-value = 0.461 at our significance level of \(\alpha\) = 0.05
Decision about Null Hypothesis
We fail to reject null hypothesis as the p-value= 0.461 > 0.05.
Conclusion
Based on the permutation test, we can conclude by saying that there is no significant difference between the mean of two treatments.
(c)Find the permutation distribution of the sum of the observations from treatment 1, and show that the p-value for the observed data is the same as the p-value in part (b).
From part(a):
Hypotheses Testing
Null Hypothesis, \(H_o\), \(\mu_{x}\) = \(\mu_{y}\) Alternative Hypothesis, \(H_a\), \(\mu_{x}\) > \(\mu_{y}\) where, Mean of treatment 1 : \(\mu_{x}\) Mean of treatment 2 : \(\mu_{y}\)
Test Statistic
The test statistic is the sum of observations in Treatment 1.
# using data
G1 <- c(10,15,50,12,17,19)
TS1 <- combinations(n=6, r=3, v=G1, set=FALSE, repeats.allowed=FALSE)
sumi <- vector()
for(i in 1:length(TS1[,1])){
sumi[i] <- sum(TS1[i,])
}
TS1 <- cbind(TS1,sumi)
TS1[order(TS1[,4]),]
## sumi
## [1,] 10 15 12 37
## [2,] 10 12 17 39
## [3,] 10 12 19 41
## [4,] 10 15 17 42
## [5,] 10 15 19 44
## [6,] 15 12 17 44
## [7,] 10 17 19 46
## [8,] 15 12 19 46
## [9,] 12 17 19 48
## [10,] 15 17 19 51
## [11,] 10 50 12 72
## [12,] 10 15 50 75
## [13,] 10 50 17 77
## [14,] 15 50 12 77
## [15,] 10 50 19 79
## [16,] 50 12 17 79
## [17,] 50 12 19 81
## [18,] 15 50 17 82
## [19,] 15 50 19 84
## [20,] 50 17 19 86
#p-value calculation for upper tailed test
A <- c(10,15,50)
sum_A = sum(A)
sum_A
## [1] 75
#pvalue for permutation of sample mean
pval2.upper = sum(sumi >= sum_A)/20 #upper-tailed
pval2.upper
## [1] 0.45
| Sum | 37 | 39 | 41 | 42 | 44 | 46 | 48 | 51 | 72 | 75 | 77 | 79 | 81 | 82 | 84 | 86 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Freq | 1 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 |
| Prob | 0.05 | 0.05 | 0.05 | 0.05 | 0.10 | 0.10 | 0.05 | 0.05 | 0.05 | 0.05 | 0.10 | 0.10 | 0.05 | 0.05 | 0.05 | 0.05 |
p-value
From our code, depicted above: We get p-value = 0.45 at our significance level of \(\alpha\) = 0.05
Decision about Null Hypothesis
We fail to reject null hypothesis as the p-value= 0.45 > 0.05.
Conclusion
Based on the permutation test, we can conclude by saying that there is no significant difference between the mean of two treatments.
Let the test statistic D be the sample median difference from two treatments. Find the permutation distribution of D (Do not simply print out 20 permuted D values but summarize them into frequency table.) Calculate the p-value for the observed data.
Hypotheses Testing
Null Hypothesis, \(H_o\), \(\theta_{x}\) = \(\theta_{y}\)
Alternative Hypothesis, \(H_a\), \(\theta_{x}\) > \(\theta_{y}\)
where,
Median of treatment 1 : \(\theta_{x}\)
Median of treatment 2 : \(\theta_{y}\)
Test Statistic
Test statistic D be the sample median difference from two treatments.
Median difference, \(\Delta\) = \(\theta_{x}- \theta_{y}\)
#median difference after permutation of two treatments
delta3 = apply(permut.A, 1, median) - apply(permut.B, 1, median)
delta3
## [1] -2 -7 -4 -2 -5 2 4 -7 -5 2 -2 5 7 -4 -2 5 2 4 7 2
delta3.obs = median(A) - median(B)
#pvalue for permutation of sample median
pval3.upper = mean(delta3 >= delta3.obs) #upper-tailed
pval3.upper
## [1] 0.7
p-value
From our code, depicted above: We get p-value = 0.70 at our significance level of \(\alpha\) = 0.05
Decision about Null Hypothesis
We fail to reject null hypothesis as the p-value= 0.70 > 0.05.
Conclusion
Based on the permutation test, we can conclude by saying that there is no significant difference between the median of two treatments.
(e) Which test statistics would you suggest for analyzing this data set and why?
The permutation test for median difference is preferred for the two-sample t-test rather than The permutation test for mean difference. As we know that, the permutation test for median difference is stable to outliers and free to distribution.
Example 02
2.The carapace lengths (in mm) of crayfish were recorded for samples from two sections of a stream in Kansas.
| Section 1 | 5 11 16 8 12 |
|---|---|
| Section 2 | 17 14 15 21 19 13 |
You can use the following R code to load the data set: Section1=c(5,11,16,8,12)
Section2=c(17,14,15,21,19,13)
(a) Test for differences between the two sections using a permutation test (you can decide on the test statistic). Use significance level 0.05. State the null and alternative hypotheses.
The total number of permutations for two sections are = 462
Hypotheses Testing
Hypothesis:
Null Hypothesis, \(H_o\), \(Median Difference, (\theta_{x}-\theta_{y})\) = 0
Alternative Hypothesis, \(H_a\), $Median Difference,({x}-{y}) $ \(\neq\) 0
where,
Median of Section 1 : \(\theta_{x}\)
Median of Section 2 : \(\theta_{y}\)
Test Statistic
Test statistic be the sample median difference from two sections
Median difference, \(\Delta\) = \(\theta_{x}- \theta_{y}\)
idx = combinations(n=11, r=5)
Section1=c(5,11,16,8,12)
Section2=c(17,14,15,21,19,13)
Section1Section2 = c(Section1,Section2) # the combined data set
permut = NULL # the permuted data set
for(i in 1:462){
permut = rbind(permut, c(Section1Section2[idx[i,]], Section1Section2[-idx[i,]]))
}
permut.Section1 = permut[, 1:5] # the permuted A matrix (20*3)
permut.Section2 = permut[, 5:11] # the permuted B matrix (20*3)
#median difference after permutation of two treatments
delta4 = apply(permut.Section1, 1, median) - apply(permut.Section2, 1, median)
delta4.obs = median(Section1) - median(Section2)
delta4.obs
## [1] -5
#pvalue for permutation of sample median
pval4.2sided = mean(abs(delta4) >= abs(delta4.obs)) #two-tailed
pval4.2sided
## [1] 0.04112554
From R code, we get sample median difference of two sections is -5.
p-value
From our code, depicted above: We get p-value = 0.04112554 at our significance level of \(\alpha\) = 0.05
Decision about Null Hypothesis
We reject null hypothesis as the p-value= 0.04112554 > 0.05.
Conclusion
Based on the permutation test, we can conclude by saying that the median of the carapace lengths(in mm) of crayfish between the two sections of a stream in Kansas has significant difference.
(b) Test for differences using the Wilcoxon rank-sum test. Use significance level 0.05. State the null and alternative hypotheses.
Answer
In order to perform the Wilcoxon Rank Sum Test, we have made the following assumptions:
Hypotheses Testing
Hypothesis:
Null Hypothesis, \(H_o\), \(Median Difference, (\theta_{x}-\theta_{y})\) = 0
Alternative Hypothesis, \(H_a\), $Median Difference,({x}-{y}) $ \(\neq\) 0 where, Median of Section 1 : \(\theta_{x}\) Median of Section 2 : \(\theta_{y}\)
Section1=c(5,11,16,8,12)
Section2=c(17,14,15,21,19,13)
wilcox.test(Section1, Section2, alternative="two.sided", paired=FALSE,
mu=0,exact=FALSE,correct=TRUE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Section1 and Section2
## W = 3, p-value = 0.03576
## alternative hypothesis: true location shift is not equal to 0
Test Statistic
From R code, we get:
The test statistic for the Wilcoxon Signed Rank Test is W = 3
p-value
From our code, depicted above: We get p-value = 0.03576 at our significance level of \(\alpha\) = 0.05
Decision about Null Hypothesis
We reject null hypothesis as the p-value= 0.03576 > 0.05.
Conclusion
Based on the Wilcoxon rank sum test, we can conclude by saying that the median of the carapace lengths(in mm) of crayfish between the two sections of a stream in Kansas has significant difference.