N = 10000
p_1 = rnorm(n=N,mean=0,sd=1)
p_2 = rnorm(n=N,mean=2,sd=1)
sample() function in R. We will draw samples of size 100 from each population and compute their means and check if they are equal. We will do this 1000 times.# number of samples
n_samples = 100
# number of times
n_times = 1000
# compute the mean difference an store them in the container below
container = c()
for(k in 1:n_times){
# draw samples from the populations
X_1 = sample(p_1,n_samples)
X_2 = sample(p_2,n_samples)
# compute their mean difference and store in container
container[k] = mean(X_2)-mean(X_1)
}
hist(container,col='gray')
container = c()
for(k in 1:n_times){
# draw samples from the populations
X_11 = sample(p_1,n_samples)
X_12 = sample(p_1,n_samples)
# compute their mean difference and store in container
container[k] = mean(X_11)-mean(X_12)
}
hist(container,col='gray')
# count the number of times the difference was exactly zero
sum(container == 0)
[1] 0
Now when statisticians says the null \(\bar X_1 - \bar X_2=0\), it means that under this null assumption the we expect the distribution of the difference to be the second histogram above, i.e., center around zero. Lets compare the second histogram (a histogram constructed under the null) to a normal distribution
qqnorm(container)
qqline(container)
Ok it seems like it the null hypothesis is true, the null distribution (the histogram centered at zero) is normal. We know the mean of the null distribution is zero. What is it standard deviation. To avoid answering that question, instead of looking at \(z_1 =\bar X_1 - \bar X_2\) (\(z_1\) is the test statistics) we could look at \[ z_2 = \frac{\bar X_1 - \bar X_2}{\text{std}(\bar X_1 - \bar X_2)} \] \(z_2\) is a normalised test statistic and \(\text{std}(z_2)=1\)
z_1 = c()
z_2 = c()
for(k in 1:n_times){
# draw samples from the populations
X_11 = sample(p_1,n_samples)
X_12 = sample(p_1,n_samples)
# compute their mean difference and store in container
z_1[k] = mean(X_11)-mean(X_12)
}
z_2 = z_1/sd(z_1)
hist(z_1,col='gray',xlim = c(-4,4))
hist(z_2,add=TRUE,col=rgb(0,0,0.5,0.3))
So we could now conlude that the distribution of \(z_2\) is standard normal. So we are saying that under the null hypothesis, the distribution of \(z_2\) is a standard normal distribution. Is this always true?
Write two functions twosample.t(X1, X2, n.perm) and paired.t(X1,X2, n.perm) that return the permutation p-value of a two-sided twosample t-test and a two-sided paired t-test based on n.perm permutations. You should implement this test yourself, so don’t use a test from e.g. the coin package. As your answer to this question, copy-paste the (readable) R code of these two functions so that we can check your implementation.
twosample.t(X1, X2, n.perm) and paired.t(X1,X2, n.perm) have NOTHING to do with a \(t\)-test. Please don’t be confuse with the .t in the function names. This can be misleading I know, the .t in the names simply means they are tests.# two samples
#_______
# pool samples
# draw two groups from pooled sample without replacement
# compute difference between their means
# compare with observed
twosample.t= function(X1,X2,n.perm){
len = length(X1)
# observed value
obs = mean(X1)-mean(X2)
# pooled samples,under the null it does not matter where the sample belongs
X = c(X1,X2)
# their indices
index = seq(1,2*len)
statistics = c()
for(i in 1:n.perm){
# sample the indices, it does not matter which sample was the observation
# came from
index_sample_1 = sample(index,len,replace = FALSE)
index_sample_2 = index[!(index %in% index_sample_1)]
statistics[i] = mean(X[index_sample_1])-mean(X[index_sample_2])
}
# proportions of test statistics below obs
p_value = sum(abs(statistics)>abs(obs))/n.perm
# p_value = ifelse(p_value>0.5,1-p_value,p_value)
return(p_value)
}
# compute X1-X2
# switch their signs randomly
# compute mean of difference
# compare with the observed value
paired.t= function(X1,X2,n.perm){
len = length(X1)
# observed value
obs = mean(X1)-mean(X2)
statistics = c()
for(i in 1:n.perm){
signs = sample(c(1,-1),len,replace = TRUE)
# change sign of difference randomly it does not matter which measurement was done
# first, this what the null hypothesis says
statistics[i] = mean((X1-X2)*signs)
}
# proportions of test statistics below obs
p_value = sum(abs(statistics)>abs(obs))/n.perm
# p_value = ifelse(p_value>0.5,1-p_value,p_value)
return(p_value)
}
# play with the two tests change the means of X1 and X2 and see what happens
N<-100
n.perm = 1000
X1 = rnorm(N,mean=2,sd=2)
X2 = rnorm(N,mean=2.3,sd=2)
paired.t(X1,X2,n.perm=n.perm)
[1] 0.201
twosample.t(X1,X2,n.perm=n.perm)
[1] 0.173
Write a function simulate.data(n, delta, correlation) that simulates \(n\) paired observations (so the total number of observations is \(2n\)), say
X1 and X2, from a skewed distribution where E[X1]−E[X2] = delta. The parameter correlation can take on a number from 1 to 5 and is related to the strength of the correlation. More specifically, correlation = 1 should correspond to cor(X1, X2) = 0, correlation = 2 to cor(X1, X2) = 0.1, correlation = 3 to cor(X1, X2) = 0.25, correlation = 4 to cor(X1, X2) = 0.5 and correlation = 5 to cor(X1, X2) = 0.7. Note that many parameters are left free in this function (e.g. the marginal means
and variances - you can choose the values yourself for this). As your an- swer to this question, copy-paste the (readable) R code of this function.
Tip: you might want to examine the code below to get inspiration on how you can simulate correlated skewed bivariate data. Note that cor below will not exactly equal cor(X1, X2), so you will have to fine tune this. You can do this be setting \(n\) to a large number (e.g. 10000) and see what the sample correlation is (which will be very close to cor(X1, X2) because \(n\) is large) for a choice of cor. It is not mandatory to use this code.
library(mvtnorm)
n <- 100000
cor <- .1
Y <- rmvnorm(n, mean = c(0,2),sigma = matrix(nrow = 2, ncol = 2, c(1, cor, cor, 1)))
cov(Y)
[,1] [,2]
[1,] 0.99953182 0.09751816
[2,] 0.09751816 1.00656266
Z <- pnorm(Y) # standard normal
X1 <- qexp(Z[,1])
X2 <- qexp(Z[,2])
cor_data = data.frame(correlation=c(1:5),cor=c(0,0.1,0.25,0.5,0.7))
simulate.data=function(n, delta, correlation){
cor = cor_data[correlation,2]
Y <- rmvnorm(n, mean = c(0,delta),sigma = matrix(nrow = 2, ncol = 2, c(1, cor, cor, 1)))
Z <- pnorm(Y) # standard normal
X1 <- qexp(Z[,1])
X2 <- qexp(Z[,2])
data.frame(X1,X2)
}
Create a graph where you plot the empirical type I and type II error for each test as a function of the correlation where the correlation takes on the values 0, 0.1, 0.25, 0.5 and 0.7 and for \(\alpha\) = 5%. Here \(n\) should be equal to \(5\) (mimicking a very small sample) and to 100 (mimicking a larger sample) and you can choose the number of permutations and Monte-Carlo simulations yourself (the higher the better, but higher also means computationally more intensive, so there is a trade-off). In your answer to this question you should only include two plots: one for n = 5 and one for n = 100 each with 4 curves (for both errors and for both tests) - this plot should be readable in black-and-white and please provide an informative caption. Make sure that when you simulate data under \(H_A\) the type II error is not too close to zero or one (because then it is hard to compare the methods - the delta is allowed to be different when n = 5 and when n = 100).
## 1) state the type I and II errors
## 2) simulate data with null hypothesis true
## 3) simulate data with null hypothesis false
## set n = 5
## set n = 100
## set delta for n = 5 and 100
alpha = 0.05
B = 100
n.perm = 1000
paired_con = c()
twosample_con = c()
# containers for n=5 type I and type II
error_type_I = data.frame(matrix(NA,nrow=2,ncol = 5))
error_type_II = data.frame(matrix(NA,nrow=2,ncol = 5))
rownames(error_type_I) = c('type_I_paired','type_I_twosample')
colnames(error_type_I)= paste('correlation_',seq(1,5))
rownames(error_type_II) = c('type_II_paired','type_II_twosample')
colnames(error_type_II)= paste('correlation_',seq(1,5))
# our simulation and plot function
correlations = c(1:5)
all_sim = function(n,delta){
for(j in 1:length(correlations)){
for(b in 1:B){
data = simulate.data(n=n,delta=0,correlation=correlations[j])
paired_con[b] = paired.t(X1=data$X1,X2=data$X2,n.perm = n.perm)<=alpha
twosample_con[b] = twosample.t(X1=data$X1,X2=data$X2,n.perm = n.perm)<=alpha
}
error_type_I[1,j] = sum(paired_con)/B
error_type_I[2,j] = sum(twosample_con)/B
}
#type II
for(j in 1:length(correlations)){
for(b in 1:B){
data = simulate.data(n=n,delta=delta,correlation=correlations[j])
paired_con[b] = paired.t(X1=data$X1,X2=data$X2,n.perm = n.perm)>alpha
twosample_con[b] = twosample.t(X1=data$X1,X2=data$X2,n.perm = n.perm)>alpha
}
error_type_II[1,j] = sum(paired_con)/B
error_type_II[2,j] = sum(twosample_con)/B
}
corr_ = c(0,0.1,0.25,0.5,0.7)
ylimits = c(min(error_type_I,error_type_II),max(error_type_I,error_type_II))
plot(
x = corr_,
y = error_type_I[1, ],
type = 'b',
main = paste('n = ',n, 'type I and II error rates'),
xlab = 'correlations',
ylab = 'Error rate',
ylim = ylimits,
lty=1,
pch = 1
)
lines(x = corr_,
y = error_type_I[2, ],
type = 'b',
lty = 2,
pch=2)
lines(
x = corr_,
y = error_type_II[1, ],
type = 'b',
lty = 3,
pch = 3
)
lines(
x = corr_,
y = error_type_II[2, ],
type = 'b',
lty = 4,
pch = 4
)
legend("topright",
legend = c('Paired type I', 'two_sample type I','Paired type II', 'two_sample type II'),
lty = c(1, 2,3,5),pch=c(1,2,3,4))
}
all_sim(n=5,delta=2)
What do you conclude for n=5, two sample test seems to perform better with small samples?.
set.seed(100)
all_sim(n=100,delta = 0.35)
We now turn to confidence intervals and we are first interested in some analytical results.
confidence intervals, you can assume that the variables are normally distributed or you can rely on large-sample approximations.
\[ E(\bar X_1 - \bar X_2) = \bar X_1-\bar X_2 \] \[ \text{Var}(\bar X_1 - \bar X_2) = \frac{ \sigma_1^2+\sigma_2^2}{n}-2\text{Cov}(\bar X_1,\bar X_2) \] b) Independence \[ Z = \frac{\bar X_1-\bar X_2}{\sqrt{\frac{ \sigma_1^2+\sigma_2^2}{n}}}\sim N(0,1) \]
\(\alpha\)% confidence is \[ \Phi^{-1}(\frac{\alpha}{2})\leq Z\leq \Phi^{-1}(1-\frac{\alpha}{2}) \]
where \(\Phi(.)\) is the cummulative normal distribution function. or \[ \sqrt{\frac{\sigma_1^2+\sigma_2^2}{n}}\Phi^{-1}(\frac{\alpha}{2})\leq \bar X_1-\bar X_2\leq \Phi^{-1}(1-\frac{\alpha}{2}) \sqrt{\frac{\sigma_1^2+\sigma_2^2}{n}} \] \[ \left[(\bar X_1-\bar X_2)+\sqrt{\frac{\sigma_1^2+\sigma_2^2}{n}}\Phi^{-1}(\frac{\alpha}{2}), ( \bar X_1-\bar X_2)+\Phi^{-1}(1-\frac{\alpha}{2}) \sqrt{\frac{\sigma_1^2+\sigma_2^2}{n}}\right] \] b) Dependence \[ Z = \frac{\bar X_1-\bar X_2}{\sqrt{\frac{ \sigma_1^2+\sigma_2^2-2\text{Cov}(\bar X_1,\bar X_2)}{n}}}\sim N(0,1) \]
Derive the confidence interval in a similar way as above.
Simulate data under \(H_0\) or \(H_A\) (you can choose and you can reuse the code from the previous questions) for n = 100 and compute the empirical coverage and average width of a \(95\%\) confidence interval for E [X1]−E [X2]
(using the two confidence intervals you have derived in the previous ques- tion - so you implement these confidence intervals in R yourself). Make
two graphs: one with the empirical coverage as a function of the correla- tion and one with the average width as a function of the correlation where the correlation takes on the values 0, 0.1, 0.25, 0.5 and 0.7.
## coverage = proportion of times the true value is in the confidence interval
## length of CI = max-min
## Important note
## Note that the data from our sample function has been transformed so a delta = 0
## correponds to a specific value in the transformed scale we set the sample size very
## large to see the corresponding value
n <- 1000000
cor <- .1
Y <- rmvnorm(n, mean = c(0,0),sigma = matrix(nrow = 2, ncol = 2, c(1, cor, cor, 0)))
sigma is numerically not positive semidefinite
Z <- pnorm(Y) # standard normal
X1 <- qexp(Z[,1])
X2 <- qexp(Z[,2])
trans_delta = round(mean(X2)-mean(X1),2)
#the correponding value is -0.3 round to 2 dp
n = 100
alpha = 0.05
delta = 1
coverage_dep = c()
coverage_ind = c()
length_dep = c()
length_ind = c()
for(j in 1:length(correlations)){
len_dep = c()
len_ind = c()
cove_dep = c()
cove_ind = c()
for(b in 1:B){
data = simulate.data(n=n,delta=0,correlation=correlations[j])
v1 = var(data$X1)
v2 = var(data$X2)
deltahat = mean(data$X2) - mean(data$X1)
term_ind = sqrt((v1+v2)/n)*qnorm(1-(alpha/2))
term_dep = sqrt((v1+v2-2*cov(data$X1,data$X2))/n)*qnorm(1-alpha/2)
# compute
lb_ind = deltahat-term_ind
up_ind = deltahat+term_ind
lb_dep = deltahat-term_dep
up_dep = deltahat+term_dep
len_ind[b] = up_ind-lb_ind
len_dep[b] = up_dep-lb_dep
cove_ind[b]= lb_ind <= trans_delta & trans_delta<=up_ind
cove_dep[b]= lb_dep <= trans_delta & trans_delta<=up_dep
}
coverage_dep[j] = sum(cove_dep)/B
coverage_ind[j] = sum(cove_ind)/B
length_dep[j] = mean(len_dep)
length_ind[j] = mean(len_ind)
}
corr_ = c(0,0.1,0.25,0.5,0.7)
ylimits = c(min(coverage_dep,coverage_ind),max(coverage_dep,coverage_ind))
plot(x = corr_,y = coverage_dep,type = 'b',
xlab = 'correlations',ylab = 'Coverage',ylim = ylimits,
lty=1,pch = 1
)
lines(x = corr_,y = coverage_ind,type = 'b',lty = 2,pch=2)
### add a legend
corr_ = c(0,0.1,0.25,0.5,0.7)
ylimits = c(min(length_dep,length_ind),max(length_dep,length_ind))
plot(x = corr_,y = length_dep,type = 'b',
xlab = 'Correlation',ylab = 'average width',ylim = ylimits,
lty=1,pch = 1
)
lines(x = corr_,y = length_ind,type = 'b',lty = 2,pch=2)