Mainly two topics, bootstrapping and permutation testing,which fall under broader category of resampling method.
Definition:
Used for constructing confidence intervals and calculating standard errors for statistics that might be difficult for some reasons (e.g., lack of data or no closed form) Defined in Website: It allows estimation of the sampling distribution of almost any statistic using very simple methods.
Explain Further:
Basic bootstrap: Using observed data to construct an estimated population distribution using random sampling with replacement. From this distribution (constructed from the observed data) we can estimate the distribution of the statistic we’re interested in.
\[Figure\ 1.\ \]
Explanation for \(Figure 1\):
(Theoretically, average is 3.5 for dicing a roll) Here we take 1000 such experiments, each of 50 dice rolls. For unusual y-axis scale, we redisplaying this as a density function so the area colored is theoretically 1. (Actually, here, the experiment we are doing is to find the distribution of means but we have only one set of observations)
We used that one distribution to create many (1000) distributions by sampling with replacement from the given one. We sampled 50000 times so we created 1000 distributions of 50 rolls each.
(To get the distribution of means by sampling one distribution many times, which then gives us SE and confidence intervals associated with statistics)
Example:
Take the dataset \(sh\) which records sons’ heights, which is 1078 in length.
head(sh)
[1] 59.77827 63.21404 63.34242 62.79238 64.28113 64.24221
Now we will create 1000 distributions of the same length as original \(sh\) by sampling \(sh\) with replacement 1000 * length(sh) times and store results in an array with 1000 rows, each row represents an observation and having length(sh) entries.
Then we will take the median of each row and plot the result. 1000 experiments, each experiment takes 1078 times sampling.
Note that every time we draw from the empirical distribution \(sh\), each of its 1078 data points is equally likely to be pulled. Therefore, the prob of drawing any one is 1/1078.
\[Figure\ 2.\ \]
Explanation for \(Figure 2\):
Resulting Density Curve(takes median from each sampling or observation). It estimates the median of original i.e. observed data of \(sh\)
median(sh)
[1] 68.61582
median(resampledMedians)
[1] 68.61273
The latter is sampled dataset. Both results show pretty close numerically.
[Back to Theory]
Suppose you have a statistic that estimates some population parameter, but you don’t know its sampling distribution. The bootstrap principle uses the distribution defined by the observed data to approximate the sampling distribution of that statistic.
The nice thing about bootstrapping is that you can always do it with simulation. The general procedure follows by first simulating \(B\) complete data sets from the observed data by sampling with replacement. Make sure\(B\) is large and that you’re sampling WITH replacement to create data sets the same size as the original.
sam <- sample(fh,nh*B,replace=TRUE)
resam <- matrix(sam,B,nh)
The latter 2 parameters are row and column numbers It samples from \(fh\) dataset for 1078*1078 times, which are lengths of both datasets, and with replacement. \(fh\): fathers’ height.(dataset)
meds <- apply(resam, 1,median)
Take the median of each row of \(resam\). The second argument, the number 1, specifies that the application of the function is to the rows of resam(instead of columns).
median(meds)-median(fh)
[1] 0
quantile(resampledMedians,c(.025,.975)). — fathers’ data
2.5% 97.5%
68.42898 68.80709
quantile(meds,c(.025,.975)) — sons’ data
2.5% 97.5%
67.55396 67.94197
Another handy tool used in group comparisons.
As bootstrapping did, permutation testing samples a single dataset a zillion times and calculates a statistic based on these samplings.It measures whether or not outcomes are independent of group identity. Our zillion samples simply permute group labels associated with outcomes.
Example:
[Purpose] Testing the effect of certain drugs, which used to promote the growth of plants.
[\(H_0\)] This kind of drug will not effec at all i.e. the distribution of both groups should be the same at all.
Here we get two groups of data, which reflects length of plants before and after using this drug.
A<-c(24,43,58,67,61,44,67,49,59,52,62,50)
B<-c(42,43,65,26,33,41,19,54,42,20,17,60,37,42,55,28)
#Construct a new statistic: The difference between means of both groups of plants’ height.
S_obs<-mean(A) - mean(B)
#Now, implement permutation test to get P-value
# 1.Merge two datasets into one dataset
a<-c(24,43,58,67,61,44,67,49,59,52,62,50,42,43,65,26,33,41,19,54,42,20,17,60,37,42,55,28)
#2.Select two groups randomly as X_a, X_b
X_a<- c(43,17,44,62,60,26,28,61,50,43,33,19) #length is 12
X_b<-c(55,41,42,65,59,24,54,52,42,49,37,67,67,20,42,58) #length is 16
#3.test difference between mean of samples:
S_1st <-mean(X_a) - mean(X_b)
#4. Repeat previous 2 steps for 999 time ( as more as possible, then we will get more accurate population distribution)
a<-c(24,43,58,67,61,44,67,49,59,52,62,50,42,43,65,26,33,41,19,54,42,20,17,60,37,42,55,28)
# merge two vectors in the previous section
group<-factor(c(rep("A",12),rep("B",16)))
#get labels i.e. 'A' 'B'
data<-data.frame(group, a)
#create a dataframe for the later sampling
testStat<-function(x) {
mean(x[group == 'A', 2]) - mean(x[group == 'B', 2])
}
# intended input of testStat is 'data' frame
result <- replicate(10000, testStat(data.frame(group, sample(a))))
hist(result,breaks = 21,probability = TRUE)
lines(density(result))
#test orginial sample in the permuting test
test_p <-result[result>mean(data[group=="A",2])-mean(data[group=="B",2])]
p_value<-length(test_p)/10000
p_value
## [1] 0.0067
#Reject H_0 Hypothesis