Resampling is a variety of methods for:
Estimating the precision of a sample statistic by using a subset(s) of available data or drawing randomly with replacement from a set of data points (bootstrapping)
Exchanging labels on data points when performing significance tests
validating models by using random subsets (bootstrapping, cross validating)
Resampling : drawing repeated samples from the original data samples, using the observer/generated data to produce new hypothetical situations/samples that mimic the underlying population, which can then be analyzed.
We resample because collecting data is expensive, because there is not enough data available, or there is insufficient information about the distribution (i.e. the distribution is unknown, uncomfortable making assumption about the distribution, distribution of the test statistics is not easily computed)
resampling works for any test statistic - regardless of whether or not the distribution is known
When assumptions are met for standard methods, there is high statistical power but flexability is relatively low
Resampling help analyzing quantifiable data that do not satisfy statistical assumptions in traditional parametric tests (e.g. t-tests, ANOVA, two sample mean test, F-Test)
Types of Resampling:
Monte Carlo Simulation:
- derives data from a mechanism (such as a porportion) that models the process you wish to understand (the population)
Permutation Test:
- type of statistical significance test
- a reference distritbuiton is obtained by calculating all possible values of a test statistic under rearrangements of the labels on the observed data points.
- suitable whenever the null hypothesis makes all permutations of the observed data equally likely.
- method should be employed when you are dealing with an unknown distribution
Bootstrapping:
- estimates the sampling distribution of an estimator by sampling with replacement from the original estimate
- most of with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter
Jackknife :
- used in statistical inferece to estimate the bias and standard error in a statistic
- provides a systematic method of resampling with a mild amount of calculations
- Offers improved estimate of the sample parameter to create less sampling bias
Permutation Test:
Permutation: given a set of objects, how many different combinations of those objects can you create?
Permutation is essenitally an nCk problems where \[\binom{n}{k} = \frac{n!}{k!(n-k)!}\]
n is the number of elements/objects to choose from and k is the number of elements/objects chosen. For example, if you have 6 objects, and can only take three at a time, how many unique combinations of those 6 objects will you get? \[\binom{n}{k} = \frac{n!}{k!(n-k)!} =\binom{6}{3} = \frac{!6}{3!(6-3)!}\]
Test a Null Hypothesis:
- Establish test statistic (e.g. risk of diabetes)
- compute the sampling distribution of the test statistic when the null hypothesis is true
- The p-value is the probablity that the test staistic would be at least as extreme as we observed if the null hypothesis were true
- to estimate the sampling distribution of the test statistic, we need many samples generated under the strong null
Permutation;
permutation test is a simple way to compute the sampling distribution for any test-statisti under the strong null hypothesis that a set of variants has ABSOLUTELY NO EFFECT on the outcome
Permutation is only valid when the null hypothesis has NO ASSOCIATION
if the null is true, changing the exposure would have no effect on the outcome
the shuffeled data sets should look like real data, otherwise they should look different from the real data
Permutation tests are viable when we assume there is no difference between the treated and the untreated. In other words, the null = 0 in a permutation test because the mean should not be statistically significant from zero if the treatment has no effect
Permutations are just simulated data, and since we assume that the treatment has no effect, it doesn’t matter if we assign different results to different people
the ranking of the real test statistic among the shuffeled test statistics gives a p-value
Procedures for Permutation Tests:
- Analyze the Problem :
- What is the hypothesis and the alternative?
- What distribution is the data drawn from?
- What losses are associated with bad decisions?
Choose a Test Statistic: one that will distinguish the hypothesis from the alternative
Rearrange the Observations (i.e. Permutations):
- Compute the test staistic for all possible permutations of the data of the observations and generte a distribution of observed values of the statistic of interest under the null hypothesis of no difference between the two populations
Make a Decision:
compare observed statistc to this empirical sampling distribution to see how unlikely our observed statistic is if the two distributions are the same(t-test)
If the value’s of the test statistic for the original data is an extreme value in the permutation distribution of the statsitic
-if NOT an extreme value, fail to reject the null and rejectthe alternative
Permutation:
collect data fram control and treatment
merge samples to form a psuedo permutation
sample w/o replacement from psuedo population to simluate control and treatment groups
compute target statistic for each resample
- where s is the standard deviation
- n is the degrees of freedom used
\[T =\frac{\bar{X} - \bar{Y}}{s /\sqrt{n}} \] When using a two sample t test: where S^2 is the variance
\[T =\frac{\bar{X} - \bar{Y}}{\sqrt{\frac{(n_x - 1)S^2_x + (n_y - 1)S^2_y}{(n_x - 1) + (n_y - 1)}}}\]
t.test(permutations$treatment_mean, permutations$control_mean, var.equal = T, paired = F)
Two Sample t-test
data: permutations$treatment_mean and permutations$control_mean
t = 1.0824, df = 38, p-value = 0.2859
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.208601 20.475267
sample estimates:
mean of x mean of y
77.55000 70.41667
Computing Sample Size:
\[n = (\frac{Z_\sigma}{E})^2\]
| Z |
the value from te standard normal distribution reflectiing the confiedence interval that will be used |
Z = 1.96 for 95% (get value from Z table) |
| sigma |
standard deviation of the outcome variable |
example |
| E |
desired margin of error |
example |
