26 June, 2015
In a randomized experiment researchers use some randomization technique in order to assign treatment to experimental individuals and observe the effect on the treated. For example: randomly select schools in order to provide them with computers.
In an observational study researchers just observe individuals. Researchers do not interfere in the assignation process because it is beyond their control. For example: observe poor people that are enrolled in some kind of income subsidiary program.
It is not always possible to randomize, but always randomize if possible!
In public policy evaluation it is very common to evaluate programs designed for a massive population. Decision makers implement programs that affect a lot of people. In this set-up we always want to know if the program intervention had effect over the population of interest.
If you are able to spend usually millions of dollars, when you plan a sampling design, you also want to measure other variables of interest that are related to characterization of people within the program. That requires some level of statistical precision!
At the end you should STOP and DECIDE if your research should be focused on estimation or hypothesis testing. Stop
We select a random sample in order to draw conclusions about the entire population .
Samples do not necessarily maintain the distribution of the population. A representative sample is the one that, when expanded, reproduces a pseudo-population that is very similar (in distribution) to the population of interest.
Some people say that a good sample must resemble the population of interest in such a way that some categories appear with the same proportions in the sample as in the population
Suppose that the aim is to estimate the production of iron in a country, and we know that the iron is produced by two huge steel companies and by several hundreds of small craft-industries. Does the better design consist in selecting each unit with the same probability?
First, one will inquire about the production of the two biggest companies. Next, one will select a sample of small companies according to an appropriate sampling design.
This is the most simple sampling design. It assumes that the variable of interest is uniformly distributed along the population.
The vast majority of sample size calculators over the internet are based on this sampling scheme! However, this design is rarely used in practice.
This design uses (continuous) auxiliary information \(x\) to select a sample. The greater the value of \(x\), the greater the chance of being selected. This variable is commonly called a measure of size (MOS).
This design supposes that you know a priori the membership of every unit in the population to a set of subgroups (or strata).
When the information comes directly from households or people, it is frequent to lack of a proper sampling frame in order to directly select a sample. This way:
At the end, all of the foundations of statistical inference are based on a sequence of random variables that are:
However, this assumptions are only possible if the sampling design is simple random with replacement. This way, proper inference must have into account the representativeness principle (or the golden rule of survey sampling).
In order to evaluate if a policy had effect (or impact) on the population of interest, you should answer the following question:
To answer this important question, you have to emulate the counter-factual reality of the sample of people who were exposed to the treatment! For unit \(k\), this means to estimate:
\[\beta_k = E(Y_k | P = 1) - E(Y_k | P = 0)\]
This way, the causal impact \(\beta\) of a program \(P\) on an outcome \(Y\) is the difference between the expected outcome exposed to the program \(E(Y | P = 1)\) and the expected outcome without the program \(E(Y | P = 0)\).
This is a property that ensures that findings can be generalized to a the population of eligible units. In other words, the expanded sample must be representative of the population of eligible units.
If the sample is selected randomly, but the treatment is not assigned at random, then the expanded sample could be representative but the comparison group may be not suitable (valid).
This property ensures that the impact is well estimated because the selected sample focuses in those units that properly represent the counter-factual.
The impact of a program in the finite popuation \(U\) is computed by means of the average treatment effect (ATE), defined as the difference of averages in the treated population \(U_T\) and the control population \(U_C\):
\(\beta = \frac{1}{N_T}\sum_{k \in U_T} Y_k(1) - \frac{1}{N_C}\sum_{k \in U_C} Y_k(0)\)
This parameter can be written as the regression coefficient in the following model
\(Y_k = \alpha + \beta T_k + E_k\)
Where \(k \in U\), \(T_k = 1\) if unit \(k\) receives the treatment; otherwise, \(T_k = 0\).
The impact of a program is unbiasedly estimated by means of the following expression:
\(\hat{\beta} = \frac{\sum_{k \in S_T} \frac{Y_k(1)}{\pi_k}}{\sum_{k \in S_T} \frac{1}{\pi_k}} - \frac{\sum_{k \in S_C} \frac{Y_k(0)}{\pi_k}}{\sum_{k \in S_C} \frac{1}{\pi_k}} = \hat{\bar{Y}}_T - \hat{\bar{Y}}_C\)
This estimator can be viewed as the sampling estimator \(\hat{\beta}\) of the regression coefficient in:
\(Y_k = \hat{\alpha} + \hat{\beta} T_k + e_k; \ \ \ \ \ \ \ \ k \in S\)
In order to include covariates when computing impact, we must specify the following model:
\(Y_k = \alpha + \beta T_k + \boldsymbol{\gamma}' \mathbf{X}_k + E_k\)
If we define \(\boldsymbol{\theta} = (\alpha, \beta, \boldsymbol{\gamma})'\), and \(\mathbf{Z}= (1, \mathbf{T}, \mathbf{X})\), then we have that:
\(\boldsymbol{\theta} = (\mathbf{Z}'\mathbf{Z})^{-1}\mathbf{Z}'Y\)
The impact of a program in the finite popuation, conditional to covariates \(\mathbf{X}\), is computed by means of the second entry of the vector of regression coefficients \(\boldsymbol{\theta}\).
The impact of a program is estimated by means of the second entry of the following vector of estimated regression coefficients
\(\hat{\boldsymbol{\theta}} = \left(\sum_{k \in S} \frac{\mathbf{Z}_k \mathbf{Z}_k'}{\pi_k}\right)^{-1}\left(\sum_{k \in S} \frac{\mathbf{Z}_k Y_k}{\pi_k}\right)\)
Although this estimator is not unbiased for the ATE, it is consistent (large-sample unbiasedness)
Regular assignment differs from randomization because units receive the treatment under a not known probabilistic scheme. This way, conditional to covariates, the probability of receiving the treatment (propensity score) is unknown.
To face this issue, it is usual to match the set of units receiving the treatment, conditional to covariates, with those units not receiving the treatment. This way, for internal validity sake, the population may be reduced to a matched subset of smaller size.
Under this setup, the regression model that accounts for the ATE is
\(Y_k = \alpha + \beta T_k + E_k\)
For \(k \in U_M\), where \(U_M\) is the matched subset.
The problem of the matched sample can be viewed as a problem of domains in survey sampling. Domains are defined as a partition of the whole population that are known after interviewing the units in the sample.
The impact of a program is estimated by means of the second entry of the following vector of estimated regression coefficients
\(\hat{\boldsymbol{\theta}} = \left(\sum_{k \in S_M} \frac{\mathbf{Z}_k \mathbf{Z}_k'}{\pi_k}\right)^{-1}\left(\sum_{k \in S_M} \frac{\mathbf{Z}_k Y_k}{\pi_k}\right)\)
The variance of sampling estimators of the ATE must have into account the complex sampling design. For example, if stratified sampling is used, the linearized estimated variance of \(\hat{\beta}\) is
\(Var(\hat{\beta}) = \sum_{k \in S_T}\sum_{l \in S_T} \frac{\Delta_{kl}}{\pi_{kl}} \frac{u_k}{\pi_k} \frac{u_l(1)}{\pi_l(1)} + \sum_{k \in S_C}\sum_{l \in S_C} \frac{\Delta_{kl}}{\pi_{kl}} \frac{u_k(0)}{\pi_k} \frac{u_l(0)}{\pi_l}\)
Where \(\Delta_{kl} = \pi_{kl} - \pi_{k}\pi_{l}\), \(\pi_{kl} = E(k \in S, l \in S)\), \(u_k(1) = Y_k(1) - \hat{\bar{Y}}_T\), and \(u_k(0) = Y_k(0) - \hat{\bar{Y}}_C\).
However, under other sampling plans, this variance may be more complex and may include some co-variance terms.
In order to take into account the sampling scheme, the variance (a tool for computing confidence intervals along with test hypothesis) of any estimator \(\hat{T}\) under a complex design \(p\) may be written as:
\(DEFF = \frac{Var_p(\hat{T})}{Var_{SI}(\hat{T})}\).
This way, the variance of the estimator under the complex sampling can be written as
\(Var_p(\hat{T}) = DEFF * Var_{SI}(\hat{T})\)
When defining confidence intervals, we have into account the margin of error as a function of the variance of an estimator \(\hat{T}\):
\(ME = Z_{1- \alpha/2} \sqrt{Var(\hat{T})}\)
Sometimes (for continuous type variables) it is useful to define the relative margin of error:
\(RME = Z_{1- \alpha/2} \frac{\sqrt{Var(\hat{T})}}{\hat{T}}\)
Note that the coefficient of variation of \(\hat{T}\) is given by
\(CV = \frac{\sqrt{Var(\hat{T})}}{\hat{T}}\)
Assuming a population of size \(N\), and a selected sample of size \(n\), the variance for the difference of proportions is defined as:
\(Var(\hat{P}_1-\hat{P}_2)=\frac{DEFF}{n}\left(1-\frac{n}{N}\right)(P_1Q_1+P_2Q_2)\)
This way, a 95% confidence interval for the difference of two proportions is given by
\(IC(95\%)_{P_1-P_2}=(\hat{P}_1-\hat{P}_2) \pm Z_{1-\alpha/2} \sqrt{\frac{DEFF}{n}\left(1-\frac{n}{N}\right)(P_1Q_1+P_2Q_2)}\)
Then the margin of error is
\(e< Z_{1-\alpha/2}\sqrt{\frac{DEFF}{n}\left(1-\frac{n}{N}\right)(P_1Q_1+P_2Q_2)}\)
\(n> \dfrac{DEFF(P_1Q_1+P_2Q_2)}{\dfrac{e^2}{Z_{1-\alpha/2}^2}+\dfrac{DEFF(P_1Q_1+P_2Q_2)}{N}}\)
By using the ss4dp it is possible to computed the minimum sample size required in order to ensure the desired precision
library(samplesize4surveys) ss4dp(N=100000, P1=0.5, P2=0.5, DEFF=1.2, conf=0.95, me=0.03)[2]
## With the parameters of this function: N = 1e+05 P1 = 0.5 P2 = 0.5 DEFF = 1.2 conf = 0.95 . ## ## The estimated sample size to obatin a maximun coefficient of variation of 5 % is n= 1e+05 . ## The estimated sample size to obatin a maximun margin of error of 3 % is n= 2498 . ##
## $n.me ## [1] 2498
library(samplesize4surveys) ss4dp(N=100000, P1=0.5, P2=0.55, DEFF=1.2, conf=0.95, cve=0.15, me=0.03, plot=TRUE)
## With the parameters of this function: N = 1e+05 P1 = 0.5 P2 = 0.55 DEFF = 1.2 conf = 0.95 . ## ## The estimated sample size to obatin a maximun coefficient of variation of 15 % is n= 9595 . ## The estimated sample size to obatin a maximun margin of error of 3 % is n= 2485 . ##
## $n.cve ## [1] 9595 ## ## $n.me ## [1] 2485
Assume that impact is measured as the difference of a proportion on the treated and a proportion on the controls.
\(H_o: P_1-P_2=0 \ \ \ \ \ vs. \ \ \ \ \ H_a: P_1 -P_2 =D > 0\)
Where \(D\) is sometimes called the effect (and it is defined by the expert of the policy). Based on the normality of sampling estimators, we reject the null hypothesis based on the following decision rule
\(\frac{\hat{P}_1-\hat{P}_2}{\sqrt{\frac{DEFF}{n}\left(1-\frac{n}{N}\right)(P_1Q_1+P_2Q_2)}} > Z_{1-\alpha}\)
Power is defined as the probability of rejecting the null hypothesis conditional to the alternate.
\(\beta = Pr\left(\dfrac{\hat{P}_1-\hat{P}_2}{\sqrt{\frac{DEFF}{n}\left(1-\frac{n}{N}\right)(P_1Q_1+P_2Q_2)}} > Z_{1-\alpha} \left. | \right. P_1 -P_2 =D \right)\)
After some algebra, we find that
\(\beta = 1-\Phi\left(Z_{1-\alpha} - \dfrac{D}{\sqrt{\frac{DEFF}{n}\left(1-\frac{n}{N}\right)(P_1Q_1+P_2Q_2)}} \right)\)
\(n \geq \dfrac{DEFF(P_1Q_1+P_2Q_2)}{\dfrac{D^2}{(Z_{1-\alpha}+Z_{\beta})^2}+\dfrac{DEFF(P_1Q_1+P_2Q_2)}{N}}\)
By using the ss4dpH it is possible to computed the minimum sample size required in order to ensure the desired precision
library(samplesize4surveys) ss4dpH(N = 100000, P1 = 0.5, P2 = 0.5, D=0.05, conf = 0.95, power = 0.8, DEFF = 1.2)
## [1] 1463
ss4dpH(N = 100000, P1 = 0.5, P2 = 0.5, D=0.03, conf = 0.95, power = 0.9, DEFF = 1.2)
## [1] 5401
library(samplesize4surveys) ss4dpH(N = 100000, P1 = 0.5, P2 = 0.5, D=0.05, conf = 0.95, power = 0.8, DEFF = 1.2, plot = TRUE)
## [1] 1463
samplesize4surveys R package