This document introduces the use of the survey
package for R for estimating ratios as well as using ratio and regression estimators. It also demonstrates the relationship between ratios and domain estimation, and between ratio and regression estimators and calibration.1
Estimating a ratio of two population totals is done using the svyratio
function. For an example, consider the families
data frame that contains survey data of families from a community (families are sampling units), where \(y_i\) is the weekly medical expenses of a family, and \(x_i\) is the family size. The parameter of interest is the ratio of total medical expenses to the total number of people in the community, which can be written as \[
R = \frac{\tau_y}{\tau_x} = \frac{\sum_{i=1}^N y_i}{\sum_{i=1}^N x_i}.
\] This can be estimated using the data in the sample of families. The survey data are available in a data frame called families
that can be read into a R session using the following commands.
library(downloader)
source_url("https://db.tt/VGK0FP4z", prompt = FALSE)
Here are the first six observations of the sample of \(n\) = 33 families.
head(families)
family size net expenses
1 1 2 372 28.6
2 2 3 372 41.6
3 3 3 522 45.4
4 4 5 390 61.0
5 5 4 348 82.4
6 6 7 552 56.4
We will assume that the sample was selected using simple random sampling from a population of \(N\) = 600 families. First we need to specify the design. For more information on how to specify a simple random sampling design using the survey
package see the document Simple Random Sampling Analysis in R.
library(survey)
families$N <- 600
families.srs <- svydesign(id = ~1, data = families, fpc = ~N)
For a simple random sampling design the estimator is \[
r = \frac{\bar{y}}{\bar{x}} \ \ \text{where} \ \ \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i \ \ \text{and} \ \ \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i,
\] and its estimated variance is \[
\hat{V}(r) = \left(1-\frac{n}{N}\right)\left(\frac{1}{\bar{x}^2}\right)\frac{s_r^2}{n} \ \ \text{where} \ \ s_r^2 = \frac{1}{n-1}\sum_{i=1}^n(y_i - rx_i)^2.
\] These are computed using the svyratio
function, which requires that we specify the numerator variable (\(y_i\)), the denominator variable (\(x_i\)), and the design object.
svyratio(~expenses, ~size, design = families.srs)
Ratio estimator: svyratio.survey.design2(~expenses, ~size, design = families.srs)
Ratios=
size
expenses 14.58862
SEs=
size
expenses 1.029044
Thus we estimate that for the community the weekly medical expenses is about $14.59 per person, with a bound on the error of estimation of about 2 \(\times\) $1.03 = $2.06 per person.
Domain estimators of means in post-stratification (i.e., when samples are not drawn from specific domains but rather classified into domains after sampling) are ratios. Under simple random sampling the usual estimator of the domain mean \(\mu_d\) can be written as \[
\bar{y}_d = \frac{\sum_{i=1}^n d_i y_i}{\sum_{i=1}^n d_i},
\] where \[
d_i =
\begin{cases}
1, & \text{if the $i$-th unit is in the domain}, \\
0, & \text{if the $i$-th unit is not in the domain}.
\end{cases}
\] For an example consider the dogscats
data frame that can be loaded as follows.
library(downloader)
source_url("https://db.tt/uoDbuoRR", prompt = FALSE)
head(dogscats)
unit type visits expenses
1 1 Dog 4 45.14
2 2 Dog 5 50.13
3 3 Cat 2 27.15
4 4 Dog 3 45.80
5 5 Cat 1 23.39
6 6 Cat 2 8.24
These data are from an example by Levy and Lemeshow (1990).2 Here is their description of the survey.
“A veterinarian is interested in studying the annual veterinary costs of his clientele (who own either dogs or cats). From a separate record system, he knows that he sees 850 dogs and 450 cats regularly in his practice (these are numbers of animals, not numbers of visits). He knows that the information on type of animal (i.e., dog or cat) is contained in the medical records, but that it would take too much time to sort the records into strata defined by animal type. So he decides to select a simple random sample and then poststratify. He regards the poststratification process as necessary since he knows that, on average, it costs more to keep dogs healthy than to keep cats healthy. He samples 50 records, recording the total amount of money spent (including medication) by the owners of the animals he saw over the past two years.”
Suppose were were specifically interested in estimating the mean expenses for dogs. There are several ways to do this. One is to create \(y_i^* = d_iy_i\) and use svyratio
. We need to create the new variable before specifying the design.
dogscats$N <- 850 + 450
dogscats$d <- ifelse(dogscats$type == "Dog", 1, 0)
dogscats$dy <- dogscats$expenses * dogscats$d
head(dogscats)
unit type visits expenses N d dy
1 1 Dog 4 45.14 1300 1 45.14
2 2 Dog 5 50.13 1300 1 50.13
3 3 Cat 2 27.15 1300 0 0.00
4 4 Dog 3 45.80 1300 1 45.80
5 5 Cat 1 23.39 1300 0 0.00
6 6 Cat 2 8.24 1300 0 0.00
dogscats.srs <- svydesign(id = ~1, data = dogscats, fpc = ~N)
Now we can apply svyratio
to the variables dy
and d
.
svyratio(~dy, ~d, design = dogscats.srs)
Ratio estimator: svyratio.survey.design2(~dy, ~d, design = dogscats.srs)
Ratios=
d
dy 49.85844
SEs=
d
dy 1.44369
Thus we esimate that the mean expenses for dogs is about $49.86 with a bound on the error of estimation of about $2.89. Thus we can always estimate a domain mean from a post-stratified design by estimating a ratio. However survey
has simpler way to estimate domain means. If we use the svyby
function it will recognize that the design was post-stratified (assuming we specified the design correct) and do the appropriate calculations without the need to create \(d_i\) and \(d_iy_i\).3:
svyby(~expenses, by = ~type, design = dogscats.srs, FUN = svymean)
type expenses se
Cat Cat 21.71111 1.96505
Dog Dog 49.85844 1.44369
Note that the estimate and standard error for dogs are the same as before. One thing to note is that when computing the estimated variance of the domain mean, svyratio
and svyby
are using the estimated value of \(\mu_x\) in the formula \[
\hat{V}(r) = \left(1-\frac{n}{N}\right)\left(\frac{1}{\mu_x^2}\right)\frac{s_r^2}{n} \ \ \text{where} \ \ s_r^2 = \frac{1}{n-1}\sum_{i=1}^n(y_i^* - rx_i)^2.
\] That is, the sample mean \(\bar{x}\) is use in place of \(\mu_x\). But note that \(x_i = d_i\) so \(\mu_x = N_d/N\). Since \(\mu_x\) is not given it is assumed to be unknown so survey
uses the estimator \(\bar{x} = n_d/n\) in place of \(\mu_x\) when computing the estimated variance. However in this case we know that \(\mu_x\) = 850/1300. To adjust the standard error given by svyratio
and svyby
to use this information, we could multiply it \(\bar{x}/\mu_x\) to effectively replace \(\bar{x}\) with \(\mu_x\). Domain totals can also be computed using svyratio
or svyby
.
svyby(~expenses, by = ~type, design = dogscats.srs, FUN = svytotal)
type expenses se
Cat Cat 10160.80 2108.889
Dog Dog 41482.22 4520.707
The estimator that is used is \(\hat{\tau}_d = \frac{N}{n}\sum_{i=1}^n d_iy_i\), which we can verify by using svytotal
with \(d_iy_i\).
svytotal(~dy, design = dogscats.srs)
total SE
dy 41482 4520.7
However this estimator does not take into account that we know the size of each domain (i.e., \(N_d\)). If \(N_d\) is known as it is here then we can use the estimator \(\tau_d = N_d\bar{y}_d\). To compute this estimator and its standard error we can multiply the estimate and standard error for \(\bar{y}_d\) by \(N_d\). Using the results above, the estimated total expenses for dogs is \(\hat{\tau}_d \approx\) 850 \(\times\) $49.86 = $42379.67.
Under simple random sampling the ratio estimators for \(\mu_y\) and \(\tau_y\) are \[
\hat{\mu}_y = r\mu_x \ \ \text{and} \ \ \hat{\tau}_y = r\tau_x,
\] respectively, where \(r = \bar{y}/\bar{x}\). Consider again the families
data frame. Suppose that we know that the total number of people in the community of \(N\) = 600 households is 2700 people. We can then use family size as an auxiliary variable with \(\tau_x\) = 2700 and \(\mu_x\) = 2700/600 = 4.5. A ratio estimator for \(\mu_y\) can be computed using the predict
function in combination with svyratio
.
predict(svyratio(~expenses, ~size, design = families.srs), total = 2700)
$total
size
expenses 39389.27
$se
size
expenses 2778.418
So the estimated total weekly expenses for the community is $39389.27 with a bound on the error of estimation of about $5556.84. It is interesting to compare the ratio estimator with the estimator \(N\bar{y}\) which does not use the auxiliary variable.
svytotal(~expenses, design = families.srs)
total SE
expenses 32626 2129.5
Notice that the standard error for the ratio estimator is larger than that for the estimator \(N\bar{y}\). The reason why can be seen if we look at a scatter plot of the data.
with(families, plot(size, expenses))
with(families, abline(0, mean(expenses)/mean(size)))
Note that the relationship betweem expenses and size is relatively weak. For a ratio estimator to be efficient we need a strong relationship between the target and auxiliary variable such that the target variable is approximately proportional to the auxiliary variable. That doesn’t appear to be the case here.
The predict
function does not directly provide a way to compute the ratio estimator of \(\mu_y\), but it can be computed relatively easily from the ratio estimator of \(\tau_y\) because \(\hat{\mu}_y = \hat{\tau}_y/N\), so to get the estimate and standard error for the ratio estimator of \(\mu_y\) we can simply divide the estimate and standard error for the ratio estimator of \(\tau_y\) that are given by predict
by \(N\).
Ratio estimators, regression estimators, and generalized regression estimators can be computed using the calibrate
function. What this function does is reweight the sample to incorporate the information known about the auxiliary variables and calibrate the sample with respect to that information. Then functions such as svymean
or svytotal
can be applied to the calibrated sample. Although ratio estimators can be computed using predict
with svyratio
, survey
does not have a separate function for regression estimators, but they can be implemented using the calibrate
function, as can ratio estimators.
For an example consider the data frame trees
which is included with R.
head(trees)
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
This data frame is a sample of \(n\) = 31 black cherry trees. Assume that the target variavble is Volume
while Girth
and Height
are potential auxiliary variables. Also for the purpose of this example assume that this is a sample from a population of \(N\) = 1000 trees, and that the totals for Girth
and Height
are perhaps known (\(\tau_{g}\) = 14000 and \(\tau_h\) = 77000), but we are interested in estimating the total volume (\(\tau_v\)) and mean volume (\(\mu_v\)) of the 1000 trees, which are unknown. We will assume that the sample was selected using simple random sampling.
trees$N <- 1000
trees.srs <- svydesign(id = ~1, data = trees, fpc = ~N)
First we can see how to compute a ratio estimator using calibrate
. Suppose that we are going to only use girth
as an auxiliary variable. The ratio estimator of total volume can then be computed using predict
with svyratio
.
predict(svyratio(~Volume, ~Girth, design = trees.srs), total = 14000)
$total
Girth
Volume 31882.64
$se
Girth
Volume 1811.891
But we can also do this with calibrate
.
trees.cal <- calibrate(trees.srs, formula = ~Girth - 1, population = 14000,
variance = 1)
svytotal(~Volume, design = trees.cal)
total SE
Volume 31883 1811.9
Notice that calibrate
doesn’t produce any estimates or standard errors itself. It only calibrates the sample. Estimates can then be obtained using other functions. The arguments of calibrate
need some explanation. The formula
argument specifies the kind of calibration we are using. It uses a conventional method in R for specifying linear regression models. The “formula” ~Girth-1
implies \(y_i = a + bx_i\) but with \(a=0\). The -1
can be read as “remove the intercept” from the formula. Note that the prediction formula \(y_i = a + bx_i\) is consistent with \(y_i = rx_i\) if \(a=0\). However to get \(b=\bar{y}/\bar{x}\) we need to specify a particular variance structure where the variance of \(y_i\) is assumed to be proportional to \(x_i\).4 This is achieved by the variance = 1
argument. The population = 14000
argument is how we tell calibrate
that \(\tau_g\) = 140000. A nice feature of calibrate is that it also allows us to compute the ratio estimator of \(\mu_y\) directly.
svymean(~Volume, design = trees.cal)
mean SE
Volume 30.171 2.9062
Calibration works by reweighting so that the new weights satisfy certain constaints. For a ratio estimator this is \(\tau_x = \sum_{i=1}^n w_ix_i\) so that the “estimate” of the total of the auxiliary variable agrees with the known value of \(\tau_x\). We can see this if we extract the weights using the weights
function and use them to “estimate” \(\tau_g\).
sum(weights(trees.cal) * trees$Girth)
[1] 14000
We can also confirm that the estimate for the total \(\tau_y\) is computed using \(\hat{\tau}_y = \sum_{i=1}^n w_iy_i\) using the weights created by calibrate
.
sum(weights(trees.cal) * trees$Volume)
[1] 31882.64
The relationship between tree volume and girth might be better approximated by a line that is not contrained to have a zero intercept. This can be seen if we look at a scatter plot of the sample data.
with(trees, plot(Girth, Volume))
with(trees, abline(0, mean(Volume)/mean(Girth)))
The line shown passes through the origin and the coordinate \((\bar{x},\bar{y})\), which is what would generate the predictions for a ratio estimator, but it doesn’t appear appropriate here. A regression estimator may be a better choice. Regression estimators can be implemented using calibrate.
trees.cal <- calibrate(trees.srs, formula = ~Girth, population = c(`(Intercept)` = 1000,
Girth = 14000))
svytotal(~Volume, trees.cal)
total SE
Volume 33978 857.99
Here formula = ~Girth
implies that prediction model \(y_i = a + bx_i\). Unless we include the -1
an intercept is assumed by default. The standard regression estimator estimates the slope and intercept assuming a constant variance, which is the default so unlike with the ratio estimator there is no need to specify anything for the variance
argument. The population
argument provides two values: \(N\) and \(\tau_g\). The syntax is a bit odd, but consider that we can write the prediction model alternatively as \(y_i = ax_{i0} + bx_{ig}\), where \(x_{ig}\) is the girth of the \(i\)-th tree, and \(x_{i0} = 1\) for all trees. The regression estimator satisfies the calibration constraints \[
N = \sum_{i=1}^n w_i x_{i0} = \sum_{i=1}^n w_i\ \ \text{and} \ \ \tau_g = \sum_{i=1}^n w_i x_{ig},
\] which we can confirm numerically:
sum(weights(trees.cal))
[1] 1000
sum(weights(trees.cal) * trees$Girth)
[1] 14000
So basically with the population
argument we are specifying that the (Intercept)
“variable”" has a total of 1000, to calibrate with respect to the known population total \(N\), and that the Girth
variable has a total of 14000 in the population.
Note that the standard error for the regression estimator for the total volume in quite a bit smaller than that for the ratio estimator. We might be able to do even better by using tree height as well as girth as an auxiliary variable. This is a generalized regression estimator.
trees.cal <- calibrate(trees.srs, formula = ~Girth + Height, population = c(`(Intercept)` = 1000,
Girth = 14000, Height = 77000))
svytotal(~Volume, trees.cal)
total SE
Volume 34049 749.65
The standard error suggests that including the second auxiliary variable may be wortwhile. Note that formula = ~Girth+Height
argument implies the prediction model \(y_i = a + b_gx_{ig} + b_hx_{ih}\) where \(x_{ig}\) and \(x_{ih}\) are the girth and height of the \(i\)-th tree. The calibrate
function allows us to include any number of auxiliary variables. Auxiliary variables can also be categorical as shown in the next section.
Post-stratification is basically calibration with a categorical auxiliary variable. First consider the results from applying the postStratify
function to the survey data in the dogscats
data frame.5 Here is the estimate of the total expenses using post-stratification.
pop <- data.frame(type = c("Dog", "Cat"), Freq = c(850, 450))
dogscats.pst <- postStratify(dogscats.srs, strata = ~type, population = pop)
svytotal(~expenses, design = dogscats.pst)
total SE
expenses 52150 1512.5
The same estimate can be obtained using calibration.
dogscats.cal <- calibrate(dogscats.srs, formula = ~type, population = c(`(Intercept)` = 1300,
typeDog = 850))
svytotal(~expenses, design = dogscats.cal)
total SE
expenses 52150 1512.5
The specification of the population
argument may seem a little strange because only the total for dogs has been specified. The reason is that the underlying regression model uses dummy/indicator variables, and only one of these is needed two distinguish between two strata. By default, R will order the strata alphanumerically and omit the first so that dummy/indicator variables exist only for the second through the last strata. So the size of the first strata does not need to be specified.
Double or two-phase sampling can be used when we want to use a ratio or (generalized) regression estimator, or other calibration methods, but the necessary population totals of the auxiliary variables are unknown. A simulated data set from a double-sampling design can be brought into an R session as follows.
library(downloader)
source_url("https://db.tt/bvWY2DgV", prompt = FALSE)
These data include \(n'\) = 200 observations, which also include the \(n\) = 31 observations of the black cherry trees. The auxiliary variables Girth
and Volume
are observed for \(n'\) = 200 trees, but Volume
was observed for only a sub-sample of \(n\) = 31 trees. Assume that the large sample was selected using simple random sampling from a population of \(N\) = 1000 trees, while the smaller sample was selected using simple random sampling from the larger sample. The data frame is called trees.double
. Note that the data frame contains a variable called subsample
which is TRUE
if the observation is from the second phase sample, and FALSE
otherwise.6
trees.double$N <- 1000
trees.2ph <- twophase(id = list(~1, ~1), data = trees.double, fpc = list(~N,
NULL), subset = ~subsample)
trees.cal <- calibrate(trees.2ph, formula = ~Girth + Height, phase = 2)
Note that here we do not specify the population
argument. Instead by including the argument phase = 2
we tell calibrate
to calibrate the observations in the second phase sample using the information from the first phase sample. Ratio or (generalized) regression estimators can then be obtained by applying functions to trees.cal
.
svytotal(~Volume, design = trees.cal)
total SE
Volume 30519 1263.9
This document is a work in progress. It is based on version 3.30.3 of the survey
package.↩
Levy, P. S. & Lemeshow, S. (1999). Sampling of populations: Methods and appliations. (3rd ed.). New York: Wiley.↩
Not all statistical packages will correctly recognize the design when domain means are requested, but the survey
package tries to be “smart” and use what you specified when you created the design object.↩
The regression line implied by a ratio estimator is the weighted least squares regression line for the model \(y_i = bx_i + \epsilon_i\) where the weights are \(w_i = 1/x_i\). Note that these weights are not the same as design or calibration weights. In R this regression line can be obtained using lm(y ~ x, weights = 1/x)
.↩
The postStratify
function is discussed in the documents Simple Random Sampling Analysis in R and Stratified Random Sampling Analysis in R.↩
You could create this variable yourself with something like trees.double$subsample <- ifelse(is.na(trees.double$Volume), TRUE, FALSE)
or trees.double$subsample <- is.na(trees.double$Volume)
.↩