Ratio and Regression Estimators in R

This document introduces the use of the survey package for R for estimating ratios as well as using ratio and regression estimators. It also demonstrates the relationship between ratios and domain estimation, and between ratio and regression estimators and calibration.¹

Estimating a Ratio

Estimating a ratio of two population totals is done using the svyratio function. For an example, consider the families data frame that contains survey data of families from a community (families are sampling units), where $y_i$ is the weekly medical expenses of a family, and $x_i$ is the family size. The parameter of interest is the ratio of total medical expenses to the total number of people in the community, which can be written as \[ R = \frac{\tau_y}{\tau_x} = \frac{\sum_{i=1}^N y_i}{\sum_{i=1}^N x_i}. \] This can be estimated using the data in the sample of families. The survey data are available in a data frame called families that can be read into a R session using the following commands.

library(downloader)
source_url("https://db.tt/VGK0FP4z", prompt = FALSE)

Here are the first six observations of the sample of $n$ = 33 families.

head(families)

  family size net expenses
1      1    2 372     28.6
2      2    3 372     41.6
3      3    3 522     45.4
4      4    5 390     61.0
5      5    4 348     82.4
6      6    7 552     56.4

We will assume that the sample was selected using simple random sampling from a population of $N$ = 600 families. First we need to specify the design. For more information on how to specify a simple random sampling design using the survey package see the document Simple Random Sampling Analysis in R.

library(survey)
families$N <- 600
families.srs <- svydesign(id = ~1, data = families, fpc = ~N)

For a simple random sampling design the estimator is \[ r = \frac{\bar{y}}{\bar{x}} \ \ \text{where} \ \ \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i \ \ \text{and} \ \ \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, \] and its estimated variance is \[ \hat{V}(r) = \left(1-\frac{n}{N}\right)\left(\frac{1}{\bar{x}^2}\right)\frac{s_r^2}{n} \ \ \text{where} \ \ s_r^2 = \frac{1}{n-1}\sum_{i=1}^n(y_i - rx_i)^2. \] These are computed using the svyratio function, which requires that we specify the numerator variable ($y_i$), the denominator variable ($x_i$), and the design object.

svyratio(~expenses, ~size, design = families.srs)

Ratio estimator: svyratio.survey.design2(~expenses, ~size, design = families.srs)
Ratios=
             size
expenses 14.58862
SEs=
             size
expenses 1.029044

Thus we estimate that for the community the weekly medical expenses is about $14.59 per person, with a bound on the error of estimation of about 2 $\times$ $1.03 = $2.06 per person.

Domain Estimation

Domain estimators of means in post-stratification (i.e., when samples are not drawn from specific domains but rather classified into domains after sampling) are ratios. Under simple random sampling the usual estimator of the domain mean $\mu_d$ can be written as \[ \bar{y}_d = \frac{\sum_{i=1}^n d_i y_i}{\sum_{i=1}^n d_i}, \] where \[ d_i = \begin{cases} 1, & \text{if the $i$-th unit is in the domain}, \\ 0, & \text{if the $i$-th unit is not in the domain}. \end{cases} \] For an example consider the dogscats data frame that can be loaded as follows.

library(downloader)
source_url("https://db.tt/uoDbuoRR", prompt = FALSE)
head(dogscats)

  unit type visits expenses
1    1  Dog      4    45.14
2    2  Dog      5    50.13
3    3  Cat      2    27.15
4    4  Dog      3    45.80
5    5  Cat      1    23.39
6    6  Cat      2     8.24

These data are from an example by Levy and Lemeshow (1990).² Here is their description of the survey.

“A veterinarian is interested in studying the annual veterinary costs of his clientele (who own either dogs or cats). From a separate record system, he knows that he sees 850 dogs and 450 cats regularly in his practice (these are numbers of animals, not numbers of visits). He knows that the information on type of animal (i.e., dog or cat) is contained in the medical records, but that it would take too much time to sort the records into strata defined by animal type. So he decides to select a simple random sample and then poststratify. He regards the poststratification process as necessary since he knows that, on average, it costs more to keep dogs healthy than to keep cats healthy. He samples 50 records, recording the total amount of money spent (including medication) by the owners of the animals he saw over the past two years.”

Suppose were were specifically interested in estimating the mean expenses for dogs. There are several ways to do this. One is to create $y_i^* = d_iy_i$ and use svyratio. We need to create the new variable before specifying the design.

dogscats$N <- 850 + 450
dogscats$d <- ifelse(dogscats$type == "Dog", 1, 0)
dogscats$dy <- dogscats$expenses * dogscats$d
head(dogscats)

  unit type visits expenses    N d    dy
1    1  Dog      4    45.14 1300 1 45.14
2    2  Dog      5    50.13 1300 1 50.13
3    3  Cat      2    27.15 1300 0  0.00
4    4  Dog      3    45.80 1300 1 45.80
5    5  Cat      1    23.39 1300 0  0.00
6    6  Cat      2     8.24 1300 0  0.00

dogscats.srs <- svydesign(id = ~1, data = dogscats, fpc = ~N)

Now we can apply svyratio to the variables dy and d.

svyratio(~dy, ~d, design = dogscats.srs)

Ratio estimator: svyratio.survey.design2(~dy, ~d, design = dogscats.srs)
Ratios=
          d
dy 49.85844
SEs=
         d
dy 1.44369

Thus we esimate that the mean expenses for dogs is about $49.86 with a bound on the error of estimation of about $2.89. Thus we can always estimate a domain mean from a post-stratified design by estimating a ratio. However survey has simpler way to estimate domain means. If we use the svyby function it will recognize that the design was post-stratified (assuming we specified the design correct) and do the appropriate calculations without the need to create $d_i$ and $d_iy_i$.³:

svyby(~expenses, by = ~type, design = dogscats.srs, FUN = svymean)

    type expenses      se
Cat  Cat 21.71111 1.96505
Dog  Dog 49.85844 1.44369

Note that the estimate and standard error for dogs are the same as before. One thing to note is that when computing the estimated variance of the domain mean, svyratio and svyby are using the estimated value of $\mu_x$ in the formula \[ \hat{V}(r) = \left(1-\frac{n}{N}\right)\left(\frac{1}{\mu_x^2}\right)\frac{s_r^2}{n} \ \ \text{where} \ \ s_r^2 = \frac{1}{n-1}\sum_{i=1}^n(y_i^* - rx_i)^2. \] That is, the sample mean $\bar{x}$ is use in place of $\mu_x$. But note that $x_i = d_i$ so $\mu_x = N_d/N$. Since $\mu_x$ is not given it is assumed to be unknown so survey uses the estimator $\bar{x} = n_d/n$ in place of $\mu_x$ when computing the estimated variance. However in this case we know that $\mu_x$ = 850/1300. To adjust the standard error given by svyratio and svyby to use this information, we could multiply it $\bar{x}/\mu_x$ to effectively replace $\bar{x}$ with $\mu_x$. Domain totals can also be computed using svyratio or svyby.

svyby(~expenses, by = ~type, design = dogscats.srs, FUN = svytotal)

    type expenses       se
Cat  Cat 10160.80 2108.889
Dog  Dog 41482.22 4520.707

The estimator that is used is $\hat{\tau}_d = \frac{N}{n}\sum_{i=1}^n d_iy_i$, which we can verify by using svytotal with $d_iy_i$.

svytotal(~dy, design = dogscats.srs)

   total     SE
dy 41482 4520.7

However this estimator does not take into account that we know the size of each domain (i.e., $N_d$). If $N_d$ is known as it is here then we can use the estimator $\tau_d = N_d\bar{y}_d$. To compute this estimator and its standard error we can multiply the estimate and standard error for $\bar{y}_d$ by $N_d$. Using the results above, the estimated total expenses for dogs is $\hat{\tau}_d \approx$ 850 $\times$ $49.86 = $42379.67.

Ratio Estimators

Under simple random sampling the ratio estimators for $\mu_y$ and $\tau_y$ are \[ \hat{\mu}_y = r\mu_x \ \ \text{and} \ \ \hat{\tau}_y = r\tau_x, \] respectively, where $r = \bar{y}/\bar{x}$. Consider again the families data frame. Suppose that we know that the total number of people in the community of $N$ = 600 households is 2700 people. We can then use family size as an auxiliary variable with $\tau_x$ = 2700 and $\mu_x$ = 2700/600 = 4.5. A ratio estimator for $\mu_y$ can be computed using the predict function in combination with svyratio.

predict(svyratio(~expenses, ~size, design = families.srs), total = 2700)

$total
             size
expenses 39389.27

$se
             size
expenses 2778.418

So the estimated total weekly expenses for the community is $39389.27 with a bound on the error of estimation of about $5556.84. It is interesting to compare the ratio estimator with the estimator $N\bar{y}$ which does not use the auxiliary variable.

svytotal(~expenses, design = families.srs)

         total     SE
expenses 32626 2129.5

Notice that the standard error for the ratio estimator is larger than that for the estimator $N\bar{y}$. The reason why can be seen if we look at a scatter plot of the data.

with(families, plot(size, expenses))
with(families, abline(0, mean(expenses)/mean(size)))

Note that the relationship betweem expenses and size is relatively weak. For a ratio estimator to be efficient we need a strong relationship between the target and auxiliary variable such that the target variable is approximately proportional to the auxiliary variable. That doesn’t appear to be the case here.

The predict function does not directly provide a way to compute the ratio estimator of $\mu_y$, but it can be computed relatively easily from the ratio estimator of $\tau_y$ because $\hat{\mu}_y = \hat{\tau}_y/N$, so to get the estimate and standard error for the ratio estimator of $\mu_y$ we can simply divide the estimate and standard error for the ratio estimator of $\tau_y$ that are given by predict by $N$.

Regression Estimators and Calibration

Ratio estimators, regression estimators, and generalized regression estimators can be computed using the calibrate function. What this function does is reweight the sample to incorporate the information known about the auxiliary variables and calibrate the sample with respect to that information. Then functions such as svymean or svytotal can be applied to the calibrated sample. Although ratio estimators can be computed using predict with svyratio, survey does not have a separate function for regression estimators, but they can be implemented using the calibrate function, as can ratio estimators.

For an example consider the data frame trees which is included with R.

head(trees)

  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

This data frame is a sample of $n$ = 31 black cherry trees. Assume that the target variavble is Volume while Girth and Height are potential auxiliary variables. Also for the purpose of this example assume that this is a sample from a population of $N$ = 1000 trees, and that the totals for Girth and Height are perhaps known ($\tau_{g}$ = 14000 and $\tau_h$ = 77000), but we are interested in estimating the total volume ($\tau_v$) and mean volume ($\mu_v$) of the 1000 trees, which are unknown. We will assume that the sample was selected using simple random sampling.

trees$N <- 1000
trees.srs <- svydesign(id = ~1, data = trees, fpc = ~N)

First we can see how to compute a ratio estimator using calibrate. Suppose that we are going to only use girth as an auxiliary variable. The ratio estimator of total volume can then be computed using predict with svyratio.

predict(svyratio(~Volume, ~Girth, design = trees.srs), total = 14000)

$total
          Girth
Volume 31882.64

$se
          Girth
Volume 1811.891

But we can also do this with calibrate.

trees.cal <- calibrate(trees.srs, formula = ~Girth - 1, population = 14000, 
    variance = 1)
svytotal(~Volume, design = trees.cal)

       total     SE
Volume 31883 1811.9

Notice that calibrate doesn’t produce any estimates or standard errors itself. It only calibrates the sample. Estimates can then be obtained using other functions. The arguments of calibrate need some explanation. The formula argument specifies the kind of calibration we are using. It uses a conventional method in R for specifying linear regression models. The “formula” ~Girth-1 implies $y_i = a + bx_i$ but with $a=0$. The -1 can be read as “remove the intercept” from the formula. Note that the prediction formula $y_i = a + bx_i$ is consistent with $y_i = rx_i$ if $a=0$. However to get $b=\bar{y}/\bar{x}$ we need to specify a particular variance structure where the variance of $y_i$ is assumed to be proportional to $x_i$.⁴ This is achieved by the variance = 1 argument. The population = 14000 argument is how we tell calibrate that $\tau_g$ = 140000. A nice feature of calibrate is that it also allows us to compute the ratio estimator of $\mu_y$ directly.

svymean(~Volume, design = trees.cal)

         mean     SE
Volume 30.171 2.9062

Calibration works by reweighting so that the new weights satisfy certain constaints. For a ratio estimator this is $\tau_x = \sum_{i=1}^n w_ix_i$ so that the “estimate” of the total of the auxiliary variable agrees with the known value of $\tau_x$. We can see this if we extract the weights using the weights function and use them to “estimate” $\tau_g$.

sum(weights(trees.cal) * trees$Girth)

[1] 14000

We can also confirm that the estimate for the total $\tau_y$ is computed using $\hat{\tau}_y = \sum_{i=1}^n w_iy_i$ using the weights created by calibrate.

sum(weights(trees.cal) * trees$Volume)

[1] 31882.64

The relationship between tree volume and girth might be better approximated by a line that is not contrained to have a zero intercept. This can be seen if we look at a scatter plot of the sample data.

with(trees, plot(Girth, Volume))
with(trees, abline(0, mean(Volume)/mean(Girth)))

The line shown passes through the origin and the coordinate $(\bar{x},\bar{y})$, which is what would generate the predictions for a ratio estimator, but it doesn’t appear appropriate here. A regression estimator may be a better choice. Regression estimators can be implemented using calibrate.

trees.cal <- calibrate(trees.srs, formula = ~Girth, population = c(`(Intercept)` = 1000, 
    Girth = 14000))
svytotal(~Volume, trees.cal)

       total     SE
Volume 33978 857.99

Here formula = ~Girth implies that prediction model $y_i = a + bx_i$. Unless we include the -1 an intercept is assumed by default. The standard regression estimator estimates the slope and intercept assuming a constant variance, which is the default so unlike with the ratio estimator there is no need to specify anything for the variance argument. The population argument provides two values: $N$ and $\tau_g$. The syntax is a bit odd, but consider that we can write the prediction model alternatively as $y_i = ax_{i0} + bx_{ig}$, where $x_{ig}$ is the girth of the $i$-th tree, and $x_{i0} = 1$ for all trees. The regression estimator satisfies the calibration constraints \[ N = \sum_{i=1}^n w_i x_{i0} = \sum_{i=1}^n w_i\ \ \text{and} \ \ \tau_g = \sum_{i=1}^n w_i x_{ig}, \] which we can confirm numerically:

sum(weights(trees.cal))

[1] 1000

sum(weights(trees.cal) * trees$Girth)

[1] 14000

So basically with the population argument we are specifying that the (Intercept) “variable”" has a total of 1000, to calibrate with respect to the known population total $N$, and that the Girth variable has a total of 14000 in the population.

Note that the standard error for the regression estimator for the total volume in quite a bit smaller than that for the ratio estimator. We might be able to do even better by using tree height as well as girth as an auxiliary variable. This is a generalized regression estimator.

trees.cal <- calibrate(trees.srs, formula = ~Girth + Height, population = c(`(Intercept)` = 1000, 
    Girth = 14000, Height = 77000))
svytotal(~Volume, trees.cal)

       total     SE
Volume 34049 749.65

The standard error suggests that including the second auxiliary variable may be wortwhile. Note that formula = ~Girth+Height argument implies the prediction model $y_i = a + b_gx_{ig} + b_hx_{ih}$ where $x_{ig}$ and $x_{ih}$ are the girth and height of the $i$-th tree. The calibrate function allows us to include any number of auxiliary variables. Auxiliary variables can also be categorical as shown in the next section.

Post-Stratification as Calibration

Post-stratification is basically calibration with a categorical auxiliary variable. First consider the results from applying the postStratify function to the survey data in the dogscats data frame.⁵ Here is the estimate of the total expenses using post-stratification.

pop <- data.frame(type = c("Dog", "Cat"), Freq = c(850, 450))
dogscats.pst <- postStratify(dogscats.srs, strata = ~type, population = pop)
svytotal(~expenses, design = dogscats.pst)

         total     SE
expenses 52150 1512.5

The same estimate can be obtained using calibration.

dogscats.cal <- calibrate(dogscats.srs, formula = ~type, population = c(`(Intercept)` = 1300, 
    typeDog = 850))
svytotal(~expenses, design = dogscats.cal)

         total     SE
expenses 52150 1512.5

The specification of the population argument may seem a little strange because only the total for dogs has been specified. The reason is that the underlying regression model uses dummy/indicator variables, and only one of these is needed two distinguish between two strata. By default, R will order the strata alphanumerically and omit the first so that dummy/indicator variables exist only for the second through the last strata. So the size of the first strata does not need to be specified.

Double (Two-Phase) Sampling and Calibration

Double or two-phase sampling can be used when we want to use a ratio or (generalized) regression estimator, or other calibration methods, but the necessary population totals of the auxiliary variables are unknown. A simulated data set from a double-sampling design can be brought into an R session as follows.

library(downloader)
source_url("https://db.tt/bvWY2DgV", prompt = FALSE)

These data include $n'$ = 200 observations, which also include the $n$ = 31 observations of the black cherry trees. The auxiliary variables Girth and Volume are observed for $n'$ = 200 trees, but Volume was observed for only a sub-sample of $n$ = 31 trees. Assume that the large sample was selected using simple random sampling from a population of $N$ = 1000 trees, while the smaller sample was selected using simple random sampling from the larger sample. The data frame is called trees.double. Note that the data frame contains a variable called subsample which is TRUE if the observation is from the second phase sample, and FALSE otherwise.⁶

trees.double$N <- 1000
trees.2ph <- twophase(id = list(~1, ~1), data = trees.double, fpc = list(~N, 
    NULL), subset = ~subsample)
trees.cal <- calibrate(trees.2ph, formula = ~Girth + Height, phase = 2)

Note that here we do not specify the population argument. Instead by including the argument phase = 2 we tell calibrate to calibrate the observations in the second phase sample using the information from the first phase sample. Ratio or (generalized) regression estimators can then be obtained by applying functions to trees.cal.

svytotal(~Volume, design = trees.cal)

       total     SE
Volume 30519 1263.9

This document is a work in progress. It is based on version 3.30.3 of the survey package.↩
Levy, P. S. & Lemeshow, S. (1999). Sampling of populations: Methods and appliations. (3rd ed.). New York: Wiley.↩
Not all statistical packages will correctly recognize the design when domain means are requested, but the survey package tries to be “smart” and use what you specified when you created the design object.↩
The regression line implied by a ratio estimator is the weighted least squares regression line for the model $y_i = bx_i + \epsilon_i$ where the weights are $w_i = 1/x_i$. Note that these weights are not the same as design or calibration weights. In R this regression line can be obtained using lm(y ~ x, weights = 1/x).↩
The postStratify function is discussed in the documents Simple Random Sampling Analysis in R and Stratified Random Sampling Analysis in R.↩
You could create this variable yourself with something like trees.double$subsample <- ifelse(is.na(trees.double$Volume), TRUE, FALSE) or trees.double$subsample <- is.na(trees.double$Volume).↩