Survey Weighting for Stratas and Clusters

Here is an example of selecting and analyzing data from clusters and stratas. The first example uses two packages survey and sampling, which will each need to be installed and libraried. In this first example, we look at how to select a stratified random sample and then calculate the mean using the correct survey design. The first step is to create an artificial data set. In this data set, there are three continuous variables (y, x1, and x2) and one strata variable indicating which of the five stratas a data point lies within. We can then use the strata function in the sampling package to select a subset of data based upon stratas. In the example below, we specified the strata variable as x3 and are telling R to select 100 units from each of the five stratas using a simple random sample without replacement (srswor) method. Then we need to combine the original data with the other variables (i.e. x1 and x2 are dropped when we selected the stratified sample) to obtain a full data set. Then we can calculate the survey mean using the svydesign function in the survey package. For this example, the id =~ 1, because we used a simple random sample from the stratas, the strata argument calls for the variable containing the stratum, which is Stratum in this example, and then we need to specify the data, which is the full data set. Finally, we can calculate the mean by using the surveymean function. In this example, we are calculating the mean of y, by specifying the y variable in the full data set and also specifying the survey design.

library(survey)
library(sampling)

set.seed(123)
data = data.frame(y = rnorm(1000), x1 = rnorm(1000), x2 = rnorm(1000), x3 = rep(c(1,2,3,4,5), 200))
data = as.data.frame(data)

dataStrat = strata(data = data, stratanames = c("x3"), size = c(rep(100,5)), method = "srswor", set.seed(123))

dataStratFull = getdata(data, dataStrat)
dataStratFull = as.data.frame(dataStratFull)
head(dataStratFull)

##             y         x1          x2 x3 ID_unit Prob Stratum
## 1  -0.5604756 -0.9957987 -0.51160372  1       1  0.5       1
## 6   1.7150650  1.0405735 -0.61526832  1       6  0.5       1
## 11  1.2240818  2.7973911 -0.83599931  1      11  0.5       1
## 16  1.7869131  0.1870511  1.12265422  1      16  0.5       1
## 36  0.6886403 -1.1368931  0.12703861  1      36  0.5       1
## 41 -0.6947070 -1.0223473  0.04599377  1      41  0.5       1

dataStratDesign = svydesign(id = ~1, strata = ~ Stratum, data = dataStratFull)

svymean(~ dataStratFull$y , dataStratDesign)

##                      mean     SE
## dataStratFull$y 0.0074284 0.0451

Now we are going to select a random sample from clusters. Y, x1, and x4 (previously x3) are the same as before, and now we have a variable titled x3 that identifies the cluster. In this example, there are 100 clusters with 100 data points in each cluster. To select the data based upon cluster we use the cluster function. The cluster function is the same as the strata, but instead of strataname, we input the clustername (x3 in this example) and then select the number of data points we want from each cluster indicated by the size argument, which is 50 in this example. The total sample is size 5,000, because we are selecting 50 clusters from the 100 possible clusters. There are 100 data points for each cluster (50*100 = 5,000). Then for the survey design instead of identifying the strata variable, the user will identify the cluster variable.

set.seed(123)
data = data.frame(y = rnorm(10000), x1 = rnorm(10000), x2 = rnorm(10000), x3 = rep(c(1:100), 100), x4 = rep(c(1,2,3,4,5), 2000))
data = as.data.frame(data)

library(sampling)

dataCluster = cluster(data = data, clustername = c("x3"), size = 50, method = "srswor", set.seed(123))

dataClusterFull = getdata(data, dataCluster)
dataClusterFull = as.data.frame(dataClusterFull)
head(dataClusterFull)

##               y         x1         x2 x4 x3 ID_unit Prob
## 1501 -0.8209867 -0.5582810 -1.8983838  1  1    1501  0.5
## 1901  1.5373275 -1.8923072 -0.3615739  1  1    1901  0.5
## 7501  0.3500025 -0.7912208 -1.1700894  1  1    7501  0.5
## 2701 -2.0061200 -0.9547640  0.3813596  1  1    2701  0.5
## 7601  0.8852981  0.3733094  1.4495605  1  1    7601  0.5
## 901  -1.0141142  0.2793207 -2.0064708  1  1     901  0.5

dataClusterDesign = svydesign(id = ~1, cluster = ~ x3, data = dataClusterFull) 
svymean(~ dataClusterFull$y , dataClusterDesign)

##                        mean     SE
## dataClusterFull$y 0.0088334 0.0142