Step By Step Guide to Creating Basic Rake Weights in R

Survey weights are widely used in survey research for a variety of purposes. In this tutorial, we will be focusing on one specific form of survey weights called a “rake weight”. Rake weights are used to make the survey sample match the target population on a set of demographic, and sometimes attitudinal, measures. They are used to ensure the sample’s demographics match the target population’s demographics. This numerical correction will change how much each individual case in your dataset is contributing to the overall, or sub-group, mean values across your sample data.

First, we load necessary packages to compute and analyze the weights. If a package is not installed on your machine, you must first install it before this chunk of code will run.

1. Import your survey data into R

We need to import our survey data into R. The way we do this will vary by the format of your data. Here, we will pull data directly from Github then read the data into R.

One critical thing to note is that you must force your data into as.data.frame or else the rake weighting function will not work. We do that below using this line of code for our sample data: sample<-as.data.frame(sample)

#Pull Data Directly from Github - Need Internet Connection 
url <- "https://github.com/drCES/survey_weighting_dacss695/raw/main/weighting_sample.dta"
sample <- read_dta(url)# Read the Stata file into R
names(sample) #Looks at all the variables in the dataset

##  [1] "caseid"       "gov_choice"   "age_group"    "pid_4"        "ideo5"       
##  [6] "prop_111"     "prop_112"     "trump_app"    "hick_app"     "gardner_app" 
## [11] "cong_app"     "scotus_app"   "pot_law"      "gambling"     "fracking"    
## [16] "gun_control"  "anger"        "pride"        "hope"         "disgust"     
## [21] "worry"        "trump_app2"   "hick_app2"    "gardner_app2" "cong_app2"   
## [26] "scotus_app2"  "pot_law2"     "gambling2"    "fracking2"    "gun_control2"
## [31] "weight_org"   "pid_x"        "sex"          "race_4"       "speakspanish"
## [36] "marstat"      "child18"      "employ"       "faminc_new"   "casscd"      
## [41] "religiosity"  "educ"

sample<-as.data.frame(sample)#For ANESRAKE to work, data must be in a data.frame

2. Save Your Target Population Demographic Parameters

You will need to know the target population proportion for each of the variables you wish to weight your sample data on. How easy it will be to find your population values will be based on your specific target population. For this tutorial, use the provided population values for each demographic variable on our Canvas page.

Some populations will be relatively easy to find (e.g. think adult demographic proportions in the United States from the Census, CPS, or ACS results and all the sub-geographic levels that accompany them), but others won’t be as easy. Sometimes, you cannot know your target population proportions so in those cases you will not be able to weight your survey sample data.

For background on the survey data, it was a political poll conducted in October 2018 with the sample consisting of likely Colorado voters (this is the population of interest) in the then upcoming 2018 election. This gubernatorial election year poll measured multiple things including:

2018 Colorado Gubernatorial Preference - Jared Polis (D) or Walker Stapleton (R)
Policy & Approval Questions: Marijuana Legalization, Fracking, Gun Control laws, President Donald Trump, Governor John Hickenlooper (at the time)
Demographic questions for survey weighting purposes
- Age, Race/Ethnicity, Sex, Education, Marital Status, Child in the Home, Congressional District

Saving New Vectors With Target Population Demographic Values

Using this data, we will review how to create survey weights to make the sample demographic profile match the population demographic profile. To illustrate the principle, we will start with simple weights using just two demographic variables commonly used in calculating survey weights: sex (unfortunately only biological sex was collected in this survey) and race/ethnicity (split into 4 categories: white, African-American, Hispanic, Other Person of Color).

We must save a vector of data with the target population demographic proportions, so in this case we will save two vectors one called sex and one called race_4. Before we start the process of creating survey weights, it is important to review the variables themselves. We begin there.

Reviewing Unweighted Sample Demographic Profile

It is important to review the unweighted demographic values in your dataset prior to building your weights. First, it familiarizes you with your data and what values each variable can take on. Secondly, while creating weights we have to follow two critical steps to get the weighting code to work.

1. Matching Names

The names we give these vectors matter and must match the names of the appropriate demographic variable in your sample data. Since the vector names we chose were sex and race_4, the variable names in the sample data must be exactly sex and race_4 Otherwise, the code will not be able to match the two and will fail.

2. Matching Orders

The second critical thing to get correct is the order the proportion values are entered into the vector. The order must match the order the proportion values are stored in the sample data. In this example, the order of proportions stored in the sex variable in the sample data is (female, male) so the values we give the sex vector must be in that exact order as well. The same is true for the race_4 variable, which has 4 groups in the sample data: white, African-American, Hispanic, Other Person of Color The proportion order in our vector for the race_4 must match that exactly as well otherwise you are creating incorrect weights or best-case scenario getting an error message.

In this code chunk, we look at the frequency that each response option for the sex and race_4 variable is selected using the freq command and also review the value labels for each variable using the var_labels command.

freq(sample$sex) #Actual Pop: F= 0.516, M=0.484

## RECODE of gender (gender) 
##       Frequency Percent
## 1           489   51.42
## 2           462   48.58
## Total       951  100.00

val_labels(sample$sex)

## Female   Male 
##      1      2

freq(sample$race_4) #Actual Pop: 0.714, 0.050,  0.177,  0.059

## RECODE of race (race) 
##       Frequency Percent
## 1           835  87.802
## 2            24   2.524
## 3            42   4.416
## 4            50   5.258
## Total       951 100.000

val_labels(sample$race_4)

##     White     Black  Hispanic Other POC 
##         1         2         3         4

Here, we see the sample demographic profile for the sex and race_4 variables. In this sample, roughly 51.4% identified as female while the other 48.6% identified as male. This is very close to the population value of 51.6% & 48.4% indicating that our sample has good balance on this variable. If you will be weighting on a lot of other variables, even though the sample looks identical to the population on the sex variable you still would want to include it in the weighting scheme. This ensures that by weighting on the other demographic, you do not accidentally skew the sex variable out of balance with the population.

Creating Survey Weight Vectors

Now, that we know our variable names in our dataset as well as the order of the value labels, we can create the weight vectors necessary to create our survey weights. Weight vectors simply are the population values for each demographic variable included in our weights listed in the same order as they appear in the dataset. The weight vectors must match the name of the demographic variable in the dataset - here we are using sex and race_4 - and we must put the population values into the vector in the appropriate sequential order.

Here, because we know that option 1 in the sex variable is female while option 2 is male, we simply input the population values (.516 for females & .484 for males) from Canvas in that order. For the race_4 variable, the order goes white, African-American, Hispanic, other Person of Color so we input the 4 population values from Canvas into the weight vector in that exact order. Then we use the sum command to ensure it rounds to exactly 1 (otherwise you will get an error when creating the weights).

sex <- c(.516, .484)  #Target values for females and males; label order (female, male)
sum(sex) #proportions should = 1 so this checks that it does

## [1] 1

race_4 <-c(0.714, 0.05, 0.177, 0.059) #Target values race/ethnic identities - white, black, Hispanic, all others
sum(race_4) #proportions should = 1 so this checks that it does

## [1] 1

If you have population information available, all things equal, you should build survey weights using all of the variables you have information about. For intuition purposes, it is important to think about what a survey weight is doing. It is changing the proportion of different subgroups in the sample so that their opinion is more or less weighted (depending on how common/unique that specific demographic profile is). Survey weights will only change the outcome variable estimate - here who will you vote for - if the different subgroups in the sample have significantly different answers to the survey question.

Now, we look at support for the gubernatorial candidate by the two variables we will weight the sample on: sex & race_4. This is not a required step in creating survey weights but can be helpful to understand why differences might exist between the weighted and unweighted samples.

#Shows the unweighted support for the gubernatorial candidates by sex
##Update Independent variable that comes first; leave gov_choice variable and everything else the same
CrossTable(sample$sex, sample$gov_choice, prop.c = FALSE, prop.r = TRUE, prop.chisq = FALSE, prop.t = FALSE)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |           N / Row Total | 
## |-------------------------|
## 
## ===================================================
##               sample$gov_choice
## sample$sex        1       2       3       4   Total
## ---------------------------------------------------
## 1               248     217      17       7     489
##               0.507   0.444   0.035   0.014   0.514
## ---------------------------------------------------
## 2               186     253      18       5     462
##               0.403   0.548   0.039   0.011   0.486
## ---------------------------------------------------
## Total           434     470      35      12     951
## ===================================================

#Shows the unweighted support for the gubernatorial candidates by race/ethnicity
##Update Independent variable that comes first; leave gov_choice variable and everything else the same
CrossTable(sample$race_4, sample$gov_choice, prop.c = FALSE, prop.r = TRUE, prop.chisq = FALSE, prop.t = FALSE)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |           N / Row Total | 
## |-------------------------|
## 
## ======================================================
##                  sample$gov_choice
## sample$race_4        1       2       3       4   Total
## ------------------------------------------------------
## 1                  373     424      29       9     835
##                  0.447   0.508   0.035   0.011   0.878
## ------------------------------------------------------
## 2                   18       4       1       1      24
##                  0.750   0.167   0.042   0.042   0.025
## ------------------------------------------------------
## 3                   24      15       3       0      42
##                  0.571   0.357   0.071   0.000   0.044
## ------------------------------------------------------
## 4                   19      27       2       2      50
##                  0.380   0.540   0.040   0.040   0.053
## ------------------------------------------------------
## Total              434     470      35      12     951
## ======================================================

We do see differences in support for the gubernatorial candidates by the two weighting variables. Females were much more likely to support candidate 1 - Jared Polis - while males were much more likely to support candidate 2 - Walker Stapleton. Similar trends are seen in the race_4 variable where white people are more likely to support Walker Stapleton while African-Americans and Hispanics are more likely to support Jared Polis.

Because there are differences in support for the different candidates by the subgroups that are used in the weighting scheme, it is likely that there will be differences between the unweighted and weighted samples. Had the different subgroups answered the question in the same way, the unweighted and weighted sample means would be virtually identical. That is not the case here though.

Now, let’s actually calculate the weights to make the sample match the population on the specified variables.

3. Calculating Your Rake Weights

To begin, we will use the newly created vectors sex and race_4 to calculate our weights. This is a multistage process where it can wrong at any step. Be careful as you work through this part.

1. Create List

Next, we create a list that merges the two demographic vectors, sex and race_4, for use in the weighting process. Remember, the names in both files must match exactly. We give this list the name of targets to reflect this is the target population parameters we want to match the sample data to. We then give the column names to match with the sample data. Do not change the name of targets as this is used later on so must remain as is.

#Now we save these values as a list and call the list targets
#Step 1: Save the target list 
targets <- list(sex, race_4)
# remember, these names will have to match
names(targets) <- c("sex", "race_4")

2. Calculate the Weights

Once we have combined our weighting variables in the targets list, we use the anesrake function to create the survey weights for our data. This function has many possible items that could be used, with all the possible items listed in the following R chunk. You should view the R documentation for all possible things it can do.

For our purposes, we will be focusing on a few things that will be noted. We will calculate a new dataframe called myweights where we input the targets list, the name of our sample data sample, a caseid value that uniquely identifies each case, the cap item tells the function to cap the size of the survey weights at 8 and not allow any case to have a weight larger than that value. The type item tells the function how it should handle, if at all, a target population demographic that is very close to the sample value for that same demographic. For instance, the sex variable sample was almost identical to the population value so it might not be necessary to weight on that variable. The type function tells R how to handle this type of variable.

You’ll see in the output once you run the anesrake function how many iterations it took for the raking to converge on this specific set of weights. Here, it took 3 iterations across the two target demographic variables.

#anesrake(target values, dataframe, caseid, weightvec = NULL, cap = 5,
#verbose = FALSE, maxit = 1000, type = "pctlim", pctlim = 5,
#nlim = 5, filter = 1, choosemethod = "total", iterate = TRUE)

#Step 2 - Calculate the Rake Weight
set.seed(1967) #Set the seed for replication  
myweights <- anesrake(targets, sample, 
  caseid = sample$caseid, cap = 8, type = "pctlim", pctlim=.025)

## [1] "Raking converged in 3 iterations"

3. Save Weights in Sample Data

Now that we have calculated the weights for each respondent in our sample, we save that newly created weight as a new variable in our existing sample data. We now have a weight variable that we can use in our analysis of the data.

#Step 3 - Save the Rake Weight at the end of your sample data
sample$weight  <- unlist(myweights[1])

4. Reviewing the Newly Created Survey Weights

Before we start the analysis of the weighted data, let’s examine the newly created survey weights saved in our sample data.

With only 2 target weighting variables with 8 total categories combined between them, we can examine the weights individually by group. To do this, we will use the srvyr package to examine the weight size by the target groups.

#Displays summary of the weight size to see range
summary(sample$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8132  0.8132  0.8132  1.0000  0.8132  4.0078

#Shows the weight size by demographic groups used in the weighting scheme
sample %>% 
  as_survey(weights = c(weight)) %>%
  group_by(sex, race_4) %>% 
  summarise(weight = survey_mean(weight, na.rm = T))

## # A tibble: 8 × 4
## # Groups:   sex [2]
##   sex        race_4        weight weight_se
##   <dbl+lbl>  <dbl+lbl>      <dbl>     <dbl>
## 1 1 [Female] 1 [White]      0.813  0       
## 2 1 [Female] 2 [Black]      1.98   5.19e-17
## 3 1 [Female] 3 [Hispanic]   4.01   0       
## 4 1 [Female] 4 [Other POC]  1.12   4.58e-17
## 5 2 [Male]   1 [White]      0.813  0       
## 6 2 [Male]   2 [Black]      1.98   0       
## 7 2 [Male]   3 [Hispanic]   4.01   0       
## 8 2 [Male]   4 [Other POC]  1.12   4.21e-17

Now we see the weight size for each of the 8 groups that we weighted our sample data on. Obviously, with more demographic variables including in the weighting scheme this list would get much more cumbersome but for pedagogical purposes it is important to look at these values to understand their meaning.

For Hispanic (race_4=3), females (sex = 1) the weight equals 4.01. This means that Hispanic females are under-represented in the sample data since the value is over 1. Fundamentally what this means is that for each Hispanic female in the sample data, they are “speaking” for 4.01 Hispanic females from the target population. Compare this value to a white female whose weight equals .813. This value, being under 1, means that the sample has oversampled white females so the ones in the sample speak for roughly .813 white females in the sample.

Examine the Weighted Demographic Values

We should also look at the weighted demographic values to ensure the weights worked as we hope they do - i.e. that the weighted sample demographic values match the target population values. If the weights worked as they should, the weighted sample values for the two variables in the weighting scheme should match the population values we specified.

sample %>%
  as_survey(weights = c(weight)) %>%
  group_by(race_4) %>% #Update variable in ()
  summarise(n = survey_total()) %>% 
  mutate(weighted_sample = n /sum(n))

## # A tibble: 4 × 4
##   race_4            n  n_se weighted_sample
##   <dbl+lbl>     <dbl> <dbl>           <dbl>
## 1 1 [White]     679.   8.21           0.714
## 2 2 [Black]      47.6  9.59           0.05 
## 3 3 [Hispanic]  168.  25.4            0.177
## 4 4 [Other POC]  56.1  7.73           0.059

#Saves the weighted & unweighted size of the race_4 demographics
ag_w<-sample %>%
  as_survey(weights = c(weight)) %>%
  group_by(race_4) %>% #Update variable in ()
  summarise(n = survey_total()) %>% 
  mutate(weighted_sample = n /sum(n))

ag_uw<- sample %>%
  group_by(race_4) %>%  #Update variable in ()
  summarise(n = n()) %>% 
  mutate(unweighted_sample = n /sum(n))
ag_combo<-left_join(ag_w, ag_uw, by = "race_4", suffix = c("", "_pop")) %>%  #Update variable in by =""
  group_by(race_4)  #Update variable in ()

ag_combo$ag_diff_per<- 100*(ag_combo$weighted_sample-ag_combo$unweighted_sample)
ag_combo

## # A tibble: 4 × 7
## # Groups:   race_4 [4]
##   race_4            n  n_se weighted_sample n_pop unweighted_sample ag_diff_per
##   <dbl+lbl>     <dbl> <dbl>           <dbl> <int>             <dbl>       <dbl>
## 1 1 [White]     679.   8.21           0.714   835            0.878      -16.4  
## 2 2 [Black]      47.6  9.59           0.05     24            0.0252       2.48 
## 3 3 [Hispanic]  168.  25.4            0.177    42            0.0442      13.3  
## 4 4 [Other POC]  56.1  7.73           0.059    50            0.0526       0.642

print(ag_combo$weighted_sample)

## [1] 0.714 0.050 0.177 0.059

print(targets)

## $sex
## [1] 0.516 0.484
## 
## $race_4
## [1] 0.714 0.050 0.177 0.059

Here, we see that the weighted sample values for the race_4 variable match the specified population values exactly. This is what we want to see. Sometimes with a lot of variables in the weighting scheme, these will not match exactly but you want them to be extremely close no matter the number of variables. Now, let’s do the same thing for the sex variable.

#Saves the weighted & unweighted size of the sex demographics
sex_w<-sample %>%
  as_survey(weights = c(weight)) %>%
  group_by(sex) %>% #Update variable in ()
  summarise(n = survey_total()) %>% 
  mutate(weighted_sample = n /sum(n))

sex_uw<- sample %>%
  group_by(sex) %>%  #Update variable in ()
  summarise(n = n()) %>% 
  mutate(unweighted_sample = n /sum(n))
sex_combo<-left_join(sex_w, sex_uw, by = "sex", suffix = c("", "_pop")) %>%  #Update variable in by =""
  group_by(sex)  #Update variable in ()

sex_combo <- sex_combo %>%
  mutate(sex_diff_per = 100 * (weighted_sample - unweighted_sample))
sex_combo

## # A tibble: 2 × 7
## # Groups:   sex [2]
##   sex            n  n_se weighted_sample n_pop unweighted_sample sex_diff_per
##   <dbl+lbl>  <dbl> <dbl>           <dbl> <int>             <dbl>        <dbl>
## 1 1 [Female]  502.  22.4           0.528   489             0.514         1.41
## 2 2 [Male]    449.  20.2           0.472   462             0.486        -1.41

print(sex_combo$weighted_sample)

## [1] 0.5283409 0.4716591

print(targets)

## $sex
## [1] 0.516 0.484
## 
## $race_4
## [1] 0.714 0.050 0.177 0.059

Once again, the values match exactly for the sex variable between the weighted sample and the population values. We should feel confident that our weights worked as anticipated so we can proceed with our analysis of the weighted data.

5. Evaluating Influence of Weights on Reported Mean Values in the Sample Data - Smallish Weights

Next, let’s examine what impact these weights have on our sample values. Using tidyverse language, we can calculate the weighted and unweighted means for survey variables then compare the differences between the two. For any ordinal, continuous, or categorical variable, you can compare differences between the two samples.

Before looking at the gubernatorial results, we look at the weighted means for a dichotomous variable - fracking2 which equals 1 when respondent supports banning fracking in some situations while 0 means the respondent does not support banning fracking in those situations.

In the following code chunk, you will see two means calculations. The first one is simply the unweighted mean for the fracking2 variable whereas the second is the weighted means version. The code for the weighted means is slightly more complex as you must specify as_survey(weights=weight) to include the newly created weights in the analysis plus changing mean to survey_mean in the final line of the code to indicate that you are running a weighted mean using the srvyr package. We then bind the two dataframes together and calculate the difference in means between the two.

fracking_uw<-sample %>% #Looks at the unweighted support for fracking in CO
  summarise(unweight_support = mean(fracking2, na.rm = T))

fracking_w<-sample %>% #Looks at the weighted support for fracking in CO
  as_survey(weights = c(weight)) %>%
   summarise(weight_support = survey_mean(fracking2, na.rm = T))


fracking_combo<-cbind(fracking_uw, fracking_w ) 

fracking_combo <- mutate(fracking_combo, difference = weight_support - unweight_support)
fracking_combo

##   unweight_support weight_support weight_support_se   difference
## 1        0.5719027      0.5678267        0.01974743 -0.004075958

When we compare the differences on support for the fracking ban between the weighted and unweighted samples, we see that there is virtually no difference between the weighted and unweighted estimates of how supportive Coloradoans are of fracking. Why is this? This occurs sometimes when the weights that applied simply do not change the sample composition enough to have an influence on the overall sample mean.

Now, let’s look at a the gubernatorial results. This is a slightly more complex approach than above since we have 4 choices rather than the mean of a variable.

#Gubernatorial Vote Choice - Weighted & Unweighted 
gov_uw<-sample %>% #Save new dataframe for later manipulation
  group_by(gov_choice) %>% #Group the outcome by which candidate respondent supports
  filter(!is.na(gov_choice)) %>%  #Removes any missing data from the sample 
  summarise(n = n()) %>%  #Counts the number of respondents who support each candidate 
  mutate(unweight_support = n /sum(n)) #Calculate percentage of respondents who support each candidate

gov_w<-sample %>%
  as_survey(weights = c(weight)) %>% #Save new dataframe for later manipulation
  group_by(gov_choice) %>% #Group the outcome by which candidate respondent supports
  filter(!is.na(gov_choice)) %>%  #Removes any missing data from the sample 
  summarise(n = survey_total()) %>%  #Counts the weighted number of respondents who support each candidate 
  mutate(weight_support = n /sum(n))  #Calculate weighted percentage of respondents who support each candidate

gov_combo<-cbind(gov_uw, gov_w) #Combine the two files

gov_combo$diff <- gov_combo$weight_support - gov_combo$unweight_support
gov_combo

##   gov_choice   n unweight_support gov_choice         n      n_se weight_support
## 1          1 434       0.45636172          1 456.49080 22.542174     0.48001136
## 2          2 470       0.49421661          2 443.13338 19.102835     0.46596570
## 3          3  35       0.03680336          3  39.83149  8.497982     0.04188380
## 4          4  12       0.01261830          4  11.54432  3.502599     0.01213914
##            diff
## 1  0.0236496356
## 2 -0.0282509112
## 3  0.0050804304
## 4 -0.0004791548

With the gubernatorial results, the unweighted sample how candidate 2, Stapleton, leading candidate 1, Polis by 49.4% to 45.6%. Polis wins this election with 53.4% of the vote to Stapleton’s 42.8% for a margin of 10.6%. Obviously, the unweighted sample is biased in favor of Stapleton. When we look at the weighted sample, we see Polis slightly ahead with 47.9% of the vote compared to Stapleton’s 46.6% of the vote. While this poll is closer to the true margin, it is still far off. What happened? We are only including two variables in our weighting scheme when we should be including as many as we have information available for, provided we have sufficient sample size of that many.

We have 7 total variables to use in the weighting scheme and above we only used two. Let’s add two additional variables to the weighting scheme to see how that is influential. In your in-class activity, you will add the remaining 3 variables to see what impact that has on the estimates.

6. Create New Weighting Scheme That Incorporates More Demographic Variables

Typically, when creating survey weights you will include more than just 2 demographic variables into your weighting scheme. Here, we use 4 variables to create a new weight: sex, race/ethnicity, education, and martial status.

We will follow the same procedure here as above just simply with more variables included. Once again, it is important to ensure that the target population values match the name of its associated variable in the sample data.

First, we save the target population values and then prepare the data to be used in the weighting.

#Save new vectors with target population values for weights 
sex <- c(.516, .484)  ##Target values for females and males; label order (female, male)
sum(sex)

## [1] 1

race_4 <-c(0.714,   0.050,  0.177,  0.059) #Target values race/ethnic identities - white, black, Hispanic, all others
sum(race_4)

## [1] 1

educ <-c(0.211, 0.226,  0.086,  0.304,  0.173) #Target values education - HS or less, Some college, AA, BA, Graduate degree
sum(educ)

## [1] 1

marstat<-c(0.549,   0.015,  0.099,  0.039,  0.258,  0.040) #Target values marital status - Married, Separate,   Divorced,   Widowed,    Single, Domestic)
sum(marstat)

## [1] 1

#Combine the demographic vectors into a list
targets <- list(sex, race_4, educ, marstat)
# remember, these names will have to match the column names & order in the sample data 
names(targets) <- c("sex", "race_4", "educ", "marstat")

Then we use the anesrake package to create the survey weights to make the sample values match the target population values.

set.seed(1946)
myweights <- anesrake(targets, sample, 
                      caseid = sample$caseid, cap = 8, type = "pctlim", pctlim=.05)

## [1] "Raking converged in 19 iterations"

sample$full_weight  <- unlist(myweights[1])

summary(sample$full_weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4452  0.6299  0.7396  1.0000  1.1688  7.9985

Let’s look at how well the weights worked to match the sample data to the target population values for the educ and marstat measures.

For education, respondents with a high school degree or less were under-represented in the initial unweighted sample, which is a typical bias in online samples. By weighting on the known population values for education, the sample better reflects the overall percentage of people with lower educational attainment.

#Shows the weighted size of the educ demographics
##We can also bind the two values together to calculate their differences 
e_w<-sample %>%
  as_survey(weights = c(full_weight)) %>%
  group_by(educ) %>%
  summarise(n = survey_total()) %>% 
  mutate(weighted_sample = n /sum(n))

e_uw<- sample %>%
  group_by(educ) %>%
  summarise(n = n()) %>% 
  mutate(unweighted_sample = n /sum(n))
e_combo<-left_join(e_w, e_uw, by = "educ") %>%
  group_by(educ)

e_combo$diff <- e_combo$weighted_sample - e_combo$unweighted_sample
e_combo

## # A tibble: 5 × 7
## # Groups:   educ [5]
##   educ                n.x  n_se weighted_sample   n.y unweighted_sample     diff
##   <dbl+lbl>         <dbl> <dbl>           <dbl> <int>             <dbl>    <dbl>
## 1 1 [HS grad or le… 201.  23.8            0.211   120             0.126  0.0848 
## 2 2 [Some college]  215.  16.1            0.226   223             0.234 -0.00849
## 3 3 [AA degree]      81.8  8.71           0.086   111             0.117 -0.0307 
## 4 4 [BA]            289.  19.6            0.304   282             0.297  0.00747
## 5 5 [Post-grad]     165.  11.6            0.173   215             0.226 -0.0531

For education, the weight had a dramatic influence on the number of high school only educated respondents increasing their percentage of the sample from 12% to 21% of the sample matching the population.

Next, we review the impact of weighting by martial status on sample composition.

#Shows the weighted size of the marstat demographics
##We can also bind the two values together to calculate their differences 
marstat_w<-sample %>%
  as_survey(weights = c(full_weight)) %>%
  group_by(marstat) %>%
  summarise(n = survey_total()) %>% 
  mutate(weighted_sample = n /sum(n))

marstat_uw<- sample %>%
  group_by(marstat) %>%
  summarise(n = n()) %>% 
  mutate(unweighted_sample = n /sum(n))
marstat_combo<-left_join(marstat_w, marstat_uw, by = "marstat") %>%
  group_by(marstat)

marstat_combo$ed_diff_per<- 100*(marstat_combo$weighted_sample-marstat_combo$unweighted_sample)
marstat_combo

## # A tibble: 6 × 7
## # Groups:   marstat [6]
##   marstat          n.x  n_se weighted_sample   n.y unweighted_sample ed_diff_per
##   <dbl+lbl>      <dbl> <dbl>           <dbl> <int>             <dbl>       <dbl>
## 1 1 [Married]    522.  20.2            0.549   592            0.623      -7.35  
## 2 2 [Separated]   14.3  4.00           0.015    15            0.0158     -0.0773
## 3 3 [Divorced]    94.1  8.97           0.099   129            0.136      -3.66  
## 4 4 [Widowed]     37.1  4.88           0.039    59            0.0620     -2.30  
## 5 5 [Never marr… 245.  26.0            0.258   128            0.135      12.3   
## 6 6 [Domestic /…  38.0  8.22           0.04     28            0.0294      1.06

Examining the impact of weighting by martial status, we see the 5th option - single, never married - having a large difference between the two samples. This indicates that single, never married respondents are underrepresented compared to their true population value.

Now, we look at the differences in gubernatorial support by martial status to see how impactful this weighting scheme might be.

CrossTable(sample$marstat, sample$gov_choice, prop.c = FALSE, prop.r = TRUE, prop.chisq = FALSE, prop.t = FALSE)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |           N / Row Total | 
## |-------------------------|
## 
## =======================================================
##                   sample$gov_choice
## sample$marstat        1       2       3       4   Total
## -------------------------------------------------------
## 1                   240     320      23       9     592
##                   0.405   0.541   0.039   0.015   0.623
## -------------------------------------------------------
## 2                     8       6       1       0      15
##                   0.533   0.400   0.067   0.000   0.016
## -------------------------------------------------------
## 3                    69      54       5       1     129
##                   0.535   0.419   0.039   0.008   0.136
## -------------------------------------------------------
## 4                    20      37       2       0      59
##                   0.339   0.627   0.034   0.000   0.062
## -------------------------------------------------------
## 5                    80      45       2       1     128
##                   0.625   0.352   0.016   0.008   0.135
## -------------------------------------------------------
## 6                    17       8       2       1      28
##                   0.607   0.286   0.071   0.036   0.029
## -------------------------------------------------------
## Total               434     470      35      12     951
## =======================================================

Here, we see a large difference for martial status 5 - single, never married - on their support for Jared Polis. This group largely supported Polis compared to Walker so by increasing the impact of this group with the weighting scheme the overall weighted vote estimate for Polis should increase as well.

#Gubernatorial Vote Choice - Weighted & Unweighted 
gov_w<-sample %>%
  as_survey(weights = c(full_weight)) %>%
  filter(!is.na(gov_choice)) %>% 
  group_by(gov_choice) %>%
  summarise(n = survey_total()) %>% 
  mutate(weight_support = n /sum(n)) 

gov_uw<-sample %>%
  group_by(gov_choice) %>%
  filter(!is.na(gov_choice)) %>% 
  summarise(n = n()) %>% 
  mutate(unweight_support = n /sum(n))

gov_combo<-cbind(gov_uw, gov_w) 


gov_combo$diff <- gov_combo$weight_support - gov_combo$unweight_support

gov_outcome<-cbind(gov_combo$gov_choice, gov_combo$weight_support, gov_combo$unweight_support, gov_combo$diff) 

colnames(gov_outcome) <- c("candidate", "weighted support", "unewighted support", "diff") 
gov_outcome

##      candidate weighted support unewighted support          diff
## [1,]         1       0.50436596         0.45636172  0.0480042392
## [2,]         2       0.44871485         0.49421661 -0.0455017618
## [3,]         3       0.03603287         0.03680336 -0.0007704949
## [4,]         4       0.01088631         0.01261830 -0.0017319826

Once we apply our new weights with the additional two weighting variables, we can see how much closer to the true election result we are. With this weighting scheme, candidate 1 - Polis - is estimated to have roughly 50.4% of the vote (true support was 53.4%) while Stapleton is estimated to have roughly 44.9% of the vote (true support was 42.8%). With this weighting scheme incorporating more population variables, the estimate becomes much closer to the true population. However, it is still off. Your job in the class activity is to improve upon this prediction.

7. Concluding Thoughts

This is an important lesson for the application of the survey weights. The target population values that you weight your survey sample data to match can have profound implications on the conclusions you and others draw from your survey results. In the case, the decision to weight the survey to give more voice to the Republican members of the sample influenced the conclusions drawn about support for various policies being debated in the public realm. This makes it critically important to make sure that the target population values that are chosen are as accurate as possible and publicly defensible.

Overall, this tutorial has taken you through how to calculate survey weights using the anesrake package. Using a sample political poll, you hopefully learned how to create target demographic population vectors, which then merge with our sample demographic values. Following this, you learned how to calculate directly survey weights, evaluate the success/failure of the survey weighting process, and compare the impact of using the survey weight on the conclusions drawn from the results.

End of tutorial

sessionInfo()

## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] anesrake_0.80   weights_1.0.4   Hmisc_5.1-0     srvyr_1.2.0    
##  [5] survey_4.2-1    survival_3.5-5  Matrix_1.5-4.1  descr_1.1.7    
##  [9] labelled_2.11.0 lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0  
## [13] dplyr_1.1.2     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0    
## [17] tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0 haven_2.5.2    
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.6        jsonlite_1.8.4    splines_4.2.2     gtools_3.9.4     
##  [5] bslib_0.4.2       Formula_1.2-5     highr_0.10        yaml_2.3.7       
##  [9] pillar_1.9.0      backports_1.4.1   lattice_0.21-8    glue_1.6.2       
## [13] digest_0.6.31     checkmate_2.2.0   minqa_1.2.5       colorspace_2.1-0 
## [17] htmltools_0.5.5   pkgconfig_2.0.3   broom_1.0.5       xtable_1.8-4     
## [21] scales_1.2.1      gdata_2.19.0      tzdb_0.4.0        lme4_1.1-33      
## [25] timechange_0.2.0  htmlTable_2.4.1   generics_0.1.3    cachem_1.0.8     
## [29] withr_2.5.0       nnet_7.3-19       cli_3.4.1         crayon_1.5.2     
## [33] magrittr_2.0.3    evaluate_0.21     mice_3.15.0       fansi_1.0.4      
## [37] nlme_3.1-162      MASS_7.3-60       foreign_0.8-84    tools_4.2.2      
## [41] data.table_1.14.8 hms_1.1.3         mitools_2.4       lifecycle_1.0.3  
## [45] munsell_0.5.0     cluster_2.1.4     compiler_4.2.2    jquerylib_0.1.4  
## [49] rlang_1.1.1       nloptr_2.0.3      rstudioapi_0.15.0 htmlwidgets_1.6.2
## [53] base64enc_0.1-3   rmarkdown_2.22    boot_1.3-28.1     gtable_0.3.3     
## [57] curl_5.0.0        DBI_1.1.3         R6_2.5.1          gridExtra_2.3    
## [61] knitr_1.43        fastmap_1.1.1     utf8_1.2.3        stringi_1.7.12   
## [65] Rcpp_1.0.10       vctrs_0.6.2       rpart_4.1.19      tidyselect_1.2.0 
## [69] xfun_0.39

Survey Weighting in R: Using Rake Weights to Adjust Samples to Population Demographics

Dr. C.E. Stapleton

2024-11-12

Note: This tutorial uses the `anesrake` package to calculate the survey weights. They are many other packages to calculate weights so this is just one possible approach that could successfully be used to create survey weights.

Step By Step Guide to Creating Basic Rake Weights in R

1. Import your survey data into R

2. Save Your Target Population Demographic Parameters

Saving New Vectors With Target Population Demographic Values

Reviewing Unweighted Sample Demographic Profile

1. Matching Names

2. Matching Orders

Creating Survey Weight Vectors

3. Calculating Your Rake Weights

1. Create List

2. Calculate the Weights

3. Save Weights in Sample Data

4. Reviewing the Newly Created Survey Weights

Examine the Weighted Demographic Values

5. Evaluating Influence of Weights on Reported Mean Values in the Sample Data - Smallish Weights

6. Create New Weighting Scheme That Incorporates More Demographic Variables

7. Concluding Thoughts

Survey Weighting in R: Using Rake Weights to Adjust Samples to Population Demographics

Dr. C.E. Stapleton

2024-11-12

Note: This tutorial uses the anesrake package to calculate the survey weights. They are many other packages to calculate weights so this is just one possible approach that could successfully be used to create survey weights.

Step By Step Guide to Creating Basic Rake Weights in R

1. Import your survey data into R

2. Save Your Target Population Demographic Parameters

Saving New Vectors With Target Population Demographic Values

Reviewing Unweighted Sample Demographic Profile

1. Matching Names

2. Matching Orders

Creating Survey Weight Vectors

3. Calculating Your Rake Weights

1. Create List

2. Calculate the Weights

3. Save Weights in Sample Data

4. Reviewing the Newly Created Survey Weights

Examine the Weighted Demographic Values

5. Evaluating Influence of Weights on Reported Mean Values in the Sample Data - Smallish Weights

6. Create New Weighting Scheme That Incorporates More Demographic Variables

7. Concluding Thoughts

Note: This tutorial uses the `anesrake` package to calculate the survey weights. They are many other packages to calculate weights so this is just one possible approach that could successfully be used to create survey weights.