anesrake
package to
calculate the survey weights. They are many other packages to calculate
weights so this is just one possible approach that could successfully be
used to create survey weights.Survey weights are widely used in survey research for a variety of purposes. In this tutorial, we will be focusing on one specific form of survey weights called a “rake weight”. Rake weights are used to make the survey sample match the target population on a set of demographic, and sometimes attitudinal, measures. They are used to ensure the sample’s demographics match the target population’s demographics. This numerical correction will change how much each individual case in your dataset is contributing to the overall, or sub-group, mean values across your sample data.
First, we load necessary packages to compute and analyze the weights. If a package is not installed on your machine, you must first install it before this chunk of code will run.
We need to import our survey data into R. The way we do this will vary by the format of your data. Here, we will pull data directly from Github then read the data into R.
One critical thing to note is that you must force your data into
as.data.frame
or else the rake weighting function will not
work. We do that below using this line of code for our
sample
data:
sample<-as.data.frame(sample)
#Pull Data Directly from Github - Need Internet Connection
<- "https://github.com/drCES/survey_weighting_dacss695/raw/main/weighting_sample.dta"
url <- read_dta(url)# Read the Stata file into R
sample names(sample) #Looks at all the variables in the dataset
## [1] "caseid" "gov_choice" "age_group" "pid_4" "ideo5"
## [6] "prop_111" "prop_112" "trump_app" "hick_app" "gardner_app"
## [11] "cong_app" "scotus_app" "pot_law" "gambling" "fracking"
## [16] "gun_control" "anger" "pride" "hope" "disgust"
## [21] "worry" "trump_app2" "hick_app2" "gardner_app2" "cong_app2"
## [26] "scotus_app2" "pot_law2" "gambling2" "fracking2" "gun_control2"
## [31] "weight_org" "pid_x" "sex" "race_4" "speakspanish"
## [36] "marstat" "child18" "employ" "faminc_new" "casscd"
## [41] "religiosity" "educ"
<-as.data.frame(sample)#For ANESRAKE to work, data must be in a data.frame sample
You will need to know the target population proportion for each of the variables you wish to weight your sample data on. How easy it will be to find your population values will be based on your specific target population. For this tutorial, use the provided population values for each demographic variable on our Canvas page.
Some populations will be relatively easy to find (e.g. think adult demographic proportions in the United States from the Census, CPS, or ACS results and all the sub-geographic levels that accompany them), but others won’t be as easy. Sometimes, you cannot know your target population proportions so in those cases you will not be able to weight your survey sample data.
For background on the survey data, it was a political poll conducted in October 2018 with the sample consisting of likely Colorado voters (this is the population of interest) in the then upcoming 2018 election. This gubernatorial election year poll measured multiple things including:
Using this data, we will review how to create survey weights to make the sample demographic profile match the population demographic profile. To illustrate the principle, we will start with simple weights using just two demographic variables commonly used in calculating survey weights: sex (unfortunately only biological sex was collected in this survey) and race/ethnicity (split into 4 categories: white, African-American, Hispanic, Other Person of Color).
We must save a vector of data with the target population demographic
proportions, so in this case we will save two vectors one called
sex
and one called race_4
. Before we start the
process of creating survey weights, it is important to review the
variables themselves. We begin there.
It is important to review the unweighted demographic values in your dataset prior to building your weights. First, it familiarizes you with your data and what values each variable can take on. Secondly, while creating weights we have to follow two critical steps to get the weighting code to work.
The names we give these vectors matter and must match the names of the appropriate demographic variable in your sample data. Since the vector names we chose were sex and race_4, the variable names in the sample data must be exactly sex and race_4 Otherwise, the code will not be able to match the two and will fail.
The second critical thing to get correct is the order the proportion values are entered into the vector. The order must match the order the proportion values are stored in the sample data. In this example, the order of proportions stored in the sex variable in the sample data is (female, male) so the values we give the sex vector must be in that exact order as well. The same is true for the race_4 variable, which has 4 groups in the sample data: white, African-American, Hispanic, Other Person of Color The proportion order in our vector for the race_4 must match that exactly as well otherwise you are creating incorrect weights or best-case scenario getting an error message.
In this code chunk, we look at the frequency that each response
option for the sex
and race_4
variable is
selected using the freq
command and also review the value
labels for each variable using the var_labels
command.
freq(sample$sex) #Actual Pop: F= 0.516, M=0.484
## RECODE of gender (gender)
## Frequency Percent
## 1 489 51.42
## 2 462 48.58
## Total 951 100.00
val_labels(sample$sex)
## Female Male
## 1 2
freq(sample$race_4) #Actual Pop: 0.714, 0.050, 0.177, 0.059
## RECODE of race (race)
## Frequency Percent
## 1 835 87.802
## 2 24 2.524
## 3 42 4.416
## 4 50 5.258
## Total 951 100.000
val_labels(sample$race_4)
## White Black Hispanic Other POC
## 1 2 3 4
Here, we see the sample demographic profile for the sex
and race_4
variables. In this sample, roughly 51.4%
identified as female while the other 48.6% identified as male. This is
very close to the population value of 51.6% & 48.4% indicating that
our sample has good balance on this variable. If you will be weighting
on a lot of other variables, even though the sample looks identical to
the population on the sex
variable you still would want to
include it in the weighting scheme. This ensures that by weighting on
the other demographic, you do not accidentally skew the sex
variable out of balance with the population.
Now, that we know our variable names in our dataset as well as the
order of the value labels, we can create the weight vectors necessary to
create our survey weights. Weight vectors simply are the population
values for each demographic variable included in our weights listed in
the same order as they appear in the dataset. The weight vectors must
match the name of the demographic variable in the dataset - here we are
using sex
and race_4
- and we must put the
population values into the vector in the appropriate sequential
order.
Here, because we know that option 1 in the sex
variable
is female while option 2 is male, we simply input the population values
(.516 for females & .484 for males) from Canvas in that order. For
the race_4
variable, the order goes white,
African-American, Hispanic, other Person of Color so we input the 4
population values from Canvas into the weight vector in that exact
order. Then we use the sum
command to ensure it rounds to
exactly 1 (otherwise you will get an error when creating the
weights).
<- c(.516, .484) #Target values for females and males; label order (female, male)
sex sum(sex) #proportions should = 1 so this checks that it does
## [1] 1
<-c(0.714, 0.05, 0.177, 0.059) #Target values race/ethnic identities - white, black, Hispanic, all others
race_4 sum(race_4) #proportions should = 1 so this checks that it does
## [1] 1
If you have population information available, all things equal, you should build survey weights using all of the variables you have information about. For intuition purposes, it is important to think about what a survey weight is doing. It is changing the proportion of different subgroups in the sample so that their opinion is more or less weighted (depending on how common/unique that specific demographic profile is). Survey weights will only change the outcome variable estimate - here who will you vote for - if the different subgroups in the sample have significantly different answers to the survey question.
Now, we look at support for the gubernatorial candidate by the two
variables we will weight the sample on: sex
&
race_4
. This is not a required step in creating survey
weights but can be helpful to understand why differences might exist
between the weighted and unweighted samples.
#Shows the unweighted support for the gubernatorial candidates by sex
##Update Independent variable that comes first; leave gov_choice variable and everything else the same
CrossTable(sample$sex, sample$gov_choice, prop.c = FALSE, prop.r = TRUE, prop.chisq = FALSE, prop.t = FALSE)
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
## ===================================================
## sample$gov_choice
## sample$sex 1 2 3 4 Total
## ---------------------------------------------------
## 1 248 217 17 7 489
## 0.507 0.444 0.035 0.014 0.514
## ---------------------------------------------------
## 2 186 253 18 5 462
## 0.403 0.548 0.039 0.011 0.486
## ---------------------------------------------------
## Total 434 470 35 12 951
## ===================================================
#Shows the unweighted support for the gubernatorial candidates by race/ethnicity
##Update Independent variable that comes first; leave gov_choice variable and everything else the same
CrossTable(sample$race_4, sample$gov_choice, prop.c = FALSE, prop.r = TRUE, prop.chisq = FALSE, prop.t = FALSE)
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
## ======================================================
## sample$gov_choice
## sample$race_4 1 2 3 4 Total
## ------------------------------------------------------
## 1 373 424 29 9 835
## 0.447 0.508 0.035 0.011 0.878
## ------------------------------------------------------
## 2 18 4 1 1 24
## 0.750 0.167 0.042 0.042 0.025
## ------------------------------------------------------
## 3 24 15 3 0 42
## 0.571 0.357 0.071 0.000 0.044
## ------------------------------------------------------
## 4 19 27 2 2 50
## 0.380 0.540 0.040 0.040 0.053
## ------------------------------------------------------
## Total 434 470 35 12 951
## ======================================================
We do see differences in support for the gubernatorial candidates by
the two weighting variables. Females were much more likely to support
candidate 1 - Jared Polis - while males were much more likely to support
candidate 2 - Walker Stapleton. Similar trends are seen in the
race_4
variable where white people are more likely to
support Walker Stapleton while African-Americans and Hispanics are more
likely to support Jared Polis.
Because there are differences in support for the different candidates by the subgroups that are used in the weighting scheme, it is likely that there will be differences between the unweighted and weighted samples. Had the different subgroups answered the question in the same way, the unweighted and weighted sample means would be virtually identical. That is not the case here though.
Now, let’s actually calculate the weights to make the sample match the population on the specified variables.
To begin, we will use the newly created vectors sex
and
race_4
to calculate our weights. This is a multistage
process where it can wrong at any step. Be careful as you work through
this part.
Next, we create a list that merges the two demographic vectors,
sex
and race_4
, for use in the weighting
process. Remember, the names in both files must match exactly. We give
this list the name of targets
to reflect this is the target
population parameters we want to match the sample data to. We then give
the column names to match with the sample data. Do not change the name
of targets
as this is used later on so must remain as
is.
#Now we save these values as a list and call the list targets
#Step 1: Save the target list
<- list(sex, race_4)
targets # remember, these names will have to match
names(targets) <- c("sex", "race_4")
Once we have combined our weighting variables in the
targets
list, we use the anesrake
function to
create the survey weights for our data. This function has many possible
items that could be used, with all the possible items listed in the
following R chunk. You should view the R documentation for all possible
things it can do.
For our purposes, we will be focusing on a few things that will be
noted. We will calculate a new dataframe called myweights
where we input the targets
list, the name of our sample
data sample
, a caseid value that uniquely identifies each
case, the cap
item tells the function to cap the size of
the survey weights at 8 and not allow any case to have a weight larger
than that value. The type
item tells the function how it
should handle, if at all, a target population demographic that is very
close to the sample value for that same demographic. For instance, the
sex
variable sample was almost identical to the population
value so it might not be necessary to weight on that variable. The
type
function tells R how to handle this type of
variable.
You’ll see in the output once you run the anesrake
function how many iterations it took for the raking to converge on this
specific set of weights. Here, it took 3 iterations across the two
target demographic variables.
#anesrake(target values, dataframe, caseid, weightvec = NULL, cap = 5,
#verbose = FALSE, maxit = 1000, type = "pctlim", pctlim = 5,
#nlim = 5, filter = 1, choosemethod = "total", iterate = TRUE)
#Step 2 - Calculate the Rake Weight
set.seed(1967) #Set the seed for replication
<- anesrake(targets, sample,
myweights caseid = sample$caseid, cap = 8, type = "pctlim", pctlim=.025)
## [1] "Raking converged in 3 iterations"
Now that we have calculated the weights for each respondent in our sample, we save that newly created weight as a new variable in our existing sample data. We now have a weight variable that we can use in our analysis of the data.
#Step 3 - Save the Rake Weight at the end of your sample data
$weight <- unlist(myweights[1]) sample
Before we start the analysis of the weighted data, let’s examine the newly created survey weights saved in our sample data.
With only 2 target weighting variables with 8 total categories
combined between them, we can examine the weights individually by group.
To do this, we will use the srvyr
package to examine the
weight size by the target groups.
#Displays summary of the weight size to see range
summary(sample$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8132 0.8132 0.8132 1.0000 0.8132 4.0078
#Shows the weight size by demographic groups used in the weighting scheme
%>%
sample as_survey(weights = c(weight)) %>%
group_by(sex, race_4) %>%
summarise(weight = survey_mean(weight, na.rm = T))
## # A tibble: 8 × 4
## # Groups: sex [2]
## sex race_4 weight weight_se
## <dbl+lbl> <dbl+lbl> <dbl> <dbl>
## 1 1 [Female] 1 [White] 0.813 0
## 2 1 [Female] 2 [Black] 1.98 5.19e-17
## 3 1 [Female] 3 [Hispanic] 4.01 0
## 4 1 [Female] 4 [Other POC] 1.12 4.58e-17
## 5 2 [Male] 1 [White] 0.813 0
## 6 2 [Male] 2 [Black] 1.98 0
## 7 2 [Male] 3 [Hispanic] 4.01 0
## 8 2 [Male] 4 [Other POC] 1.12 4.21e-17
Now we see the weight size for each of the 8 groups that we weighted our sample data on. Obviously, with more demographic variables including in the weighting scheme this list would get much more cumbersome but for pedagogical purposes it is important to look at these values to understand their meaning.
For Hispanic (race_4=3), females (sex = 1) the weight equals 4.01. This means that Hispanic females are under-represented in the sample data since the value is over 1. Fundamentally what this means is that for each Hispanic female in the sample data, they are “speaking” for 4.01 Hispanic females from the target population. Compare this value to a white female whose weight equals .813. This value, being under 1, means that the sample has oversampled white females so the ones in the sample speak for roughly .813 white females in the sample.
We should also look at the weighted demographic values to ensure the weights worked as we hope they do - i.e. that the weighted sample demographic values match the target population values. If the weights worked as they should, the weighted sample values for the two variables in the weighting scheme should match the population values we specified.
%>%
sample as_survey(weights = c(weight)) %>%
group_by(race_4) %>% #Update variable in ()
summarise(n = survey_total()) %>%
mutate(weighted_sample = n /sum(n))
## # A tibble: 4 × 4
## race_4 n n_se weighted_sample
## <dbl+lbl> <dbl> <dbl> <dbl>
## 1 1 [White] 679. 8.21 0.714
## 2 2 [Black] 47.6 9.59 0.05
## 3 3 [Hispanic] 168. 25.4 0.177
## 4 4 [Other POC] 56.1 7.73 0.059
#Saves the weighted & unweighted size of the race_4 demographics
<-sample %>%
ag_was_survey(weights = c(weight)) %>%
group_by(race_4) %>% #Update variable in ()
summarise(n = survey_total()) %>%
mutate(weighted_sample = n /sum(n))
<- sample %>%
ag_uwgroup_by(race_4) %>% #Update variable in ()
summarise(n = n()) %>%
mutate(unweighted_sample = n /sum(n))
<-left_join(ag_w, ag_uw, by = "race_4", suffix = c("", "_pop")) %>% #Update variable in by =""
ag_combogroup_by(race_4) #Update variable in ()
$ag_diff_per<- 100*(ag_combo$weighted_sample-ag_combo$unweighted_sample)
ag_combo ag_combo
## # A tibble: 4 × 7
## # Groups: race_4 [4]
## race_4 n n_se weighted_sample n_pop unweighted_sample ag_diff_per
## <dbl+lbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1 [White] 679. 8.21 0.714 835 0.878 -16.4
## 2 2 [Black] 47.6 9.59 0.05 24 0.0252 2.48
## 3 3 [Hispanic] 168. 25.4 0.177 42 0.0442 13.3
## 4 4 [Other POC] 56.1 7.73 0.059 50 0.0526 0.642
print(ag_combo$weighted_sample)
## [1] 0.714 0.050 0.177 0.059
print(targets)
## $sex
## [1] 0.516 0.484
##
## $race_4
## [1] 0.714 0.050 0.177 0.059
Here, we see that the weighted sample values for the
race_4
variable match the specified population values
exactly. This is what we want to see. Sometimes with a lot of variables
in the weighting scheme, these will not match exactly but you want them
to be extremely close no matter the number of variables. Now, let’s do
the same thing for the sex
variable.
#Saves the weighted & unweighted size of the sex demographics
<-sample %>%
sex_was_survey(weights = c(weight)) %>%
group_by(sex) %>% #Update variable in ()
summarise(n = survey_total()) %>%
mutate(weighted_sample = n /sum(n))
<- sample %>%
sex_uwgroup_by(sex) %>% #Update variable in ()
summarise(n = n()) %>%
mutate(unweighted_sample = n /sum(n))
<-left_join(sex_w, sex_uw, by = "sex", suffix = c("", "_pop")) %>% #Update variable in by =""
sex_combogroup_by(sex) #Update variable in ()
<- sex_combo %>%
sex_combo mutate(sex_diff_per = 100 * (weighted_sample - unweighted_sample))
sex_combo
## # A tibble: 2 × 7
## # Groups: sex [2]
## sex n n_se weighted_sample n_pop unweighted_sample sex_diff_per
## <dbl+lbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1 [Female] 502. 22.4 0.528 489 0.514 1.41
## 2 2 [Male] 449. 20.2 0.472 462 0.486 -1.41
print(sex_combo$weighted_sample)
## [1] 0.5283409 0.4716591
print(targets)
## $sex
## [1] 0.516 0.484
##
## $race_4
## [1] 0.714 0.050 0.177 0.059
Once again, the values match exactly for the sex
variable between the weighted sample and the population values. We
should feel confident that our weights worked as anticipated so we can
proceed with our analysis of the weighted data.
Next, let’s examine what impact these weights have on our sample
values. Using tidyverse
language, we can calculate the
weighted and unweighted means for survey variables then compare the
differences between the two. For any ordinal, continuous, or categorical
variable, you can compare differences between the two samples.
Before looking at the gubernatorial results, we look at the weighted
means for a dichotomous variable - fracking2
which equals 1
when respondent supports banning fracking in some situations while 0
means the respondent does not support banning fracking in those
situations.
In the following code chunk, you will see two means
calculations. The first one is simply the unweighted mean for the
fracking2
variable whereas the second is the weighted means
version. The code for the weighted means is slightly more complex as you
must specify as_survey(weights=weight)
to include the newly
created weights in the analysis plus changing mean
to
survey_mean
in the final line of the code to indicate that
you are running a weighted mean using the srvyr
package. We
then bind the two dataframes together and calculate the difference in
means between the two.
<-sample %>% #Looks at the unweighted support for fracking in CO
fracking_uwsummarise(unweight_support = mean(fracking2, na.rm = T))
<-sample %>% #Looks at the weighted support for fracking in CO
fracking_was_survey(weights = c(weight)) %>%
summarise(weight_support = survey_mean(fracking2, na.rm = T))
<-cbind(fracking_uw, fracking_w )
fracking_combo
<- mutate(fracking_combo, difference = weight_support - unweight_support)
fracking_combo fracking_combo
## unweight_support weight_support weight_support_se difference
## 1 0.5719027 0.5678267 0.01974743 -0.004075958
When we compare the differences on support for the fracking ban between the weighted and unweighted samples, we see that there is virtually no difference between the weighted and unweighted estimates of how supportive Coloradoans are of fracking. Why is this? This occurs sometimes when the weights that applied simply do not change the sample composition enough to have an influence on the overall sample mean.
Now, let’s look at a the gubernatorial results. This is a slightly more complex approach than above since we have 4 choices rather than the mean of a variable.
#Gubernatorial Vote Choice - Weighted & Unweighted
<-sample %>% #Save new dataframe for later manipulation
gov_uwgroup_by(gov_choice) %>% #Group the outcome by which candidate respondent supports
filter(!is.na(gov_choice)) %>% #Removes any missing data from the sample
summarise(n = n()) %>% #Counts the number of respondents who support each candidate
mutate(unweight_support = n /sum(n)) #Calculate percentage of respondents who support each candidate
<-sample %>%
gov_was_survey(weights = c(weight)) %>% #Save new dataframe for later manipulation
group_by(gov_choice) %>% #Group the outcome by which candidate respondent supports
filter(!is.na(gov_choice)) %>% #Removes any missing data from the sample
summarise(n = survey_total()) %>% #Counts the weighted number of respondents who support each candidate
mutate(weight_support = n /sum(n)) #Calculate weighted percentage of respondents who support each candidate
<-cbind(gov_uw, gov_w) #Combine the two files
gov_combo
$diff <- gov_combo$weight_support - gov_combo$unweight_support
gov_combo gov_combo
## gov_choice n unweight_support gov_choice n n_se weight_support
## 1 1 434 0.45636172 1 456.49080 22.542174 0.48001136
## 2 2 470 0.49421661 2 443.13338 19.102835 0.46596570
## 3 3 35 0.03680336 3 39.83149 8.497982 0.04188380
## 4 4 12 0.01261830 4 11.54432 3.502599 0.01213914
## diff
## 1 0.0236496356
## 2 -0.0282509112
## 3 0.0050804304
## 4 -0.0004791548
With the gubernatorial results, the unweighted sample how candidate 2, Stapleton, leading candidate 1, Polis by 49.4% to 45.6%. Polis wins this election with 53.4% of the vote to Stapleton’s 42.8% for a margin of 10.6%. Obviously, the unweighted sample is biased in favor of Stapleton. When we look at the weighted sample, we see Polis slightly ahead with 47.9% of the vote compared to Stapleton’s 46.6% of the vote. While this poll is closer to the true margin, it is still far off. What happened? We are only including two variables in our weighting scheme when we should be including as many as we have information available for, provided we have sufficient sample size of that many.
We have 7 total variables to use in the weighting scheme and above we only used two. Let’s add two additional variables to the weighting scheme to see how that is influential. In your in-class activity, you will add the remaining 3 variables to see what impact that has on the estimates.
Typically, when creating survey weights you will include more than just 2 demographic variables into your weighting scheme. Here, we use 4 variables to create a new weight: sex, race/ethnicity, education, and martial status.
We will follow the same procedure here as above just simply with more variables included. Once again, it is important to ensure that the target population values match the name of its associated variable in the sample data.
First, we save the target population values and then prepare the data to be used in the weighting.
#Save new vectors with target population values for weights
<- c(.516, .484) ##Target values for females and males; label order (female, male)
sex sum(sex)
## [1] 1
<-c(0.714, 0.050, 0.177, 0.059) #Target values race/ethnic identities - white, black, Hispanic, all others
race_4 sum(race_4)
## [1] 1
<-c(0.211, 0.226, 0.086, 0.304, 0.173) #Target values education - HS or less, Some college, AA, BA, Graduate degree
educ sum(educ)
## [1] 1
<-c(0.549, 0.015, 0.099, 0.039, 0.258, 0.040) #Target values marital status - Married, Separate, Divorced, Widowed, Single, Domestic)
marstatsum(marstat)
## [1] 1
#Combine the demographic vectors into a list
<- list(sex, race_4, educ, marstat)
targets # remember, these names will have to match the column names & order in the sample data
names(targets) <- c("sex", "race_4", "educ", "marstat")
Then we use the anesrake
package to create the survey
weights to make the sample values match the target population
values.
set.seed(1946)
<- anesrake(targets, sample,
myweights caseid = sample$caseid, cap = 8, type = "pctlim", pctlim=.05)
## [1] "Raking converged in 19 iterations"
$full_weight <- unlist(myweights[1])
sample
summary(sample$full_weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4452 0.6299 0.7396 1.0000 1.1688 7.9985
Let’s look at how well the weights worked to match the sample data to
the target population values for the educ
and
marstat
measures.
For education, respondents with a high school degree or less were under-represented in the initial unweighted sample, which is a typical bias in online samples. By weighting on the known population values for education, the sample better reflects the overall percentage of people with lower educational attainment.
#Shows the weighted size of the educ demographics
##We can also bind the two values together to calculate their differences
<-sample %>%
e_was_survey(weights = c(full_weight)) %>%
group_by(educ) %>%
summarise(n = survey_total()) %>%
mutate(weighted_sample = n /sum(n))
<- sample %>%
e_uwgroup_by(educ) %>%
summarise(n = n()) %>%
mutate(unweighted_sample = n /sum(n))
<-left_join(e_w, e_uw, by = "educ") %>%
e_combogroup_by(educ)
$diff <- e_combo$weighted_sample - e_combo$unweighted_sample
e_combo e_combo
## # A tibble: 5 × 7
## # Groups: educ [5]
## educ n.x n_se weighted_sample n.y unweighted_sample diff
## <dbl+lbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1 [HS grad or le… 201. 23.8 0.211 120 0.126 0.0848
## 2 2 [Some college] 215. 16.1 0.226 223 0.234 -0.00849
## 3 3 [AA degree] 81.8 8.71 0.086 111 0.117 -0.0307
## 4 4 [BA] 289. 19.6 0.304 282 0.297 0.00747
## 5 5 [Post-grad] 165. 11.6 0.173 215 0.226 -0.0531
For education, the weight had a dramatic influence on the number of high school only educated respondents increasing their percentage of the sample from 12% to 21% of the sample matching the population.
Next, we review the impact of weighting by martial status on sample composition.
#Shows the weighted size of the marstat demographics
##We can also bind the two values together to calculate their differences
<-sample %>%
marstat_was_survey(weights = c(full_weight)) %>%
group_by(marstat) %>%
summarise(n = survey_total()) %>%
mutate(weighted_sample = n /sum(n))
<- sample %>%
marstat_uwgroup_by(marstat) %>%
summarise(n = n()) %>%
mutate(unweighted_sample = n /sum(n))
<-left_join(marstat_w, marstat_uw, by = "marstat") %>%
marstat_combogroup_by(marstat)
$ed_diff_per<- 100*(marstat_combo$weighted_sample-marstat_combo$unweighted_sample)
marstat_combo marstat_combo
## # A tibble: 6 × 7
## # Groups: marstat [6]
## marstat n.x n_se weighted_sample n.y unweighted_sample ed_diff_per
## <dbl+lbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1 [Married] 522. 20.2 0.549 592 0.623 -7.35
## 2 2 [Separated] 14.3 4.00 0.015 15 0.0158 -0.0773
## 3 3 [Divorced] 94.1 8.97 0.099 129 0.136 -3.66
## 4 4 [Widowed] 37.1 4.88 0.039 59 0.0620 -2.30
## 5 5 [Never marr… 245. 26.0 0.258 128 0.135 12.3
## 6 6 [Domestic /… 38.0 8.22 0.04 28 0.0294 1.06
Examining the impact of weighting by martial status, we see the 5th option - single, never married - having a large difference between the two samples. This indicates that single, never married respondents are underrepresented compared to their true population value.
Now, we look at the differences in gubernatorial support by martial status to see how impactful this weighting scheme might be.
CrossTable(sample$marstat, sample$gov_choice, prop.c = FALSE, prop.r = TRUE, prop.chisq = FALSE, prop.t = FALSE)
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
## =======================================================
## sample$gov_choice
## sample$marstat 1 2 3 4 Total
## -------------------------------------------------------
## 1 240 320 23 9 592
## 0.405 0.541 0.039 0.015 0.623
## -------------------------------------------------------
## 2 8 6 1 0 15
## 0.533 0.400 0.067 0.000 0.016
## -------------------------------------------------------
## 3 69 54 5 1 129
## 0.535 0.419 0.039 0.008 0.136
## -------------------------------------------------------
## 4 20 37 2 0 59
## 0.339 0.627 0.034 0.000 0.062
## -------------------------------------------------------
## 5 80 45 2 1 128
## 0.625 0.352 0.016 0.008 0.135
## -------------------------------------------------------
## 6 17 8 2 1 28
## 0.607 0.286 0.071 0.036 0.029
## -------------------------------------------------------
## Total 434 470 35 12 951
## =======================================================
Here, we see a large difference for martial status 5 - single, never married - on their support for Jared Polis. This group largely supported Polis compared to Walker so by increasing the impact of this group with the weighting scheme the overall weighted vote estimate for Polis should increase as well.
#Gubernatorial Vote Choice - Weighted & Unweighted
<-sample %>%
gov_was_survey(weights = c(full_weight)) %>%
filter(!is.na(gov_choice)) %>%
group_by(gov_choice) %>%
summarise(n = survey_total()) %>%
mutate(weight_support = n /sum(n))
<-sample %>%
gov_uwgroup_by(gov_choice) %>%
filter(!is.na(gov_choice)) %>%
summarise(n = n()) %>%
mutate(unweight_support = n /sum(n))
<-cbind(gov_uw, gov_w)
gov_combo
$diff <- gov_combo$weight_support - gov_combo$unweight_support
gov_combo
<-cbind(gov_combo$gov_choice, gov_combo$weight_support, gov_combo$unweight_support, gov_combo$diff)
gov_outcome
colnames(gov_outcome) <- c("candidate", "weighted support", "unewighted support", "diff")
gov_outcome
## candidate weighted support unewighted support diff
## [1,] 1 0.50436596 0.45636172 0.0480042392
## [2,] 2 0.44871485 0.49421661 -0.0455017618
## [3,] 3 0.03603287 0.03680336 -0.0007704949
## [4,] 4 0.01088631 0.01261830 -0.0017319826
Once we apply our new weights with the additional two weighting variables, we can see how much closer to the true election result we are. With this weighting scheme, candidate 1 - Polis - is estimated to have roughly 50.4% of the vote (true support was 53.4%) while Stapleton is estimated to have roughly 44.9% of the vote (true support was 42.8%). With this weighting scheme incorporating more population variables, the estimate becomes much closer to the true population. However, it is still off. Your job in the class activity is to improve upon this prediction.
This is an important lesson for the application of the survey weights. The target population values that you weight your survey sample data to match can have profound implications on the conclusions you and others draw from your survey results. In the case, the decision to weight the survey to give more voice to the Republican members of the sample influenced the conclusions drawn about support for various policies being debated in the public realm. This makes it critically important to make sure that the target population values that are chosen are as accurate as possible and publicly defensible.
Overall, this tutorial has taken you through how to calculate survey
weights using the anesrake
package. Using a sample
political poll, you hopefully learned how to create target demographic
population vectors, which then merge with our sample demographic values.
Following this, you learned how to calculate directly survey weights,
evaluate the success/failure of the survey weighting process, and
compare the impact of using the survey weight on the conclusions drawn
from the results.
End of tutorial
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] anesrake_0.80 weights_1.0.4 Hmisc_5.1-0 srvyr_1.2.0
## [5] survey_4.2-1 survival_3.5-5 Matrix_1.5-4.1 descr_1.1.7
## [9] labelled_2.11.0 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
## [13] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0
## [17] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0 haven_2.5.2
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.6 jsonlite_1.8.4 splines_4.2.2 gtools_3.9.4
## [5] bslib_0.4.2 Formula_1.2-5 highr_0.10 yaml_2.3.7
## [9] pillar_1.9.0 backports_1.4.1 lattice_0.21-8 glue_1.6.2
## [13] digest_0.6.31 checkmate_2.2.0 minqa_1.2.5 colorspace_2.1-0
## [17] htmltools_0.5.5 pkgconfig_2.0.3 broom_1.0.5 xtable_1.8-4
## [21] scales_1.2.1 gdata_2.19.0 tzdb_0.4.0 lme4_1.1-33
## [25] timechange_0.2.0 htmlTable_2.4.1 generics_0.1.3 cachem_1.0.8
## [29] withr_2.5.0 nnet_7.3-19 cli_3.4.1 crayon_1.5.2
## [33] magrittr_2.0.3 evaluate_0.21 mice_3.15.0 fansi_1.0.4
## [37] nlme_3.1-162 MASS_7.3-60 foreign_0.8-84 tools_4.2.2
## [41] data.table_1.14.8 hms_1.1.3 mitools_2.4 lifecycle_1.0.3
## [45] munsell_0.5.0 cluster_2.1.4 compiler_4.2.2 jquerylib_0.1.4
## [49] rlang_1.1.1 nloptr_2.0.3 rstudioapi_0.15.0 htmlwidgets_1.6.2
## [53] base64enc_0.1-3 rmarkdown_2.22 boot_1.3-28.1 gtable_0.3.3
## [57] curl_5.0.0 DBI_1.1.3 R6_2.5.1 gridExtra_2.3
## [61] knitr_1.43 fastmap_1.1.1 utf8_1.2.3 stringi_1.7.12
## [65] Rcpp_1.0.10 vctrs_0.6.2 rpart_4.1.19 tidyselect_1.2.0
## [69] xfun_0.39