suppressPackageStartupMessages(library(survey))
suppressPackageStartupMessages(library(tidyverse))Warning: package 'ggplot2' was built under R version 4.5.2
data <- readRDS("cleaned_NYU_data.rds")
options(survey.lonely.psu = "adjust")Intro to the Surveys Package in R
suppressPackageStartupMessages(library(survey))
suppressPackageStartupMessages(library(tidyverse))Warning: package 'ggplot2' was built under R version 4.5.2
data <- readRDS("cleaned_NYU_data.rds")
options(survey.lonely.psu = "adjust")We will work with a binary variable, TCHN1_2 which is coded as 1 for “Yes” and 0 for “No.” Because of this coding, the mean of the variable can be interpreted directly as the proportion of respondents who answered “Yes.” The dataset also includes a survey weight, TMETAwts1, which adjusts the sample to better represent the target population.
We begin with the most familiar approach which is computing simple, unweighted means. Here we compute the unweighted proportion answering “Yes” for the full set of items TCHN1_2 through TCHN1_6.
vars <- paste0("TCHN1_", 2:6)
unweighted_yes <- sapply(data[vars], mean, na.rm = TRUE)
unweighted_yes TCHN1_2 TCHN1_3 TCHN1_4 TCHN1_5 TCHN1_6
0.5172070 0.1281796 0.4518703 0.1286783 0.1406484
Because each variable is coded as 1 for “Yes” and 0 for “No,” each unweighted mean is the unweighted proportion of “Yes” responses. This calculation implicitly treats every respondent as equally representative of the population. In other words, it assumes the data were collected using a simple random sample with no weighting or design features. While these estimates are easy to compute and often useful for quick exploration, they are generally inappropriate for inference when survey weights are present.
If you prefer a cleaner display, you can convert the result into a small table and simply calculate the percent No:
unweighted_tbl <- data.frame(item = names(unweighted_yes),
pct_yes = round(100 * as.numeric(unweighted_yes),1),
pct_no = round(100 * (1 - as.numeric(unweighted_yes)),1))
unweighted_tbl item pct_yes pct_no
1 TCHN1_2 51.7 48.3
2 TCHN1_3 12.8 87.2
3 TCHN1_4 45.2 54.8
4 TCHN1_5 12.9 87.1
5 TCHN1_6 14.1 85.9
Our next step is to incorporate survey weights directly using base R’s weighted.mean() function.
vars <- paste0("TCHN1_", 2:6)
weighted_yes <- sapply(data[vars],
function(x) weighted.mean(x, w = data$TMETAwts1, na.rm = TRUE))
weighted_yes TCHN1_2 TCHN1_3 TCHN1_4 TCHN1_5 TCHN1_6
0.5096717 0.1297673 0.4260652 0.1328597 0.1361525
This improves on the unweighted mean by accounting for unequal representation in the sample. Respondents with larger weights contribute more to the estimate, while those with smaller weights contribute less. However, this approach still falls short for formal inference. Although the point estimates are often reasonable, weighted.mean() does not produce design based standard errors and does not account for other aspects of the survey design. As a result, it should be viewed as a descriptive or exploratory tool rather than a complete inferential solution. Once again, if you prefer a cleaner display, you can convert the result into a small table:
weighted_tbl <- data.frame(
item = names(weighted_yes),
pct_yes = round(100 * as.numeric(weighted_yes),1),
pct_no = round(100 * (1 - as.numeric(weighted_yes)),1)
)
weighted_tbl item pct_yes pct_no
1 TCHN1_2 51.0 49.0
2 TCHN1_3 13.0 87.0
3 TCHN1_4 42.6 57.4
4 TCHN1_5 13.3 86.7
5 TCHN1_6 13.6 86.4
One thing worth noting is that, in this example, the unweighted and weighted means are very close to one another. This is not something we should expect in general, nor does it imply that weighting is unnecessary. Instead, this similarity reflects strong quota design and effective sample balancing during data collection. When a survey is fielded with well calibrated quotas that closely match the target population on key dimensions, unweighted estimates may already be quite close to their weighted counterparts. The weighting process in this case is doing relatively little correction because there is relatively little imbalance to correct. It is important to emphasize that this is a contingent outcome of good survey design, not a general rule. In many real world settings, especially when quotas are loose, response rates vary sharply across groups, or post-stratification variables are weakly controlled, unweighted and weighted estimates can differ substantially. The purpose of this comparison is not to suggest that weights can be ignored, but to show that good design upstream can reduce the amount of adjustment required downstream.
Surveys PackageThe survey package provides tools specifically designed for analyzing survey data. The first step is to declare a survey design object, which tells R how the data were collected and how weights should be used.
des <- svydesign(
ids = ~1, #~1 tells R there is no clustering variable (treat each case as its own)
weights = ~TMETAwts1, #survey weight that adjusts the sample to the target population
data= data #data frame containing the survey variables
)In this example, ids = ~1 indicates that we are not explicitly modeling clustering. The weight variable is supplied via the weights argument. Once this design object is created, we can compute design based estimates.
svymean(~TCHN1_2, des, na.rm = TRUE) mean SE
TCHN1_2 0.50967 0.0151
This output provides the weighted proportion of respondents answering “Yes,” along with a standard error that correctly reflects the survey design. This is the preferred approach when reporting results from survey data. In some contexts, it is useful to display both “Yes” and “No” explicitly rather than inferring one from the other. This can be done by temporarily treating the variable as a factor within the call to svymean().
svymean(
~factor(TCHN1_2, levels = c(1, 0), labels = c("Yes", "No")),
des,
na.rm = TRUE
) mean SE
factor(TCHN1_2, levels = c(1, 0), labels = c("Yes", "No"))Yes 0.50967 0.0151
factor(TCHN1_2, levels = c(1, 0), labels = c("Yes", "No"))No 0.49033 0.0151
The output shows the weighted proportion for each response category. The two values sum to one (up to rounding), and each has an associated design based standard error. Although all three methods aim to summarize the same underlying variable, they answer subtly different questions. The unweighted mean ignores the survey design entirely and is rarely appropriate for inference. The base R weighted mean incorporates weights but does not provide correct uncertainty estimates. The survey package approach accounts for both weights and design, making it the best choice for reporting and inference.
Once you move beyond a single overall estimate, subgroup analysis is where the consequences of weighting usually become most visible. Differences between unweighted and weighted estimates that appear small in the aggregate can widen meaningfully within subgroups such as party identification, gender, or age. For that reason, subgroup comparisons are not optional in survey analysis, they are often the primary diagnostic for understanding why weights matter.
In practice, it is helpful to examine subgroup estimates three ways: unweighted (as a baseline), weighted using base R (as a descriptive adjustment), and design-based using the survey package (as the inferentially correct approach). Looking at all three side by side helps clarify how much the weighting scheme is actually influencing conclusions.
In the code below, we first create a labeled party variable purely for readability.
party <- factor(
data$PTPID3Basic,
levels = c(1, 2, 3),
labels = c("Democrat", "Independent", "Republican")
)This is the simplest approach and treats the sample as if it were a simple random sample within each group.
tapply(data$TCHN1_2, party, mean, na.rm = TRUE) Democrat Independent Republican
0.5103245 0.4847973 0.5496599
This incorporates survey weights but still does not produce design-based standard errors.
tapply(seq_len(nrow(data)), party, function(idx) {
weighted.mean(data$TCHN1_2[idx], w = data$TMETAwts1[idx], na.rm = TRUE)
}) Democrat Independent Republican
0.4923955 0.4662629 0.5724767
This is the standard approach for reporting subgroup differences in survey data, because it respects the survey design and provides valid standard errors.
#Add the party factor into the design object
#(This does not modify the original data frame.)
des2 <- update(des, party = party)
svyby(
~TCHN1_2,
~party,
des2,
svymean,
na.rm = TRUE
) party TCHN1_2 se
Democrat Democrat 0.4923955 0.02625306
Independent Independent 0.4662629 0.02711003
Republican Republican 0.5724767 0.02422314
In many high-quality surveys, the unweighted and weighted subgroup estimates may still be fairly close, especially when quotas and field controls are strong. Even in those cases, the survey package approach remains important because it provides the correct uncertainty estimates and protects you against cases where design features matter more strongly.
Age is another dimension where weighting differences often become especially clear. Because age is frequently used in quota sampling and weighting schemes, unweighted estimates can over or under-represent certain age groups in ways that meaningfully affect conclusions. Looking at unweighted and weighted estimates side by side helps make this visible.
In the example below, we examine the share answering “Yes” to TCHN1_2 across four age groups. We compute the unweighted subgroup means first, then the design-based weighted means using the survey package. Both steps are shown together so the contrast is immediate.
age_grp <- factor(
data$PTAge4,
levels = c(1, 2, 3, 4),
labels = c("Age 18–34", "Age 35–44", "Age 45–64", "Age 65+")
)
#Unweighted subgroup means
unweighted_age <- tapply(
data$TCHN1_2,
age_grp,
mean,
na.rm = TRUE
)
#Design-based weighted subgroup means - adds age into the design
des_age <- update(des, age_grp = age_grp)
weighted_age <- svyby(
~TCHN1_2,
~age_grp,
des_age,
svymean,
na.rm = TRUE
)
#Combine unweighted and weighted results into one table
age_tbl <- data.frame(
Age_Group = names(unweighted_age),
Unweighted_Pct_Yes = round(100 * as.numeric(unweighted_age),1),
Weighted_Pct_Yes = round(100 * as.numeric(coef(weighted_age)),1),
Weighted_SE = round(100 * as.numeric(SE(weighted_age)),1)
)
age_tbl Age_Group Unweighted_Pct_Yes Weighted_Pct_Yes Weighted_SE
1 Age 18–34 71.0 71.8 2.6
2 Age 35–44 64.5 59.5 3.3
3 Age 45–64 41.9 39.8 2.5
4 Age 65+ 26.0 25.5 2.8
In many surveys, the overall unweighted and weighted estimates may look fairly similar, giving the impression that weights do not matter much. This example shows why that impression can be misleading. Within age groups, differences between unweighted and weighted estimates can be larger, and those differences can change how patterns are interpreted across the population.
In this case, the largest adjustment appears among respondents ages 35–44, where the weighted estimate of generative AI use is roughly five percentage points lower than the unweighted estimate. Smaller downward adjustments are also visible for adults ages 45–64. By contrast, estimates for adults ages 18–34 increase slightly after weighting, while estimates for those 65 and older change very little. Together, these patterns suggest that the unweighted sample somewhat overrepresents generative AI users in the middle age groups, and that weighting primarily corrects those imbalances rather than altering the overall age gradient.
Age-based subgroup analysis is often where we first see the practical value of survey weights. It reinforces the idea that weights are not merely a technical detail, but a substantive part of how survey estimates are constructed and interpreted.
So far, we have focused on means and subgroup comparisons. Another common step is to estimate relationships using regression models. Regression is often where analysts are most tempted to “just add weights” and move on. This section shows why it is important to distinguish between an unweighted regression, a regression that simply includes weights, and a design-based regression using the survey package.
We will model the probability of answering “Yes” to TCHN1_2 as a function of three predictors:
education (TCHEduUS23), an ordinal variable capturing highest level of education
age group (PTAge4)
party identification (PTPID3Basic)
The outcome TCHN1_2 is binary (0/1), so we use a logistic regression framework.
For readability, we first convert the categorical predictors into factors with meaningful labels. This step does not change the underlying data, it just makes model output easier to interpret.
edu <- factor(
data$TCHEduUS23,
levels = 1:6,
labels = c(
"Less than HS",
"High School",
"Some College",
"Associate / Trade",
"Bachelor's",
"Postgraduate"
)
)
age_grp <- factor(
data$PTAge4,
levels = c(1, 2, 3, 4),
labels = c("Age 18–34", "Age 35–44", "Age 45–64", "Age 65+")
)
party <- factor(
data$PTPID3Basic,
levels = c(1, 2, 3),
labels = c("Democrat", "Independent", "Republican")
)We begin with a standard logistic regression that ignores the survey design entirely.
unweighted_glm <- glm(
TCHN1_2 ~ edu + age_grp + party,
data = data,
family = binomial()
)
summary(unweighted_glm)
Call:
glm(formula = TCHN1_2 ~ edu + age_grp + party, family = binomial(),
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.008439 0.302829 -0.028 0.977769
eduHigh School 0.553728 0.293947 1.884 0.059596 .
eduSome College 0.884756 0.300616 2.943 0.003249 **
eduAssociate / Trade 1.309833 0.316570 4.138 3.51e-05 ***
eduBachelor's 1.824190 0.314673 5.797 6.75e-09 ***
eduPostgraduate 2.318078 0.326353 7.103 1.22e-12 ***
age_grpAge 35–44 -0.539771 0.145004 -3.722 0.000197 ***
age_grpAge 45–64 -1.430053 0.132698 -10.777 < 2e-16 ***
age_grpAge 65+ -2.401490 0.162986 -14.734 < 2e-16 ***
partyIndependent -0.157067 0.124812 -1.258 0.208236
partyRepublican 0.173536 0.119752 1.449 0.147303
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2777.1 on 2004 degrees of freedom
Residual deviance: 2359.8 on 1994 degrees of freedom
AIC: 2381.8
Number of Fisher Scoring iterations: 4
This model treats the data as if they were drawn from a simple random sample. The coefficients describe associations within the sample, but the standard errors and statistical significance do not reflect the survey design or weighting.
To obtain population-representative estimates and valid uncertainty, we instead fit the model using svyglm().
#Add predictors into the survey design object
des_reg <- update(
des,
edu = edu,
age_grp = age_grp,
party = party
)
weighted_glm <- svyglm(
TCHN1_2 ~ edu + age_grp + party,
design = des_reg,
family = quasibinomial()
)
summary(weighted_glm)
Call:
svyglm(formula = TCHN1_2 ~ edu + age_grp + party, design = des_reg,
family = quasibinomial())
Survey design:
update(des, edu = edu, age_grp = age_grp, party = party)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.38881 0.39659 -0.980 0.327015
eduHigh School 0.86483 0.38899 2.223 0.026309 *
eduSome College 1.32991 0.39538 3.364 0.000784 ***
eduAssociate / Trade 1.69177 0.41108 4.115 4.02e-05 ***
eduBachelor's 2.18008 0.40728 5.353 9.66e-08 ***
eduPostgraduate 2.67948 0.41593 6.442 1.47e-10 ***
age_grpAge 35–44 -0.82447 0.19042 -4.330 1.57e-05 ***
age_grpAge 45–64 -1.69699 0.17447 -9.726 < 2e-16 ***
age_grpAge 65+ -2.66540 0.20869 -12.772 < 2e-16 ***
partyIndependent -0.06433 0.16726 -0.385 0.700580
partyRepublican 0.41709 0.16100 2.591 0.009652 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 0.9883782)
Number of Fisher Scoring iterations: 4
This model estimates the same conceptual relationship as the unweighted regression, but it does so using survey weighted estimating equations. The coefficients now reflect population level associations, and the standard errors properly account for the survey design. In many applications, the direction of effects will be similar across the two models. For example, higher education levels may be associated with a greater likelihood of answering “Yes,” and older age groups may be associated with lower usage. However, the magnitude of coefficients and their statistical significance can change once survey weights are applied. This happens because weighting changes the relative influence of observations in the regression. Groups that are overrepresented in the raw sample exert less influence in the weighted model, while underrepresented groups exert more. As a result, regression estimates can shift in meaningful ways even when overall descriptive patterns appear stable.
To make the differences between the two models easier to see, it’s helpful to place their coefficients side by side in a single table. This allows us to focus on how point estimates and uncertainty change once survey weights and design are taken into account.
In the code below, we extract coefficients and standard errors from both the unweighted and survey-weighted models and combine them into a single comparison table. This table is not meant to support formal hypothesis testing, instead, it is a diagnostic and teaching tool that highlights how weighting can alter regression results.
#extract coefs and ses from the unweighted model
unw_coef <- summary(unweighted_glm)$coefficients
unweighted_tbl <- data.frame(
term = rownames(unw_coef),
estimate_unwgt = round(unw_coef[, "Estimate"],3),
se_unwgt = round(unw_coef[, "Std. Error"],3),
row.names = NULL
)
#extract coefs and ses from the weighted model
w_coef <- summary(weighted_glm)$coefficients
weighted_tbl <- data.frame(
term = rownames(w_coef),
estimate_wgt = round(w_coef[, "Estimate"],3),
se_wgt = round(w_coef[, "Std. Error"],3),
row.names = NULL
)
#combine them into comparison table
coef_compare <- merge(
unweighted_tbl,
weighted_tbl,
by = "term",
all = TRUE
)
coef_compare term estimate_unwgt se_unwgt estimate_wgt se_wgt
1 (Intercept) -0.008 0.303 -0.389 0.397
2 age_grpAge 35–44 -0.540 0.145 -0.824 0.190
3 age_grpAge 45–64 -1.430 0.133 -1.697 0.174
4 age_grpAge 65+ -2.401 0.163 -2.665 0.209
5 eduAssociate / Trade 1.310 0.317 1.692 0.411
6 eduBachelor's 1.824 0.315 2.180 0.407
7 eduHigh School 0.554 0.294 0.865 0.389
8 eduPostgraduate 2.318 0.326 2.679 0.416
9 eduSome College 0.885 0.301 1.330 0.395
10 partyIndependent -0.157 0.125 -0.064 0.167
11 partyRepublican 0.174 0.120 0.417 0.161
When you inspect this table, pay attention to which coefficients move the most after weighting and whether standard errors increase or decrease. Often, the direction of effects will remain the same, but the magnitude and apparent strength of associations can shift. These shifts reflect the fact that the survey-weighted model is rebalancing the influence of different groups to better represent the population.