Yuxin Zhang yuxin.zhang@unitn.it
Research Design Lab,
01-02 April 2026
A survey weight is a numerical value assigned to each observation
It tells us how much each observation should contribute to representing population parameters
In other words, survey weight adjusts the sample to better reflect a population using prior knowledge
Used to adjust for known differences in the probability of being selected
\(p_i\) = probability of selection of unit \(i\)
\(d_i = \frac{1}{p_i}\) (sampling or design weight)
Scenario 1: Equal selection probability for all units
Scenario 2: Unequal selection probability across groups
Example: In a random address-based survey, individuals living alone are more likely to be sampled than those living with others
Then, not everyone sampled will actually take part in a survey (missing contact information/nonresponse/rejection /…)
And this can vary systematically across groups in a population
Used to adjust for known differences in the probability of responding
People with different characteristics are not as likely to respond to the survey
E.g., by gender/age/education/income/ethnicity…
In stratified sampling, the sample design requires strata to be defined in advance.
Therefore, strata should be constructed based on known population characteristics using external sources (e.g., information from the census).
Population: 50% men, 50% women
Sample: 30% men, 70% women
Which group should get higher weights?
Why?
\[ \bar{y} = \frac{\sum_{i=1}^{n} y_i}{n} \]
\[ \bar{y}_w = \frac{\sum_{i=1}^{n} w_i y_i}{\sum_{i=1}^{n} w_i} \]
Suppose the true population and your sample differ:
| Group | Population count (\(N_g\)) | Sample count (\(n_g\)) | Score (\(y_g\)) |
|---|---|---|---|
| Men | 50 | 30 | 6 |
| Women | 50 | 70 | 8 |
Task
Compute the weight for each group: \(w_g = \frac{\% population_g}{\% sample_g}\)
Assign the weight to each observation (as weights live at the observation level!)
Compute the unweighted and weighted mean scores, and compare them
Hint:
Men are underrepresented → weight should be > 1
Women are overrepresented → weight should be < 1
| Group | Population count (\(N_g\)) | Sample count (\(n_g\)) | Score (\(y_g\)) | Weight (\(w_g\)) |
|---|---|---|---|---|
| Men | 50 | 30 | 6 | 1.67 |
| Women | 50 | 70 | 8 | 0.71 |
By steps:
Unweighted average score:
\(\bar{y} = \frac{(30 \cdot 6) + (70 \cdot 8)}{30 + 70} = \frac{180 + 560}{100} = 7.4\)
Weighted average score:
\(\bar{y}_w = \frac{(1.67 \cdot 30 \cdot 6) + (0.71 \cdot 70 \cdot 8)}{(1.67 \cdot 30) + (0.71 \cdot 70)} \approx 7.0\)
Please calculate unweighted and weighted average scores using:
| id | score | weight |
|---|---|---|
| 1 | 50 | 1 |
| 2 | 60 | 1 |
| 3 | 55 | 1 |
| 4 | 70 | 1 |
| 5 | 65 | 1 |
| 6. | 40 | 1 |
| 7 | 45 | 1 |
| 8 | 50 | 1 |
| 9 | 75 | 1 |
| 10 | 90 | 50 |
## [1] 60
## [1] 84.91525
Extreme weights can heavily influence overall estimates. So, always inspect weight distribution before using it.
Pros:
Cons:
Can give disproportionate influence to a small number of atypical observations
Can introduce additional bias if the prior information for weighting is incorrect
Can increase the variance of estimates (see Korn & Graubard (1999) Section 4.3 for details and examples)
Step 1: Weighted mean: \[ \bar{y}_w = \frac{\sum_{i=1}^n w_i y_i}{\sum_{i=1}^n w_i} \]
Step 2: Weighted variance:
\[
s_w^2 = \frac{\sum_{i=1}^n w_i (y_i - \bar{y}_w)^2}{\sum_{i=1}^n w_i -
1}
\]
Step 3: Effective sample size: \[ n_{\text{eff}} = \frac{\left(\sum_{i=1}^n w_i\right)^2}{\sum_{i=1}^n w_i^2} \]
Step 4: Standard error: \[ \mathrm{SE}(\bar{y}_w) = \sqrt{\frac{s_w^2}{n_{\rm eff}}} \]
Note: The conventional standard error can be misleading
when weights are highly unequal, because most of the information comes
from a few heavily weighted observations. To adjust for this, we use the
effective sample size \(n_{\text{eff}}\) rather than \(\sum_{i=1}^n w_i -1\) when computing the
standard error:
When weights become even, \(n_{\text{eff}}\) approaches original sample size \(n\); when weights are dominated by a single (or a few) observation(s), \(n_{\text{eff}}\) approaches 1, reflecting the fact that the estimate is based on very limited independent information.
When weights are equal: \(n_{\text{eff}} = n\)
When weights are highly unequal:
\(\sum w_i^2 \uparrow\)
\(n_{\text{eff}} \downarrow\)
\(\mathrm{SE}(\bar{y}_w) \uparrow\)
Consider 5 observations with scores:
| id | score | weight |
|---|---|---|
| 1 | 50 | 1 |
| 2 | 80 | 1 |
| 3 | 55 | 1 |
| 4 | 70 | 1 |
| 5 | 65 | 1 |
Calculate the weighted mean and the SE of the weighted mean?
Weighted mean (all weights = 1, reduces to regular mean): \[ \bar{y}_w = \frac{\sum_{i=1}^n w_i y_i}{\sum_{i=1}^n w_i} = \frac{1\cdot50 + 1\cdot80 + 1\cdot55 + 1\cdot70 + 1\cdot65}{1+1+1+1+1} = 64 \]
Weighted variance: \[ s_w^2 = \frac{\sum_{i=1}^n w_i (y_i - \bar{y}_w)^2}{\sum_{i=1}^n w_i - 1} \]
\[ = \frac{(50-64)^2 + (80-64)^2 + (55-64)^2 + (70-64)^2 + (65-64)^2}{5-1} = \frac{570}{4} = 142.5 \]
Effective sample size: \[ n_{\rm eff} = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2} = \frac{(1+1+1+1+1)^2}{1^2+1^2+1^2+1^2+1^2} = \frac{25}{5} = 5 \]
Variance and SE of weighted mean: \[ \mathrm{SE}(\bar{y}_w) = \sqrt{\frac{s_w^2}{n}} = \sqrt{\frac{142.5}{5}} = \sqrt{28.5} \approx 5.34 \]
Now consider one extreme weight:
| id | score | weight |
|---|---|---|
| 1 | 50 | 100 |
| 2 | 80 | 100 |
| 3 | 55 | 1 |
| 4 | 70 | 1 |
| 5 | 65 | 1 |
Calculate again the weighted mean and the SE of the weighted mean?
Weighted mean: \[ \bar{y}_w = \frac{\sum_i w_i y_i}{\sum_i w_i} = \frac{100\cdot50 + 100\cdot80 + 1\cdot55 + 1\cdot70 + 1\cdot65}{100+100+1+1+1} \approx 65.0 \]
Weighted variance: \[ s_w^2 = \frac{\sum_{i=1}^n w_i (y_i - \bar{y}_w)^2}{\sum_{i=1}^n w_i - 1} \]
\[ = \frac{100\cdot(50-65)^2 + 100\cdot(80-65)^2 + 1\cdot(55-65)^2 + 1\cdot(70-65)^2 + 1\cdot(65-65)^2}{100+100+1+1+1 - 1} \]
\[ = \frac{45125}{202} \approx 223.4 \]
Effective sample size: \[ n_{\rm eff} = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2} = \frac{(100+100+1+1+1)^2}{100^2 + 100^2 + 1^2 + 1^2 + 1^2} = = \frac{41209}{20003} \approx 2.06 \]
Variance and SE of weighted mean: \[ \mathrm{SE}(\bar{y}_w) = \sqrt{\frac{s_w^2}{n_{\rm eff}}} = \sqrt{\frac{223.4}{2.06}} \approx \sqrt{108.4} \approx 10.4 \]
| Scenario | Weighted Mean \(\bar{y}_w\) | Weighted Variance \(s_w^2\) | Effective n \(n_{\rm eff}\) | SE of mean \(\mathrm{SE}(\bar{y}_w)\) |
|---|---|---|---|---|
| Equal weights | 64 | 142.5 | 5 | 5.34 |
| Extreme weights | 65 | 223.4 | 2.06 | 10.4 |
You can also do the calculation using R or Stata:
# Solution in R
df <- data.frame(
y = c(50, 80, 55, 70, 65),
w1 = c(1, 1, 1, 1, 1), # equal weights
w2 = c(100, 100, 1, 1, 1) # large unequal weights
)
# create a function
compute_stats <- function(y, w) {
ybar <- sum(w * y) / sum(w) # weighted mean
sw2 <- sum(w * (y - ybar)^2) / (sum(w) - 1) # weighted variance
neff <- (sum(w)^2) / sum(w^2) # effective sample size
se <- sqrt(sw2 / neff) # standard error
return(data.frame(
mean = ybar,
weighted_variance = sw2,
n_eff = neff,
se = se))
}
# 4.1 equal weights
res_equal <- compute_stats(df$y, df$w1)
# 4.2 unequal weights
res_unequal <- compute_stats(df$y, df$w2)
# show results
results <- rbind(
Equal_Weights = res_equal,
Unequal_Weights = res_unequal)
round(results, 3) |>
gt::gt()| mean | weighted_variance | n_eff | se |
|---|---|---|---|
| 64.000 | 142.50 | 5.00 | 5.339 |
| 64.975 | 223.39 | 2.06 | 10.413 |
Using weights without justifying underlying assumptions is bad practice. So, before computing or using them, ask yourself:
1. What is my target population?
2. Is my sample representative of this population?
3. Do I have enough prior knowledge on the population?
4. Am I doing description or modeling?
5. Is selection related to my outcome of interest?
Final note:
“It is good practice to report both weighted and unweighted estimates[…] because the contrast serves as a useful joint test against model misspecification and/or misunderstanding of the sampling process” (Solon et al., 2015).