According to Yannick Duprez’s very handy reference, using pweight
(probability weights) in a regression is equivalent to using aweight
(analytic weights) and robust standard errors.
Here’s the logic that I’m going to work through:
pweight
is equivalent to using aweight
with robust standard errors.aweight
in Stata is equivalent to using the weights
param in glm
aweight
+ robust
in Stata is equivalent to using the weights
param and the robust SE function we just wrote.As a bonus, I’m also going to use the weights function in the survey package to see how this works.
rm(list=ls())
data(api, package = 'survey')
df <- apistrat[, c('api00', 'ell', 'meals', 'mobility', 'cname', 'pw')]
df$cname <- ifelse(
df$cname %in% c('Fresno', 'Santa Clara', 'San Bernadino'),
'Group1', 'Group2')
In Stata, adding pweight
is equivalent to aweight
+ robust
(Replicates!)
## . reg api00 ell meals mobility i.cname [pweight = pw]
## (sum of wgt is 6.1940e+03)
##
## Linear regression Number of obs = 200
## F( 4, 195) = 104.68
## Prob > F = 0.0000
## R-squared = 0.6601
## Root MSE = 72.589
##
## ------------------------------------------------------------------------------
## | Robust
## api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
## -------------+----------------------------------------------------------------
## ell | -.513901 .4063997 -1.26 0.208 -1.315404 .287602
## meals | -3.148314 .2925792 -10.76 0.000 -3.72534 -2.571288
## mobility | .2346743 .4047053 0.58 0.563 -.5634871 1.032836
## |
## cname |
## Group2 | -9.708186 19.92028 -0.49 0.627 -48.99504 29.57867
## _cons | 830.4303 21.18687 39.20 0.000 788.6455 872.2152
## ------------------------------------------------------------------------------
##
## . estout . using mod1.txt, cells("b se t p") stats(N) replace
## (note: file mod1.txt not found)
## (output written to mod1.txt)
##
## . estimates store t1
## . reg api00 ell meals mobility i.cname [aweight = pw], robust
## (sum of wgt is 6.1940e+03)
##
## Linear regression Number of obs = 200
## F( 4, 195) = 104.68
## Prob > F = 0.0000
## R-squared = 0.6601
## Root MSE = 72.589
##
## ------------------------------------------------------------------------------
## | Robust
## api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
## -------------+----------------------------------------------------------------
## ell | -.513901 .4063997 -1.26 0.208 -1.315404 .287602
## meals | -3.148314 .2925792 -10.76 0.000 -3.72534 -2.571288
## mobility | .2346743 .4047053 0.58 0.563 -.5634871 1.032836
## |
## cname |
## Group2 | -9.708186 19.92028 -0.49 0.627 -48.99504 29.57867
## _cons | 830.4303 21.18687 39.20 0.000 788.6455 872.2152
## ------------------------------------------------------------------------------
##
## . estout . using mod1.txt, cells("b se t p") stats(N) replace
## (note: file mod1.txt not found)
## (output written to mod1.txt)
##
## . estimates store t1
In Stata, adding aweight
is equivalent to using the weights
option in glm
(Replicates!)
## . reg api00 ell meals mobility i.cname [aweight = pw]
## (sum of wgt is 6.1940e+03)
##
## Source | SS df MS Number of obs = 200
## -------------+------------------------------ F( 4, 195) = 94.68
## Model | 1995449.59 4 498862.398 Prob > F = 0.0000
## Residual | 1027477.06 195 5269.11315 R-squared = 0.6601
## -------------+------------------------------ Adj R-squared = 0.6531
## Total | 3022926.65 199 15190.5862 Root MSE = 72.589
##
## ------------------------------------------------------------------------------
## api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
## -------------+----------------------------------------------------------------
## ell | -.513901 .3720566 -1.38 0.169 -1.247673 .2198705
## meals | -3.148314 .2701049 -11.66 0.000 -3.681016 -2.615612
## mobility | .2346743 .462905 0.51 0.613 -.6782688 1.147617
## |
## cname |
## Group2 | -9.708186 16.87379 -0.58 0.566 -42.98675 23.57038
## _cons | 830.4303 20.26218 40.98 0.000 790.4692 870.3915
## ------------------------------------------------------------------------------
##
## . estout . using mod1.txt, cells("b se t p") stats(N) replace
## (note: file mod1.txt not found)
## (output written to mod1.txt)
##
## . estimates store t1
y <- glm(api00 ~ ell + meals + mobility + cname,
data = df, weights = pw)
summary(y)
##
## Call:
## glm(formula = api00 ~ ell + meals + mobility + cname, data = df,
## weights = pw)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1076.96 -317.47 -87.58 217.20 1164.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 830.4303 20.2622 40.984 <2e-16 ***
## ell -0.5139 0.3721 -1.381 0.169
## meals -3.1483 0.2701 -11.656 <2e-16 ***
## mobility 0.2347 0.4629 0.507 0.613
## cnameGroup2 -9.7082 16.8738 -0.575 0.566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 163184.4)
##
## Null deviance: 93620038 on 199 degrees of freedom
## Residual deviance: 31820964 on 195 degrees of freedom
## AIC: 2309.7
##
## Number of Fisher Scoring iterations: 2
In Stata, robust
should replicate with the following function (h/t Nat Olin):
robust_se <- function(model){
# get parameters
x <- model.matrix(model)
n <- nrow(x)
k <- length(coef(model))
# See pg 6 of Yannick - note that length k is num_pred + indicator
dfc <- n / (n - k)
# make sandwich
u <- model$residuals
bread <- solve(crossprod(x))
meat <- t(x) %*% (dfc * diag(u^2)) %*% x
v <- bread %*% meat %*% bread
return(v)
}
(Replicates!)
## . reg api00 ell meals mobility i.cname, robust
##
## Linear regression Number of obs = 200
## F( 4, 195) = 86.85
## Prob > F = 0.0000
## R-squared = 0.5897
## Root MSE = 78.276
##
## ------------------------------------------------------------------------------
## | Robust
## api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
## -------------+----------------------------------------------------------------
## ell | -.6955824 .4562246 -1.52 0.129 -1.59535 .2041856
## meals | -2.865973 .3150943 -9.10 0.000 -3.487403 -2.244543
## mobility | .0238613 .4857161 0.05 0.961 -.93407 .9817927
## |
## cname |
## Group2 | -12.82289 20.02464 -0.64 0.523 -52.31557 26.66979
## _cons | 807.4927 21.83548 36.98 0.000 764.4287 850.5567
## ------------------------------------------------------------------------------
##
## . estout . using mod1.txt, cells("b se t p") stats(N) replace
## (note: file mod1.txt not found)
## (output written to mod1.txt)
##
## . estimates store t1
mod <- glm(api00 ~ ell + meals + mobility + cname, data = df)
lmtest::coeftest(mod, vcov = robust_se(mod))
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 807.492693 21.835481 36.9808 <2e-16 ***
## ell -0.695582 0.456225 -1.5246 0.1273
## meals -2.865973 0.315094 -9.0956 <2e-16 ***
## mobility 0.023861 0.485716 0.0491 0.9608
## cnameGroup2 -12.822890 20.024643 -0.6404 0.5219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In Stata, adding aweight
& robust
is equivalent to using the weights
option in glm
+ robust_se
Does not replicate :(
## . reg api00 ell meals mobility i.cname [pweight = pw]
## (sum of wgt is 6.1940e+03)
##
## Linear regression Number of obs = 200
## F( 4, 195) = 104.68
## Prob > F = 0.0000
## R-squared = 0.6601
## Root MSE = 72.589
##
## ------------------------------------------------------------------------------
## | Robust
## api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
## -------------+----------------------------------------------------------------
## ell | -.513901 .4063997 -1.26 0.208 -1.315404 .287602
## meals | -3.148314 .2925792 -10.76 0.000 -3.72534 -2.571288
## mobility | .2346743 .4047053 0.58 0.563 -.5634871 1.032836
## |
## cname |
## Group2 | -9.708186 19.92028 -0.49 0.627 -48.99504 29.57867
## _cons | 830.4303 21.18687 39.20 0.000 788.6455 872.2152
## ------------------------------------------------------------------------------
##
## . estout . using mod1.txt, cells("b se t p") stats(N) replace
## (note: file mod1.txt not found)
## (output written to mod1.txt)
##
## . estimates store t1
y <- glm(api00 ~ ell + meals + mobility + cname,
data = df, weights = pw)
lmtest::coeftest(y, vcov = robust_se(y))
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 830.43035 22.43148 37.0207 <2e-16 ***
## ell -0.51390 0.44726 -1.1490 0.2506
## meals -3.14831 0.31928 -9.8606 <2e-16 ***
## mobility 0.23467 0.56912 0.4123 0.6801
## cnameGroup2 -9.70819 20.48117 -0.4740 0.6355
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Does not replicate :(
## . reg api00 ell meals mobility i.cname [pweight = pw]
## (sum of wgt is 6.1940e+03)
##
## Linear regression Number of obs = 200
## F( 4, 195) = 104.68
## Prob > F = 0.0000
## R-squared = 0.6601
## Root MSE = 72.589
##
## ------------------------------------------------------------------------------
## | Robust
## api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
## -------------+----------------------------------------------------------------
## ell | -.513901 .4063997 -1.26 0.208 -1.315404 .287602
## meals | -3.148314 .2925792 -10.76 0.000 -3.72534 -2.571288
## mobility | .2346743 .4047053 0.58 0.563 -.5634871 1.032836
## |
## cname |
## Group2 | -9.708186 19.92028 -0.49 0.627 -48.99504 29.57867
## _cons | 830.4303 21.18687 39.20 0.000 788.6455 872.2152
## ------------------------------------------------------------------------------
##
## . estout . using mod1.txt, cells("b se t p") stats(N) replace
## (note: file mod1.txt not found)
## (output written to mod1.txt)
##
## . estimates store t1
dstrat <- survey::svydesign(id = ~1, weights = ~pw, data = df)
y <- survey::svyglm(api00 ~ ell + meals + mobility + cname,
design = dstrat)
summary(y)
##
## Call:
## svyglm(formula = api00 ~ ell + meals + mobility + cname, design = dstrat)
##
## Survey design:
## survey::svydesign(id = ~1, weights = ~pw, data = df)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 830.4303 20.9729 39.595 <2e-16 ***
## ell -0.5139 0.4023 -1.277 0.203
## meals -3.1483 0.2896 -10.870 <2e-16 ***
## mobility 0.2347 0.4006 0.586 0.559
## cnameGroup2 -9.7082 19.7191 -0.492 0.623
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 5163.201)
##
## Number of Fisher Scoring iterations: 2
Interesting, this is the same as using glm
with the weight
parameter and robust SEs.
dstrat <- survey::svydesign(id = ~1, weights = ~pw, data = df)
y <- survey::svyglm(api00 ~ ell + meals + mobility + cname,
design = dstrat)
lmtest::coeftest(y, vcov = robust_se(y))
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 830.43035 22.43148 37.0207 <2e-16 ***
## ell -0.51390 0.44726 -1.1490 0.2506
## meals -3.14831 0.31928 -9.8606 <2e-16 ***
## mobility 0.23467 0.56912 0.4123 0.6801
## cnameGroup2 -9.70819 20.48117 -0.4740 0.6355
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1