According to Yannick Duprez’s very handy reference, using pweight (probability weights) in a regression is equivalent to using aweight (analytic weights) and robust standard errors.

Here’s the logic that I’m going to work through:

  1. Validate that (in Stata), pweight is equivalent to using aweight with robust standard errors.
  2. Validate that aweight in Stata is equivalent to using the weights param in glm
  3. Validate that our function in R to calculate robust standard errors replicates the results in Stata.
  4. Validate that using aweight + robust in Stata is equivalent to using the weights param and the robust SE function we just wrote.

As a bonus, I’m also going to use the weights function in the survey package to see how this works.

Data Munging

rm(list=ls())
data(api, package = 'survey')

df <- apistrat[, c('api00', 'ell', 'meals', 'mobility', 'cname', 'pw')]
df$cname <- ifelse(
  df$cname %in% c('Fresno', 'Santa Clara', 'San Bernadino'), 
  'Group1', 'Group2')

Probability Weight = Analytic Weights + Robust SE

In Stata, adding pweight is equivalent to aweight + robust

(Replicates!)

Probability Weights

## . reg api00 ell meals mobility i.cname [pweight = pw]
## (sum of wgt is   6.1940e+03)
## 
## Linear regression                                      Number of obs =     200
##                                                        F(  4,   195) =  104.68
##                                                        Prob > F      =  0.0000
##                                                        R-squared     =  0.6601
##                                                        Root MSE      =  72.589
## 
## ------------------------------------------------------------------------------
##              |               Robust
##        api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##          ell |   -.513901   .4063997    -1.26   0.208    -1.315404     .287602
##        meals |  -3.148314   .2925792   -10.76   0.000     -3.72534   -2.571288
##     mobility |   .2346743   .4047053     0.58   0.563    -.5634871    1.032836
##              |
##        cname |
##      Group2  |  -9.708186   19.92028    -0.49   0.627    -48.99504    29.57867
##        _cons |   830.4303   21.18687    39.20   0.000     788.6455    872.2152
## ------------------------------------------------------------------------------
## 
## . estout . using mod1.txt, cells("b se t p") stats(N) replace
## (note: file mod1.txt not found)
## (output written to mod1.txt)
## 
## . estimates store t1

Analytic Weights + Robust

## . reg api00 ell meals mobility i.cname [aweight = pw], robust
## (sum of wgt is   6.1940e+03)
## 
## Linear regression                                      Number of obs =     200
##                                                        F(  4,   195) =  104.68
##                                                        Prob > F      =  0.0000
##                                                        R-squared     =  0.6601
##                                                        Root MSE      =  72.589
## 
## ------------------------------------------------------------------------------
##              |               Robust
##        api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##          ell |   -.513901   .4063997    -1.26   0.208    -1.315404     .287602
##        meals |  -3.148314   .2925792   -10.76   0.000     -3.72534   -2.571288
##     mobility |   .2346743   .4047053     0.58   0.563    -.5634871    1.032836
##              |
##        cname |
##      Group2  |  -9.708186   19.92028    -0.49   0.627    -48.99504    29.57867
##        _cons |   830.4303   21.18687    39.20   0.000     788.6455    872.2152
## ------------------------------------------------------------------------------
## 
## . estout . using mod1.txt, cells("b se t p") stats(N) replace
## (note: file mod1.txt not found)
## (output written to mod1.txt)
## 
## . estimates store t1

Analytics Weights = GLM Weights Param

In Stata, adding aweight is equivalent to using the weights option in glm

(Replicates!)

Stata

## . reg api00 ell meals mobility i.cname [aweight = pw]
## (sum of wgt is   6.1940e+03)
## 
##       Source |       SS       df       MS              Number of obs =     200
## -------------+------------------------------           F(  4,   195) =   94.68
##        Model |  1995449.59     4  498862.398           Prob > F      =  0.0000
##     Residual |  1027477.06   195  5269.11315           R-squared     =  0.6601
## -------------+------------------------------           Adj R-squared =  0.6531
##        Total |  3022926.65   199  15190.5862           Root MSE      =  72.589
## 
## ------------------------------------------------------------------------------
##        api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##          ell |   -.513901   .3720566    -1.38   0.169    -1.247673    .2198705
##        meals |  -3.148314   .2701049   -11.66   0.000    -3.681016   -2.615612
##     mobility |   .2346743    .462905     0.51   0.613    -.6782688    1.147617
##              |
##        cname |
##      Group2  |  -9.708186   16.87379    -0.58   0.566    -42.98675    23.57038
##        _cons |   830.4303   20.26218    40.98   0.000     790.4692    870.3915
## ------------------------------------------------------------------------------
## 
## . estout . using mod1.txt, cells("b se t p") stats(N) replace
## (note: file mod1.txt not found)
## (output written to mod1.txt)
## 
## . estimates store t1

R

y <- glm(api00 ~ ell + meals + mobility + cname,
         data = df, weights = pw)
summary(y)
## 
## Call:
## glm(formula = api00 ~ ell + meals + mobility + cname, data = df, 
##     weights = pw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1076.96   -317.47    -87.58    217.20   1164.29  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 830.4303    20.2622  40.984   <2e-16 ***
## ell          -0.5139     0.3721  -1.381    0.169    
## meals        -3.1483     0.2701 -11.656   <2e-16 ***
## mobility      0.2347     0.4629   0.507    0.613    
## cnameGroup2  -9.7082    16.8738  -0.575    0.566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 163184.4)
## 
##     Null deviance: 93620038  on 199  degrees of freedom
## Residual deviance: 31820964  on 195  degrees of freedom
## AIC: 2309.7
## 
## Number of Fisher Scoring iterations: 2

Replicating Robust SE

In Stata, robust should replicate with the following function (h/t Nat Olin):

robust_se <- function(model){
  
  # get parameters
  x <- model.matrix(model)
  n <- nrow(x)
  k <- length(coef(model))
  # See pg 6 of Yannick - note that length k is num_pred + indicator
  dfc <- n / (n - k)
  
  # make sandwich
  u <- model$residuals
  bread <- solve(crossprod(x))
  meat <- t(x) %*% (dfc * diag(u^2)) %*% x
  v <- bread %*% meat %*% bread
  
  return(v)
}

(Replicates!)

Stata

## . reg api00 ell meals mobility i.cname, robust
## 
## Linear regression                                      Number of obs =     200
##                                                        F(  4,   195) =   86.85
##                                                        Prob > F      =  0.0000
##                                                        R-squared     =  0.5897
##                                                        Root MSE      =  78.276
## 
## ------------------------------------------------------------------------------
##              |               Robust
##        api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##          ell |  -.6955824   .4562246    -1.52   0.129     -1.59535    .2041856
##        meals |  -2.865973   .3150943    -9.10   0.000    -3.487403   -2.244543
##     mobility |   .0238613   .4857161     0.05   0.961      -.93407    .9817927
##              |
##        cname |
##      Group2  |  -12.82289   20.02464    -0.64   0.523    -52.31557    26.66979
##        _cons |   807.4927   21.83548    36.98   0.000     764.4287    850.5567
## ------------------------------------------------------------------------------
## 
## . estout . using mod1.txt, cells("b se t p") stats(N) replace
## (note: file mod1.txt not found)
## (output written to mod1.txt)
## 
## . estimates store t1

R

mod <- glm(api00 ~ ell + meals + mobility + cname, data = df)
lmtest::coeftest(mod, vcov = robust_se(mod))
## 
## z test of coefficients:
## 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 807.492693  21.835481 36.9808   <2e-16 ***
## ell          -0.695582   0.456225 -1.5246   0.1273    
## meals        -2.865973   0.315094 -9.0956   <2e-16 ***
## mobility      0.023861   0.485716  0.0491   0.9608    
## cnameGroup2 -12.822890  20.024643 -0.6404   0.5219    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Replication in R

In Stata, adding aweight & robust is equivalent to using the weights option in glm + robust_se

Does not replicate :(

Stata

## . reg api00 ell meals mobility i.cname [pweight = pw]
## (sum of wgt is   6.1940e+03)
## 
## Linear regression                                      Number of obs =     200
##                                                        F(  4,   195) =  104.68
##                                                        Prob > F      =  0.0000
##                                                        R-squared     =  0.6601
##                                                        Root MSE      =  72.589
## 
## ------------------------------------------------------------------------------
##              |               Robust
##        api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##          ell |   -.513901   .4063997    -1.26   0.208    -1.315404     .287602
##        meals |  -3.148314   .2925792   -10.76   0.000     -3.72534   -2.571288
##     mobility |   .2346743   .4047053     0.58   0.563    -.5634871    1.032836
##              |
##        cname |
##      Group2  |  -9.708186   19.92028    -0.49   0.627    -48.99504    29.57867
##        _cons |   830.4303   21.18687    39.20   0.000     788.6455    872.2152
## ------------------------------------------------------------------------------
## 
## . estout . using mod1.txt, cells("b se t p") stats(N) replace
## (note: file mod1.txt not found)
## (output written to mod1.txt)
## 
## . estimates store t1

R

y <- glm(api00 ~ ell + meals + mobility + cname,
         data = df, weights = pw)
lmtest::coeftest(y, vcov = robust_se(y))
## 
## z test of coefficients:
## 
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 830.43035   22.43148 37.0207   <2e-16 ***
## ell          -0.51390    0.44726 -1.1490   0.2506    
## meals        -3.14831    0.31928 -9.8606   <2e-16 ***
## mobility      0.23467    0.56912  0.4123   0.6801    
## cnameGroup2  -9.70819   20.48117 -0.4740   0.6355    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bonus: Survey package

Does not replicate :(

Stata

## . reg api00 ell meals mobility i.cname [pweight = pw]
## (sum of wgt is   6.1940e+03)
## 
## Linear regression                                      Number of obs =     200
##                                                        F(  4,   195) =  104.68
##                                                        Prob > F      =  0.0000
##                                                        R-squared     =  0.6601
##                                                        Root MSE      =  72.589
## 
## ------------------------------------------------------------------------------
##              |               Robust
##        api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##          ell |   -.513901   .4063997    -1.26   0.208    -1.315404     .287602
##        meals |  -3.148314   .2925792   -10.76   0.000     -3.72534   -2.571288
##     mobility |   .2346743   .4047053     0.58   0.563    -.5634871    1.032836
##              |
##        cname |
##      Group2  |  -9.708186   19.92028    -0.49   0.627    -48.99504    29.57867
##        _cons |   830.4303   21.18687    39.20   0.000     788.6455    872.2152
## ------------------------------------------------------------------------------
## 
## . estout . using mod1.txt, cells("b se t p") stats(N) replace
## (note: file mod1.txt not found)
## (output written to mod1.txt)
## 
## . estimates store t1

surveyglm

dstrat <- survey::svydesign(id = ~1, weights = ~pw, data = df)
y <- survey::svyglm(api00 ~ ell + meals + mobility + cname,
                       design = dstrat)
summary(y)
## 
## Call:
## svyglm(formula = api00 ~ ell + meals + mobility + cname, design = dstrat)
## 
## Survey design:
## survey::svydesign(id = ~1, weights = ~pw, data = df)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 830.4303    20.9729  39.595   <2e-16 ***
## ell          -0.5139     0.4023  -1.277    0.203    
## meals        -3.1483     0.2896 -10.870   <2e-16 ***
## mobility      0.2347     0.4006   0.586    0.559    
## cnameGroup2  -9.7082    19.7191  -0.492    0.623    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 5163.201)
## 
## Number of Fisher Scoring iterations: 2

surveyglm + robust SE

Interesting, this is the same as using glm with the weight parameter and robust SEs.

dstrat <- survey::svydesign(id = ~1, weights = ~pw, data = df)
y <- survey::svyglm(api00 ~ ell + meals + mobility + cname,
                       design = dstrat)
lmtest::coeftest(y, vcov = robust_se(y))
## 
## z test of coefficients:
## 
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 830.43035   22.43148 37.0207   <2e-16 ***
## ell          -0.51390    0.44726 -1.1490   0.2506    
## meals        -3.14831    0.31928 -9.8606   <2e-16 ***
## mobility      0.23467    0.56912  0.4123   0.6801    
## cnameGroup2  -9.70819   20.48117 -0.4740   0.6355    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1