DEM7283: Statistics for Demographic Data II

Objectives of the exercise
Data source
Set working directory and load packages
Load and process data set
Objective 1: Outcome of interest, hypothesis statement and variables specification
Objective 2: Fit Poission regression model
Objective 3: Fit Alternative Model, Negative Binomial regression model
- Compare coefficients
- Fit Negative Binomial regression
Acknowledgements

Author: Elhakim Ibrahim

Instructor: Corey Sparks, PhD¹

April 06, 2020

Objectives of the exercise

In this exercise, we shall:

Define a count outcome and state a research question about the outcome
- Determine if an offset term is necessary? Why or why not?
Consider a Poisson regression model for the outcome
- Evaluate the level of dispersion in the outcome
- Evaluate if the Poisson model is a good choice
Consider a Negative binomial model
Compare the model fits of the alternative models based on AIC values

Data source

The exercise is based on data for the State of Texas extracted from nationally representative 2016 Behavioral Risk Factor Surveillance System (BRFSS) SMART metro area survey data set were employed².

Set working directory and load packages

setwd("C:/A/01/02S20/D283/H/05")

suppressPackageStartupMessages({
    library(car)
    library(dplyr)
    library(MASS)
    library(gtools)
    library(lmtest)
    library(survey)
    library(sjPlot)
    library(sjmisc)
    library(forcats)
    library(stargazer)
    library(questionr)
    library(htmlTable)
    library(sjlabelled)
})

Load and process data set

load(url("https://github.com/coreysparks/data/blob/master/brfss_2017.Rdata?raw=true"))

mydata0 <- brfss_17
rawnames <- names(mydata0)
newnames <- tolower(gsub(pattern = "_", replacement = "", x = rawnames))
# above identifies and removes '_' from the variable names and changes all
# to lowercase
names(mydata0) <- newnames
head(newnames, n = 35)

 [1] "dispcode" "statere1" "safetime" "hhadult"  "genhlth"  "physhlth"
 [7] "menthlth" "poorhlth" "hlthpln1" "persdoc2" "medcost"  "checkup1"
[13] "bphigh4"  "bpmeds"   "cholchk1" "toldhi2"  "cholmed1" "cvdinfr4"
[19] "cvdcrhd4" "cvdstrk3" "asthma3"  "asthnow"  "chcscncr" "chcocncr"
[25] "chccopd1" "havarth3" "addepev2" "chckidny" "diabete3" "diabage2"
[31] "lmtjoin3" "arthdis2" "arthsocl" "joinpai1" "sex"

Objective 1: Outcome of interest, hypothesis statement and variables specification

Outcome of interest

Alcohol consumption

Our outcome of interest is number of days respondent consumed alcohol. We will measure this as a count variable

# explore variable number of alcday5
summarytools::freq(mydata0$alcday5, round.digits = 1, cumul = T, report.nas = T)

Frequencies  
mydata0$alcday5  
Label: DAYS IN PAST 30 HAD ALCOHOLIC BEVERAGE  
Type: Numeric  

                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- -------- --------- -------------- --------- --------------
        101     9402       4.3            4.3       4.1            4.1
        102     7786       3.5            7.8       3.4            7.4
        103     5242       2.4           10.2       2.3            9.7
        104     2477       1.1           11.3       1.1           10.8
        105     2317       1.1           12.4       1.0           11.8
        106      739       0.3           12.7       0.3           12.1
        107     4392       2.0           14.7       1.9           14.0
        201    18747       8.5           23.2       8.1           22.1
        202    14233       6.5           29.7       6.2           28.3
        203     8112       3.7           33.4       3.5           31.8
        204     6823       3.1           36.5       3.0           34.8
        205     6849       3.1           39.6       3.0           37.7
        206     2494       1.1           40.8       1.1           38.8
        207     1769       0.8           41.6       0.8           39.6
        208     2021       0.9           42.5       0.9           40.5
        209      179       0.1           42.6       0.1           40.5
        210     5266       2.4           45.0       2.3           42.8
        211       37       0.0           45.0       0.0           42.8
        212     1029       0.5           45.5       0.4           43.3
        213       42       0.0           45.5       0.0           43.3
        214      350       0.2           45.6       0.2           43.4
        215     4238       1.9           47.6       1.8           45.3
        216      117       0.1           47.6       0.1           45.3
        217       54       0.0           47.6       0.0           45.4
        218      114       0.1           47.7       0.0           45.4
        219       13       0.0           47.7       0.0           45.4
        220     4190       1.9           49.6       1.8           47.2
        221      105       0.0           49.6       0.0           47.3
        222       86       0.0           49.7       0.0           47.3
        223       45       0.0           49.7       0.0           47.3
        224      102       0.0           49.8       0.0           47.4
        225     1812       0.8           50.6       0.8           48.2
        226       94       0.0           50.6       0.0           48.2
        227      109       0.0           50.7       0.0           48.2
        228      469       0.2           50.9       0.2           48.4
        229      194       0.1           51.0       0.1           48.5
        230     7606       3.5           54.4       3.3           51.8
        777     1712       0.8           55.2       0.7           52.6
        888    97541      44.4           99.6      42.2           94.8
        999      913       0.4          100.0       0.4           95.2
       <NA>    11055                                4.8          100.0
      Total   230875     100.0          100.0     100.0          100.0

# subset data and limit to realistic number of alcday5
mydata0 = subset(mydata0, alcday5 < 777)

# check the variable structure and counts
str(mydata0$alcday5)

 num [1:119654] 107 215 210 204 201 102 101 205 107 107 ...

summary(mydata0$alcday5)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    101     107     202     180     207     230

summarytools::freq(mydata0$alcday5, round.digits = 1, cumul = T, report.nas = T)

Frequencies  
mydata0$alcday5  
Type: Numeric  

                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- -------- --------- -------------- --------- --------------
        101     9402       7.9            7.9       7.9            7.9
        102     7786       6.5           14.4       6.5           14.4
        103     5242       4.4           18.7       4.4           18.7
        104     2477       2.1           20.8       2.1           20.8
        105     2317       1.9           22.8       1.9           22.8
        106      739       0.6           23.4       0.6           23.4
        107     4392       3.7           27.0       3.7           27.0
        201    18747      15.7           42.7      15.7           42.7
        202    14233      11.9           54.6      11.9           54.6
        203     8112       6.8           61.4       6.8           61.4
        204     6823       5.7           67.1       5.7           67.1
        205     6849       5.7           72.8       5.7           72.8
        206     2494       2.1           74.9       2.1           74.9
        207     1769       1.5           76.4       1.5           76.4
        208     2021       1.7           78.1       1.7           78.1
        209      179       0.1           78.2       0.1           78.2
        210     5266       4.4           82.6       4.4           82.6
        211       37       0.0           82.6       0.0           82.6
        212     1029       0.9           83.5       0.9           83.5
        213       42       0.0           83.5       0.0           83.5
        214      350       0.3           83.8       0.3           83.8
        215     4238       3.5           87.4       3.5           87.4
        216      117       0.1           87.5       0.1           87.5
        217       54       0.0           87.5       0.0           87.5
        218      114       0.1           87.6       0.1           87.6
        219       13       0.0           87.6       0.0           87.6
        220     4190       3.5           91.1       3.5           91.1
        221      105       0.1           91.2       0.1           91.2
        222       86       0.1           91.3       0.1           91.3
        223       45       0.0           91.3       0.0           91.3
        224      102       0.1           91.4       0.1           91.4
        225     1812       1.5           92.9       1.5           92.9
        226       94       0.1           93.0       0.1           93.0
        227      109       0.1           93.1       0.1           93.1
        228      469       0.4           93.5       0.4           93.5
        229      194       0.2           93.6       0.2           93.6
        230     7606       6.4          100.0       6.4          100.0
       <NA>        0                                0.0          100.0
      Total   119654     100.0          100.0     100.0          100.0

Hypothesis and offset term justification

It is hypothesized that number of alcohol consumption days will not differ signifcantly by racial/ethinic identity net of other covariates. To test the hypothesis, we employ race/ethnicity as key indpendent variable and would control for age, sex, educational attainment, employment and smoking statuses as important covariates. Model specification for the outcome will include an offset term. This is due to unequal population shares and expectedly unequal rate of distribution of parity across racial/ethnic groups.

Specification of covariates

Race/ethnic identity

mydata0$racegr3 <- as.factor(mydata0$racegr3)

mydata0 <- mydata0 %>% mutate(racegrp = factor(racegr3, levels = c(1, 2, 3, 
    4, 5), labels = c("NH White", "NH Black", "Other", "Other", "Hispanic")))

mydata0$racegrp <- fct_relevel(mydata0$racegrp, "NH White", "NH Black", "Hispanic")
summarytools::freq(mydata0$racegrp, round.digits = 1, cumul = T, report.nas = T)

Frequencies  
mydata0$racegrp  
Type: Factor  

                   Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
-------------- -------- --------- -------------- --------- --------------
      NH White    92815      78.9           78.9      77.6           77.6
      NH Black     9634       8.2           87.1       8.1           85.6
      Hispanic     9237       7.8           94.9       7.7           93.3
         Other     6003       5.1          100.0       5.0           98.4
          <NA>     1965                                1.6          100.0
         Total   119654     100.0          100.0     100.0          100.0

Age group

mydata0$agegrp <- cut(mydata0$age80, breaks = c(0, 24, 39, 59, 79, 99))

summarytools::freq(mydata0$agegrp, round.digits = 1, cumul = T, report.nas = T)

Frequencies  
mydata0$agegrp  
Type: Factor  

                  Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
------------- -------- --------- -------------- --------- --------------
       (0,24]     7587       6.3            6.3       6.3            6.3
      (24,39]    24167      20.2           26.5      20.2           26.5
      (39,59]    40422      33.8           60.3      33.8           60.3
      (59,79]    40801      34.1           94.4      34.1           94.4
      (79,99]     6677       5.6          100.0       5.6          100.0
         <NA>        0                                0.0          100.0
        Total   119654     100.0          100.0     100.0          100.0

Sex

mydata0$sex <- Recode(mydata0$sex, recodes = "1='Male';2='Female';else=NA", 
    as.factor = T)

summarytools::freq(mydata0$sex, round.digits = 1, cumul = T, report.nas = T)

Frequencies  
mydata0$sex  
Type: Factor  

                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
------------ -------- --------- -------------- --------- --------------
      Female    60593      50.7           50.7      50.6           50.6
        Male    58996      49.3          100.0      49.3           99.9
        <NA>       65                                0.1          100.0
       Total   119654     100.0          100.0     100.0          100.0

Education attainment

mydata0$edu <- Recode(mydata0$educa, recodes = "1:2='0Primary';3='1Some HS';4='2HS Grad';
                    5='3Some Coll';6='4Coll Grad';9=NA", 
    as.factor = T)

mydata0$edu <- relevel(mydata0$edu, "4Coll Grad", )
summarytools::freq(mydata0$edu, round.digits = 1, cumul = T, report.nas = T)

Frequencies  
mydata0$edu  
Type: Factor  

                     Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
---------------- -------- --------- -------------- --------- --------------
      4Coll Grad    60998      51.1           51.1      51.0           51.0
        0Primary     1168       1.0           52.1       1.0           52.0
        1Some HS     2971       2.5           54.6       2.5           54.4
        2HS Grad    22669      19.0           73.5      18.9           73.4
      3Some Coll    31586      26.5          100.0      26.4           99.8
            <NA>      262                                0.2          100.0
           Total   119654     100.0          100.0     100.0          100.0

Employment status

mydata0$employ <- Recode(mydata0$employ1, recodes = "1:2='0Employed';2:6='3NILF';7='2Retired';
                       8='1Unemployed';else=NA", 
    as.factor = T)

summarytools::freq(mydata0$employ, round.digits = 1, cumul = T, report.nas = T)

Frequencies  
mydata0$employ  
Type: Factor  

                      Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------------- -------- --------- -------------- --------- --------------
        0Employed    72056      60.6           60.6      60.2           60.2
      1Unemployed     3884       3.3           63.9       3.2           63.5
         2Retired    30347      25.5           89.4      25.4           88.8
            3NILF    12642      10.6          100.0      10.6           99.4
             <NA>      725                                0.6          100.0
            Total   119654     100.0          100.0     100.0          100.0

Smoking status

mydata0$smoke <- Recode(mydata0$smoker3, recodes = "1:2='Current'; 3='Former';
                      4='Never smoked'; else=NA", 
    as.factor = T)

mydata0$smoke <- relevel(mydata0$smoke, ref = "Never smoked")
summarytools::freq(mydata0$smoke, round.digits = 1, cumul = T, report.nas = T)

Frequencies  
mydata0$smoke  
Type: Factor  

                       Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
------------------ -------- --------- -------------- --------- --------------
      Never smoked    66616      56.0           56.0      55.7           55.7
           Current    16487      13.9           69.8      13.8           69.5
            Former    35910      30.2          100.0      30.0           99.5
              <NA>      641                                0.5          100.0
             Total   119654     100.0          100.0     100.0          100.0

Objective 2: Fit Poission regression model

Select working variables and filter complete observations

mydata <- mydata0 %>% dplyr::select(alcday5, racegrp, agegrp, sex, edu, employ, 
    smoke, mmsawt, ststr) %>% dplyr::filter(complete.cases(.))

Summarize data

summary(mydata)

    alcday5        racegrp          agegrp          sex       
 Min.   :101   NH White:91781   (0,24] : 7382   Female:59073  
 1st Qu.:107   NH Black: 9485   (24,39]:23435   Male  :57125  
 Median :202   Hispanic: 9036   (39,59]:39226                 
 Mean   :180   Other   : 5896   (59,79]:39722                 
 3rd Qu.:207                    (79,99]: 6433                 
 Max.   :230                                                  
         edu                employ               smoke      
 4Coll Grad:59460   0Employed  :70480   Never smoked:65077  
 0Primary  : 1110   1Unemployed: 3759   Current     :16025  
 1Some HS  : 2879   2Retired   :29603   Former      :35096  
 2HS Grad  :22044   3NILF      :12356                       
 3Some Coll:30705                                           
                                                            
     mmsawt          ststr      
 Min.   :    0   Min.   : 1011  
 1st Qu.:  122   1st Qu.:13051  
 Median :  306   Median :28042  
 Mean   :  686   Mean   :26200  
 3rd Qu.:  748   3rd Qu.:39099  
 Max.   :35484   Max.   :52122

Define complex survey design object

options(survey.lonely.psu = "adjust")
survdes <- svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = mydata)

Explore sample distribution by race/ethnicity and alcohol consumption days

Absolute distribution

a = table(mydata$alcday5, mydata$racegrp)
addmargins(a)

     
      NH White NH Black Hispanic  Other    Sum
  101     6730      873      934    557   9094
  102     5915      682      592    393   7582
  103     4218      366      307    219   5110
  104     2033      144      141     98   2416
  105     2012       93       90     74   2269
  106      639       26       25     24    714
  107     3751      181      159    154   4245
  201    12920     2056     2075   1102  18153
  202    10086     1497     1398    818  13799
  203     5999      761      705    416   7881
  204     5089      588      591    352   6620
  205     5365      517      471    314   6667
  206     1961      160      176    121   2418
  207     1410      117      113     90   1730
  208     1620      117      147     94   1978
  209      141       14       11      9    175
  210     4279      319      299    222   5119
  211       27        2        4      3     36
  212      844       50       67     46   1007
  213       34        1        4      1     40
  214      276       26       26     14    342
  215     3530      213      215    168   4126
  216       97        2       11      4    114
  217       40        6        1      5     52
  218       90        8        8      4    110
  219        7        3        0      1     11
  220     3625      166      134    152   4077
  221       83        2        7      6     98
  222       82        1        1      1     85
  223       39        4        0      1     44
  224       87        6        4      4    101
  225     1632       49       40     48   1769
  226       84        3        2      3     92
  227       96        6        1      3    106
  228      422       12        9     12    455
  229      172        4        2      9    187
  230     6346      410      266    354   7376
  Sum    91781     9485     9036   5896 116198

Fit Poisson regression

poisson.fit <- svyglm(alcday5 ~ factor(racegrp) + factor(agegrp) + factor(sex) + 
    factor(edu) + factor(employ) + factor(smoke), design = survdes, family = poisson)

summary(poisson.fit)


Call:
svyglm(formula = alcday5 ~ factor(racegrp) + factor(agegrp) + 
    factor(sex) + factor(edu) + factor(employ) + factor(smoke), 
    design = survdes, family = poisson)

Survey design:
svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = mydata)

Coefficients:
                           Estimate Std. Error t value     Pr(>|t|)    
(Intercept)                5.202104   0.006094  853.59      < 2e-16 ***
factor(racegrp)NH Black    0.001926   0.005380    0.36      0.72032    
factor(racegrp)Hispanic   -0.002394   0.005872   -0.41      0.68346    
factor(racegrp)Other       0.000495   0.007311    0.07      0.94599    
factor(agegrp)(24,39]     -0.011627   0.006199   -1.88      0.06072 .  
factor(agegrp)(39,59]     -0.017805   0.006135   -2.90      0.00371 ** 
factor(agegrp)(59,79]     -0.019309   0.007061   -2.73      0.00625 ** 
factor(agegrp)(79,99]     -0.045230   0.012126   -3.73      0.00019 ***
factor(sex)Male           -0.014553   0.003209   -4.53 0.0000057708 ***
factor(edu)0Primary       -0.007652   0.015779   -0.48      0.62772    
factor(edu)1Some HS       -0.006887   0.010102   -0.68      0.49540    
factor(edu)2HS Grad        0.013108   0.004362    3.01      0.00266 ** 
factor(edu)3Some Coll      0.008221   0.003832    2.15      0.03191 *  
factor(employ)1Unemployed  0.044737   0.007631    5.86 0.0000000046 ***
factor(employ)2Retired     0.011776   0.005490    2.15      0.03195 *  
factor(employ)3NILF       -0.003676   0.005119   -0.72      0.47263    
factor(smoke)Current       0.002360   0.005013    0.47      0.63778    
factor(smoke)Former       -0.007652   0.004040   -1.89      0.05821 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 12.5)

Number of Fisher Scoring iterations: 4

round(exp(summary(poisson.fit)$coef[-1, 1]), 2)

  factor(racegrp)NH Black   factor(racegrp)Hispanic 
                     1.00                      1.00 
     factor(racegrp)Other     factor(agegrp)(24,39] 
                     1.00                      0.99 
    factor(agegrp)(39,59]     factor(agegrp)(59,79] 
                     0.98                      0.98 
    factor(agegrp)(79,99]           factor(sex)Male 
                     0.96                      0.99 
      factor(edu)0Primary       factor(edu)1Some HS 
                     0.99                      0.99 
      factor(edu)2HS Grad     factor(edu)3Some Coll 
                     1.01                      1.01 
factor(employ)1Unemployed    factor(employ)2Retired 
                     1.05                      1.01 
      factor(employ)3NILF      factor(smoke)Current 
                     1.00                      1.00 
      factor(smoke)Former 
                     0.99

Interpretation:

The results suggest that no significant difference in mean count of days of alcohol comsumption by race/ethnicity. Meanwhile, the average count is inversely associated with age with significant reduction in the mean as age increased. The mean count is also lower for males than females. Compared with college graduates, those who are high school graduates and those with some college education had higher mean count of alcohol consumption days. While the retirees had higher mean count of alcohol consumption days than those in other employment groups, no signifcant difference was observed by smoking status.

Evaluate the extent to which our data fit the model

Check for overdispersion

Overdispersion measure tells about variablity in our data and the degree to which our selected model fit the data. If there is more variability than expected in the data, it is overdispersed. In other words, there is overdispersion which basically implies that the model do not fit the data. The main problem with overdispersion is that it leads to standard errors for model parameters that are too small. Thus, if present and not accounted for, overdispersion could lead us to wrongly infer a relationship as being significant when indeed is not!

We will evaluate overdispersion using two easy approaches:

General residuals ratio test:
- compares model residual deviance to model residual degrees of freedom
- the ratio should equal 1 if the model fits the data; implies that the mean and variance are equal
- the two values will be the same, or very similar if the model fits the data
- the larger residual deviance becomes relative to residual degrees of freedom, the more overdispersed the model becomes
Goodness of fit test:
- uses Chi-square distribution of model residual deviance with degrees of freedom (df, n-p) equal to the degrees of freedom
- The test should be insignificant with p-value>0 if the model fits the data
- significant test with p<0 indicates overdispersion

Note that the svyglm() function includes a scaling term for overdispersion, so this is already taken into account. But assuming we have a non-complex survey data without survey design parameters, we can measure overdispersionby using the residual deviance.

To do that, we would refit our model using glm () function and without survey design specification as implemented below.

Refit model

poisson.fit2 = glm(alcday5 ~ factor(racegrp) + factor(agegrp) + factor(sex) + 
    factor(edu) + factor(employ) + factor(smoke), data = mydata0, family = poisson)
summary(poisson.fit2)


Call:
glm(formula = alcday5 ~ factor(racegrp) + factor(agegrp) + factor(sex) + 
    factor(edu) + factor(employ) + factor(smoke), family = poisson, 
    data = mydata0)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
 -7.04   -5.83    1.66    1.99    4.05  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                5.204404   0.001005 5177.06  < 2e-16 ***
factor(racegrp)NH Black   -0.002689   0.000817   -3.29  0.00100 ** 
factor(racegrp)Hispanic   -0.002716   0.000866   -3.14  0.00170 ** 
factor(racegrp)Other       0.000540   0.001012    0.53  0.59347    
factor(agegrp)(24,39]     -0.010547   0.001017  -10.37  < 2e-16 ***
factor(agegrp)(39,59]     -0.014890   0.000980  -15.19  < 2e-16 ***
factor(agegrp)(59,79]     -0.013283   0.001061  -12.53  < 2e-16 ***
factor(agegrp)(79,99]     -0.030900   0.001431  -21.59  < 2e-16 ***
factor(sex)Male           -0.007996   0.000445  -17.97  < 2e-16 ***
factor(edu)0Primary       -0.005564   0.002332   -2.39  0.01705 *  
factor(edu)1Some HS        0.003872   0.001467    2.64  0.00829 ** 
factor(edu)2HS Grad        0.003547   0.000614    5.78  7.4e-09 ***
factor(edu)3Some Coll      0.007298   0.000538   13.55  < 2e-16 ***
factor(employ)1Unemployed  0.036848   0.001266   29.12  < 2e-16 ***
factor(employ)2Retired     0.005086   0.000699    7.28  3.4e-13 ***
factor(employ)3NILF        0.002659   0.000753    3.53  0.00042 ***
factor(smoke)Current       0.001589   0.000688    2.31  0.02095 *  
factor(smoke)Former       -0.003952   0.000516   -7.66  1.9e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1614201  on 116197  degrees of freedom
Residual deviance: 1612008  on 116180  degrees of freedom
  (3456 observations deleted due to missingness)
AIC: 2423802

Number of Fisher Scoring iterations: 4

Residuals ratio test

# Compare ratios of residuals
resid.ratios = sqrt(poisson.fit2$deviance/poisson.fit2$df.residual)
resid.ratios

[1] 3.72

The ratio exceeds the expected value of 1. This indicates overdispersion as the variance much larger than the mean.

Goodness of fit test

# Use Chi-square of residuals
resid.chisq = 1 - pchisq(poisson.fit2$deviance, df = poisson.fit2$df.residual)
resid.chisq

[1] 0

The value is significant; p-values=0. This further confirms overdispersion and that the model does not fit the data.

Objective 3: Fit Alternative Model, Negative Binomial regression model

In fitting negative binomial models R will not use survey design. Hence, we will fit the model using sample weights. We achieve this by standardizing the weights to equal the sample size rather than the population size by dividing each person’s weight by mean weight. We will also use clustered standard errors to calculate the standard errors for the model which account for homogeneity within strata.

Compare coefficients

Compare coefficients generated with clustered standard errors with coefficient in model with non-clustered specification.

library(sandwich)
coeftest(poisson.fit2, vcov = vcovHC(poisson.fit2, type = "HC1", cluster = "ststr"))


z test of coefficients:

                          Estimate Std. Error z value     Pr(>|z|)    
(Intercept)                5.20440    0.00338 1541.62      < 2e-16 ***
factor(racegrp)NH Black   -0.00269    0.00276   -0.97      0.32980    
factor(racegrp)Hispanic   -0.00272    0.00292   -0.93      0.35142    
factor(racegrp)Other       0.00054    0.00349    0.15      0.87715    
factor(agegrp)(24,39]     -0.01055    0.00340   -3.11      0.00190 ** 
factor(agegrp)(39,59]     -0.01489    0.00328   -4.55 0.0000054445 ***
factor(agegrp)(59,79]     -0.01328    0.00360   -3.69      0.00022 ***
factor(agegrp)(79,99]     -0.03090    0.00511   -6.04 0.0000000015 ***
factor(sex)Male           -0.00800    0.00157   -5.08 0.0000003681 ***
factor(edu)0Primary       -0.00556    0.00816   -0.68      0.49518    
factor(edu)1Some HS        0.00387    0.00502    0.77      0.44072    
factor(edu)2HS Grad        0.00355    0.00215    1.65      0.09890 .  
factor(edu)3Some Coll      0.00730    0.00189    3.87      0.00011 ***
factor(employ)1Unemployed  0.03685    0.00393    9.38      < 2e-16 ***
factor(employ)2Retired     0.00509    0.00252    2.02      0.04349 *  
factor(employ)3NILF        0.00266    0.00258    1.03      0.30296    
factor(smoke)Current       0.00159    0.00244    0.65      0.51408    
factor(smoke)Former       -0.00395    0.00185   -2.14      0.03267 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fit Negative Binomial regression

negbin.fit1 = glm.nb(alcday5 ~ factor(racegrp), data = mydata, weights = mmsawt/mean(mmsawt, 
    na.rm = T))

negbin.fit2 = glm.nb(alcday5 ~ factor(racegrp) + factor(agegrp) + factor(sex) + 
    factor(edu) + factor(employ) + factor(smoke), data = mydata, weights = mmsawt/mean(mmsawt, 
    na.rm = T))

# clx2(fit.nb2,cluster =sub$ststr)
test1 <- coeftest(negbin.fit1, vcov = vcovHC(negbin.fit2, type = "HC1", cluster = "ststr"))
test2 <- coeftest(negbin.fit2, vcov = vcovHC(negbin.fit2, type = "HC1", cluster = "ststr"))

stargazer(negbin.fit1, negbin.fit2, style = "demography", type = "text", t.auto = F, 
    p.auto = F, coef = list(test1[, 1], test2[, 1]), se = list(test1[, 2], test2[, 
        2]), p = list(test1[, 4], test2[, 4]), font.size = "normalsize", dep.var.caption = c("Outcome: Parity"), 
    dep.var.labels.include = F, align = T, single.row = T, no.space = T, column.sep.width = "4pt", 
    covariate.labels = c("NH Black", "Hispanic", "Other", "25-39", "40-59", 
        "60-79", "80+", "Male", "Primary", "Some HS", "HS Grad", "Some Coll", 
        "Unemployed", "NILF", "Retired", "Current", "Former"), column.labels = c("Model 1, ß(SE)", 
        "Model 2, ß(SE)"), column.separate = c(1, 1), star.cutoffs = c(0.05, 
        0.01, 0.001), object.names = F, model.numbers = F, report = "vcs*", 
    notes.align = "l", notes.label = "Notes:", multicolumn = T, out = "tabnegativebinomial.htm")


--------------------------------------------------
                Model 1, ß(SE)    Model 2, ß(SE)  
--------------------------------------------------
NH Black         0.007 (0.005)     0.002 (0.005)  
Hispanic        -0.002 (0.006)    -0.003 (0.006)  
Other            0.002 (0.007)    0.0004 (0.007)  
25-39                             -0.012 (0.006)  
40-59                            -0.018 (0.006)** 
60-79                            -0.019 (0.007)** 
80+                              -0.045 (0.012)***
Male                             -0.015 (0.003)***
Primary                           -0.008 (0.016)  
Some HS                           -0.007 (0.010)  
HS Grad                           0.013 (0.004)** 
Some Coll                         0.008 (0.004)*  
Unemployed                       0.045 (0.008)*** 
NILF                              0.012 (0.006)*  
Retired                           -0.004 (0.005)  
Current                            0.002 (0.005)  
Former                            -0.008 (0.004)  
Constant       5.180 (0.006)***  5.200 (0.006)*** 
N                   116,198           116,198     
Log Likelihood   -621,903.000      -621,758.000   
theta          12.500*** (0.055) 12.500*** (0.055)
AIC              1,243,814.000     1,243,552.000  
--------------------------------------------------
Notes:         *p < .05; **p < .01; ***p < .001

Model evaluation:

Both the log likelihood and AIC values of model 2 are lower than observed for model 1. This result indicates that model 2 with additional covariates fit the data better than model 1 with race/ethnicity as the only predictor. In other words, model 2 explain more variations in the outcome based on the data than model 1.

Acknowledgements

This content was inspired by Dr. Corey Sparks’ lab session and his easy-to-follow instructional materials accessible at https://rpubs.com/corey_sparks.
The 2016 BRFSS SMART metro area survey data are open source resources made available by the US Centers for Disease Control and Prevention at https://www.cdc.gov/brfss/smart/smart_2016.html.
Stargazer is authored by Marek Hlavac and full document can be downloaded from https://cran.r-project.org/web/packages/stargazer/stargazer.pdf

##########################################################################################################