Homework 4 - DEM 7283

Defining a count outcome for the ELS:2002 dataset

For this homework assignment, I will be creating a “count” outcome variable from the ELS:2002 dataset. I have chosen “Sports Participation”. This is a variable which asks the student how many different sports they have participated in during the base year of the survey. In other words, our outcome is the risk of participation in sports activity over the year (1 year) preceeding the survey.

#creating sport count
els$sport_count<-recode(els$bynsport,recodes="99=0;-9=NA;-8=NA;-4=NA")
hist(els$sport_count)

summary(els$sport_count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.000   1.057   2.000   7.000    1848

As we can see, most students have not participated in sports, while there is a good number who have participated in one sport, and a “long tail” of students who have particpated in many sports.

Research Question regarding Sports Participation

I am interested in the effect of non-traditional family structure upon particpation in school sports. My hypothesis is that non-traditional family structure has a negative effect upon sports participation in the year preceeding the survey. I also believe that sports participation over the previous year may be affected by gender (boys may be encouraged to play sports more than girls), race and ethnicity (some ethnicities place great value upon scholastic sports) and family SES (wealthier families may be able to better afford the high costs associated with sports participation).

#non-tradtional families
els<-els %>% mutate(non_trad_family=case_when(.$byfcomp %in% c(2:9) ~ 1,
                                              .$byfcomp==1 ~ 0,
                                              .$byfcomp %in% c(-4,-8,-9) ~ NA_real_))
#race-ethnicity
els<-els %>% mutate(race_eth=case_when(.$byrace %in% c(4:5) ~ "hispanic",
                                       .$byrace==1~"native_american",
                                       .$byrace==2~"asian_pacific",
                                       .$byrace==3~"black",
                                       .$byrace==6~"multirace",
                                       .$byrace==7~"white",
                                       .$byrace %in% c(-4,-8) ~ NA_character_))
els$race_eth<-as.factor(els$race_eth)
els$race_eth<-relevel(els$race_eth,ref="white")


#gender (male)
els<-els %>% mutate(male=case_when(.$bysex==1~1,
                                   .$bysex==2~0,
                                   .$bysex %in% c(-4,-8) ~ NA_real_))

#family SES
els$ses_ordinal<-recode(els$byses1qu,recodes="1=1;2=2;3=3;4=4;else=NA",as.factor.result = T)
els$ses_ordinal<-relevel(els$ses_ordinal,ref="1")
els$ses_number<-recode(els$byses1qu,recodes="1=1;2=2;3=3;4=4;else=NA",as.factor.result = F)

Offset Term: Is it needed?

In this particular case, an offset term is not needed, as the risk of exposure to sports is being measured only over the period of one (1) year. In theory, everyone in this group has the same baseline risk of being exposed to sports participation.

Poisson regression attempt

Let’s try a Poisson regression model to see if non-traditional family structure has a negative effect upon risk of sports participation:

#first, I create my dataset:
data1<-els %>% dplyr::select(stu_id,strat_id,bystuwt,sport_count,non_trad_family,race_eth,male,ses_ordinal,ses_number) %>% filter(complete.cases(.))

#then I create my survey design:
des<-svydesign(ids=~stu_id,strata = ~strat_id,weights=~bystuwt,data=data1)

fit1_poisson<-svyglm(sport_count~non_trad_family+race_eth+male+ses_ordinal,design=des,family=poisson)
summary(fit1_poisson)

## 
## Call:
## svyglm(formula = sport_count ~ non_trad_family + race_eth + male + 
##     ses_ordinal, design = des, family = poisson)
## 
## Survey design:
## svydesign(ids = ~stu_id, strata = ~strat_id, weights = ~bystuwt, 
##     data = data1)
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -0.197214   0.039462  -4.998 5.88e-07 ***
## non_trad_family         -0.097415   0.027647  -3.524 0.000427 ***
## race_ethasian_pacific   -0.320384   0.057549  -5.567 2.64e-08 ***
## race_ethblack            0.109023   0.041491   2.628 0.008608 ** 
## race_ethhispanic        -0.117948   0.043583  -2.706 0.006813 ** 
## race_ethmultirace        0.009589   0.068586   0.140 0.888808    
## race_ethnative_american -0.109781   0.145818  -0.753 0.451545    
## male                     0.173529   0.025426   6.825 9.17e-12 ***
## ses_ordinal2             0.131678   0.042851   3.073 0.002124 ** 
## ses_ordinal3             0.222028   0.041193   5.390 7.16e-08 ***
## ses_ordinal4             0.320851   0.040018   8.018 1.16e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1.688818)
## 
## Number of Fisher Scoring iterations: 6

We can now run the risk ratios for our poisson model:

# poisson risk ratios:
round(exp(summary(fit1_poisson)$coef[-1,1]),3)

##         non_trad_family   race_ethasian_pacific           race_ethblack 
##                   0.907                   0.726                   1.115 
##        race_ethhispanic       race_ethmultirace race_ethnative_american 
##                   0.889                   1.010                   0.896 
##                    male            ses_ordinal2            ses_ordinal3 
##                   1.189                   1.141                   1.249 
##            ses_ordinal4 
##                   1.378

So, students from non-traditional familes are about 10% less likely to participate in school sports than those from traditional families. There is also a positive relationship between family SES and sports participation, and some ethnicities are more likely to participate in school sports than others.

Dispersion parameter: 1.6 (Poisson not a good choice)

According to our model summary, we have a dispersion parameter of 1.6. This means that there is more variability than we would expect for a Poisson model, and therefore the model is incorrect.

This can also be shown as follows:

fit2_poisson<-glm(sport_count~non_trad_family+race_eth+male+ses_ordinal,data=data1,family=poisson)
1-pchisq(fit2_poisson$deviance,df=fit2_poisson$df.residual)

## [1] 0

A p value of “0” means that the model does not fit the data, so the Poisson model is not a good choice.

Negative binomial model

Since the Poisson model is not a good choice, we can try a negative binomial model. To do this, we’ll need to get robust standard errors as follows:

#special thanks to Dr. Sparks and the guy he took this from

clx2 <-   function(fm, dfcw,  cluster){
    # R-codes (www.r-project.org) for computing
    # clustered-standard errors. Mahmood Arai, Jan 26, 2008.
    
    # The arguments of the function are:
    # fitted model, cluster1
    # You need to install libraries `sandwich' and `lmtest'
    
    # reweighting the var-cov matrix for the within model
    require(sandwich);require(lmtest)
    if(class(fm)=="zeroinfl"|class(fm)=="hurdle") {
    M <- length(unique(cluster))   
    N <- length(cluster)           
    K <- dim(fm$vcov)[1]        #here is the rank from the zero inflated fits             
    dfc <- (M/(M-1))*((N-1)/(N-K))  
    uj  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum, na.rm=T));
    vcovCL <- dfc[1]*sandwich(fm, meat=crossprod(uj)/N)*dfcw #fix a length problem in dfc
    list(summary=coeftest(fm, vcovCL))}
    else if(class(fm)!="zeroinfl"){
    M <- length(unique(cluster))
    N <- length(cluster)
    K <- fm$rank
    dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
    uj <- apply(estfun(fm), 2, function(x) tapply(x, cluster, sum, na.rm=T));
    rcse.cov <- dfc * sandwich(fm, meat = crossprod(uj)/N)
    rcse.se <- coeftest(fm, rcse.cov)
    return(list( rcse.se))}
}

Now we can create new standardized weights:

data1$new_wts<-data1$bystuwt/mean(data1$bystuwt,na.rm=T)
glm_poisson<-glm(sport_count~non_trad_family+race_eth+male+ses_ordinal,data=data1,weights=new_wts,family=poisson)

And test a poisson glm with the new weights:

clx2(glm_poisson,cluster=data1$strat_id)

## Warning in if (class(fm) == "zeroinfl" | class(fm) == "hurdle") {: the
## condition has length > 1 and only the first element will be used

## Warning in if (class(fm) != "zeroinfl") {: the condition has length > 1 and
## only the first element will be used

## [[1]]
## 
## z test of coefficients:
## 
##                           Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)             -0.1972140  0.0434474 -4.5391 5.648e-06 ***
## non_trad_family         -0.0974146  0.0278033 -3.5037 0.0004588 ***
## race_ethasian_pacific   -0.3203839  0.0650782 -4.9231 8.520e-07 ***
## race_ethblack            0.1090229  0.0486204  2.2423 0.0249403 *  
## race_ethhispanic        -0.1179482  0.0492190 -2.3964 0.0165571 *  
## race_ethmultirace        0.0095894  0.0701261  0.1367 0.8912325    
## race_ethnative_american -0.1097808  0.1380761 -0.7951 0.4265701    
## male                     0.1735287  0.0268328  6.4670 9.994e-11 ***
## ses_ordinal2             0.1316777  0.0409648  3.2144 0.0013071 ** 
## ses_ordinal3             0.2220279  0.0428793  5.1780 2.243e-07 ***
## ses_ordinal4             0.3208509  0.0412737  7.7737 7.621e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficients are the same, but the standard errors are considerably higher!

Now we can create a true negative binomial model using “glm.nb”:

fit.nb<-glm.nb(sport_count~non_trad_family+race_eth+male+ses_ordinal,data=data1,weights=new_wts)
clx2(fit.nb,cluster=data1$strat_id)

## Warning in if (class(fm) == "zeroinfl" | class(fm) == "hurdle") {: the
## condition has length > 1 and only the first element will be used

## Warning in if (class(fm) != "zeroinfl") {: the condition has length > 1 and
## only the first element will be used

## [[1]]
## 
## z test of coefficients:
## 
##                          Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)             -0.203265   0.043787 -4.6421 3.449e-06 ***
## non_trad_family         -0.095244   0.027924 -3.4109 0.0006476 ***
## race_ethasian_pacific   -0.321199   0.065794 -4.8819 1.051e-06 ***
## race_ethblack            0.108038   0.048797  2.2140 0.0268258 *  
## race_ethhispanic        -0.121868   0.049737 -2.4503 0.0142753 *  
## race_ethmultirace        0.010758   0.072364  0.1487 0.8818137    
## race_ethnative_american -0.112249   0.136993 -0.8194 0.4125700    
## male                     0.179900   0.027439  6.5563 5.516e-11 ***
## ses_ordinal2             0.135438   0.041027  3.3012 0.0009626 ***
## ses_ordinal3             0.224199   0.042926  5.2229 1.761e-07 ***
## ses_ordinal4             0.324678   0.041280  7.8652 3.684e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which model is better?

Here, we see the AIC for our original model:

AIC(fit2_poisson)

## [1] 41830.34

The AIC for the Poisson glm with standardized weights:

AIC(glm_poisson)

## [1] 41591.61

And, finally, the AIC for our negative-binomial model:

AIC(fit.nb)

## [1] 39742.88

Clearly, the negative-binomial model has a better AIC which means that we will use the negative-binomial model to analyze our results.

Conclusion

Negative binomial odds ratios:

nb.coef<-round(exp(summary(fit.nb)$coef[-1,1]),3)
binded<-data.frame(cbind(round(exp(coef(fit.nb)),3),round(exp(confint(fit.nb)),3)))

## Waiting for profiling to be done...

names(binded)<-c("Odds_Ratio","2.5%","97.5%")
print(binded)

##                         Odds_Ratio  2.5% 97.5%
## (Intercept)                  0.816 0.767 0.868
## non_trad_family              0.909 0.869 0.951
## race_ethasian_pacific        0.725 0.645 0.814
## race_ethblack                1.114 1.045 1.187
## race_ethhispanic             0.885 0.829 0.945
## race_ethmultirace            1.011 0.911 1.121
## race_ethnative_american      0.894 0.712 1.118
## male                         1.197 1.148 1.249
## ses_ordinal2                 1.145 1.075 1.220
## ses_ordinal3                 1.251 1.175 1.333
## ses_ordinal4                 1.384 1.297 1.476

We see from the negative-binomial model that non-traditional families are roughly 9.1% less likely to have participated in college sports during the year previous to the survey, with race, gender and SES status being held equal.