Research Questions:

What are the facorts associated with preterm birth in Texas counties?

Variables:

  1. Preterm: Number of Preterm birth (Birth before 37weeks of pregnancy)
  2. hpsa10: 0=no shortage, 1=shortage 2=partially shortage of hospitals in a county.

Background:

Preterm birth is always higher compared to national preterm birth rate.On top of that preterm birth rate in Texas has been increase and now it 10.8%. 0.8% higher than total United States preterm birth. Literature says, absence of prenatal care has a impact over preterm birth as well as the race/ethnic backgroud. As most of the preterm birth occures among Black birhs compared to Hispanics and whites. However, for this study I have analysed whether there is a association between preterm birth and counties with shortage of hospitals because hospitals give the prenatal care.

Result:

I found in possion and in alternative models that counties which have shortage of hospitals are more likely to have higher preterm birth rates.

#Load Data

1) Define a count outcome for the dataset of your choosing

In the ARF datatset, preterm birth is a number of preterm deliveries which are before 37 weeks of pregnancy.

a. State a research question about your outcome

What are the facorts associated with preterm birth in Texas counties?

b. Is an offset term necessary? why or why not?

All the county population or number of birth are not same. The number of birth of each county is necessary because we need to find the percentage of preterm birth in each county. Hence we need use off-set term.

Renaming variables

library(dplyr)
files2 <- files%>%
      mutate(cofips=f00004, 
         coname=f00010,
         state = f00011,
         births1416=f1254614,
         births0608=f1254608,
         lowbw1416=f1255314,
         lowbw0608=f1255308,
         preterm = f1360814, 
         childpov10= f1332210,
         rucc= as.factor(f0002013),
         hpsa10= as.factor(f0978710),
         obgyn10_pc= 1000*(f1168410/ f0453010) )%>%
  dplyr::select(births1416, lowbw1416,births0608, lowbw0608, preterm, state, cofips, coname, childpov10, rucc, hpsa10, obgyn10_pc)%>%
  filter(complete.cases(.))%>%
  as.data.frame()


head(files2)
##   births1416 lowbw1416 births0608 lowbw0608 preterm state cofips  coname
## 1        657        62        701        65      92    01  01001 Autauga
## 2       2278       198       2194       189     291    01  01003 Baldwin
## 3        276        33        321        41      50    01  01005 Barbour
## 4        271        28        247        31      41    01  01007    Bibb
## 5        686        55        705        52      81    01  01009  Blount
## 6        136        22        173        24      28    01  01011 Bullock
##   childpov10 rucc hpsa10 obgyn10_pc
## 1       17.5   02      1 0.00000000
## 2       20.2   03      1 0.11521685
## 3       36.2   06      2 0.00000000
## 4       28.6   01      1 0.00000000
## 5       24.1   01      1 0.00000000
## 6       40.3   06      1 0.09162544
summary(files2)
##    births1416       lowbw1416         births0608       lowbw0608      
##  Min.   :    67   Min.   :  11.00   Min.   :    84   Min.   :   11.0  
##  1st Qu.:   268   1st Qu.:  22.00   1st Qu.:   287   1st Qu.:   23.0  
##  Median :   499   Median :  39.00   Median :   521   Median :   43.0  
##  Mean   :  1742   Mean   : 141.01   Mean   :  1810   Mean   :  147.9  
##  3rd Qu.:  1254   3rd Qu.:  99.75   3rd Qu.:  1300   3rd Qu.:  103.0  
##  Max.   :126007   Max.   :8969.00   Max.   :140251   Max.   :10211.0  
##                                                                       
##     preterm           state              cofips             coname         
##  Min.   :   11.0   Length:2238        Length:2238        Length:2238       
##  1st Qu.:   32.0   Class :character   Class :character   Class :character  
##  Median :   58.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :  197.6                                                           
##  3rd Qu.:  142.8                                                           
##  Max.   :11494.0                                                           
##                                                                            
##    childpov10         rucc     hpsa10    obgyn10_pc     
##  Min.   : 2.70   06     :483    :  0   Min.   :0.00000  
##  1st Qu.:17.70   01     :405   0:433   1st Qu.:0.00000  
##  Median :24.10   02     :355   1:834   Median :0.05731  
##  Mean   :24.42   03     :301   2:971   Mean   :0.06926  
##  3rd Qu.:30.10   07     :275           3rd Qu.:0.10382  
##  Max.   :61.10   04     :214           Max.   :0.82115  
##                  (Other):205

Making map

library(tigris)
## To enable 
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
## 
## Attaching package: 'tigris'
## The following object is masked from 'package:graphics':
## 
##     plot
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(ggplot2)

options(tigris_class="sf")
usco<-counties(cb=T, year=2015, state = "TX")
usco$cofips<-usco$GEOID
sts<-states(cb = T, year=2015)
sts<-st_boundary(sts)%>%
  filter(STATEFP==48)

files2_m<-geo_join(usco, files2, by_sp="cofips", by_df="cofips",how="left" )
## Warning: Column `cofips` has different attributes on LHS and RHS of join
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
## that
files2_m%>%
  filter(STATEFP==48)%>%
  mutate(ppb=preterm/births1416)%>%
  mutate(pb_group = cut(ppb, breaks=quantile(ppb, p=seq(0,1,length.out = 6), na.rm=T ), include.lowest=T ))%>%
  ggplot()+
  geom_sf(aes(fill=pb_group, color=NA))+
  scale_color_brewer(palette = "Blues")+
  scale_fill_brewer(palette = "Blues",na.value = "grey50")+
  geom_sf(data=sts["STATEFP"=="48",], color="black")+
  coord_sf(crs = 102008)+
  ggtitle(label = "Proportion of births that were preterm, 2014-2016")

Interpretation of map

We can see from the map is, end of the southern part of the Texas is under high proportion of preterm birth compared to the total number of birth of 2014-2016.The Coastal parts are mostly has highest preterm birth. In addition we can also see that rural parts in the north west of the Texas counties like Pecos and Reeves are in between the highest rate of preterm births since 2014-2016. Important part to notice here is, majority of the border counties are slightly lower than the higher range.

2) Consider a Poisson regression model for the outcomea. Evaluate the level of dispersion in the outcome. Is the Poisson model a good choice?

files3<-filter(files2_m, is.na(preterm)==F)


fit_pois<- glm(preterm ~ offset(log(births1416)) + hpsa10, 
               family=poisson, 
               data=files3)
summary(fit_pois)
## 
## Call:
## glm(formula = preterm ~ offset(log(births1416)) + hpsa10, family = poisson, 
##     data = files3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.2377  -0.5404   0.1685   0.9559  10.4849  
## 
## Coefficients:
##             Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -2.15211    0.01514 -142.154  < 2e-16 ***
## hpsa101      0.05352    0.01616    3.312 0.000928 ***
## hpsa102      0.05903    0.01756    3.361 0.000775 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 607.13  on 160  degrees of freedom
## Residual deviance: 594.63  on 158  degrees of freedom
## AIC: 1609.3
## 
## Number of Fisher Scoring iterations: 4
Caution: This model has higher residual deviance compared to degrees of freedom. If the model was appropriate for this data then two of them were supposed to be same. This indicates this model has a issue of overdisperson.

To get the risk ratio we will exponentiate the coefficients.

exp(coef(fit_pois))
## (Intercept)     hpsa101     hpsa102 
##   0.1162382   1.0549800   1.0608022

Interpretation:

Counties which has shortage of hospitals, have a mean preterm birth is 5.4% which is higher than no shortage counties.Whereas counties which have partial shortage of hospitals have mean preterm birth rate of 6% which is higher than no shortage area.

files3$est_pois<- fitted(fit_pois)

files3$est_rate<- files3$est_pois/ files3$births1416

aggregate(est_rate ~ hpsa10, data=files3, FUN = mean)
##   hpsa10  est_rate
## 1      0 0.1162382
## 2      1 0.1226290
## 3      2 0.1233057

Interpretation:

Difference between the mean of preterm birth rate of no shortage area and whole shortage area is 5.4%.

Difference between the mean of preterm birth rate of no shortage area and partially shortage area is 6%.

Doing arithmatic to see the same result.

(0.1226290-0.1162382)/0.1162382
## [1] 0.0549802
(0.1233057-0.1162382)/0.1162382
## [1] 0.06080187

Overdispersion test:

scale<-sqrt(fit_pois$deviance/fit_pois$df.residual)
scale
## [1] 1.939966

If there were no difference between mean and variance then it was to be 1. This tell that there is overdispersion in this model. We can also check it by godness of fit. This will also tell us that by larger that this model is fit for the model. However, if it yields lower p-value, it is not appropriate for this model.

1-pchisq(fit_pois$deviance, df = fit_pois$df.residual)
## [1] 0

The value is lower than the significant level. Therefore we can conclude that poisson is not a appropriate model for this fit.

Let us see whether quasi-poisson model yields p-value more than 1 or not. I am using quasi-poisson model because it includes overdisperson parameter.

fit_q_pois<- glm(preterm ~ offset(log(births1416)) + hpsa10, 
               family=quasipoisson, 
               data=files3)
summary(fit_q_pois)
## 
## Call:
## glm(formula = preterm ~ offset(log(births1416)) + hpsa10, family = quasipoisson, 
##     data = files3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.2377  -0.5404   0.1685   0.9559  10.4849  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.15211    0.02950 -72.960   <2e-16 ***
## hpsa101      0.05352    0.03149   1.700   0.0912 .  
## hpsa102      0.05903    0.03421   1.725   0.0864 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.796133)
## 
##     Null deviance: 607.13  on 160  degrees of freedom
## Residual deviance: 594.63  on 158  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Here we see disperson parameter for quasipoissan is 3.7 and substantive difference is similar. I would use this model.

3.Consider a negative binomial model

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
fit_nb<- glm.nb(preterm ~ offset(log(births1416)) + hpsa10, 
               data=files3)
summary(fit_nb)
## 
## Call:
## glm.nb(formula = preterm ~ offset(log(births1416)) + hpsa10, 
##     data = files3, init.theta = 119.2885051, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1721  -0.4639   0.0305   0.6846   2.4871  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.13528    0.02620 -81.491   <2e-16 ***
## hpsa101      0.06136    0.03155   1.945   0.0518 .  
## hpsa102      0.03740    0.03165   1.181   0.2374    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(119.2885) family taken to be 1)
## 
##     Null deviance: 127.26  on 160  degrees of freedom
## Residual deviance: 123.44  on 158  degrees of freedom
## AIC: 1254.9
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  119.3 
##           Std. Err.:  23.5 
## 
##  2 x log-likelihood:  -1246.861

Risk fators are different compared to poisson model.

exp(coef(fit_nb))
## (Intercept)     hpsa101     hpsa102 
##   0.1182118   1.0632823   1.0381035

Interpretation:

6.3% increased risk of preterm birth for counties with shortage of hospitals.

3.8% increased risk of preterm birth for countires with partial shortage of hospitals.

If we look at the AIC value, compared to poisson model, negetive binomial model has much lower AIC value.

4) Compare the model fits of the alternative models using AIC

Relative Risk Model

pbrate<-(sum(files3$preterm, na.rm=T)/sum(files3$births1416, na.rm=T)) 
pbrate
## [1] 0.122197

Which estimates that on average, 12% of births coulb be preterm. We can apply this rate to each county’s births.

files3$E <- pbrate* files3$births1416

head(files3[, c("coname", "births1416", "preterm", "E")])
## Simple feature collection with 6 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -98.64623 ymin: 27.83954 xmax: -94.04272 ymax: 33.31201
## epsg (SRID):    4269
## proj4string:    +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
##      coname births1416 preterm         E                       geometry
## 1   Aransas        263      32  32.13782 MULTIPOLYGON (((-96.8229 28...
## 2       Bee        397      48  48.51222 MULTIPOLYGON (((-98.08976 2...
## 3      Cass        360      52  43.99093 MULTIPOLYGON (((-94.65384 3...
## 4     Comal       1510     195 184.51752 MULTIPOLYGON (((-98.64612 2...
## 5     Ellis       2171     237 265.28976 MULTIPOLYGON (((-97.08743 3...
## 6 Galveston       4174     574 510.05042 MULTIPOLYGON (((-94.52566 2...

This shows us the observed number of preterm births, and the expected number.

files3%>%
  ggplot( aes(preterm))+geom_histogram()+ggtitle("Distribution of preterm births", "y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

files3%>%
  ggplot( aes(preterm/E))+geom_histogram()+ggtitle("Distribution of standardized preterm births", "y/E")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In order to estimate the model with the expected counts, changed the offset term in the model, otherwise everything is the same:

fit_pois_E<-glm(preterm ~ offset(log(E+.000001)) + hpsa10, 
               family=poisson, 
               data=files3)
summary(fit_pois_E)
## 
## Call:
## glm(formula = preterm ~ offset(log(E + 1e-06)) + hpsa10, family = poisson, 
##     data = files3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.2377  -0.5404   0.1685   0.9559  10.4849  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.04999    0.01514  -3.302 0.000959 ***
## hpsa101      0.05352    0.01616   3.312 0.000928 ***
## hpsa102      0.05903    0.01756   3.361 0.000775 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 607.13  on 160  degrees of freedom
## Residual deviance: 594.63  on 158  degrees of freedom
## AIC: 1609.3
## 
## Number of Fisher Scoring iterations: 4

In fact, these results are identical to those from the Poisson model with the births as the offset term.

Binomial count model

fit_bin<-glm(cbind(preterm, births1416)~  hpsa10, 
               family=binomial, 
               data=files3)
summary(fit_bin)
## 
## Call:
## glm(formula = cbind(preterm, births1416) ~ hpsa10, family = binomial, 
##     data = files3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.8548  -0.5108   0.1594   0.9005   9.8501  
## 
## Coefficients:
##             Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -2.15211    0.01600 -134.549  < 2e-16 ***
## hpsa101      0.05352    0.01708    3.133  0.00173 ** 
## hpsa102      0.05903    0.01857    3.179  0.00148 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 540.39  on 160  degrees of freedom
## Residual deviance: 529.23  on 158  degrees of freedom
## AIC: 1525.1
## 
## Number of Fisher Scoring iterations: 4

In this model, we see the exact same results compared to the Poisson model. The only difference is the AIC for the binomial model is substantially lower, suggesting that that distribution is preferred for this analysis.