Counties in North Carolina

Counties in North Carolina

This data reflects all births recorded in North Carolina in 2012. We will analyze the data several ways. The main question we would like to answer is:

Does early prenatal care (beginning before month 5) decrease the risk of preterm birth (birth at less than 37 weeks)?

Initial Data Exploration

First, let’s explore the data we have. First, we see that our dataset has 122,513 observations and 127 variables. We can also produce simple five number summaries of some of our variables. Shown here are weeks of gestation (wksgest), month that prenatal care began (mdif), and maternal age (mage). We are also interested in how much data is missing in these dataset. For example, for month of prenatal care, there are 2155 missing observations.

## [1] 122513    117
##     wksgest           mdif             mage      
##  Min.   :18.00   Min.   : 1.000   Min.   :10.00  
##  1st Qu.:38.00   1st Qu.: 2.000   1st Qu.:23.00  
##  Median :39.00   Median : 3.000   Median :27.00  
##  Mean   :38.83   Mean   : 6.183   Mean   :27.57  
##  3rd Qu.:40.00   3rd Qu.: 4.000   3rd Qu.:32.00  
##  Max.   :99.00   Max.   :99.000   Max.   :99.00

Data Visualization

It is often easier to interpret this data when shown visually. Here, we produce some simple graphs and tables to explore the data.

Formatting Variables

After getting an overview of our data, we’ll transform some variables so they make more sense. We’ll remove missing variables, label 0s and 1s appropriately, and re-code variable names. Here is an example of the work we’ll do on the race and ethnicity variables.

##       
##            N     U     Y  <NA>
##   1    68995    47  3026     0
##   2    28842    12   487     0
##   3     1645     0    81     0
##   4     4877    12 14489     0
##   <NA>     0     0     0     0
mrace methnic raceeth_f n
1 N White non-Hispanic 68995
1 U Other 47
1 Y White Hispanic 3026
2 N African American 28842
2 U African American 12
2 Y African American 487
3 N American Indian or Alaska Native 1645
3 Y American Indian or Alaska Native 81
4 N Other 4877
4 U Other 12
4 Y Other 14489

Coding our Exposure and Outcome Variables

This code shows how we are defining our exposure and outcome variable. Prenatal care begins at or before month 5 of pregnancy, and preterm birth is defined as 37 weeks or less of gestation.

# Create new variable for prenatal care. It is dichotomous. If 5 or less, then code it 1 (yes) and if not, 0 (no)
births$pnc5 <- ifelse(births$mdif <= 5, 1, 0)

# Label it
births$pnc5_f <- factor(births$pnc5, levels=c(0,1), labels=c("No Early PNC", "Early PNC"))

# Create new variable for preterm births that are greater/equal to 20 and less than 37 weeks (1, yes)
births$preterm <- ifelse(births$wksgest >= 20 & births$wksgest < 37, 1,
                         ifelse(births$wksgest >= 37 & births$wksgest < 99, 0, NA)) 
# Label it
births$preterm_f <- factor(births$preterm, levels=c(0,1), labels=c("term", "preterm"))

Create Table One

Now we’ll take a look at our outcome by a few variables of interest - maternal age, prenatal care before/after month 5, smoking status, and sex. The CreateTableOne package automatically runs a chi-squared test of significance on each variable against the outcome. Here, we see that all of our variables have a statistically significant impact on preterm birth.

##                         Stratified by preterm_f
##                          term           preterm       p      test
##   n                      108182         14155                    
##   mage (mean (SD))        27.54 (5.91)  27.72 (6.39)   0.001     
##   pnc5_f = Early PNC (%)  97113 (91.2)  12267 (89.1)  <0.001     
##   smoker_f = Smoker (%)   11140 (10.3)   1787 (12.7)  <0.001     
##   sex_f = Female (%)      53304 (49.3)   6612 (46.7)  <0.001

Creating Exclusion Criteria

We’ll set some inclusion and exclusion criteria for this large dataset. I’ve included the full code below, but much of it is commented out so that it wouldn’t run through RMarkdown (it would take too long to process).

# Some of these are commented out, becuase lubridate doesn't work in RMarkdown

# Create many inclusion variables, all of which have to be =1 to be included.
# births$weeknum = births$dob_d %>% week

# Has gestation data (weeks), and at least 20 weeks gestation pre-birth (weeks >= 20)
births$wksgest[births$wksgest == 99] = NA
births$incl_hasgest = !is.na(births$wksgest) %>% as.numeric

# Create incl_enoughgest in the births dataset to be a 1 if the birth has >= 20 weeks of gestation, 0 if otherwise.
# Less than 20 weeks gestation before Jan 1 of year (LMP > Aug 20), or weeks-weeknum>=19, w/ weeknum=1 for Jan 1-7
births$incl_enoughgest = as.numeric(births$wksgest >= 20)

# Start date of gestation was 45 (max gest in'11) w prior to Jan 1 of next year, so all births are observed in year
# births$incl_lateenough = as.numeric(births$wksgest - births$weeknum <= 19)

# Create incl_earlyenough to be 1 when weeknum - wksgest <= 7, 0 otherwise.
# births$incl_earlyenough = as.numeric(births$weeknum - births$wksgest <=7)

# Create incl_singleton to be 1 when plur is 1, 0 otherwise
births$incl_singleton = as.numeric(births$plur == 1)

# .................

# Create congenital anomaly variable using apply()
# Create new dataframe (outside of births!) with all the variables that indicate a congenital abnormality
congen_anom = c("anen","mnsb","cchd", "cdh", "omph","gast", "limb", "cl","cp","dowt","cdit", "hypo")
# births[,congen_anom] %>% head
# births[,congen_anom] %>% map(table, useNA = "always")

# if all congenital anomaly variables are "N", 0 otherwise
#births$incl_noanomalies = as.numeric(apply(births[,congen_anom]=="N", FUN=all, 1)) #All not present

#births$incl_noanomalies %>% head #shows us that most are 1, have none of the abnormalities

#births$include_allpass = apply(births[, grepl("incl_", names(births))], FUN=all, MARGIN = 1) #(HW2.1.Q4d indicator for the inclusion criteria yes/no)

# old_births <- births #save the dataset, because we are about to do some more stuff to births
old_births = read_csv("old_births.csv")

# now write over births to only be the births that pass the inclusion criteria. this will drop rows / cases
# births = births[births$include_allpass, ]

dim(births) # 62,370 observations with all our inclusion criteria
## [1] 122513    127
dim(old_births) #122,513 observations
## [1] 122513    121

Variables of Interest

We’ll create a new dataset with only 12 variables we’re most interested in.

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   pnc5_f = col_character(),
##   preterm_f = col_character(),
##   smoker_f = col_character(),
##   wksgest = col_double(),
##   sex = col_double(),
##   meduc = col_double(),
##   mage = col_double(),
##   raceeth_f = col_character(),
##   weeknum = col_double(),
##   include_allpass = col_logical(),
##   cores = col_double(),
##   kotel = col_double()
## )

We’ll create a new Table One with the smaller data set with only our main variables of interest.

##                                      
##                                       Overall      
##   n                                   62370        
##   pnc5_f (%)                                       
##      Early PNC                        56087 (89.9) 
##      No Early PNC                      5442 ( 8.7) 
##      NA                                 841 ( 1.3) 
##   preterm_f = term (%)                56256 (90.2) 
##   smoker_f (%)                                     
##      Non-smoker                       55702 (89.3) 
##      Smoker                            6642 (10.6) 
##      NA                                  26 ( 0.0) 
##   raceeth_f (%)                                    
##      African American                 14840 (23.8) 
##      American Indian or Alaska Native   884 ( 1.4) 
##      Other                            10013 (16.1) 
##      White Hispanic                    1581 ( 2.5) 
##      White non-Hispanic               35052 (56.2) 
##   wksgest (mean (SD))                 38.91 (2.44) 
##   mage (mean (SD))                    27.52 (5.98) 
##   meduc (mean (SD))                    2.79 (1.23)

Plotting

Weeks Gestation

Now we’ll visually represent some of this analysis. The first chart shows that the majority of births occur right before the 40 week mark. The second chart shows distribution of weeks of gestation by birth outcome, but also overlays the factors of race/ethnicity (x-axis, top) and maternal education (y-axis, right).

Maternal Age

It’s possibly to print out the number and percent of preterm births at every level of maternal age. However, this data is easier to interpret in a graph. Here we see that women of lower maternal age (less than about 18) and higher maternal age (move than 35) are more likely to have preterm birth.

mage n pct_earlyPNC pct_preterm
10 1 0.00000 0.00000
11 1 100.00000 0.00000
12 5 80.00000 0.00000
13 23 65.21739 22.72727
14 102 69.60784 24.75248
15 355 77.39130 19.54674
16 846 82.89157 15.52133
17 1665 84.67692 13.88221
18 2877 86.47121 13.21018
19 4490 86.73330 11.76995
20 5539 87.15207 11.49882
21 6244 87.49594 11.89484
22 6564 88.74049 11.17889
23 6581 89.66528 12.07553
24 6533 89.99844 10.89488
25 6625 90.44831 10.26921
26 6716 90.78208 10.54984
27 7084 91.86864 10.76227
28 6953 92.61105 10.37613
29 7005 93.10545 10.62187
30 6934 93.13625 10.81588
31 6670 92.94511 10.80757
32 6026 93.52664 11.08158
33 5361 93.53800 11.18372
34 4681 93.65875 11.14677
35 4034 93.22930 12.60233
36 3266 92.90162 12.72226
37 2642 92.92852 12.88367
38 2030 92.53133 13.90533
39 1570 91.88312 13.88535
40 1152 92.68078 15.67944
41 806 92.99363 17.08229
42 510 91.38277 16.96252
43 293 91.57895 20.47782
44 156 87.41722 17.30769
45 64 87.09677 26.56250
46 36 91.66667 22.22222
47 26 92.30769 38.46154
48 8 100.00000 50.00000
49 6 83.33333 50.00000
50 5 100.00000 60.00000
51 2 100.00000 0.00000
52 2 50.00000 0.00000
54 1 NaN 0.00000
55 1 100.00000 0.00000
NA 22 59.09091 33.33333

Risk

We’ll now calculate the interactions between early start to prenatal care and the birth outcome. First, we’ll draw out the 2x2 tables with the absolute values and the percentages.

preterm term
Early PNC 5290 50797
No Early PNC 696 4746
preterm term
Early PNC 0.0943178 0.9056822
No Early PNC 0.1278942 0.8721058
What is the crude risk difference?
pt[2,1]-pt[1,1] # No early PNC/preterm - early PNC/preter
## [1] 0.0335764

Interpretation: The population who did not recieve early prenatal care had 3.3 additional preterm births out of 100 people compared to the population that did have early prenatal care.

What is the crude risk ratio?
pt[2,1]/pt[1,1]
## [1] 1.355992

Interpretation: The risk of preterm birth among people who did not recieve early prenatal care is 1.35 the risk of preterm birth among people who did recieve early prenatal care.

What is the odds ratio?
(pt[2,1]/pt[1,1])/(pt[2,2]/pt[1,2])
## [1] 1.408199

Interpretation: The odds of preterm birth among women who did not recieve early prenatal care were 1.4 the odds of preterm birth among women who did recieve prenatal care.

What is the attributable proportion?
((pt[2,1]-pt[1,1]) / pt[2,1])*100
## [1] 26.25327

Interpretation: 26% of the preterm births in this population could be attributed to lack of early prenatal care.

Looking by Race/Ethnicity

race_preterm = table(births_62_sm$raceeth_f, births_62_sm$preterm_f)
knitr::kable(race_preterm)
preterm term
African American 2049 12791
American Indian or Alaska Native 123 761
Other 971 9042
White Hispanic 128 1453
White non-Hispanic 2843 32209
pt2 = prop.table(table(births_62_sm$pnc5_f, births_62_sm$preterm_f, births_62_sm$raceeth_f),
                 margin = c(1,3)) # not sure what margin does, but it changes the outcome
pt2_df = data.frame(pt2); names(pt2_df) = c("exposure", "outcome", "group", "estimate")

knitr::kable(pt2_df)
exposure outcome group estimate
Early PNC preterm African American 0.1340852
No Early PNC preterm African American 0.1623094
Early PNC term African American 0.8659148
No Early PNC term African American 0.8376906
Early PNC preterm American Indian or Alaska Native 0.1328225
No Early PNC preterm American Indian or Alaska Native 0.1744186
Early PNC term American Indian or Alaska Native 0.8671775
No Early PNC term American Indian or Alaska Native 0.8255814
Early PNC preterm Other 0.0954092
No Early PNC preterm Other 0.1120968
Early PNC term Other 0.9045908
No Early PNC term Other 0.8879032
Early PNC preterm White Hispanic 0.0757030
No Early PNC preterm White Hispanic 0.1142857
Early PNC term White Hispanic 0.9242970
No Early PNC term White Hispanic 0.8857143
Early PNC preterm White non-Hispanic 0.0782831
No Early PNC preterm White non-Hispanic 0.1064133
Early PNC term White non-Hispanic 0.9217169
No Early PNC term White non-Hispanic 0.8935867

Modelling

Creating 4 distinct models

We’ll create four models with different variables.

m1 = m_crude_rd

m2 = glm(data=births, preterm_f ~ pnc5_f + smoker_f, family=binomial(link=“identity”)) model_results = rbind(m_crude_results, data.frame(model=“M2: M1+smoking”, cbind(broom::tidy(m2), confint(m2)), stringsAsFactors = F)[2,])

m3 = glm(data=births, preterm_f ~ pnc5_f + smoker_f + mage, family=binomial(link=“identity”)) model_results = rbind(model_results, data.frame(model=“M3: M2+mage”, cbind(broom::tidy(m3), confint(m3)), stringsAsFactors = F)[2,])

m4 = glm(data=births, preterm_f ~ pnc5_f + smoker_f + mage + mage_sq, family=binomial(link=“identity”))

Observations 120214 (2299 missing obs. deleted)
Dependent variable preterm_f
Type Generalized linear model
Family binomial
Link identity
χ²(1) 67.4
Pseudo-R² (Cragg-Uhler) 0.0
Pseudo-R² (McFadden) 0.0
AIC 85528.9
BIC 85548.3
Est. 2.5% 97.5% z val. p
(Intercept) 0.1 0.1 0.1 41.9 0.0
pnc5_fEarly PNC -0.0 -0.0 -0.0 -7.8 0.0
Standard errors: MLE
## [1] "term"    "preterm"
## # A tibble: 3 x 8
##   term      estimate std.error statistic  p.value conf.low conf.high model_name 
##   <chr>        <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr>      
## 1 pnc5_fEa~  -0.0270   0.00346     -7.82 5.45e-15  -0.0339   -0.0203 M1: Crude  
## 2 pnc5_fEa~  -0.0251   0.00346     -7.25 4.27e-13  -0.0320   -0.0184 M2: M1 + s~
## 3 pnc5_fEa~  -0.0259   0.00347     -7.46 8.54e-14  -0.0328   -0.0192 M3: M2 + m~

Best Model

glm(data=births, preterm_f ~ pnc5_f + smoker_f, family=binomial(link="identity")) #Discussion of model spec
## 
## Call:  glm(formula = preterm_f ~ pnc5_f + smoker_f, family = binomial(link = "identity"), 
##     data = births)
## 
## Coefficients:
##     (Intercept)  pnc5_fEarly PNC   smoker_fSmoker  
##         0.13487         -0.02509          0.02380  
## 
## Degrees of Freedom: 120148 Total (i.e. Null);  120146 Residual
##   (2364 observations deleted due to missingness)
## Null Deviance:       85530 
## Residual Deviance: 85400     AIC: 85410
mfull_emm_rd = glm(data=births,
                   preterm_f ~ pnc5_f * raceeth_f +
                     smoker_f + mage + mage_sq,
                   family=binomial(link="identity"))

# summary(mfull_emm_rd)

broom::tidy(mfull_emm_rd) #this is the best output
## # A tibble: 13 x 5
##    term                                    estimate std.error statistic  p.value
##    <chr>                                      <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                              2.95e-1 0.0189      15.6    1.32e-54
##  2 pnc5_fEarly PNC                         -1.78e-2 0.00515     -3.46   5.47e- 4
##  3 raceeth_fAfrican American                5.82e-2 0.00794      7.32   2.49e-13
##  4 raceeth_fOther                           1.79e-3 0.00823      0.218  8.28e- 1
##  5 raceeth_fWhite Hispanic                 -2.69e-3 0.0178      -0.151  8.80e- 1
##  6 raceeth_fAmerican Indian or Alaska Nat~  7.81e-2 0.0300       2.60   9.27e- 3
##  7 smoker_fSmoker                           2.81e-2 0.00320      8.77   1.79e-18
##  8 mage                                    -1.49e-2 0.00133    -11.2    5.27e-29
##  9 mage_sq                                  2.85e-4 0.0000235   12.2    5.34e-34
## 10 pnc5_fEarly PNC:raceeth_fAfrican Ameri~ -1.38e-3 0.00833     -0.165  8.69e- 1
## 11 pnc5_fEarly PNC:raceeth_fOther           9.49e-3 0.00862      1.10   2.71e- 1
## 12 pnc5_fEarly PNC:raceeth_fWhite Hispanic -4.08e-4 0.0187      -0.0218 9.83e- 1
## 13 pnc5_fEarly PNC:raceeth_fAmerican Indi~ -4.09e-2 0.0313      -1.31   1.91e- 1

Spatial Information

General maps

We’ll run in a map of North Carolina and transform it based on the Coordinate Reference System.

## Reading layer `nc_counties_simplified' from data source `C:\Users\LBQ8\OneDrive - CDC\R for epi 2020 data pack\NC-Counties-Simple\nc_counties_simplified.shp' using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 19 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -84.32187 ymin: 33.7529 xmax: -75.40039 ymax: 36.58816
## geographic CRS: WGS 84

More detailed geographic data with our variables of interest (though, without labels).

Adding economic data

We can “harvest” economic data from the web. In this case, tier 3 represents the most developed counties, and tier 1 represents the least developed.

##   [1] "County"       "Tier"         "Alamance"     "2"            "Alexander"   
##   [6] "2"            "Alleghany"    "2"            "Anson"        "1"           
##  [11] "Ashe"         "2"            "Avery"        "2"            "Beaufort"    
##  [16] "1"            "Bertie"       "1"            "Bladen"       "1"           
##  [21] "Brunswick"    "3"            "Buncombe"     "3"            "Burke"       
##  [26] "2"            "Cabarrus"     "3"            "Caldwell"     "1"           
##  [31] "Camden"       "2"            "Carteret"     "2"            "Caswell"     
##  [36] "1"            "Catawba"      "2"            "Chatham"      "3"           
##  [41] "Cherokee"     "2"            "Chowan"       "1"            "Clay"        
##  [46] "2"            "Cleveland"    "1"            "Columbus"     "1"           
##  [51] "Craven"       "2"            "Cumberland"   "1"            "Currituck"   
##  [56] "3"            "Dare"         "2"            "Davidson"     "2"           
##  [61] "Davie"        "3"            "Duplin"       "1"            "Durham"      
##  [66] "3"            "Edgecombe"    "1"            "Forsyth"      "2"           
##  [71] "Franklin"     "2"            "Gaston"       "2"            "Gates"       
##  [76] "2"            "Graham"       "1"            "Granville"    "2"           
##  [81] "Greene"       "1"            "Guilford"     "2"            "Halifax"     
##  [86] "1"            "Harnett"      "2"            "Haywood"      "3"           
##  [91] "Henderson"    "3"            "Hertford"     "1"            "Hoke"        
##  [96] "2"            "Hyde"         "1"            "Iredell"      "3"           
## [101] "Jackson"      "2"            "Johnston"     "3"            "Jones"       
## [106] "1"            "Lee"          "2"            "Lenoir"       "1"           
## [111] "Lincoln"      "3"            "Macon"        "2"            "Madison"     
## [116] "2"            "Martin"       "1"            "McDowell"     "2"           
## [121] "Mecklenburg"  "3"            "Mitchell"     "1"            "Montgomery"  
## [126] "2"            "Moore"        "3"            "Nash"         "1"           
## [131] "New Hanover"  "3"            "Northampton"  "1"            "Onslow"      
## [136] "1"            "Orange"       "3"            "Pamlico"      "2"           
## [141] "Pasquotank"   "1"            "Pender"       "3"            "Perquimans"  
## [146] "1"            "Person"       "2"            "Pitt"         "1"           
## [151] "Polk"         "2"            "Randolph"     "2"            "Richmond"    
## [156] "1"            "Robeson"      "1"            "Rockingham"   "1"           
## [161] "Rowan"        "2"            "Rutherford"   "1"            "Sampson"     
## [166] "1"            "Scotland"     "1"            "Stanly"       "2"           
## [171] "Stokes"       "2"            "Surry"        "2"            "Swain"       
## [176] "1"            "Transylvania" "2"            "Tyrrell"      "1"           
## [181] "Union"        "3"            "Vance"        "1"            "Wake"        
## [186] "3"            "Warren"       "1"            "Washington"   "1"           
## [191] "Watauga"      "3"            "Wayne"        "1"            "Wilkes"      
## [196] "1"            "Wilson"       "1"            "Yadkin"       "2"           
## [201] "Yancey"       "2"

Data by economic tier

cores n county_name FIPS econ_tier county_name_ord variable value
1 1757 Alamance 37001 2 Alamance pct_earlyPNC 90.36697
3 386 Alexander 37003 2 Alexander pct_earlyPNC 91.96891
5 88 Alleghany 37005 1 Alleghany pct_earlyPNC 84.70588
7 244 Anson 37007 1 Anson pct_earlyPNC 84.01639
9 256 Ashe 37009 1 Ashe pct_earlyPNC 92.88538
11 143 Avery 37011 2 Avery pct_earlyPNC 96.50350

Interactive chart of Preterm Birth, Early Prenatal Care, and Economic Tier by County

This graph, which you can zoom in on using toggles in the upper-right hand corner, allows us to see the interaction between many of our variable of interest. It includes our exposure and outcome variable, spatial information by county, and economic tier of the county. Hovering over a point will bring up detailed information for that county.