Author: YJ Choi
Date: 2019-06-10

Introduction

This is a markdown file to reproduce analysis and results for a published paper: Does age-adjusted measurement of contraceptive use better explain the relationship between fertility and contraception? (Demographic Research 2018 Dec 39;45:1227-1240). The abstract, the complete paper, and full publication information is available here.

There are three sections in this document for: accessing public data used for the study; preparing the analysis data file (i.e., data processing/manipulation); and conducting analysis. Analysis section contains code that is only relevant to reproduce results presented in the paper. The paper was prepared using Stata, and a Stata do file to reproduce the results is available at GitHub.

"R code is shown in a gray box"

Output is in white box. (Not all reults are shown in the output document. To see results, simply remove the argument "results=FALSE" from a code chunk)

1. Data Access

Data came from all available standard Demographic and Health Surveys. The analysis dataset (i.e., estimates for select indicators from multiple surveys) was obtained in September 2017 through the DHS API (http://api.dhsprogram.com). The indicators are:
- Total fertility rate (TFR) (15-49),
- Contraceptive prevalence rate (CPR) among all women, and
- union status (i.e., percentage of women 15-49 years who are currently married or live with a partner).

1.1. Define relevant indicators for the study

First, get relevant indicator data from DHS API. More information on indicators available at DHS API is here. For this study, the following indicators were used.
- FE_FRTR_W_TFR
- FP_CUSA_W_ANY
- MA_MSTA_W_UNI

1.2. Call API data for each indicator

See more information about calling DHS API data here: (http://rpubs.com/YJ_Choi/HowTo_DHSAPI_Introduction). Here APIkey is used to overcome the per page limit.

suppressPackageStartupMessages(library(jsonlite)) 
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(dplyr))

#set and use your DHS API key as environment 
Sys.getenv("MYDHSAPIKEY")

#call and save individual indicator data here 
#this code uses API key, "DUMMY-123456". Replace it with your own valid key 

#FE_FRTR_W_TFR  
url<-(paste0("http://api.dhsprogram.com/rest/dhs/data?f=json&indicatorIds=FE_FRTR_W_TFR&breakdown=all&perpage=20000&APIkey=", "DUMMY-123456"))
jsondata<-fromJSON(url) 
dta<-data.table(jsondata$Data)
dta<-select(dta, CountryName, SurveyId, Value, 
            CharacteristicCategory, CharacteristicLabel)   
API_FE_FRTR_W_TFR<- dta %>% rename(FE_FRTR_W_TFR=Value)

#FP_CUSA_W_ANY  
url<-(paste0("http://api.dhsprogram.com/rest/dhs/data?f=json&indicatorIds=FP_CUSA_W_ANY&breakdown=all&perpage=20000&APIkey=", "DUMMY-123456"))
jsondata<-fromJSON(url) 
dta<-data.table(jsondata$Data)
dta<-select(dta, CountryName, SurveyId, Value, 
            CharacteristicCategory, CharacteristicLabel)
API_FP_CUSA_W_ANY<- dta %>% rename(FP_CUSA_W_ANY=Value)

#MA_MSTA_W_UNI      
url<-(paste0("http://api.dhsprogram.com/rest/dhs/data?f=json&indicatorIds=MA_MSTA_W_UNI&breakdown=all&perpage=20000&APIkey=", "DUMMY-123456"))
jsondata<-fromJSON(url) 
dta<-data.table(jsondata$Data)
dta<-select(dta, CountryName, SurveyId, Value, 
            CharacteristicCategory, CharacteristicLabel)    
API_MA_MSTA_W_UNI<- dta %>% rename(MA_MSTA_W_UNI=Value)

1.3. Merge the individual indicator datasets

Then, merge all three datasets into one dataset.

#check the three datasets
dim(API_FE_FRTR_W_TFR)
dim(API_FP_CUSA_W_ANY)
dim(API_MA_MSTA_W_UNI)
table(API_FE_FRTR_W_TFR$CharacteristicCategory)
table(API_FP_CUSA_W_ANY$CharacteristicCategory)
table(API_MA_MSTA_W_UNI$CharacteristicCategory)

suppressPackageStartupMessages(library(dplyr))

#recode total labeled differently in API_MA_MSTA_W_UNI
API_MA_MSTA_W_UNI<-mutate(API_MA_MSTA_W_UNI,
    CharacteristicCategory = ifelse(CharacteristicCategory == "Total 15-49", "Total", CharacteristicCategory),  
    CharacteristicLabel = ifelse(CharacteristicLabel == "Total 15-49", 
                        "Total", CharacteristicLabel)
)

#merge all three indicator datasets here
dtaapi<-full_join(API_FE_FRTR_W_TFR, API_FP_CUSA_W_ANY, 
                  by =c("CountryName", "SurveyId", "CharacteristicCategory", "CharacteristicLabel") ) 
dim(dtaapi)
names(dtaapi)
table(dtaapi$CharacteristicCategory)

dtaapi<-full_join(dtaapi, API_MA_MSTA_W_UNI, 
                  by =c("CountryName", "SurveyId", "CharacteristicCategory", "CharacteristicLabel") ) 
dim(dtaapi)
names(dtaapi)
table(dtaapi$CharacteristicCategory)

As of 2019-06-10, the API calls generated 11831 observations - i.e., subgroup-specific estimates - from 316 surveys. Sort the data by “SurveyId”, “CharacteristicCategory”, “CharacteristicLabel” to explore the structure further.

2. Data Management

2.1. Tidy the dataset

suppressPackageStartupMessages(library(dplyr)) 
suppressPackageStartupMessages(library(Hmisc))

#rename
dta<-dtaapi %>%
    rename (cpr_all =   FP_CUSA_W_ANY) %>%
    rename (tfr =   FE_FRTR_W_TFR) %>%
    rename (inunion =   MA_MSTA_W_UNI) %>%
    rename (country =   CountryName) %>%
    rename (group   =   CharacteristicCategory) %>% 
    rename (grouplabel  =   CharacteristicLabel) 
colnames(dta)<-tolower(names(dta))

#check "group"
table(dta$group)

#recode 
dta<-mutate(dta,
    group = ifelse(group == "Total 15-49", "Total", group),     
    grouplabel = ifelse(grouplabel == "Total 15-49", 
                        "Total", grouplabel)
)

#create basic variables
dta<-mutate(dta,
    year=as.numeric(substr(surveyid,3,6)),
    type=substr(surveyid,7,9), 
    period2=year>=2001
)

label(dta$year)<- "year of survey" 
label(dta$type)<- "type of survey"
label(dta$cpr_all)<- "CPR among all women, regardless of marital status"    
label(dta$period2)<- "1985-2000 vs. 2001-2015"

dim(dta)

Unit of current data set is a survey-subgroup specific estimate. We only need estimates among all women and estimates by age group and overall. Thus, drop all estimates by subgroup other than age. Also, drop all estimates from surveys that are not standard DHS.

table(dta$group)
table(dta$type)
summary(dta$cpr_all)

dta<-dta %>% 
    filter(group=="Total" | group=="Age (5-year groups)") %>% 
    filter(type=="DHS") %>% 
    filter(!is.na(cpr_all))
dim(dta)

2.2. Create regional category variables

Create regional variables, following UN Statistics Division’s classification: the M49 standard (https://unstats.un.org/unsd/methodology/m49/). Here, we first scrape the webpage to get the data table of countries by geographic regions and then merge it with the analysis dataset, dta. For more information, see here: (http://rpubs.com/YJ_Choi/AssignRegionalClassification)

Scrape regional classification data.

suppressPackageStartupMessages(library(rvest))

# data table scraping from the web
ctry_UNSD<-read_html("https://unstats.un.org/unsd/methodology/m49/") %>% 
    html_nodes("table") %>%
    .[[7]] %>% 
    html_table(header = TRUE)     
# tidy up
ctry_UNSD<-ctry_UNSD %>%
    rename (country =   "Country or Area") %>% 
    rename (M49 =   "M49 code") %>% 
    rename (ISOalpha3   =   "ISO-alpha3 code") %>% 
    select(country, M49, ISOalpha3) 
ctry_UNSD$country<-as.character(ctry_UNSD$country)
ctry_UNSD<-ctry_UNSD %>% 
    mutate(
    UNSDsubregion=country,
    UNSDsubregion=ifelse(ISOalpha3!="", "", UNSDsubregion)
    )
for (i in 1:nrow(ctry_UNSD)){
    if (ctry_UNSD[i,4]==""){
    ctry_UNSD[i,4]=ctry_UNSD[i-1,4]
    }}
# Keep only country rows and replace country names as needed for merge
ctry_UNSD<-ctry_UNSD %>% 
    filter(ISOalpha3!="") %>% 
    select(country, UNSDsubregion) %>% 
    mutate(
    country = ifelse(country == "Bolivia (Plurinational State of)", "Bolivia", country) ,
    country = ifelse(country == "Cabo Verde", "Cape Verde", country) , 
    country = ifelse(country == "Democratic Republic of the Congo", "Congo Democratic Republic", country) ,
    country = ifelse(country == "Côte d'Ivoire", "Cote d'Ivoire", country) ,
    country = ifelse(country == "Kyrgyzstan", "Kyrgyz Republic", country) , 
    country = ifelse(country == "Republic of Moldova", "Moldova", country) , 
    country = ifelse(country == "United Republic of Tanzania", "Tanzania", country) ,
    country = ifelse(country == "Viet Nam", "Vietnam", country) 
    )
label(ctry_UNSD$UNSDsubregion) <- "Sub-region, UNSD Methodology 49"

Then, merge the two data frames: ctry_UNSD & dta (cleaned DHS API data).

# inspect and confirm "country" variable
length(unique(dta$country)) #number of unique countries
length(unique(ctry_UNSD$country)) #number of unique countries

# merge UNSD country list with the anaysis data 
dim(dta)
dim(ctry_UNSD)
dta<-left_join(dta, ctry_UNSD, by = "country")
dim(dta)
length(unique(dta$surveyid)) #number of unique surveys
length(unique(dta$country)) #number of unique countries

# check all countries now have UNSDsubregion, 
table(dta$UNSDsubregion, exclude = NULL)
    test<-filter(dta, UNSDsubregion=="Western Africa")
    table(test$country)
dta<- mutate(dta, 
    UNSDsubregion=ifelse(country=="Cote d'Ivoire", "Western Africa", UNSDsubregion) )
table(dta$UNSDsubregion, exclude = NULL)

dta<- mutate(dta, 
    ssa=UNSDsubregion=="Eastern Africa"|
        UNSDsubregion=="Middle Africa"|
        UNSDsubregion=="Southern Africa"|
        UNSDsubregion=="Western Africa")
label(dta$ssa) <- "Sub-Saharan Africa"
table(dta$ssa) 

dta<- mutate(dta, 
    region5="NULL",   
    region5 = ifelse(UNSDsubregion == "South-eastern Asia", "SSEA", region5),
    region5 = ifelse(UNSDsubregion == "Southern Asia", "SSEA", region5),
    
    region5 = ifelse(UNSDsubregion == "Central Asia", "MENACAE", region5),
    region5 = ifelse(UNSDsubregion == "Western Asia", "MENACAE", region5),    
    region5 = ifelse(UNSDsubregion == "Eastern Europe", "MENACAE", region5),
    region5 = ifelse(UNSDsubregion == "Southern Europe", "MENACAE", region5),
    region5 = ifelse(UNSDsubregion == "Northern Africa", "MENACAE", region5),
    
    region5 = ifelse(UNSDsubregion == "Caribbean", "LAC", region5),
    region5 = ifelse(UNSDsubregion == "Central America", "LAC", region5),
    region5 = ifelse(UNSDsubregion == "South America", "LAC", region5),

    region5 = ifelse(UNSDsubregion == "Eastern Africa", "SSA, Southern & Eastern", region5),
    region5 = ifelse(UNSDsubregion == "Southern Africa", "SSA, Southern & Eastern", region5),    
    region5 = ifelse(UNSDsubregion == "Middle Africa", "SSA, Central & Western", region5),
    region5 = ifelse(UNSDsubregion == "Western Africa", "SSA, Central & Western", region5))
table(dta$region5)

dta<- mutate(dta, 
    region4 = region5,
    region4 = ifelse(region5 == "MENACAE", "Asia, Middle East, Europe", region4),
    region4 = ifelse(region5 == "SSEA", "Asia, Middle East, Europe", region4))
table(dta$region4)

dta<- mutate(dta, 
    region5_1 = region5 == "SSEA",
    region5_2 = region5 == "LAC", 
    region5_3 = region5 == "MENACAE", 
    region5_4 = region5 == "SSA, Central & Western",
    region5_5 = region5 == "SSA, Southern & Eastern" ) 

dta$region5_1 <-as.numeric(dta$region5_1)
dta$region5_2 <-as.numeric(dta$region5_2)
dta$region5_3 <-as.numeric(dta$region5_3)
dta$region5_4 <-as.numeric(dta$region5_4)
dta$region5_5 <-as.numeric(dta$region5_5)

2.3. Create age-adjusted CPR

Then, now we construct age-adjusted CPR. See the paper for detailed methods and background. In summary, average of age-specific CPR will be calculated - weighted by the population size in each age-group.

dtasum<-mutate(dta, 
        temp_17=    NA  , 
        temp_22=    NA  , 
        temp_27=    NA  , 
        temp_32=    NA  , 
        temp_37=    NA  , 
        temp_42=    NA  , 
        temp_47=    NA  , 
        temp_17=ifelse(grouplabel =="15-19", cpr_all, temp_17),
        temp_22=ifelse(grouplabel =="20-24", cpr_all, temp_22), 
        temp_27=ifelse(grouplabel =="25-29", cpr_all, temp_27), 
        temp_32=ifelse(grouplabel =="30-34", cpr_all, temp_32), 
        temp_37=ifelse(grouplabel =="35-39", cpr_all, temp_37), 
        temp_42=ifelse(grouplabel =="40-44", cpr_all, temp_42), 
        temp_47=ifelse(grouplabel =="45-49", cpr_all, temp_47) ) %>%
    group_by(surveyid) %>% 
    mutate(
        temp_17 = mean(temp_17, na.rm=TRUE) , 
        temp_22 = mean(temp_22, na.rm=TRUE) , 
        temp_27 = mean(temp_27, na.rm=TRUE) , 
        temp_32 = mean(temp_32, na.rm=TRUE) , 
        temp_37 = mean(temp_37, na.rm=TRUE) , 
        temp_42 = mean(temp_42, na.rm=TRUE) , 
        temp_47 = mean(temp_47, na.rm=TRUE) ,
        aacpr_all= 5*(
            temp_17+temp_22+temp_27+temp_32+temp_37+temp_42+temp_47)/35
        ) %>% 
    select( -starts_with("temp")) %>%
    filter(group=="Total") 

dtasum$aacpr_all <- round(dtasum$aacpr_all,1)
dim(dtasum)

2.4. Create variables to quantify relationship between unadjusted vs. age-adjusted CPR

dtasum<-mutate(dtasum,
               ratio_cpr    =aacpr_all / cpr_all) 
#label(dtasum$aacpr_all)<- "age-adjusted CPR" 
#label(dtasum$ratio_cpr)<- "ratio: AACPR/CPR" 
dtasum$ratio_cpr <- round(dtasum$ratio_cpr,4)

obs<-nrow(dtasum)
countries<-length(unique(dtasum$country))

2.5. Keep analysis sample

Now the dataset, dtasum, has 273 observations, at the survey level. But, the study was conducted in a subset of surveys that were accessed as of September, 2017 (259 surveys conducted in 85 countries). Thus, for the purpose of reproducing the results, drop newly released surveys since September 2017. They are likely surveys conducted in 2014, 2015 or later.

dtasum<-mutate(dtasum, 
          newsurvey=
            year>=2017  |
            (year==2016 & (surveyid!="AM2016DHS" & surveyid!="ET2016DHS"
                            & surveyid!="MM2016DHS"))|
            (year==2015 & (surveyid!="CO2015DHS" & surveyid!="GU2015DHS" 
                            & surveyid!="MW2015DHS" & surveyid!="RW2015DHS" 
                            & surveyid!="TZ2015DHS" & surveyid!="ZW2015DHS"
                            & surveyid!="AO2015DHS"))|
            (year==2014 & (surveyid!="BD2014DHS" & surveyid!="EG2014DHS" 
                            & surveyid!="GH2014DHS" & surveyid!="KE2014DHS" 
                            & surveyid!="KH2014DHS" & surveyid!="LS2014DHS" 
                            & surveyid!="SN2014DHS" & surveyid!="TD2014DHS")) 
          ) %>%
        filter(newsurvey==0) %>%
        select(-"newsurvey")

obs<-nrow(dtasum)
countries<-length(unique(dtasum$country))

Now the analysis dataset, dtasum, has 259 observations (i.e., DHS surveys) from 85 countries, Yay!!

3. Data Analysis

There are four tables and one figure in results. The following presents code for the tables and the figure in the order of their appearance in the paper.

NOTE: This section is work in progress. Still trying to figure out correct model specification in R.

3.1. Table 1

Table 1 presents and compares results from linear regression of contraceptive use on TFR between two periods: 1985-2000 vs. 2001-2016.

All countries (upper panel)

Sub-Saharan African countries (lower panel)

3.2. Table 2

Table 2 illustrates examples of contraceptive prevalence rate by age group, and comparison between unadjusted and age-adjusted measures.

df<-dta %>% 
    filter(surveyid=="EG2014DHS" | surveyid=="DR2013DHS") %>% 
    filter(group!="Total") %>% 
    select(surveyid, group, grouplabel, cpr_all)
df

    surveyid               group grouplabel cpr_all
1  DR2013DHS Age (5-year groups)      15-19    22.8
2  DR2013DHS Age (5-year groups)      20-24    44.6
3  DR2013DHS Age (5-year groups)      25-29    60.1
4  DR2013DHS Age (5-year groups)      30-34    67.9
5  DR2013DHS Age (5-year groups)      35-39    72.3
6  DR2013DHS Age (5-year groups)      40-44    71.4
7  DR2013DHS Age (5-year groups)      45-49    70.2
8  EG2014DHS Age (5-year groups)      15-19    20.0
9  EG2014DHS Age (5-year groups)      20-24    41.2
10 EG2014DHS Age (5-year groups)      25-29    53.6
11 EG2014DHS Age (5-year groups)      30-34    62.3
12 EG2014DHS Age (5-year groups)      35-39    68.2
13 EG2014DHS Age (5-year groups)      40-44    63.9
14 EG2014DHS Age (5-year groups)      45-49    45.5

df<-dtasum %>% 
    filter(surveyid=="EG2014DHS" | surveyid=="DR2013DHS") %>% 
    select(surveyid, cpr_all, aacpr_all, ratio_cpr)
df

# A tibble: 2 x 4
# Groups:   surveyid [2]
  surveyid  cpr_all        aacpr_all      ratio_cpr     
  <chr>     <S3: labelled> <S3: labelled> <S3: labelled>
1 DR2013DHS 55.1           58.47143       1.0611875     
2 EG2014DHS 55.0           50.67143       0.9212987

3.3. Figure 1

The figure shows the association between the level of CPR among all women and the relative difference between age-adjusted and unadjusted CPR, overall and by region.

#scatter plot with the regresion line
ggplot(dtasum, aes(cpr_all, ratio_cpr)) +  
    geom_point()+
    geom_segment(aes(x = p1x, y = p1y, xend = p2x, yend = p2y))+
    geom_hline(yintercept=1, linetype="dashed")+
    labs(y= "The ratio of the age-adjusted CPR to unadjusted CPR",
         x= "CPR, among all women (%)") +
    scale_y_continuous(limits = c(0.8,1.3), 
                       breaks = seq(0.8,1.3,0.1))+
    scale_x_continuous(limits = c(0,80), 
                       breaks = seq(0,80,20))+
    ggtitle("Overall") +
    theme(strip.text = element_text(size=rel(0.7)),
          legend.text = element_text(size=rel(0.7)),
          axis.text.x = element_text(size=rel(0.7)),
          axis.text.y = element_text(size=rel(0.7)),
          axis.title.x = element_text(size=rel(0.7)),
          axis.title.y = element_text(size=rel(0.7)))

suppressPackageStartupMessages(library(plm))
suppressPackageStartupMessages(library(ggplot2))
#BY REGION >>> needs to be improved with looping !!!
#fixed effect regression line, since plm is not an option for geom_smooth

{r figure1B, comment=""}
#THIS chuck runs fine individually, but not when knitted. WHY?? 
suppressPackageStartupMessages(library(ggplot2))

ggplot(dtasum, aes(cpr_all, ratio_cpr, color=region5)) +  
    geom_point()+
    facet_wrap(~region4, ncol=2) +
    geom_hline(yintercept=1, linetype="dashed")+
    labs(y= "The ratio of the age-adjusted CPR to unadjusted CPR",
         x= "CPR, among all women (%)") +
    scale_y_continuous(limits = c(0.8,1.3), 
                       breaks = seq(0.8,1.3,0.1))+
    scale_x_continuous(limits = c(0,80), 
                       breaks = seq(0,80,20))+
    ggtitle("By Region") +
    theme(strip.text = element_text(size=rel(0.7)),
          legend.text = element_text(size=rel(0.7)),
          axis.text.x = element_text(size=rel(0.7)),
          axis.text.y = element_text(size=rel(0.7)),
          axis.title.x = element_text(size=rel(0.7)),
          axis.title.y = element_text(size=rel(0.7)),
          legend.position = "none")

Note: LAC: Latin America and the Caribbean, SSA: sub-Saharan Africa; Fitted line is based on linear regression using a country-level fixed effect model. Only significant (p-value<0.05) associations are presented. TO DO: Correct fitted line by region

3.4. Table 3

Linear regression of contraceptive use on TFR by unadjusted vs. age adjusted contraceptive use. Coming soon, after getting model specification correctly… Ugh.

3.5. Table 4

Linear regression of contraceptive use on TFR by unadjusted vs. age adjusted contraceptive use, by time period. Coming soon, after getting model specification correctly… Ugh.

1985-2000 (upper panel)

2001-2015 (lower panel)