Author: YJ Choi
Last updated: 2019-10-04

Introduction

This is a markdown file to share how to access and process DHS API data for various analyses, using illustrative examples. There are two sections in this document: (1) accessing DHS API data; and (2) wrangling the data into a structure that is generally appropriate for common analyses.

Although I am a huge fan of micro-level open data (e.g., DHS recode files, PMA2020 data files), I also have become fond of pre-calculated indicator data from DHS API and even published a few papers using only the API data, like this and this. The API data, however, can be used for purposes beyond research and publications, as illustrated in the figure above. I most recently developed a data visualization app here, using DHS API data.

For basic but important information about DHS API, check the website. For those who are familiar with DHS STATcompiler, same data base (i.e., estimates for indicators) is used for both STATcompiler and API, and we get same results/data from the two sources.

"R code is shown in a gray box"

Output is in white box. This markdown includes only selected results that are useful to understand the process. To see results for each step, simply remove the argument "results=FALSE" from a code chunk.

1. Data Access

1.0. Know and plan your data

When you get DHS API data there are rows (i.e., observations) and columns (i.e., variables) like this example. Each row is a pre-calculated estimate (under the column “Value”) of:
- a specific indicator for
- a specific denominator by background characteristics (e.g., all, those in urban, those who finished primary school education) from
- a specific survey.

So calling the API data means you specify these points in URL (section 1.2).

1.1. Identify indicators

First, select indicators that are relevant for your specific study. The list of available indicators is here. For the purpose of this demonstration, the following four indicators are used.

Indicator	DHSAPI_name	Definition
Total fertility rate	FE_FRTR_W_TFR	Total fertility rate for the three years preceding the survey for age group 15-49 expressed per woman
Modern contraceptive prevalence rate (%)	FP_CUSA_W_MOD	Percentage of women currently using any modern method of contraception
Women with secondary education (%)	ED_EDAT_W_SEC	Percentage of the de facto female household population age 6 and over who attended secondary education
Women in union (%)	MA_MSTA_W_UNI	Percentage of women married or living in union

1.2. Call API data for each indicator

Detailed information with good examples about DHS API call/query is available here and here, too. Highly recommended resources!

Below shows a dissected example of a DHS API URL. It calls data:
- in json format
- for one indicator, total fertility rate
- for all surveys
- for all background characteristics
- with maximum of 10,000 rows per page.

Figure 1. Example of DHS API query URL Alt text

And, why these particular specifications?

I learned API with json and just like it. If you prefer other formats, see here for available options.

For those who just want to literally see the data, without further manipulation or analysis, html is a good option, producing results like this.
My personal preference is to call each indicator separately, and then merge them together. This is a straightforward and also generic way that you can apply to any cases, simply by modifying the indicator list in your code.

However, if you wish, you can call multiple indicators at once (separated by a comma: e.g., indicatorIds=FE_FRTR_W_TFR,FP_CUSA_W_MOD,ED_EDAT_W_SEC,MA_MSTA_W_UNI). If so, each row will be an estimate/value for one indicator, and the dataset needs to be “reshaped wide” for analysis. Also, calling many indicators at a time can be very slow.

In case you are curious, if you do not specify indicators, the defalt will bring all available indicators, which will take probably very, very, very long time.
Since I am mostly interested in multi-country analyses, I usually call all surveys. You can define selected countries (e.g., countryids=MW,TZ) or surveys (e.g., surveyid=MWDHS2015,TZ2017MIS), if you mostly work with those data. If you do not specify anything, the default is all surveys. I included surveyid=all for only demonstrative purposes.
I also like to call data by all subgroup (i.e., estimates disaggregated by every available subgroup for the particular indicator). Even if your analysis requires only national average data, having all data allows you to have different analytical options and discussion points down the road.

Currently, you can get: (1) only the national-level estimates, (2) only estimates disaggregated by sub-national geographic regions, or (3) estimates disaggregated by all subgroups (including region). It is not (yet?) possible to call selected subgroups based on household/individual background characteristics (e.g., household wealth quintile). If you do not specify anything, the default is national-level estimates.

However, I should also note that calling all data can be slow, depending on the type of indicators - as some indicators are disaggregated by many subgroups (i.e., there are many rows to call) - based on the typical survey sample size.
Finally, each API call provides 100 observations as a default and maximum of 1000. However, depending on the number of surveys and countries of your interest, your complete API data may well exceed the limit. To deal with this issue, there are two solutions.

A highly recommended solution: register at the DHS API site, obtain DHS API Key, and set a customized perpage limit. With an API key, the page limit is large enough (OR does not exists? - to be confirmed) that you will not need to worry about this, even if your call is for all nearly 300 surveys conducted since the 1980s.
Alternatively: you can set per-page limit to 1000 and append data from each page. You will need to know how many pages would be there for each indicator (another reason to avoid this approach), and then make the call from page 1 until the last page. The required number of pages for an indicator depends on various factors (the number of subgroups, which again depends on the indicator definition itself and the survey design/sample size). In my experience, I rarely had to deal with the number of pages exceeding 10, in order to get all data from all surveys. But, always check if the last survey of your interest (based on alpha-numeric order of surveys) is included in the final data. But, really, why would you not register?!

Okay, enough talking, let’s call the data.

#get required functions 
library(jsonlite) 
library(data.table)
library(dplyr)

#this code uses API key, "DUMMY-123456". Replace it with your own valid key 

#FE_FRTR_W_TFR  
url<-("http://api.dhsprogram.com/rest/dhs/data?f=json&indicatorIds=FE_FRTR_W_TFR&surveyid=all&breakdown=all&perpage=10000&APIkey=DUMMY-123456")
jsondata<-fromJSON(url) 
dta<-data.table(jsondata$Data)
dta<-select(dta, CountryName, SurveyId, Value, 
            CharacteristicCategory, CharacteristicLabel)   
FE_FRTR_W_TFR<- dta %>% rename(FE_FRTR_W_TFR=Value)

#FP_CUSA_W_ANY  
url<-("http://api.dhsprogram.com/rest/dhs/data?f=json&indicatorIds=FP_CUSA_W_ANY&surveyid=all&breakdown=all&perpage=10000&APIkey=DUMMY-123456")
jsondata<-fromJSON(url) 
dta<-data.table(jsondata$Data)
dta<-select(dta, CountryName, SurveyId, Value, 
            CharacteristicCategory, CharacteristicLabel)
FP_CUSA_W_MOD<- dta %>% rename(FP_CUSA_W_MOD=Value)

#ED_EDAT_W_SEC
url<-("http://api.dhsprogram.com/rest/dhs/data?f=json&indicatorIds=ED_EDAT_W_SEC&surveyid=all&breakdown=all&perpage=10000&APIkey=DUMMY-123456")
jsondata<-fromJSON(url) 
dta<-data.table(jsondata$Data)
dta<-select(dta, CountryName, SurveyId, Value, 
            CharacteristicCategory, CharacteristicLabel)
ED_EDAT_W_SEC<- dta %>% rename(ED_EDAT_W_SEC=Value)

#MA_MSTA_W_UNI
url<-("http://api.dhsprogram.com/rest/dhs/data?f=json&indicatorIds=MA_MSTA_W_UNI&surveyid=all&breakdown=all&perpage=10000&APIkey=DUMMY-123456")
jsondata<-fromJSON(url) 
dta<-data.table(jsondata$Data)
dta<-select(dta, CountryName, SurveyId, Value, 
            CharacteristicCategory, CharacteristicLabel)
MA_MSTA_W_UNI<- dta %>% rename(MA_MSTA_W_UNI=Value)

1.3. Inspect individual indicator datasets

As of 2019-10-04, the API calls generated four datasets, FE_FRTR_W_TFR, FP_CUSA_W_MOD, ED_EDAT_W_SEC, and MA_MSTA_W_UNI, with 7917, 2222, 8789, and 4378 observations, respectively. See the differences in the number of observations in each dataset? That is because of two factors: (1) differences in the number of surveys that collected information for each indicator, and, more importantly, (2) differences in the number and type of disaggregation categories.

We need to inspect the data further, before merging all individual indicator datasets into one based on three ID variables: “SurveyId”, “CharacteristicCategory”, “CharacteristicLabel”. Especially, check “CharacteristicCategory”, which will give you the first clue if anything is inconsistent. Among the four indicators, see below how each have different dimension of disaggregation.

#   
table(FE_FRTR_W_TFR$CharacteristicCategory)


           Education Education (2 groups)               Region 
                1199                  618                 3847 
           Residence                Total      Wealth quintile 
                 632                  316                 1305

table(FP_CUSA_W_MOD$CharacteristicCategory)


Age (5-year groups)               Total 
               1944                 278

table(ED_EDAT_W_SEC$CharacteristicCategory)


Household members age                Region             Residence 
                 3380                  3364                   520 
                Total       Wealth quintile 
                  260                  1265

table(MA_MSTA_W_UNI$CharacteristicCategory)


Age (10-year groups)  Age (5-year groups)        Age (grouped) 
                1168                 2042                  876 
         Total 15-49 
                 292

While most indicators share common and constant categories (e.g., residence, and wealth quintile), some indicators may have unique category name and labels - even when it is a same subgroup. For example, see “Total” in the indicator, “MA_MSTA_W_UNI”. Thus, recode values for ID variables as needed before merge.

library(dplyr)
# Recode labeled for ID variables as needed: MA_MSTA_W_UNI example 
MA_MSTA_W_UNI<-MA_MSTA_W_UNI %>%
    mutate(
    CharacteristicCategory = ifelse(
        CharacteristicCategory == "Total 15-49", 
        "Total", 
        CharacteristicCategory),    
    CharacteristicLabel = ifelse(
        CharacteristicLabel == "Total 15-49", 
        "Total", 
        CharacteristicLabel)
    )

1.4. Merge the individual indicator datasets

# define merge ID variables
idvars<-c("CountryName", "SurveyId", "CharacteristicCategory", "CharacteristicLabel")

dtaapi<-FE_FRTR_W_TFR %>%
    full_join(FP_CUSA_W_MOD, by =idvars) %>%
    full_join(ED_EDAT_W_SEC, by =idvars) %>%
    full_join(MA_MSTA_W_UNI, by =idvars) 
    
dim(dtaapi)
names(dtaapi)
table(dtaapi$CharacteristicCategory)

obs<-nrow(dtaapi)
surveys<-length(unique(dtaapi$SurveyId))

The merged dataset has 15619 observations - i.e., subgroup-specific estimates - from 322 surveys. It has much more observations than any of the individual datasets, because of differences in available “CharacteristicCategory” across the indicators, as shown above in Section 1.3. Sort the data by “SurveyId”, “CharacteristicCategory”, “CharacteristicLabel” to explore the structure further.

2. Data Management

2.1. Tidy the dataset

library(dplyr) 
library(Hmisc)

# Rename
dta<-dtaapi %>%
    rename (tfr     =   FE_FRTR_W_TFR) %>%
    rename (mcpr_all    =   FP_CUSA_W_MOD) %>%
    rename (wedusec =   ED_EDAT_W_SEC) %>%
    rename (inunion =   MA_MSTA_W_UNI) %>%
    rename (country =   CountryName) %>%
    rename (group   =   CharacteristicCategory) %>% 
    rename (grouplabel  =   CharacteristicLabel) 
colnames(dta)<-tolower(names(dta))

label(dta$tfr)<- "TFR" 
label(dta$mcpr_all)<- "MCPR among all women"
label(dta$wedusec)<- "Women with secondary education (%)"
label(dta$inunion)<- "Women in union (%)"

# Check "group"
table(dta$group)


 Age (10-year groups)   Age (5-year groups)         Age (grouped) 
                 1168                  2049                   876 
            Education  Education (2 groups) Household members age 
                 1199                   618                  3380 
               Region             Residence                 Total 
                 3953                   644                   322 
      Wealth quintile 
                 1410

2.2. Create basic variables for surveys

dta<-dta %>%
    mutate(
        year=as.numeric(substr(surveyid,3,6)),
        type=substr(surveyid,7,9))

label(dta$year)<- "year of survey" 
label(dta$type)<- "type of survey"

2.3. Create regional category variables

Create regional variables, following my preferred classification, [UN Statistics Division’s M49 standard] (https://unstats.un.org/unsd/methodology/m49/). Here, we first scrape the webpage to get the data table of countries by geographic regions and then merge it with the dataset, dta. For more information about UNSD’s regional classification, see here.

library(rvest)

# Scrape data table from the UNSD web
ctry_UNSD<-read_html("https://unstats.un.org/unsd/methodology/m49/") %>% 
    html_nodes("table") %>%
    .[[7]] %>% 
    html_table(header = TRUE)     

# Tidy up
ctry_UNSD<-ctry_UNSD %>%
    rename (country =   "Country or Area") %>% 
    rename (M49 =   "M49 code") %>% 
    rename (ISOalpha3   =   "ISO-alpha3 code") %>% 
    select(country, M49, ISOalpha3) 
ctry_UNSD$country<-as.character(ctry_UNSD$country)
ctry_UNSD<-ctry_UNSD %>% 
    mutate(
    UNSDsubregion=country,
    UNSDsubregion=ifelse(ISOalpha3!="", "", UNSDsubregion)
    )
for (i in 1:nrow(ctry_UNSD)){
    if (ctry_UNSD[i,4]==""){
    ctry_UNSD[i,4]=ctry_UNSD[i-1,4]
    }}

# Keep only country rows and, also, replace country names as needed for merge
ctry_UNSD<-ctry_UNSD %>% 
    filter(ISOalpha3!="") %>% 
    select(country, UNSDsubregion) %>% 
    mutate(
    country = ifelse(country == "Bolivia (Plurinational State of)", "Bolivia", country) ,
    country = ifelse(country == "Cabo Verde", "Cape Verde", country) , 
    country = ifelse(country == "Democratic Republic of the Congo", "Congo Democratic Republic", country) ,
    country = ifelse(country == "Côte d'Ivoire", "Cote d'Ivoire", country) ,
    country = ifelse(country == "Kyrgyzstan", "Kyrgyz Republic", country) , 
    country = ifelse(country == "Republic of Moldova", "Moldova", country) , 
    country = ifelse(country == "United Republic of Tanzania", "Tanzania", country) ,
    country = ifelse(country == "Viet Nam", "Vietnam", country) 
    )
label(ctry_UNSD$UNSDsubregion) <- "Sub-region, UNSD Methodology 49"

Then, merge the two data frames: ctry_UNSD & dta (cleaned DHS API data).

# Inspect and confirm "country" variable
length(unique(dta$country)) #number of unique countries
length(unique(ctry_UNSD$country)) #number of unique countries

# Merge UNSD country list with the DHS API data, "dta" 
dim(dta)
dim(ctry_UNSD)
dta<-left_join(dta, ctry_UNSD, by = "country")
dim(dta)
length(unique(dta$surveyid)) #number of unique surveys
length(unique(dta$country)) #number of unique countries

# Check all countries now have UNSDsubregion, 
table(dta$UNSDsubregion, exclude = NULL)


         Caribbean    Central America       Central Asia 
               629                492                359 
    Eastern Africa     Eastern Europe      Middle Africa 
              3718                 86                910 
   Northern Africa South-eastern Asia      South America 
               550               1357               1587 
   Southern Africa      Southern Asia    Southern Europe 
               539               1233                 97 
    Western Africa       Western Asia               <NA> 
              2910                956                196

    test<-filter(dta, UNSDsubregion=="Western Africa")
    table(test$country)


       Benin Burkina Faso       Gambia        Ghana       Guinea 
         272          257           49          315          152 
     Liberia         Mali   Mauritania        Niger      Nigeria 
         203          307           46          189          326 
     Senegal Sierra Leone         Togo 
         514          136          144

# Replace UNSDregiion if missing     
dta<- mutate(dta, 
    UNSDsubregion=ifelse(country=="Cote d'Ivoire", "Western Africa", UNSDsubregion) )
table(dta$UNSDsubregion, exclude = NULL)


         Caribbean    Central America       Central Asia 
               629                492                359 
    Eastern Africa     Eastern Europe      Middle Africa 
              3718                 86                910 
   Northern Africa South-eastern Asia      South America 
               550               1357               1587 
   Southern Africa      Southern Asia    Southern Europe 
               539               1233                 97 
    Western Africa       Western Asia 
              3106                956

# Generate regional variables as needed per your analysis. Two examples:
dta<- mutate(dta, 
    ssa=UNSDsubregion=="Eastern Africa"|
        UNSDsubregion=="Middle Africa"|
        UNSDsubregion=="Southern Africa"|
        UNSDsubregion=="Western Africa")
label(dta$ssa) <- "Sub-Saharan Africa"
table(dta$ssa)

2.4. Create analysis dataset at different levels, based on unit of analysis

Current dataset, dta, has observations at the survey-subgroup level. There are 15619 observations from 322 surveys, 86 countries, as of 2019-10-04.

But, unit of analysis can vary from a country, to a survey, and to the current survey-subgroup specific estimate. Studies like this and this create survey-level measures, starting from survey-subgroup specific estimates. For example, you can simply keep only “Total” observations.

dtasurvey<-dta %>% filter(group=="Total")

Such dataset, dtasurvey, has 322 observations from 322 surveys, 86 countries.

But, if your study uses subnational-level estimates in the latest surveys, you can simply keep only “Total” observations.

dtacountrylatest<-dta %>% 
    group_by(country) %>% 
    mutate(maxyear = max(year)) %>%
    filter(year==maxyear)

In this case, the dataset, dtacountrylatest, has 4142 observations from 86 surveys, 86 countries.

Now, create your own DHS API indicator dataset. Enjoy!