Author: YJ Choi
Last updated: 2019-10-04
This is a markdown file to share how to access and process DHS API data for various analyses, using illustrative examples. There are two sections in this document: (1) accessing DHS API data; and (2) wrangling the data into a structure that is generally appropriate for common analyses.
Although I am a huge fan of micro-level open data (e.g., DHS recode files, PMA2020 data files), I also have become fond of pre-calculated indicator data from DHS API and even published a few papers using only the API data, like this and this. The API data, however, can be used for purposes beyond research and publications, as illustrated in the figure above. I most recently developed a data visualization app here, using DHS API data.
For basic but important information about DHS API, check the website. For those who are familiar with DHS STATcompiler, same data base (i.e., estimates for indicators) is used for both STATcompiler and API, and we get same results/data from the two sources.
"R code is shown in a gray box"
Output is in white box. This markdown includes only selected results that are useful to understand the process. To see results for each step, simply remove the argument "results=FALSE" from a code chunk.
When you get DHS API data there are rows (i.e., observations) and columns (i.e., variables) like this example. Each row is a pre-calculated estimate (under the column “Value”) of:
- a specific indicator for
- a specific denominator by background characteristics (e.g., all, those in urban, those who finished primary school education) from
- a specific survey.
So calling the API data means you specify these points in URL (section 1.2).
First, select indicators that are relevant for your specific study. The list of available indicators is here. For the purpose of this demonstration, the following four indicators are used.
Indicator | DHSAPI_name | Definition |
---|---|---|
Total fertility rate | FE_FRTR_W_TFR | Total fertility rate for the three years preceding the survey for age group 15-49 expressed per woman |
Modern contraceptive prevalence rate (%) | FP_CUSA_W_MOD | Percentage of women currently using any modern method of contraception |
Women with secondary education (%) | ED_EDAT_W_SEC | Percentage of the de facto female household population age 6 and over who attended secondary education |
Women in union (%) | MA_MSTA_W_UNI | Percentage of women married or living in union |
Detailed information with good examples about DHS API call/query is available here and here, too. Highly recommended resources!
Below shows a dissected example of a DHS API URL. It calls data:
- in json format
- for one indicator, total fertility rate
- for all surveys
- for all background characteristics
- with maximum of 10,000 rows per page.
Figure 1. Example of DHS API query URL
And, why these particular specifications?
I learned API with json and just like it. If you prefer other formats, see here for available options.
For those who just want to literally see the data, without further manipulation or analysis, html is a good option, producing results like this.
My personal preference is to call each indicator separately, and then merge them together. This is a straightforward and also generic way that you can apply to any cases, simply by modifying the indicator list in your code.
However, if you wish, you can call multiple indicators at once (separated by a comma: e.g., indicatorIds=FE_FRTR_W_TFR,FP_CUSA_W_MOD,ED_EDAT_W_SEC,MA_MSTA_W_UNI
). If so, each row will be an estimate/value for one indicator, and the dataset needs to be “reshaped wide” for analysis. Also, calling many indicators at a time can be very slow.
In case you are curious, if you do not specify indicators, the defalt will bring all available indicators, which will take probably very, very, very long time.
Since I am mostly interested in multi-country analyses, I usually call all surveys. You can define selected countries (e.g., countryids=MW,TZ
) or surveys (e.g., surveyid=MWDHS2015,TZ2017MIS
), if you mostly work with those data. If you do not specify anything, the default is all surveys. I included surveyid=all
for only demonstrative purposes.
I also like to call data by all subgroup (i.e., estimates disaggregated by every available subgroup for the particular indicator). Even if your analysis requires only national average data, having all data allows you to have different analytical options and discussion points down the road.
Currently, you can get: (1) only the national-level estimates, (2) only estimates disaggregated by sub-national geographic regions, or (3) estimates disaggregated by all subgroups (including region). It is not (yet?) possible to call selected subgroups based on household/individual background characteristics (e.g., household wealth quintile). If you do not specify anything, the default is national-level estimates.
However, I should also note that calling all data can be slow, depending on the type of indicators - as some indicators are disaggregated by many subgroups (i.e., there are many rows to call) - based on the typical survey sample size.
Finally, each API call provides 100 observations as a default and maximum of 1000. However, depending on the number of surveys and countries of your interest, your complete API data may well exceed the limit. To deal with this issue, there are two solutions.
A highly recommended solution: register at the DHS API site, obtain DHS API Key, and set a customized perpage limit. With an API key, the page limit is large enough (OR does not exists? - to be confirmed) that you will not need to worry about this, even if your call is for all nearly 300 surveys conducted since the 1980s.
Alternatively: you can set per-page limit to 1000 and append data from each page. You will need to know how many pages would be there for each indicator (another reason to avoid this approach), and then make the call from page 1 until the last page. The required number of pages for an indicator depends on various factors (the number of subgroups, which again depends on the indicator definition itself and the survey design/sample size). In my experience, I rarely had to deal with the number of pages exceeding 10, in order to get all data from all surveys. But, always check if the last survey of your interest (based on alpha-numeric order of surveys) is included in the final data. But, really, why would you not register?!
Okay, enough talking, let’s call the data.
#get required functions
library(jsonlite)
library(data.table)
library(dplyr)
#this code uses API key, "DUMMY-123456". Replace it with your own valid key
#FE_FRTR_W_TFR
url<-("http://api.dhsprogram.com/rest/dhs/data?f=json&indicatorIds=FE_FRTR_W_TFR&surveyid=all&breakdown=all&perpage=10000&APIkey=DUMMY-123456")
jsondata<-fromJSON(url)
dta<-data.table(jsondata$Data)
dta<-select(dta, CountryName, SurveyId, Value,
CharacteristicCategory, CharacteristicLabel)
FE_FRTR_W_TFR<- dta %>% rename(FE_FRTR_W_TFR=Value)
#FP_CUSA_W_ANY
url<-("http://api.dhsprogram.com/rest/dhs/data?f=json&indicatorIds=FP_CUSA_W_ANY&surveyid=all&breakdown=all&perpage=10000&APIkey=DUMMY-123456")
jsondata<-fromJSON(url)
dta<-data.table(jsondata$Data)
dta<-select(dta, CountryName, SurveyId, Value,
CharacteristicCategory, CharacteristicLabel)
FP_CUSA_W_MOD<- dta %>% rename(FP_CUSA_W_MOD=Value)
#ED_EDAT_W_SEC
url<-("http://api.dhsprogram.com/rest/dhs/data?f=json&indicatorIds=ED_EDAT_W_SEC&surveyid=all&breakdown=all&perpage=10000&APIkey=DUMMY-123456")
jsondata<-fromJSON(url)
dta<-data.table(jsondata$Data)
dta<-select(dta, CountryName, SurveyId, Value,
CharacteristicCategory, CharacteristicLabel)
ED_EDAT_W_SEC<- dta %>% rename(ED_EDAT_W_SEC=Value)
#MA_MSTA_W_UNI
url<-("http://api.dhsprogram.com/rest/dhs/data?f=json&indicatorIds=MA_MSTA_W_UNI&surveyid=all&breakdown=all&perpage=10000&APIkey=DUMMY-123456")
jsondata<-fromJSON(url)
dta<-data.table(jsondata$Data)
dta<-select(dta, CountryName, SurveyId, Value,
CharacteristicCategory, CharacteristicLabel)
MA_MSTA_W_UNI<- dta %>% rename(MA_MSTA_W_UNI=Value)
As of 2019-10-04, the API calls generated four datasets, FE_FRTR_W_TFR, FP_CUSA_W_MOD, ED_EDAT_W_SEC, and MA_MSTA_W_UNI, with 7917, 2222, 8789, and 4378 observations, respectively. See the differences in the number of observations in each dataset? That is because of two factors: (1) differences in the number of surveys that collected information for each indicator, and, more importantly, (2) differences in the number and type of disaggregation categories.
We need to inspect the data further, before merging all individual indicator datasets into one based on three ID variables: “SurveyId”, “CharacteristicCategory”, “CharacteristicLabel”. Especially, check “CharacteristicCategory”, which will give you the first clue if anything is inconsistent. Among the four indicators, see below how each have different dimension of disaggregation.
#
table(FE_FRTR_W_TFR$CharacteristicCategory)
Education Education (2 groups) Region
1199 618 3847
Residence Total Wealth quintile
632 316 1305
table(FP_CUSA_W_MOD$CharacteristicCategory)
Age (5-year groups) Total
1944 278
table(ED_EDAT_W_SEC$CharacteristicCategory)
Household members age Region Residence
3380 3364 520
Total Wealth quintile
260 1265
table(MA_MSTA_W_UNI$CharacteristicCategory)
Age (10-year groups) Age (5-year groups) Age (grouped)
1168 2042 876
Total 15-49
292
While most indicators share common and constant categories (e.g., residence, and wealth quintile), some indicators may have unique category name and labels - even when it is a same subgroup. For example, see “Total” in the indicator, “MA_MSTA_W_UNI”. Thus, recode values for ID variables as needed before merge.
library(dplyr)
# Recode labeled for ID variables as needed: MA_MSTA_W_UNI example
MA_MSTA_W_UNI<-MA_MSTA_W_UNI %>%
mutate(
CharacteristicCategory = ifelse(
CharacteristicCategory == "Total 15-49",
"Total",
CharacteristicCategory),
CharacteristicLabel = ifelse(
CharacteristicLabel == "Total 15-49",
"Total",
CharacteristicLabel)
)
# define merge ID variables
idvars<-c("CountryName", "SurveyId", "CharacteristicCategory", "CharacteristicLabel")
dtaapi<-FE_FRTR_W_TFR %>%
full_join(FP_CUSA_W_MOD, by =idvars) %>%
full_join(ED_EDAT_W_SEC, by =idvars) %>%
full_join(MA_MSTA_W_UNI, by =idvars)
dim(dtaapi)
names(dtaapi)
table(dtaapi$CharacteristicCategory)
obs<-nrow(dtaapi)
surveys<-length(unique(dtaapi$SurveyId))
The merged dataset has 15619 observations - i.e., subgroup-specific estimates - from 322 surveys. It has much more observations than any of the individual datasets, because of differences in available “CharacteristicCategory” across the indicators, as shown above in Section 1.3. Sort the data by “SurveyId”, “CharacteristicCategory”, “CharacteristicLabel” to explore the structure further.
library(dplyr)
library(Hmisc)
# Rename
dta<-dtaapi %>%
rename (tfr = FE_FRTR_W_TFR) %>%
rename (mcpr_all = FP_CUSA_W_MOD) %>%
rename (wedusec = ED_EDAT_W_SEC) %>%
rename (inunion = MA_MSTA_W_UNI) %>%
rename (country = CountryName) %>%
rename (group = CharacteristicCategory) %>%
rename (grouplabel = CharacteristicLabel)
colnames(dta)<-tolower(names(dta))
label(dta$tfr)<- "TFR"
label(dta$mcpr_all)<- "MCPR among all women"
label(dta$wedusec)<- "Women with secondary education (%)"
label(dta$inunion)<- "Women in union (%)"
# Check "group"
table(dta$group)
Age (10-year groups) Age (5-year groups) Age (grouped)
1168 2049 876
Education Education (2 groups) Household members age
1199 618 3380
Region Residence Total
3953 644 322
Wealth quintile
1410
dta<-dta %>%
mutate(
year=as.numeric(substr(surveyid,3,6)),
type=substr(surveyid,7,9))
label(dta$year)<- "year of survey"
label(dta$type)<- "type of survey"
Create regional variables, following my preferred classification, [UN Statistics Division’s M49 standard] (https://unstats.un.org/unsd/methodology/m49/). Here, we first scrape the webpage to get the data table of countries by geographic regions and then merge it with the dataset, dta. For more information about UNSD’s regional classification, see here.
library(rvest)
# Scrape data table from the UNSD web
ctry_UNSD<-read_html("https://unstats.un.org/unsd/methodology/m49/") %>%
html_nodes("table") %>%
.[[7]] %>%
html_table(header = TRUE)
# Tidy up
ctry_UNSD<-ctry_UNSD %>%
rename (country = "Country or Area") %>%
rename (M49 = "M49 code") %>%
rename (ISOalpha3 = "ISO-alpha3 code") %>%
select(country, M49, ISOalpha3)
ctry_UNSD$country<-as.character(ctry_UNSD$country)
ctry_UNSD<-ctry_UNSD %>%
mutate(
UNSDsubregion=country,
UNSDsubregion=ifelse(ISOalpha3!="", "", UNSDsubregion)
)
for (i in 1:nrow(ctry_UNSD)){
if (ctry_UNSD[i,4]==""){
ctry_UNSD[i,4]=ctry_UNSD[i-1,4]
}}
# Keep only country rows and, also, replace country names as needed for merge
ctry_UNSD<-ctry_UNSD %>%
filter(ISOalpha3!="") %>%
select(country, UNSDsubregion) %>%
mutate(
country = ifelse(country == "Bolivia (Plurinational State of)", "Bolivia", country) ,
country = ifelse(country == "Cabo Verde", "Cape Verde", country) ,
country = ifelse(country == "Democratic Republic of the Congo", "Congo Democratic Republic", country) ,
country = ifelse(country == "Côte d'Ivoire", "Cote d'Ivoire", country) ,
country = ifelse(country == "Kyrgyzstan", "Kyrgyz Republic", country) ,
country = ifelse(country == "Republic of Moldova", "Moldova", country) ,
country = ifelse(country == "United Republic of Tanzania", "Tanzania", country) ,
country = ifelse(country == "Viet Nam", "Vietnam", country)
)
label(ctry_UNSD$UNSDsubregion) <- "Sub-region, UNSD Methodology 49"
Then, merge the two data frames: ctry_UNSD & dta (cleaned DHS API data).
# Inspect and confirm "country" variable
length(unique(dta$country)) #number of unique countries
length(unique(ctry_UNSD$country)) #number of unique countries
# Merge UNSD country list with the DHS API data, "dta"
dim(dta)
dim(ctry_UNSD)
dta<-left_join(dta, ctry_UNSD, by = "country")
dim(dta)
length(unique(dta$surveyid)) #number of unique surveys
length(unique(dta$country)) #number of unique countries
# Check all countries now have UNSDsubregion,
table(dta$UNSDsubregion, exclude = NULL)
Caribbean Central America Central Asia
629 492 359
Eastern Africa Eastern Europe Middle Africa
3718 86 910
Northern Africa South-eastern Asia South America
550 1357 1587
Southern Africa Southern Asia Southern Europe
539 1233 97
Western Africa Western Asia <NA>
2910 956 196
test<-filter(dta, UNSDsubregion=="Western Africa")
table(test$country)
Benin Burkina Faso Gambia Ghana Guinea
272 257 49 315 152
Liberia Mali Mauritania Niger Nigeria
203 307 46 189 326
Senegal Sierra Leone Togo
514 136 144
# Replace UNSDregiion if missing
dta<- mutate(dta,
UNSDsubregion=ifelse(country=="Cote d'Ivoire", "Western Africa", UNSDsubregion) )
table(dta$UNSDsubregion, exclude = NULL)
Caribbean Central America Central Asia
629 492 359
Eastern Africa Eastern Europe Middle Africa
3718 86 910
Northern Africa South-eastern Asia South America
550 1357 1587
Southern Africa Southern Asia Southern Europe
539 1233 97
Western Africa Western Asia
3106 956
# Generate regional variables as needed per your analysis. Two examples:
dta<- mutate(dta,
ssa=UNSDsubregion=="Eastern Africa"|
UNSDsubregion=="Middle Africa"|
UNSDsubregion=="Southern Africa"|
UNSDsubregion=="Western Africa")
label(dta$ssa) <- "Sub-Saharan Africa"
table(dta$ssa)
Current dataset, dta, has observations at the survey-subgroup level. There are 15619 observations from 322 surveys, 86 countries, as of 2019-10-04.
But, unit of analysis can vary from a country, to a survey, and to the current survey-subgroup specific estimate. Studies like this and this create survey-level measures, starting from survey-subgroup specific estimates. For example, you can simply keep only “Total” observations.
dtasurvey<-dta %>% filter(group=="Total")
Such dataset, dtasurvey, has 322 observations from 322 surveys, 86 countries.
But, if your study uses subnational-level estimates in the latest surveys, you can simply keep only “Total” observations.
dtacountrylatest<-dta %>%
group_by(country) %>%
mutate(maxyear = max(year)) %>%
filter(year==maxyear)
In this case, the dataset, dtacountrylatest, has 4142 observations from 86 surveys, 86 countries.
Now, create your own DHS API indicator dataset. Enjoy!